Techniques for Measurement, Analysis, and Optimization of HPC Communication Performance
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Inter-node communication is a critical component of tightly coupled applications running on parallel high performance computing systems. Surveys of high performance computing benchmarks and applications show that most applications spend at least 20% of their execution time communicating, and some spend more than 50%. Thus, inter-node communication performance is important to the overall performance of parallel applications. Furthermore, as the scale of parallelism increases, communicating efficiently becomes more important and typically more difficult. Application developers often cannot address communication performance issues on their own, whether because of a lack of useful diagnostic information, or because they stem from system-level issues such as poor routing.
This dissertation describes several techniques for measuring, analyzing, and optimizing communication performance for parallel applications running on a supercomputer with a fat tree interconnect, all of which can aid in improving communication performance of applications. First, I describe a sampling-based monitoring technique that uses a small amount of performance-related data in each packet to reconstruct quantitative estimates of traffic and congestion correlated with both application contexts and individual links. Using this information, it can distinguish between problems with an application's communication pattern, its mapping onto a parallel system, and outside interference. Second, I propose an approach for generating optimized, traffic-aware routes on a statically routed network. The core of this approach is a combination of linear programming formulations for the optimal static routing problem. Third, I propose a technique for reconstructing application traffic patterns via compressed sensing from switch counters and other system-level information. The second and third contributions, combined to form a system called CoGARFrSN, use measures of communication traffic to produce better static routes that reduce congestion, which can be used effectively to turn a statically routed network into a coarse-grained adaptively routed network. Experiments with a network simulator show that CoGARFrSN routes often result in a 4-7x speedup over the traffic-oblivious static routing strategy typically used in fat trees for several communication motifs, and CoGARFrSN routes sometimes even perform significantly better than fine-grained hardware adaptive routing.
Description
Advisor
Degree
Type
Keywords
Citation
Taffet, Philip Adam. "Techniques for Measurement, Analysis, and Optimization of HPC Communication Performance." (2021) Diss., Rice University. https://hdl.handle.net/1911/111185.