Browsing by Author "Taffet, Philip Adam"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Techniques for Measurement, Analysis, and Optimization of HPC Communication Performance(2021-07-21) Taffet, Philip Adam; Mellor-Crummey, John M.Inter-node communication is a critical component of tightly coupled applications running on parallel high performance computing systems. Surveys of high performance computing benchmarks and applications show that most applications spend at least 20% of their execution time communicating, and some spend more than 50%. Thus, inter-node communication performance is important to the overall performance of parallel applications. Furthermore, as the scale of parallelism increases, communicating efficiently becomes more important and typically more difficult. Application developers often cannot address communication performance issues on their own, whether because of a lack of useful diagnostic information, or because they stem from system-level issues such as poor routing. This dissertation describes several techniques for measuring, analyzing, and optimizing communication performance for parallel applications running on a supercomputer with a fat tree interconnect, all of which can aid in improving communication performance of applications. First, I describe a sampling-based monitoring technique that uses a small amount of performance-related data in each packet to reconstruct quantitative estimates of traffic and congestion correlated with both application contexts and individual links. Using this information, it can distinguish between problems with an application's communication pattern, its mapping onto a parallel system, and outside interference. Second, I propose an approach for generating optimized, traffic-aware routes on a statically routed network. The core of this approach is a combination of linear programming formulations for the optimal static routing problem. Third, I propose a technique for reconstructing application traffic patterns via compressed sensing from switch counters and other system-level information. The second and third contributions, combined to form a system called CoGARFrSN, use measures of communication traffic to produce better static routes that reduce congestion, which can be used effectively to turn a statically routed network into a coarse-grained adaptively routed network. Experiments with a network simulator show that CoGARFrSN routes often result in a 4-7x speedup over the traffic-oblivious static routing strategy typically used in fat trees for several communication motifs, and CoGARFrSN routes sometimes even perform significantly better than fine-grained hardware adaptive routing.Item Understanding Congestion in High Performance Interconnection Networks Using Sampling(2018-04-30) Taffet, Philip Adam; Mellor-Crummey, John MThe computational needs of many applications outstrip the capabilities of a single compute node. Communication is necessary to employ multiple nodes, but slow communication often limits application performance on multiple nodes. To improve communication performance, developers need tools that enable them to understand how their application’s communication patterns interact with the network, especially when those interactions result in congestion. Since communication performance is difficult to reason about analytically and simulation is costly, measurement-based approaches are needed. This thesis describes a new sampling-based technique to collect information about the path a packet takes and congestion it encounters. Experiments with simulations show that this strategy can distinguish problems with an application's communication patterns, its mapping onto a parallel system, and outside interference. We describe a variant of this scheme that requires only 5-6 bits of information in a monitored packet, making it practical for use in next-generation networks.