Techniques for Measurement, Analysis, and Optimization of HPC Communication Performance

dc.contributor.advisorMellor-Crummey, John M.
dc.creatorTaffet, Philip Adam
dc.date.accessioned2021-08-16T18:14:58Z
dc.date.available2021-08-16T18:14:58Z
dc.date.created2021-08
dc.date.issued2021-07-21
dc.date.submittedAugust 2021
dc.date.updated2021-08-16T18:14:58Z
dc.description.abstractInter-node communication is a critical component of tightly coupled applications running on parallel high performance computing systems. Surveys of high performance computing benchmarks and applications show that most applications spend at least 20% of their execution time communicating, and some spend more than 50%. Thus, inter-node communication performance is important to the overall performance of parallel applications. Furthermore, as the scale of parallelism increases, communicating efficiently becomes more important and typically more difficult. Application developers often cannot address communication performance issues on their own, whether because of a lack of useful diagnostic information, or because they stem from system-level issues such as poor routing. This dissertation describes several techniques for measuring, analyzing, and optimizing communication performance for parallel applications running on a supercomputer with a fat tree interconnect, all of which can aid in improving communication performance of applications. First, I describe a sampling-based monitoring technique that uses a small amount of performance-related data in each packet to reconstruct quantitative estimates of traffic and congestion correlated with both application contexts and individual links. Using this information, it can distinguish between problems with an application's communication pattern, its mapping onto a parallel system, and outside interference. Second, I propose an approach for generating optimized, traffic-aware routes on a statically routed network. The core of this approach is a combination of linear programming formulations for the optimal static routing problem. Third, I propose a technique for reconstructing application traffic patterns via compressed sensing from switch counters and other system-level information. The second and third contributions, combined to form a system called CoGARFrSN, use measures of communication traffic to produce better static routes that reduce congestion, which can be used effectively to turn a statically routed network into a coarse-grained adaptively routed network. Experiments with a network simulator show that CoGARFrSN routes often result in a 4-7x speedup over the traffic-oblivious static routing strategy typically used in fat trees for several communication motifs, and CoGARFrSN routes sometimes even perform significantly better than fine-grained hardware adaptive routing.
dc.format.mimetypeapplication/pdf
dc.identifier.citationTaffet, Philip Adam. "Techniques for Measurement, Analysis, and Optimization of HPC Communication Performance." (2021) Diss., Rice University. <a href="https://hdl.handle.net/1911/111185">https://hdl.handle.net/1911/111185</a>.
dc.identifier.doihttps://doi.org/10.25611/FMS4-TY78
dc.identifier.urihttps://hdl.handle.net/1911/111185
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectHPC
dc.subjecthigh performance computing
dc.subjectnetworking
dc.subjectInfiniBand
dc.subjectOmni-Path
dc.subjectstatic routing
dc.subjectnetworks
dc.subjectcompressed sensing
dc.subjectcompressive sensing
dc.subjectreservoir sampling
dc.subjectinteger linear programming
dc.subjectoptimization
dc.titleTechniques for Measurement, Analysis, and Optimization of HPC Communication Performance
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TAFFET-DOCUMENT-2021.pdf
Size:
6.07 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.61 KB
Format:
Plain Text
Description: