Automated Diagnosis of Scalability Losses in Parallel Applications

Date
2020-02-25
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Each generation of supercomputers is more powerful than the last in an attempt to keep up with the growing ambition of scientific inquiry. Despite improvements in computational power, however, performance of many parallel applications has failed to scale. Many factors degrade the parallel performance of applications. The need to understand application behaviors and pinpoint causes of inefficiency has led to the development of a broad array of tools for measuring and analyzing application performance. Those performance analysis tools generally focus on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of performance measurement data falls to application developers.

Profiles generated by performance tools can usually identify the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. To address this problem, we developed an automated framework that analyzes time series of call path samples to present users with performance diagnosis of parallel executions. Our automated framework incurs much lower overhead in time and space compared to prior tools that analyze performance using instrumentation-based traces. The framework's automated diagnosis indicates the symptoms, severity, and causes of scalability losses found in a parallel execution. To support a broad array of parallel applications, our automated analysis is applicable to both SPMD and MPMD in both flat and hierarchical parallel models. We demonstrate the effectiveness of our framework by applying it to time-series measurements of three scientific codes.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
performance, automated diagnosis, scalability losses, sample-based time series data
Citation

Wei, Lai. "Automated Diagnosis of Scalability Losses in Parallel Applications." (2020) Diss., Rice University. https://hdl.handle.net/1911/108077.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page