Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications

Date
2022-04-26
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

With the end of Moore’s law, computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) have emerged as a key component for accelerating applications in various domains, including deep learning, data analytics, and scientific simulations. While GPUs provide superior compute power and higher memory bandwidth than CPUs, writing efficient GPU code to achieve maximum possible performance is challenging because of the sophisticated programming models and architectural features. GPU performance tools are designed to pinpoint performance bottlenecks in GPU-accelerated applications and provide performance insights for users. However, existing performance tools are insufficient to identify hotspots and provide insights for complex applications.

This thesis describes novel GPU performance tools that measure and analyze GPU-accelerated applications to address these challenges. First, I describe a GPU profiler that uses API interception, instruction sampling, and binary instrumentation to collect GPU performance metrics. To lower the overhead caused by the profiler, I designed novel wait-free queues for communication between multiple threads, a GPU-accelerated method to process measurement data, and metrics derivation method that derives multiple essential GPU performance metrics without replaying GPU operations. Then, I present a framework that attributes measurement data collected at runtime to call paths with low overhead. Offline, I developed a binary analyzer that reconstructs approximate GPU calling contexts by analyzing instruction samples and GPU binaries. Also, the analyzer analyzes def-use relations among GPU instructions to attribute instruction stalls to their root causes and identify the value type of memory instructions. Using performance metrics, program contexts, and instruction characteristics, I developed context-sensitive, instruction stall, and value redundancy analyzers to generate insightful performance reports. The context-sensitive analyzer focuses users' attention on hotspots with sophisticated program contexts. The instruction stall analyzer matches performance bottlenecks with potential optimizations, estimates speedups for each optimization, and outputs the optimization suggestions with the highest estimated speedups. The value redundancy analyzer identifies GPU operations involving significantly redundant values and constructs a value flow graph to visualize value changes across GPU operations. To demonstrate the effectiveness of our performance tools, I have studied many machine learning and HPC applications. Guided by the insightful performance reports generated by our tools, I have identified performance hotspots and proposed effective optimizations that ameliorate underlying causes for inefficiency.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Graphics Processing Units, Performance Tools, Performance Measurement, Instrumentation, Instruction Sampling, Deep Learning, High Performance Computing, Performance Tuning
Citation

Zhou, Keren. "Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications." (2022) Diss., Rice University. https://hdl.handle.net/1911/113506.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page