Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the end of Moore’s law, computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) have emerged as a key component for accelerating applications in various domains, including deep learning, data analytics, and scientific simulations. While GPUs provide superior compute power and higher memory bandwidth than CPUs, writing efficient GPU code to achieve maximum possible performance is challenging because of the sophisticated programming models and architectural features. GPU performance tools are designed to pinpoint performance bottlenecks in GPU-accelerated applications and provide performance insights for users. However, existing performance tools are insufficient to identify hotspots and provide insights for complex applications.
This thesis describes novel GPU performance tools that measure and analyze GPU-accelerated applications to address these challenges. First, I describe a GPU profiler that uses API interception, instruction sampling, and binary instrumentation to collect GPU performance metrics. To lower the overhead caused by the profiler, I designed novel wait-free queues for communication between multiple threads, a GPU-accelerated method to process measurement data, and metrics derivation method that derives multiple essential GPU performance metrics without replaying GPU operations. Then, I present a framework that attributes measurement data collected at runtime to call paths with low overhead. Offline, I developed a binary analyzer that reconstructs approximate GPU calling contexts by analyzing instruction samples and GPU binaries. Also, the analyzer analyzes def-use relations among GPU instructions to attribute instruction stalls to their root causes and identify the value type of memory instructions. Using performance metrics, program contexts, and instruction characteristics, I developed context-sensitive, instruction stall, and value redundancy analyzers to generate insightful performance reports. The context-sensitive analyzer focuses users' attention on hotspots with sophisticated program contexts. The instruction stall analyzer matches performance bottlenecks with potential optimizations, estimates speedups for each optimization, and outputs the optimization suggestions with the highest estimated speedups. The value redundancy analyzer identifies GPU operations involving significantly redundant values and constructs a value flow graph to visualize value changes across GPU operations. To demonstrate the effectiveness of our performance tools, I have studied many machine learning and HPC applications. Guided by the insightful performance reports generated by our tools, I have identified performance hotspots and proposed effective optimizations that ameliorate underlying causes for inefficiency.
Description
Advisor
Degree
Type
Keywords
Citation
Zhou, Keren. "Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications." (2022) Diss., Rice University. https://hdl.handle.net/1911/113506.