Autotuning Memory-intensive Software for Node Architectures

dc.contributor.advisorMellor-Crummey, John
dc.contributor.committeeMemberCooper, Keith
dc.contributor.committeeMemberSarkar, Vivek
dc.creatorWei, Lai
dc.date.accessioned2016-02-05T21:49:56Z
dc.date.available2016-02-05T21:49:56Z
dc.date.created2015-05
dc.date.issued2015-05-13
dc.date.submittedMay 2015
dc.date.updated2016-02-05T21:49:56Z
dc.description.abstractToday, scientific computing plays an important role in scientific research. People build supercomputers to support the computational needs of large-scale scientific applications. Achieving high performance on today's supercomputers is difficult, in large part due to the complexity of the node architectures, which include wide-issue instruction-level parallelism, SIMD operations, multiple cores, multiple threads per core, and a deep memory hierarchy. In addition, growth of compute performance has outpaced the growth of memory bandwidth, making memory bandwidth a scarce resource. People have proposed various optimization methods, including tiling and prefetching, to make better usage of the memory hierarchy. However, due to architectural differences, code hand-tuned for one architecture is not necessarily efficient for others. For that reason, autotuning is often used to tailor high-performance code for different architectures. Common practice is to develop a parametric code generator that generates code according to different optimization parameters and then picks the best among various implementation alternatives for a given architecture. In this thesis, we use tensor transposition, a generalization of matrix transposition, as a motivating example to study the problem of autotuning memory-intensive codes for complex memory hierarchies. We developed a framework to produce optimized parallel tensor transposition code for node architectures. This framework has two components: a rule-based code generation and transformation system that generates code according to specified optimization parameters, and an autotuner that uses static analysis along with empirical autotuning to pick the best implementation scheme. In this work, we studied how to prune the autotuning search space and perform run-time code selection using hardware performance counters. Despite the complex memory access patterns of tensor transposition, experiments on two very different architectures show that our approach achieves more than 80% of the bandwidth of optimized memory copies when transposing most tensors. Our results show that autotuning is the key to achieving peak application performance across different node architectures for memory-intensive codes.
dc.format.mimetypeapplication/pdf
dc.identifier.citationWei, Lai. "Autotuning Memory-intensive Software for Node Architectures." (2015) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/88422">https://hdl.handle.net/1911/88422</a>.
dc.identifier.urihttps://hdl.handle.net/1911/88422
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectautotuning
dc.subjectmemory-intensive software
dc.subjecttensor transposition
dc.subjecthardware performance counters
dc.subjectmemory hierachy
dc.titleAutotuning Memory-intensive Software for Node Architectures
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelMasters
thesis.degree.nameMaster of Science
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
WEI-DOCUMENT-2015.pdf
Size:
5.87 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: