Browsing by Author "Wei, Lai"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Automated Diagnosis of Scalability Losses in Parallel Applications(2020-02-25) Wei, Lai; Mellor-Crummey, JohnEach generation of supercomputers is more powerful than the last in an attempt to keep up with the growing ambition of scientific inquiry. Despite improvements in computational power, however, performance of many parallel applications has failed to scale. Many factors degrade the parallel performance of applications. The need to understand application behaviors and pinpoint causes of inefficiency has led to the development of a broad array of tools for measuring and analyzing application performance. Those performance analysis tools generally focus on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of performance measurement data falls to application developers. Profiles generated by performance tools can usually identify the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. To address this problem, we developed an automated framework that analyzes time series of call path samples to present users with performance diagnosis of parallel executions. Our automated framework incurs much lower overhead in time and space compared to prior tools that analyze performance using instrumentation-based traces. The framework's automated diagnosis indicates the symptoms, severity, and causes of scalability losses found in a parallel execution. To support a broad array of parallel applications, our automated analysis is applicable to both SPMD and MPMD in both flat and hierarchical parallel models. We demonstrate the effectiveness of our framework by applying it to time-series measurements of three scientific codes.Item Autotuning Memory-intensive Software for Node Architectures(2015-05-13) Wei, Lai; Mellor-Crummey, John; Cooper, Keith; Sarkar, VivekToday, scientific computing plays an important role in scientific research. People build supercomputers to support the computational needs of large-scale scientific applications. Achieving high performance on today's supercomputers is difficult, in large part due to the complexity of the node architectures, which include wide-issue instruction-level parallelism, SIMD operations, multiple cores, multiple threads per core, and a deep memory hierarchy. In addition, growth of compute performance has outpaced the growth of memory bandwidth, making memory bandwidth a scarce resource. People have proposed various optimization methods, including tiling and prefetching, to make better usage of the memory hierarchy. However, due to architectural differences, code hand-tuned for one architecture is not necessarily efficient for others. For that reason, autotuning is often used to tailor high-performance code for different architectures. Common practice is to develop a parametric code generator that generates code according to different optimization parameters and then picks the best among various implementation alternatives for a given architecture. In this thesis, we use tensor transposition, a generalization of matrix transposition, as a motivating example to study the problem of autotuning memory-intensive codes for complex memory hierarchies. We developed a framework to produce optimized parallel tensor transposition code for node architectures. This framework has two components: a rule-based code generation and transformation system that generates code according to specified optimization parameters, and an autotuner that uses static analysis along with empirical autotuning to pick the best implementation scheme. In this work, we studied how to prune the autotuning search space and perform run-time code selection using hardware performance counters. Despite the complex memory access patterns of tensor transposition, experiments on two very different architectures show that our approach achieves more than 80% of the bandwidth of optimized memory copies when transposing most tensors. Our results show that autotuning is the key to achieving peak application performance across different node architectures for memory-intensive codes.