Browsing by Author "Mellor-Crummey, John M"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Code Generation for Extreme Scale Parallel Systems(2017-03-08) Srinivasa Murthy, Karthik; Mellor-Crummey, John MPower consumption and fabrication limitations are increasingly playing significant roles in the design of extreme scale parallel systems. These factors are influencing system designers to support higher on-node computing capability via throughput-optimized processors instead of latency-optimized processors. However, the inter- and intra-processor communication capabilities on such systems are not increasing at the same rate as the on-node computing capability. Consequently, achieving high performance requires careful orchestration of both single- and multiprocessor parallelism. This thesis shows that compiler technology and expressive programming model constructs can help applications more effectively exploit both forms of parallelism. Compilers play an important role in harnessing short vector parallelism supported by cores in modern processors. Over last ten years, vector widths have increased dramatically from the 64-bit vectors supported by Intel's Pentium MMX processor to the 512-bit vectors supported by Intel's Knights Corner processor. However, the vectorization capabilities of state-of-the-art compilers are still immature, failing in the presence of complex control flow and data dependencies. This thesis presents compiler transformations that enable efficient vector parallelism in the presence of common kinds of complex dependencies. To enable efficient multiprocessor parallelism, this thesis develops compiler technology to support sophisticated algorithms that minimize interprocessor communication. The class of .5D communication-avoiding algorithms was developed to address the inter-processor communication bottleneck. Mapping these algorithms to complex architectures efficiently is tedious for even expert programmers. To address this issue, this thesis presents the Maunam compiler, which generates efficient parallel code from a high-level, global-view sketch of a .5D algorithm that is expressed using symbolic data sizes and numbers of processors. To mitigate the cost of communication for multiprocessor parallelism, this thesis develops a novel compiler transformation to overlap communication with computation for systolic computations. Additionally, to aid effective management of the completion of non-blocking communication, this thesis presents two synchronization constructs, cofence and distributed phasers.Item Point-to-Point and Barrier Synchronization in Distributed SPMD Systems(2019-11-08) Milakovic, Srdan; Mellor-Crummey, John M; Sarkar, Vivek; Budimlić, ZoranDistributed memory programming models are very often the only way to scale up large scientific applications. To ensure correctness and optimal performance in distributed applications, it is necessary to use general, high-level, but efficient synchronization constructs. Implementing distributed applications using one-sided communication libraries is getting more popular, as opposed to the two-sided communication used in the MPI model. However, in most cases, those libraries only have support for high-level collective barrier synchronization and low-level point-to-point synchronization. Phaser synchronization construct is a very attractive synchronization mechanism because it unifies collective and point-to-point synchronization in a simple, easy to use high-level synchronization construct. In this thesis, we propose several novel algorithms for phaser synchronization on distributed-memory systems with one-sided communication. We also present several improvements to the distributed barrier algorithms in the OpenSHMEM reference implementation. We establish a very high confidence level in algorithms' correctness by using the SPIN model checker for our algorithms. We evaluated our phaser algorithm using several benchmark applications on large supercomputers, and we show that using phasers can reduce the synchronization time by up to 47% and improve total execution time by up to 26%. This thesis shows that high-level, efficient, and intuitive synchronization is possible on distributed systems with one-sided communication.Item Understanding Congestion in High Performance Interconnection Networks Using Sampling(2018-04-30) Taffet, Philip Adam; Mellor-Crummey, John MThe computational needs of many applications outstrip the capabilities of a single compute node. Communication is necessary to employ multiple nodes, but slow communication often limits application performance on multiple nodes. To improve communication performance, developers need tools that enable them to understand how their application’s communication patterns interact with the network, especially when those interactions result in congestion. Since communication performance is difficult to reason about analytically and simulation is costly, measurement-based approaches are needed. This thesis describes a new sampling-based technique to collect information about the path a packet takes and congestion it encounters. Experiments with simulations show that this strategy can distinguish problems with an application's communication patterns, its mapping onto a parallel system, and outside interference. We describe a variant of this scheme that requires only 5-6 bits of information in a monitored packet, making it practical for use in next-generation networks.