Browsing by Author "Barik, Rajkishore"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Communication Optimizations for Distributed-Memory X10 Programs(2010-04-10) Barik, Rajkishore; Budimlić, Zoran; Grove, David; Peshansky, Igor; Sarkar, Vivek; Zhao, JishengX10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46× relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01× (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73× (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.Item Efficient optimization of memory accesses in parallel programs(2010) Barik, Rajkishore; Sarkar, VivekThe power, frequency, and memory wall problems have caused a major shift in mainstream computing by introducing processors that contain multiple low power cores. As multi-core processors are becoming ubiquitous, software trends in both parallel programming languages and dynamic compilation have added new challenges to program compilation for multi-core processors. This thesis proposes a combination of high-level and low-level compiler optimizations to address these challenges. The high-level optimizations introduced in this thesis include new approaches to May-Happen-in-Parallel analysis and Side-Effect analysis for parallel programs and a novel parallelism-aware Scalar Replacement for Load Elimination transformation. A new Isolation Consistency (IC) memory model is described that permits several scalar replacement transformation opportunities compared to many existing memory models. The low-level optimizations include a novel approach to register allocation that retains the compile time and space efficiency of Linear Scan, while delivering runtime performance superior to both Linear Scan and Graph Coloring. The allocation phase is modeled as an optimization problem on a Bipartite Liveness Graph (BLG) data structure. The assignment phase focuses on reducing the number of spill instructions by using register-to-register move and exchange instructions wherever possible. Experimental evaluations of our scalar replacement for load elimination transformation in the Jikes RVM dynamic compiler show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when compared with the load elimination algorithm available in Jikes RVM. A prototype implementation of our BLG register allocator in Jikes RVM demonstrates runtime performance improvements of up to 3.52x relative to Linear Scan on an x86 processor. When compared to Graph Coloring register allocator in the GCC compiler framework, our allocator resulted in an execution time improvement of up to 5.8%, with an average improvement of 2.3% on a POWER5 processor. With the experimental evaluations combined with the foundations presented in this thesis, we believe that the proposed high-level and low-level optimizations are useful in addressing some of the new challenges emerging in the optimization of parallel programs for multi-core architectures.Item Efficient Selection of Vector Instructions using Dynamic Programming(2010-06-17) Barik, Rajkishore; Sarkar, Vivek; Zhao, JishengAccelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of an auto-vectorization framework in the backend of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel compile-time efficient dynamic programming-based vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) scalar packing explores opportunities of packing multiple scalar variables into short vectors; (2) judicious use of shuffle and horizontal vector operations, when possible; and (3) algebraic reassociation expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution, with a modest increase in compile time in the range from 0.87% to 9.992%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Superword Level Parallelization (SLP) algorithm from [21], shows that our approach yields a performance improvement of up to 13.78% relative to SLP.Item Interprocedural Strength Reduction of Critical Sections in Explicitly-Parallel Programs(2013-05-01) Barik, Rajkishore; Sarkar, Vivek; Zhao, JishengIn this paper, we introduce novel compiler optimization techniques to reduce the number of operations performed in critical sections that occur in explicitly-parallel programs. Specifically, we focus on three code transformations: 1) Partial Strength Reduction (PSR) of critical sections to replace critical sections by non-critical sections on certain control flow paths; 2) Critical Load Elimination (CLE) to replace memory accesses within a critical section by accesses to scalar temporaries that contain values loaded outside the critical section; and 3) Non-critical Code Motion (NCM) to hoist thread-local computations out of critical sections. The effectiveness of the first two transformations is further increased by inter-procedural analysis. The effectiveness of our techniques has been demonstrated for critical section constructs from three different explicitly-parallel programming models — the isolated construct in Habanero Java (HJ), the synchronized construct in standard Java, and transactions in the Java-based Deuce software transactional memory system. We used two SMP platforms (a 16-core Intel Xeon SMP and a 32-Core IBM Power7 SMP) to evaluate our optimizations on 17 explicitly-parallel benchmark programs that span all three models. Our results show that the optimizations introduced in this paper can deliver measurable performance improvements that increase in magnitude when the program is run with a larger number of processor cores. These results underscore the importance of optimizing critical sections, and the fact that the benefits from such optimizations will continue to increase with increasing numbers of cores in future many-core processors.Item Portable Programming Models for Heterogeneous Platforms(2015-09-14) Majeti, Deepak; Sarkar, Vivek; Mellor-Crummey, John; Warburton, Timothy C; Barik, RajkishoreWith the end of Dennard scaling and emergence of dark silicon, the bets are high on heterogeneous architectures to achieve both application performance and energy efficiency. However, diversity in heterogeneous architectures poses severe programming challenges in terms of data layout, memory coherence, task partitioning, data distribution, and sharing of virtual addresses. Existing high-level programming languages are inadequate to address these new architectural features since they lack the necessary abstractions to address the challenges mentioned above. It is necessary for existing languages to be extended minimally with high-level constructs while maintaining existing standards of portability, performance, and productivity. The compiler and runtime together must efficiently map these constructs to a target architecture. We develop Concord, a C++ based programming model that extends the Intel Threading Building Blocks onto integrated heterogeneous CPU+GPU architectures that do not share the same virtual address between CPU and GPU. Concord supports many C++ features including virtual functions. We implement Shared Virtual Memory to map applications with pointer intensive data structures onto heterogeneous architectures that do not share the same virtual address. We introduce Heterogeneous Habanero-C (H2C), an implementation of the Habanero execution model targeting modern heterogeneous architectures with multiple devices. H2C provides high-level constructs to specify the computation, communication and synchronization in a given application. The H2C compiler and runtime frameworks efficiently map these high-level constructs onto the underlying heterogeneous hardware. The highlights of H2C include: a data layout framework to generate code with best data layout suited for a given memory hierarchy; constructs to specify a task partition, leaving the complex analysis of determining the resultant data distribution to the compiler; and a unified event framework that allows a programmer to write applications with a macro data-flow model onto heterogeneous architectures. Experimental results show Concord and H2C provide good portability, productivity, and performance. In the future, we believe heterogeneous architectures will be more diverse and more pervasive. We believe programming systems like H2C and Concord that have a tight integration of language, compiler and runtime are the right way to target current and future heterogeneous systems.Item Register Allocation using Bipartite Liveness Graphs(2010-10-12) Barik, Rajkishore; Sarkar, Vivek; Zhao, JishengRegister allocation is an essential optimization for all compilers. A number of sophisticated register allocation algorithms have been developed based on Graph Coloring (GC) over the years. However, these algorithms pose three major limitations in practice. First, construction of a full interference graph can be a major source of space and time overhead in the compiler. Second, the interference graph lacks information on the program points at which two variables may interfere. Third, integration of coloring and coalescing leads to a coupling between register allocation and register assignment, which can further compromise the effectiveness of the solution. This paper addresses these limitations by making a clean separation between the register allocation and register assignment phases. Allocation is modeled as an optimization problem on a new data structure called the Bipartite Liveness Graph (BLG). We model the register assignment phase as a separate optimization problem that avoids some spill instructions by generating register-to-register moves and exchange instructions and at the same time, performs move coalescing and handles register class constraints. We implemented our BLG allocator in both the LLVM static compiler infrastructure and the Jikes RVM dynamic compiler infrastructure. In the LLVM evaluation, our BLG register allocator results in a performance improvement of up to 7.8% for SpecCPU 2006 benchmarks and a significantly lower compile-time overhead compared to a Chaitin-Briggs GC register allocator with both allocators using the same spill-code generator. The BLG allocator delivers performance comparable to the existing LLVM 2.7 Linear Scan register allocator that includes additional optimizations such as live-range splitting and backtracking techniques that are currently not present in the BLG allocator. In the Jikes RVM evaluation, the BLG register allocator delivers runtime performance improvements of up to 30.7% for Java Grande Forum benchmarks and up to 9% for Dacapo benchmarks relative to Linear Scan (LS) with a modest increase in compile-time.Item Work-First and Help-First Scheduling Policies for Terminally Strict Parallel Programs(2008-11-13) Barik, Rajkishore; Guo, Yi; Raman, Raghavan; Sarkar, VivekMultiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10’s terminally strict async-finish task parallelism, which is more general than Cilk’s fully strict spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insight on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.