Computer Science Technical Reports

Permanent URI for this collection

https://hdl.handle.net/1911/94827

Browse

Now showing 1 - 5 of 5

Communication Optimizations for Distributed-Memory X10 Programs
(2010-04-10) Barik, Rajkishore; Budimlić, Zoran; Grove, David; Peshansky, Igor; Sarkar, Vivek; Zhao, Jisheng
X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46× relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01× (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73× (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.
Efficient Selection of Vector Instructions using Dynamic Programming
(2010-06-17) Barik, Rajkishore; Sarkar, Vivek; Zhao, Jisheng
Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of an auto-vectorization framework in the backend of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel compile-time efficient dynamic programming-based vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) scalar packing explores opportunities of packing multiple scalar variables into short vectors; (2) judicious use of shuffle and horizontal vector operations, when possible; and (3) algebraic reassociation expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution, with a modest increase in compile time in the range from 0.87% to 9.992%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Superword Level Parallelization (SLP) algorithm from [21], shows that our approach yields a performance improvement of up to 13.78% relative to SLP.
Interprocedural Strength Reduction of Critical Sections in Explicitly-Parallel Programs
(2013-05-01) Barik, Rajkishore; Sarkar, Vivek; Zhao, Jisheng
In this paper, we introduce novel compiler optimization techniques to reduce the number of operations performed in critical sections that occur in explicitly-parallel programs. Specifically, we focus on three code transformations: 1) Partial Strength Reduction (PSR) of critical sections to replace critical sections by non-critical sections on certain control flow paths; 2) Critical Load Elimination (CLE) to replace memory accesses within a critical section by accesses to scalar temporaries that contain values loaded outside the critical section; and 3) Non-critical Code Motion (NCM) to hoist thread-local computations out of critical sections. The effectiveness of the first two transformations is further increased by inter-procedural analysis. The effectiveness of our techniques has been demonstrated for critical section constructs from three different explicitly-parallel programming models — the isolated construct in Habanero Java (HJ), the synchronized construct in standard Java, and transactions in the Java-based Deuce software transactional memory system. We used two SMP platforms (a 16-core Intel Xeon SMP and a 32-Core IBM Power7 SMP) to evaluate our optimizations on 17 explicitly-parallel benchmark programs that span all three models. Our results show that the optimizations introduced in this paper can deliver measurable performance improvements that increase in magnitude when the program is run with a larger number of processor cores. These results underscore the importance of optimizing critical sections, and the fact that the benefits from such optimizations will continue to increase with increasing numbers of cores in future many-core processors.
Register Allocation using Bipartite Liveness Graphs
(2010-10-12) Barik, Rajkishore; Sarkar, Vivek; Zhao, Jisheng
Register allocation is an essential optimization for all compilers. A number of sophisticated register allocation algorithms have been developed based on Graph Coloring (GC) over the years. However, these algorithms pose three major limitations in practice. First, construction of a full interference graph can be a major source of space and time overhead in the compiler. Second, the interference graph lacks information on the program points at which two variables may interfere. Third, integration of coloring and coalescing leads to a coupling between register allocation and register assignment, which can further compromise the effectiveness of the solution. This paper addresses these limitations by making a clean separation between the register allocation and register assignment phases. Allocation is modeled as an optimization problem on a new data structure called the Bipartite Liveness Graph (BLG). We model the register assignment phase as a separate optimization problem that avoids some spill instructions by generating register-to-register moves and exchange instructions and at the same time, performs move coalescing and handles register class constraints. We implemented our BLG allocator in both the LLVM static compiler infrastructure and the Jikes RVM dynamic compiler infrastructure. In the LLVM evaluation, our BLG register allocator results in a performance improvement of up to 7.8% for SpecCPU 2006 benchmarks and a significantly lower compile-time overhead compared to a Chaitin-Briggs GC register allocator with both allocators using the same spill-code generator. The BLG allocator delivers performance comparable to the existing LLVM 2.7 Linear Scan register allocator that includes additional optimizations such as live-range splitting and backtracking techniques that are currently not present in the BLG allocator. In the Jikes RVM evaluation, the BLG register allocator delivers runtime performance improvements of up to 30.7% for Java Grande Forum benchmarks and up to 9% for Dacapo benchmarks relative to Linear Scan (LS) with a modest increase in compile-time.
Work-First and Help-First Scheduling Policies for Terminally Strict Parallel Programs
(2008-11-13) Barik, Rajkishore; Guo, Yi; Raman, Raghavan; Sarkar, Vivek
Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10’s terminally strict async-finish task parallelism, which is more general than Cilk’s fully strict spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insight on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.

Browse

Browsing Computer Science Technical Reports by Author "Barik, Rajkishore"

Results Per Page

Sort Options