R-3 Repository :: Browsing by Author "Raman, Raghavan"

Browsing by Author "Raman, Raghavan"

Now showing 1 - 5 of 5

Compiler Support for Work-Stealing Parallel Runtime Systems
(2010-03-03) Raman, Raghavan; Zhao, Jisheng; Budimlić, Zoran; Sarkar, Vivek
Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we focus on the compiler support needed to extend work-stealing for dynamic async-finish task parallelism as supported by X10 and Habanero-Java (HJ). We discuss the compiler support needed for workstealing with both the work-first and help-first policies. Performance results obtained using our compiler and the HJ work-stealing runtime show significant improvement compared to the earlier work-sharing runtime from X10 v1.5. We also propose and implement three optimizations, Dynamic-Loop-Chunking, Redundant-Frame-Store, and Objects-As-Frames, that can be performed in the compiler to improve the code generated for work-stealing schedulers. Performance results show that the Dynamic-Loop-Chunking optimization significantly improves the performance of loop based benchmarks using work-stealing schedulers with work-first policy. The Redundant-Frame-Store optimizations provide a significant reduction in the code size. The results also show that our novel Objects-As-Frames optimization yields performance improvement in many cases. To the best of our knowledge, this is the first implementation of compiler support for work-stealing schedulers for async-finish parallelism with both the work-first and help-first policies and support for task migration.
Compiler support for work-stealing parallel runtime systems
(2009) Raman, Raghavan; Sarkar, Vivek
Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk's implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this thesis, we focus on the compiler support needed to extend work-stealing for dynamic async-finish task parallelism as in X10 and HabaneroJava (HJ). We also discuss the compiler support needed for work-stealing with both the work-first and help-first policies. Performance results obtained using our compiler and the HJ work-stealing runtime show significant improvement compared to the earlier work-sharing runtime. We then propose and implement two optimizations that can be performed in the compiler to improve the code generated for work-stealing schedulers. Performance results show that the Frame-Store optimizations provide a significant reduction in the code size and the number of frame-store statements executed dynamically, but these reductions do not result in execution time improvements on current multicore systems. We also show that the Objects-As-Frames optimization yields an improvement in performance for small number of threads. Finally, we propose topics for future work which include extending work-stealing for additional language constructs as well as new optimizations.
Dynamic Data Race Detection for Structured Parallelism
(2013-07-24) Raman, Raghavan; Sarkar, Vivek; Mellor-Crummey, John; Zhong, Lin
With the advent of multicore processors and an increased emphasis on parallel computing, parallel programming has become a fundamental requirement for achieving available performance. Parallel programming is inherently hard because, to reason about the correctness of a parallel program, programmers have to consider large numbers of interleavings of statements in different threads in the program. Though structured parallelism imposes some restrictions on the programmer, it is an attractive approach because it provides useful guarantees such as deadlock-freedom. However, data races remain a challenging source of bugs in parallel programs. Data races may occur only in few of the possible schedules of a parallel program, thereby making them extremely hard to detect, reproduce, and correct. In the past, dynamic data race detection algorithms have suffered from at least one of the following limitations: some algorithms have a worst-case linear space and time overhead, some algorithms are dependent on a specific scheduling technique, some algorithms generate false positives and false negatives, some have no empirical evaluation as yet, and some require sequential execution of the parallel program. In this thesis, we introduce dynamic data race detection algorithms for structured parallel programs that overcome past limitations. We present a race detection algorithm called ESP-bags that requires the input program to be executed sequentially and another algorithm called SPD3 that can execute the program in parallel. While the ESP-bags algorithm addresses all the above mentioned limitations except sequential execution, the SPD3 algorithm addresses the issue of sequential execution by scaling well across highly parallel shared memory multiprocessors. Our algorithms incur constant space overhead per memory location and time overhead that is independent of the number of processors on which the programs execute. Our race detection algorithms support a rich set of parallel constructs (including async, finish, isolated, and future) that are found in languages such as HJ, X10, and Cilk. Our algorithms for async, finish, and future are precise and sound for a given input. In the presence of isolated, our algorithms are precise but not sound. Our experiments show that our algorithms (for async, finish, and isolated) perform well in practice, incurring an average slowdown of under 3x over the original execution time on a suite of 15 benchmarks. SPD3 is the first practical dynamic race detection algorithm for async-finish parallel programs that can execute the input program in parallel and use constant space per memory location. This takes us closer to our goal of building dynamic data race detectors that can be "always-on" when developing parallel applications.
Scalable and Precise Dynamic Datarace Detection for Structured Parallelism
(2012-07-06) Raman, Raghavan; Zhao, Jisheng; Sarkar, Vivek; Vechev, Martin; Yahav , Eran
Existing dynamic race detectors suffer from at least one of the following three limitations: i) space overhead per memory location grows linearly with the number of parallel threads [13], severely limiting the parallelism that the algorithm can handle. (ii) sequentialization: the parallel program must be processed in a sequential order, usually depth-first [12, 24]. This prevents the analysis from scaling with available hardware parallelism, inherently limiting its performance. (iii) inefficiency: even though race detectors with good theoretical complexity exist, they do not admit efficient implementations and are unsuitable for practical use [4, 18]. We present a new precise dynamic race detector that leverages structured parallelism in order to address these limitations. Our algorithm requires constant space per memory location, works in parallel, and is efficient in practice. We implemented and evaluated our algorithm on a set of 15 benchmarks. Our experimental results indicate an average (geometric mean) slowdown of 2.78× on a 16core SMP system.
Work-First and Help-First Scheduling Policies for Terminally Strict Parallel Programs
(2008-11-13) Barik, Rajkishore; Guo, Yi; Raman, Raghavan; Sarkar, Vivek
Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10’s terminally strict async-finish task parallelism, which is more general than Cilk’s fully strict spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insight on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.

Browsing by Author "Raman, Raghavan"

Results Per Page

Sort Options