Browsing by Author "Sarkar, Vivek"
Now showing 1 - 20 of 61
Results Per Page
Sort Options
Item A Hierarchical Region-Based Static Single Assignment Form(2009-12-14) Sarkar, Vivek; Zhao, JishengModern compilation systems face the challenge of incrementally reanalyzing a program’s intermediate representation each time a code transformation is performed. Current approaches typically either re-analyze the entire program after an individual transformation or limit the analysis information that is available after a transformation. To address both efficiency and precision goals in an optimizing compiler, we introduce a hierarchical static single-assignment form called Region Static Single-Assignment (Region-SSA) form. Static single assignment (SSA) form is an efficient intermediate representation that is well suited for solving many data flow analysis and optimization problems. By partitioning the program into hierarchical regions, Region-SSA form maintains a local SSA form for each region. Region-SSA supports a demand-driven re-computation of SSA form after a transformation is performed, since only the updated region’s SSA form needs to be reconstructed along with a potential propagation of exposed defs and uses. In this paper, we introduce the Region-SSA data structure, and present algorithms for construction and incremental reconstruction of Region-SSA form. The Region-SSA data structure includes a tree based region hierarchy, a region based control flow graph, and region-based SSA forms. We have implemented in Region SSA form in the Habanero-Java (HJ) research compiler. Our experimental results show significant improvements in compile-time compared to traditional approaches that recomputed the entire procedure’s SSA form exhaustively after transformation. For loop unrolling transformations, compile-time speedups up to 35.8× were observed using Region-SSA form relative to standard SSA form. For loop interchange transformations, compile-time speedups up to 205.6× were observed. We believe that Region-SSA form is an attractive foundation for future compiler frameworks that need to incorporate sophisticated incremental program analyses and transformations.Item A Scalable Locality-aware Adaptive Work-stealing Scheduler for Multi-core Task Parallelism(2011) Guo, Yi; Sarkar, VivekRecent trend has made it clear that the processor makers are committed to the multicore chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because programs have to be written or transformed into a multi-threaded form to take full advantage of future hardware advances. Task parallelism has been identified as one of the prerequisites for software productivity. In task parallelism, programmers focus on decomposing the problem into subcomputations that can run in parallel and leave the compiler and runtime to handle the scheduling details. This separation of concerns between task decomposition and scheduling provides productivity to the programmer but poses challenges to the runtime scheduler. Our thesis is that work-stealing schedulers with adaptive scheduling policies and locality-awareness can provide a scalable and robust runtime foundation for multicore task parallelism. We evaluate our thesis using the new Scalable Locality-aware Adaptive Work-stealing (SLAW) runtime scheduler developed for the Habanero-Java programming language, a task-parallel variant of Java. SLAW's adaptive task scheduling is motivated by the study of two common scheduling policies in a work-stealing scheduler, specifically, the work-first and the help-first policy. Both policies exhibit limitations in performance and resource usage in different situations. The variances make it hard to determine the best policy a priori. SLAW addresses these limitations by supporting both policies simultaneously and selecting policies adaptively on a per-task basis at runtime. Our results show that SLAW achieves O.98x to 9.2x speedup over the help-first scheduler and O.97x to 4.5x speedup over the work-first scheduler. Further, for large irregular parallel computations, SLAW supports data sizes and achieves performance that cannot be delivered by the use of any single fixed policy. SLAW's locality-aware scheduling framework aims to overcome the cache unfriendliness of work-stealing due to randomized stealing. The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler. Our results show that locality-aware scheduling can improve performance by increasing temporal data reuse for iterative data-parallel applications.Item Accelerated Plane-wave Discontinuous Galerkin for Heterogeneous Scattering Problems(2015-04-23) Atcheson, Thomas Reid; Warburton, Timothy; Symes, William; Sorensen, Dan; Sarkar, VivekThis thesis considers algorithmic and computational acceleration of numerical wave modelling at high frequencies. Numerical propagation of linear waves at high frequencies poses a significant challenge to modern simulation techniques. Despite the fact that potential practical benefits led a great deal of attention to this problem, current research has yet to provide a general and performant method to solve it. I consider finite element as a possible solution because it can handle geometric complex- ity of heterogeneous domains, but unfortunately it suffers from the “pollution” effect which imposes a prohobitively large memory requirement to handle high frequencies. One recent step towards enabling the finite element method to solve high frequency wave propagation in the frequency domain involves using a plane-wave basis rather than the standard polynomial basis. This allows highly compressed representations of scattering waves but otherwise appeared to limit users to nearly-homogeneous prob- lems. This thesis explores the use of plane-waves in a discontinuous Galerkin method (PWDG) for highly heterogeneous problems possibly containing a point source. The low-memory nature of PWDG and the fact that its expressions can be computed in an entirely symbolic manner without quadratures furthermore permits an efficient graphics processing unit (GPU) implementation such that problems with very high frequencies can be solved on a single workstation. This thesis includes computational results demonstrating results for frequencies in excess of 100 hertz on the Marmousi model, solved using only a single GPU.Item Array optimizations for high productivity programming languages(2009) Joyner, Mackale; Kennedy, Ken; Sarkar, Vivek; Budimlic, ZoranWhile the HPCS languages (Chapel, Fortress and X10) have introduced improvements in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three aspects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank-independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transformations that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10. We use rank analysis information to enable storage transformations on arrays. If rank-independent array references still remain after compiler analysis, the programmer can use X10's dependent type system to safely annotate array variable declarations with additional information for the rank and region of the variable, and to enable the compiler to generate efficient code in cases where the dependent type information is available. Our solution to the third challenge is to use a new interprocedural array bounds analysis approach using regions to automatically determine when runtime bounds checks are not needed. Our performance results show that our optimizations deliver performance that rivals the performance of hand-tuned code with explicit rank-specific loops and lower-level array accesses, and is up to two orders of magnitude faster than unoptimized, high-level X10 programs. These optimizations also result in scalability improvements of X10 programs as we increase the number of CPUs. While we perform the optimizations primarily in X10, these techniques are applicable to other high-productivity languages such as Chapel and Fortress.Item Automatic Detection of Inter-application Permission Leaks in Android Applications(2013-01-23) Burke, Michael G.; Guarnieri, Salvatore; Pistoia, Marco; Sarkar, Vivek; Sbîrlea, DragoșDue to their growing prevalence, smartphones can access an increasing amount of sensitive user information. To better protect this information, modern mobile operating systems provide permission-based security, which restricts applications to only access a clearly defined subset of system APIs and user data. The Android operating system builds upon already successful permission systems, but complements them by allowing application components to be reused within and across applications through a single communication mechanism, called the Intent mechanism. In this paper we identify three types of inter-application Intent-based attacks that rely on information flows in applications to obtain unauthorized access to permission-protected information. Two of these attacks are of previously known types: confused deputy and permission collusion attacks. The third attack, private activity invocation, is new and relies on the existence of difficult-to-detect misconfigurations introduced because Intents can be used for both intra-application and inter-application communication. Such misconfigured applications allow protected information meant for intraapplication communication to leak into unauthorized applications. This breaks a fundamental security guarantee of permissions systems: that applications can only access information if they own the corresponding permission. We formulate the detection of the vulnerabilities on which these attacks rely as a static taint propagation problem based on rules. We show that the rules describing the permission protected information can be automatically generated though static analysis of the Android libraries an improvement over previous work. To test our approach we built Permission Flow, a tool that can reliably and accurately identify the presence of vulnerable information flows in Android applications. Our automated analysis of popular applications found that 56% of the top 313 Android applications actively use inter-component information flows; by ensuring the absence of inter-application permission leaks, the proposed analysis would be highly beneficial to the Android ecosystem. Of the tested applications, Permission Flow found four exploitable vulnerabilities.Item Autotuning Memory-intensive Software for Node Architectures(2015-05-13) Wei, Lai; Mellor-Crummey, John; Cooper, Keith; Sarkar, VivekToday, scientific computing plays an important role in scientific research. People build supercomputers to support the computational needs of large-scale scientific applications. Achieving high performance on today's supercomputers is difficult, in large part due to the complexity of the node architectures, which include wide-issue instruction-level parallelism, SIMD operations, multiple cores, multiple threads per core, and a deep memory hierarchy. In addition, growth of compute performance has outpaced the growth of memory bandwidth, making memory bandwidth a scarce resource. People have proposed various optimization methods, including tiling and prefetching, to make better usage of the memory hierarchy. However, due to architectural differences, code hand-tuned for one architecture is not necessarily efficient for others. For that reason, autotuning is often used to tailor high-performance code for different architectures. Common practice is to develop a parametric code generator that generates code according to different optimization parameters and then picks the best among various implementation alternatives for a given architecture. In this thesis, we use tensor transposition, a generalization of matrix transposition, as a motivating example to study the problem of autotuning memory-intensive codes for complex memory hierarchies. We developed a framework to produce optimized parallel tensor transposition code for node architectures. This framework has two components: a rule-based code generation and transformation system that generates code according to specified optimization parameters, and an autotuner that uses static analysis along with empirical autotuning to pick the best implementation scheme. In this work, we studied how to prune the autotuning search space and perform run-time code selection using hardware performance counters. Despite the complex memory access patterns of tensor transposition, experiments on two very different architectures show that our approach achieves more than 80% of the bandwidth of optimized memory copies when transposing most tensors. Our results show that autotuning is the key to achieving peak application performance across different node architectures for memory-intensive codes.Item BMS-CnC: Bounded Memory Scheduling of Dynamic Task Graphs(2013-10-24) Budimlić, Zoran; Sarkar, Vivek; Sbîrlea, DragoșIt is now widely recognized that increased levels of parallelism is a necessary condition for improved application performance on multicore computers. However, as the number of cores increases, the memory-per-core ratio is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. For many parallel applications, the memory requirements can be significantly larger than for their sequential counterparts and, more importantly, their memory utilization depends critically on the schedule used when running them. To address this problem we propose bounded memory scheduling (BMS) for parallel programs expressed as dynamic task graphs, in which an upper bound is imposed on the program’s peak memory. Using the inspector/executor model, BMS tailors the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. Since solving BMS is NP-hard, we propose an approach in which we first use our heuristic algorithm, and if it fails we fall back on a more expensive optimal approach which is sped up by the best-effort result of the heuristic. Through evaluation on seven benchmarks, we show that BMS gracefully spans the spectrum between fully parallel and serial execution with decreasing memory bounds. Comparison with OpenMP shows that BMS-CnC can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP’s performance.Item Communication Optimizations for Distributed-Memory X10 Programs(2010-04-10) Barik, Rajkishore; Budimlić, Zoran; Grove, David; Peshansky, Igor; Sarkar, Vivek; Zhao, JishengX10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46× relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01× (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73× (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.Item Compiler Support for Work-Stealing Parallel Runtime Systems(2010-03-03) Raman, Raghavan; Zhao, Jisheng; Budimlić, Zoran; Sarkar, VivekMultiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we focus on the compiler support needed to extend work-stealing for dynamic async-finish task parallelism as supported by X10 and Habanero-Java (HJ). We discuss the compiler support needed for workstealing with both the work-first and help-first policies. Performance results obtained using our compiler and the HJ work-stealing runtime show significant improvement compared to the earlier work-sharing runtime from X10 v1.5. We also propose and implement three optimizations, Dynamic-Loop-Chunking, Redundant-Frame-Store, and Objects-As-Frames, that can be performed in the compiler to improve the code generated for work-stealing schedulers. Performance results show that the Dynamic-Loop-Chunking optimization significantly improves the performance of loop based benchmarks using work-stealing schedulers with work-first policy. The Redundant-Frame-Store optimizations provide a significant reduction in the code size. The results also show that our novel Objects-As-Frames optimization yields performance improvement in many cases. To the best of our knowledge, this is the first implementation of compiler support for work-stealing schedulers for async-finish parallelism with both the work-first and help-first policies and support for task migration.Item Compiler support for work-stealing parallel runtime systems(2009) Raman, Raghavan; Sarkar, VivekMultiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk's implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this thesis, we focus on the compiler support needed to extend work-stealing for dynamic async-finish task parallelism as in X10 and HabaneroJava (HJ). We also discuss the compiler support needed for work-stealing with both the work-first and help-first policies. Performance results obtained using our compiler and the HJ work-stealing runtime show significant improvement compared to the earlier work-sharing runtime. We then propose and implement two optimizations that can be performed in the compiler to improve the code generated for work-stealing schedulers. Performance results show that the Frame-Store optimizations provide a significant reduction in the code size and the number of frame-store statements executed dynamically, but these reductions do not result in execution time improvements on current multicore systems. We also show that the Objects-As-Frames optimization yields an improvement in performance for small number of threads. Finally, we propose topics for future work which include extending work-stealing for additional language constructs as well as new optimizations.Item Cooperative Execution of Parallel Tasks with Synchronization Constraints(2015-10-14) Imam, Shams Mahmood; Sarkar, Vivek; Mellor-Crummey, John; Chaudhuri, Swarat; Zhong, LinThe topic of this thesis is the effective execution of parallel applications on emerging multicore and manycore systems in the presence of modern synchronization and coordination constraints. Synchronization and coordination can contribute significant productivity and performance overheads to the development and execution of parallel programs. Higher-level programming models, such as the Task Parallel Model and Actor Model, provide abstractions that can be used to simplify writing parallel programs, in contrast to lower-level programming models that directly expose locks, threads and processes. However, these higher-level models often lack efficient support for general synchronization patterns that are necessary for a wide range of applications. Many modern synchronization and coordination constructs in parallel programs can incur significant performance overheads on current runtime systems, or significant productivity overheads when the programmer is forced to complicate their code to mitigate these performance overheads. We believe that a cooperation between the programmer and the runtime system is necessary to reduce the parallel overhead and to execute the available parallelism efficiently in the presence of synchronization constraints. In a cooperative approach, an executing entity yields control to other entities at well-defined points during its execution. This thesis shows that the use of cooperative techniques is critical to performance and scalability of certain parallel programming models, especially in the presence of modern synchronization and coordination constraints such as asynchronous tasks, futures, phasers, data-driven tasks, and actors. In particular, we focus on cooperative extensions and runtimes for the async-finish Task Parallel Model and the Actor Model in this thesis. Our work shows that cooperative techniques simplify programmability and deliver significant performance improvements by reducing the overhead in modern parallel programming models.Item Data Race Detection for Event-Driven Parallel Runtime Systems(2017-12-01) Yu, Lechen; Sarkar, VivekEvent-Driven Parallel (EDP) runtime systems (or more simply, EDP runtimes) are growing in popularity in the high-performance computing area because they provide a promising foundation for new programming systems that can support heterogeneous architectures and ever-increasing hardware complexity. EDP runtimes allow the programmer to focus on program logic, such as control and data dependences, thereby enabling portability across a wide range of platforms and system configurations. However, the applications written on top of EDP runtimes remain vulnerable to data races. Existing data race detection tools either do not support the primitives in EDP runtimes, or incur intractable large overheads by failing to utilize the structural information available in event-driven programs. In this dissertation, we propose a graph-traversal based data race detection method for EDP runtimes. It introduces a reachability graph (encodes the dependences in a program), to check the happens-before relation between memory accesses. In order to reduce the time complexity for race detection, we propose a few optimizations, such as reachability cache and reversed reachability graph to avoid unnecessary graph traversals and path compression to reduce the number of steps performed for graph traversal. Based on our race detection technique, we have developed a prototype implementation for the Open Community Runtime (OCR). Our evaluation on a set of open source OCR benchmarks shows that our tool handles all OCR constructs, and that the time overhead for race detection is comparable to that of past work on race detection for more constrained (e.g., fork-join) runimes.Item Debugging, Repair, and Synthesis of Task-Parallel Programs(2017-03-08) Surendran, Rishi; Sarkar, VivekParallelizing sequential programs to effectively utilize modern multicore architectures is a key challenge facing application developers and domain experts. Therefore, it is a need of the hour to create tools that aid programmers in developing correct and efficient parallel programs. In this thesis, we present algorithms for debugging, repairing, and synthesizing task-parallel programs that can provide a foundation for creating such tools. Our work focuses on task-parallel programs with both imperative async-finish parallelism and functional-style parallelism using futures. First, we address the problem of detecting races in parallel programs with async, finish and future constructs. Existing dynamic determinacy race detectors for task-parallel programs are limited to programs with strict computation graphs in which a task can only wait for some subset of its descendant tasks to complete. In this thesis, we present the first known determinacy race detector for non-strict computation graphs generated using futures. The space and time complexity of our algorithm degenerate to those of the classical SP-bags algorithm when using only structured parallel constructs such as spawn-sync and async-finish. In the presence of point-to-point synchronization using futures, the complexity of the algorithm increases by a factor determined by the number of future task creation and get operations as well as the number of non-tree edges in the computation graph. Next, we introduce a hybrid static+dynamic test-driven approach to repairing data races in task-parallel programs. Past solutions to the problem of repairing parallel programs have used static-only or dynamic-only approaches, both of which incur significant limitations in practice. Static approaches can guarantee soundness in many cases but are limited in precision when analyzing medium or large-scale software with accesses to pointer-based data structures in multiple procedures. Dynamic approaches are more precise, but their proposed repairs are limited to a single input and are not reflected back in the original source program. Our approach includes a novel coupling between static and dynamic analyses. First, we execute the program on a concrete test input and determine the set of data races for this input dynamically. Next, we compute a set of static "finish" placements that repairs these races and also respects the static scoping rules of the program while maximizing parallelism. Finally, we introduce a novel approach to automatically synthesize task-parallel programs with futures from sequential programs through identification of pure method calls. Our approach is built on three new techniques to address the challenge of automatic parallelization via future synthesis: candidate future synthesis, parallelism benefit analysis, and threshold expression synthesis. In candidate future synthesis, our system annotates pure method calls as async expressions and synthesizes a parallel program with future objects and their type declarations that are more precise than those from past work. Next, the system performs a novel parallel benefit analysis to determine which async expressions may need to be executed sequentially due to overhead reasons, based on execution profile information collected from multiple test inputs. Finally, threshold expression synthesis uses the output from parallelism benefit analysis to synthesize predicate expressions that can be used to determine at runtime if a specific pure method call should be executed sequentially or in parallel. These algorithms have been implemented and evaluated on a range of benchmark programs. The evaluation establishes the effectiveness of our approach with respect to dynamic data race detection overhead, compile-time overhead, and precision and performance of the repaired and synthesized code.Item Design Space Exploration of Parallel Algorithms and Architectures for Wireless Communication and Mobile Computing Systems(2014-10-30) Wang, Guohui; Cavallaro, Joseph R.; Sarkar, Vivek; Zhong, Lin; Juntti, MarkkuDuring past several years, there has been a trend that the modern mobile SoC (system-on-chip) chipsets start to incorporate in one single chip the functionality of several general purpose processors and application-specific accelerators to reduce the cost, the power consumption and the communication overhead. Given the ever-growing performance requirements and strict power constraints, the existence of different types of signal processing workloads have posed challenges to the mapping of the computationally-intensive algorithms to the heterogeneous architecture of the mobile SoCs. Many such signal processing workloads such as channel decoding for wireless communication modem and mobile computer vision applications have high computational complexity, which requires accelerators implemented with parallel algorithms and architectures to meet the performance requirements. Partitioning the workloads and deploying them with the appropriate components of mobile chipsets are crucial to fully utilize the mobile SoC's heterogeneous architecture. The goal of this thesis is to study parallel algorithms and architecture of high performance signal processing accelerators for several representative application workloads in wireless communication and mobile computing systems. We explore the design space of the parallel algorithms and architectures and highlight the workload partitioning and architecture-aware optimization schemes including algorithmic optimization, data structure optimization, and memory access optimization to improve the throughput performance and hardware (or energy) efficiency. As case studies, we will first propose contention-free interleaver architecture for parallel turbo decoding, which enables high throughput multi-standard turbo decoding ASIC (application-specific integrated circuit) with efficient hardware. Secondly, we propose massively parallel LDPC (low-density parity-check) decoding algorithm and implementation using GPU (graphics processor unit), which leads to high throughput and low latency LDPC decoding for practical SDR (software-defined radio) systems. Furthermore, we take advantage of the heterogeneous mobile CPU and GPU to accelerate representative mobile computer vision algorithms such as image editing and local feature extraction algorithms. Based on algorithm analysis and experimental results from the above case studies, we finally explore the design space and compare the performance of accelerator architectures for wireless communication and mobile vision use cases. We will show that the heterogeneous architecture of mobile systems is the key to efficiently accelerating parallel algorithms in order to meet the growing requirements of performance, efficiency, and flexibility.Item Distributed Communication Middleware for an Selector Model(2017-08-10) Xue, Bing; Sarkar, VivekThe problem sizes that the community is dealing with today in both scientific re- search and day-to-day use computing exceed the capacity of modern shared-memory systems. With the increasing prevalence of powerful multi-core/heterogenous pro-cessors on portable devices and cloud computing clusters, the demand for portable mainstream programming models supporting scalable, portable and extensible distributed computing is also rapidly growing. In this dissertation, we present the distributed selector model enabled distributed programming runtime library: cluster-based Habanero Java Distributed Selector and the mobile platform based Distributed Actor Model for Mobile Platforms by extending the HJDS implementation. This work focuses on enabling distributed message passing through building the communication middleware for an actor/selector model by supporting a fully actor-based runtime communication layer on clusters and a highly decoupled and customizable communication middleware and publish-subscribe enabled application-level runtime event handling on mobile devices that address the need for an easy-to-use, portable, reusable and scalable framework for small to medium sized distributed applications. We demonstrated the scalability of computationally intensive applications using distributed cluster-based and mobile-based platforms, and discuss the future steps for expanding the HJDS and DAMMP framework.Item Dynamic Data Race Detection for Structured Parallelism(2013-07-24) Raman, Raghavan; Sarkar, Vivek; Mellor-Crummey, John; Zhong, LinWith the advent of multicore processors and an increased emphasis on parallel computing, parallel programming has become a fundamental requirement for achieving available performance. Parallel programming is inherently hard because, to reason about the correctness of a parallel program, programmers have to consider large numbers of interleavings of statements in different threads in the program. Though structured parallelism imposes some restrictions on the programmer, it is an attractive approach because it provides useful guarantees such as deadlock-freedom. However, data races remain a challenging source of bugs in parallel programs. Data races may occur only in few of the possible schedules of a parallel program, thereby making them extremely hard to detect, reproduce, and correct. In the past, dynamic data race detection algorithms have suffered from at least one of the following limitations: some algorithms have a worst-case linear space and time overhead, some algorithms are dependent on a specific scheduling technique, some algorithms generate false positives and false negatives, some have no empirical evaluation as yet, and some require sequential execution of the parallel program. In this thesis, we introduce dynamic data race detection algorithms for structured parallel programs that overcome past limitations. We present a race detection algorithm called ESP-bags that requires the input program to be executed sequentially and another algorithm called SPD3 that can execute the program in parallel. While the ESP-bags algorithm addresses all the above mentioned limitations except sequential execution, the SPD3 algorithm addresses the issue of sequential execution by scaling well across highly parallel shared memory multiprocessors. Our algorithms incur constant space overhead per memory location and time overhead that is independent of the number of processors on which the programs execute. Our race detection algorithms support a rich set of parallel constructs (including async, finish, isolated, and future) that are found in languages such as HJ, X10, and Cilk. Our algorithms for async, finish, and future are precise and sound for a given input. In the presence of isolated, our algorithms are precise but not sound. Our experiments show that our algorithms (for async, finish, and isolated) perform well in practice, incurring an average slowdown of under 3x over the original execution time on a suite of 15 benchmarks. SPD3 is the first practical dynamic race detection algorithm for async-finish parallel programs that can execute the input program in parallel and use constant space per memory location. This takes us closer to our goal of building dynamic data race detectors that can be "always-on" when developing parallel applications.Item Efficient optimization of memory accesses in parallel programs(2010) Barik, Rajkishore; Sarkar, VivekThe power, frequency, and memory wall problems have caused a major shift in mainstream computing by introducing processors that contain multiple low power cores. As multi-core processors are becoming ubiquitous, software trends in both parallel programming languages and dynamic compilation have added new challenges to program compilation for multi-core processors. This thesis proposes a combination of high-level and low-level compiler optimizations to address these challenges. The high-level optimizations introduced in this thesis include new approaches to May-Happen-in-Parallel analysis and Side-Effect analysis for parallel programs and a novel parallelism-aware Scalar Replacement for Load Elimination transformation. A new Isolation Consistency (IC) memory model is described that permits several scalar replacement transformation opportunities compared to many existing memory models. The low-level optimizations include a novel approach to register allocation that retains the compile time and space efficiency of Linear Scan, while delivering runtime performance superior to both Linear Scan and Graph Coloring. The allocation phase is modeled as an optimization problem on a Bipartite Liveness Graph (BLG) data structure. The assignment phase focuses on reducing the number of spill instructions by using register-to-register move and exchange instructions wherever possible. Experimental evaluations of our scalar replacement for load elimination transformation in the Jikes RVM dynamic compiler show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when compared with the load elimination algorithm available in Jikes RVM. A prototype implementation of our BLG register allocator in Jikes RVM demonstrates runtime performance improvements of up to 3.52x relative to Linear Scan on an x86 processor. When compared to Graph Coloring register allocator in the GCC compiler framework, our allocator resulted in an execution time improvement of up to 5.8%, with an average improvement of 2.3% on a POWER5 processor. With the experimental evaluations combined with the foundations presented in this thesis, we believe that the proposed high-level and low-level optimizations are useful in addressing some of the new challenges emerging in the optimization of parallel programs for multi-core architectures.Item Efficient Selection of Vector Instructions using Dynamic Programming(2010-06-17) Barik, Rajkishore; Sarkar, Vivek; Zhao, JishengAccelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of an auto-vectorization framework in the backend of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel compile-time efficient dynamic programming-based vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) scalar packing explores opportunities of packing multiple scalar variables into short vectors; (2) judicious use of shuffle and horizontal vector operations, when possible; and (3) algebraic reassociation expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution, with a modest increase in compile time in the range from 0.87% to 9.992%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Superword Level Parallelization (SLP) algorithm from [21], shows that our approach yields a performance improvement of up to 13.78% relative to SLP.Item Elastic Tasks: Unifying Task Parallelism and SPMD Parallelism with an Adaptive Runtime(2015-02-11) Agrawal, Kunal; Sarkar, Vivek; Sbîrlea, AlinaIn this paper, we introduce elastic tasks, a new high-level parallel programming primitive that can be used to unify task parallelism and SPMD parallelism in a common adaptive scheduling framework. Elastic tasks are internally parallel tasks and can run on a single worker or expand to take over multiple workers. An elastic task can be an ordinary task or an SPMD region that must be executed by one or more workers simultaneously, in a tightly coupled manner. The gains obtained by using elastic tasks, as demonstrated in this paper, are three-fold: (1) they offer theoretical guarantees: given a computation with work W and span S executing on P cores, a work-sharing runtime guarantees a completion time of O(W/P+S+E), and a work-stealing runtime completes the computation in expected time O(W/P + S + E lgP), where E is the number of elastic tasks in the computation, (2) they offer performance benefits in practice by co-scheduling tightly coupled parallel/SPMD subcomputations within a single elastic task, and (3) they can adapt at runtime to the state of the application and work-load of the machine. We also introduce ElastiJ — a runtime system that includes work-sharing and work-stealing scheduling algorithms to support computations with regular and elastic tasks. This scheduler dynamically decides the allocation for each elastic task in a non-centralized manner, and provides close to asymptotically optimal running times for computations that use elastic tasks. We have created an implementation of ElastiJ and present experimental results showing that elastic tasks provide the aforementioned benefits. We also make study on the sensitivity of elastic tasks to the theoretical assumptions and the user parameters.Item Enabling Distributed Reconfiguration In An Actor Model(2017-08-11) Chatterjee, Ronnie; Sarkar, VivekThe demand for portable mainstream programming models supporting scalable, reactive and versatile distributed computing is growing dramatically with the prolifer- ation of manycore/heterogeneous processors on portable devices and cloud computing clusters that can be elastically and dynamically allocated. With such changes, dis- tributed software systems and applications are increasingly shifting towards service oriented architectures (SOA) that consist of dynamically replaceable components, and connected via loosely coupled, interactive networks that can support more complex coordination and synchronization patterns. In this dissertation, we address the dynamic reconfiguration challenges that arise in distributed implementations of the Selector Model. We focus on the Selector Model (a generalization of the actor model) in this work because of its support for multi- ple guarded mailboxes, which enables the programmer to easily specify coordination patterns that are more general than those supported by the actor model. The contri- butions of this dissertation are demonstrated in two implementations of distributed selectors, one for distributed servers and another for distributed Android devices. Both implementations run on distributed JVMs and feature the automated boot- strap and global termination capabilities introduced in this dissertation. In addition, the distributed Android implementation supports dynamic joining and leaving of de- vices, which is also part of the dynamic reconfiguration capabilities introduced in this dissertation.