Browsing by Author "Budimlić, Zoran"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item BMS-CnC: Bounded Memory Scheduling of Dynamic Task Graphs(2013-10-24) Budimlić, Zoran; Sarkar, Vivek; Sbîrlea, DragoșIt is now widely recognized that increased levels of parallelism is a necessary condition for improved application performance on multicore computers. However, as the number of cores increases, the memory-per-core ratio is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. For many parallel applications, the memory requirements can be significantly larger than for their sequential counterparts and, more importantly, their memory utilization depends critically on the schedule used when running them. To address this problem we propose bounded memory scheduling (BMS) for parallel programs expressed as dynamic task graphs, in which an upper bound is imposed on the program’s peak memory. Using the inspector/executor model, BMS tailors the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. Since solving BMS is NP-hard, we propose an approach in which we first use our heuristic algorithm, and if it fails we fall back on a more expensive optimal approach which is sped up by the best-effort result of the heuristic. Through evaluation on seven benchmarks, we show that BMS gracefully spans the spectrum between fully parallel and serial execution with decreasing memory bounds. Comparison with OpenMP shows that BMS-CnC can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP’s performance.Item Communication Optimizations for Distributed-Memory X10 Programs(2010-04-10) Barik, Rajkishore; Budimlić, Zoran; Grove, David; Peshansky, Igor; Sarkar, Vivek; Zhao, JishengX10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46× relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01× (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73× (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.Item Compiler and Runtime Optimization of Computational Kernels for Irregular Applications(2023-08-17) Milakovic, Srdan; Mellor-Crummey, John; Budimlić, Zoran; Varman, Peter J; Mamouras, KonstantinosMany computationally-intensive workloads do not fit on individual compute nodes due to their size. As a consequence, such workloads are usually executed on multiple heterogenous compute nodes of a cluster or supercomputer. However, due to the complexity of the hardware, developing efficient and scalable code for modern compute nodes is difficult. Another challenge with sophisticated applications is that data structures, communication, and control patterns are often irregular and unknown before the program execution. Lack of regularity makes static analysis especially difficult or very often impossible. To overcome these issues, programmers use high-level and implicitly parallel programming models or domain-specific libraries that consist of composable building blocks. This dissertation explores compiler and runtime optimizations for automatic granularity selection in the context of two programming paradigms: Concurrent Collections (CnC)---a declarative,dynamic single-assignment, data-race free programming model---and GraphBLAS--a domain-specific Application-specific Programming Interface (API)---. Writing fine-grained CnC programs is easy and intuitive for domain experts because the programmers do not have to worry about parallelism. Additionally, fine-grained programs expose maximum parallelism. However, fine-grained programs can significantly increase the runtime overhead of CnC program execution due to a large number of data accesses and dependencies between computation tasks with respect to the amount of computation that is done by a fine-grained task. Runtime overhead can be reduced by coarsening the data accesses and task dependencies. However, coarsening is usually tedious, and it is not easy even for domain experts. For some applications, the coarse-grained code can be generated by a compiler. However, not all fine-grained applications can be converted to coarse-grained applications because not all information is statically known. In this dissertation, we introduce the concept of micro-runtimes. A micro-runtime is a Hierarchical CnC construct that enables fusion of multiple steps into a higher-level step during program execution. Another way for users to develop applications that efficiently exploit modern hardware is through domain-specific APIs that define composable building blocks. One such API specification is GraphBLAS. GraphBLAS allows users to specify graph algorithms using (sparse) linear algebra building blocks. Even though GraphBLAS libraries usually consist of highly hand-optimized building blocks, GraphBLAS libraries provide limited or no support for inter-kernel optimization. In this dissertation, we investigate multiple different approaches for inter-kernel optimization, including runtime optimizations and compile-time optimizations. Our optimizations reduce the number of arithmetic operations, memory accesses, and memory required for temporary objects.Item Compiler Support for Work-Stealing Parallel Runtime Systems(2010-03-03) Raman, Raghavan; Zhao, Jisheng; Budimlić, Zoran; Sarkar, VivekMultiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk’s implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we focus on the compiler support needed to extend work-stealing for dynamic async-finish task parallelism as supported by X10 and Habanero-Java (HJ). We discuss the compiler support needed for workstealing with both the work-first and help-first policies. Performance results obtained using our compiler and the HJ work-stealing runtime show significant improvement compared to the earlier work-sharing runtime from X10 v1.5. We also propose and implement three optimizations, Dynamic-Loop-Chunking, Redundant-Frame-Store, and Objects-As-Frames, that can be performed in the compiler to improve the code generated for work-stealing schedulers. Performance results show that the Dynamic-Loop-Chunking optimization significantly improves the performance of loop based benchmarks using work-stealing schedulers with work-first policy. The Redundant-Frame-Store optimizations provide a significant reduction in the code size. The results also show that our novel Objects-As-Frames optimization yields performance improvement in many cases. To the best of our knowledge, this is the first implementation of compiler support for work-stealing schedulers for async-finish parallelism with both the work-first and help-first policies and support for task migration.Item Composability for Application-Specific Transactional Optimizations(2010-01-21) Zhang, Rui; Budimlić, Zoran; Scherer, William N., IIISoftware Transactional Memory (STM) has made great advances towards acceptance into mainstream programming by promising a programming model that greatly reduces the complexity of writing concurrent programs. Unfortunately, the mechanisms in current STM implementations that enforce the fundamental properties of transactions — atomicity, consistency, and isolation — also introduce considerable performance overhead. This performance impact can be so significant that in practice, programmers are tempted to leverage their knowledge of a specific application to carefully bypass STM calls and instead access shared memory directly. While this technique can be very effective in improving performance, it breaks the consistency and isolation properties of transactions, which have to be handled manually by the programmer for the specific application. It also tends to break another desirable property of transactions: composability. In this paper, we identify the composability problem and propose two STM system extensions to provide transaction composability in the presence of direct shared memory reads by transactions. Our proposed extensions give the programmer a similar level of flexibility and performance when optimizing the STM application as the existing practices, while preserving composability. We evaluate our extensions on several benchmarks on a 16-way SMP. The results show that our extensions provide performance competitive with hand-optimized non-composable techniques, while still maintaining transactional composability.Item Inside Time-based Software Transactional Memory(2007-07-06) Zhang, Rui; Budimlić, Zoran; Scherer, William N., IIIWe present a comprehensive analysis and experimental evaluation of time-based validation techniques for Software Transactional Memory (STM). Time-based validation techniques emerge recently as an effective way to reduce the validation overhead for STM systems. In a time-based strategy, information based on global time enables the system to avoid a full validation pass in many cases where it can quickly prove that no consistency violation is possible given the time information for the current transaction and the object it is attempting to open. We show that none of the current timebased strategies offers the best performance across various applications and thread counts. We also show an adaptive technique which has the potential to achieve an overall best performance based on time information and show some preliminary results we have.Item Point-to-Point and Barrier Synchronization in Distributed SPMD Systems(2019-11-08) Milakovic, Srdan; Mellor-Crummey, John M; Sarkar, Vivek; Budimlić, ZoranDistributed memory programming models are very often the only way to scale up large scientific applications. To ensure correctness and optimal performance in distributed applications, it is necessary to use general, high-level, but efficient synchronization constructs. Implementing distributed applications using one-sided communication libraries is getting more popular, as opposed to the two-sided communication used in the MPI model. However, in most cases, those libraries only have support for high-level collective barrier synchronization and low-level point-to-point synchronization. Phaser synchronization construct is a very attractive synchronization mechanism because it unifies collective and point-to-point synchronization in a simple, easy to use high-level synchronization construct. In this thesis, we propose several novel algorithms for phaser synchronization on distributed-memory systems with one-sided communication. We also present several improvements to the distributed barrier algorithms in the OpenSHMEM reference implementation. We establish a very high confidence level in algorithms' correctness by using the SPIN model checker for our algorithms. We evaluated our phaser algorithm using several benchmark applications on large supercomputers, and we show that using phasers can reduce the synchronization time by up to 47% and improve total execution time by up to 26%. This thesis shows that high-level, efficient, and intuitive synchronization is possible on distributed systems with one-sided communication.Item Scheduling Tasks to Maximize Usage of Aggregate Variables In Place(2006-08-21) Budimlić, Zoran; Kennedy, Ken; Mahmeed, Samah; McCosh, Cheryl; Rogers , SteveWe present an algorithm for greedy in-placeness that runs in O(TlogT + Ew V + V^2) time, where T is the number of in-placeness opportunities, Ew is the aggregate number of wire successors and v is the number of virtual instruments in a program graph.Item Support for Complex Numbers in Habanero(2009-05-18) Zhao, Jisheng; Cavé, Vincent; Yan, Yonghong; Budimlić, Zoran; Sarkar, VivekNo AbstractItem The Concurrent Collections Programming Model(2010-01-04) Budimlić, Zoran; Burke, Michael G.; Cavé, Vincent; Knobe, Kathleen; Lowney, Geoff; Palsberg, Jens; Peixotto, David; Sarkar, Vivek; Schlimbach, Frank; Taşırlar, SağnakWe introduce the Concurrent Collections (CnC) programming model. In this model, programs are written in terms of high-level operations. These operations are partially ordered according to only their semantic constraints. These partial orderings correspond to data dependences and control dependences. The role of the domain expert, whose interest and expertise is in the application domain, and the role of the tuning expert, whose interest and expertise is in performance on a specific architecture, can be viewed as separate concerns. The CnC programming model pro vides a high-level specification that can be used as a common language between the two experts, raising the level of their discourse. The model facilitates a significant degree of separation, which simplifies the task of the domain expert, who can focus on the application rather than scheduling concerns and mapping to the target architecture. This separation also simplifies the work of the tuning expert, who is given the maximum possible freedom to map the computation onto the target architecture and is not required to understand the details of the domain. However, the domain and tuning expert may still be the same person. We formally describe the execution semantics of CnC and prove that this model guarantees deterministic computation. We evaluate the performance of CnC implementations on several applications and show that CnC can effectively exploit several different kinds of parallelism and offer performance and scalability that is equivalent to or better than that offered by the current low-level parallel programming models. Further, with respect to ease of programming, we discuss the tradeoffs between CnC and other parallel program ming models on these applications.