R-3 Repository :: Browsing by Author "Varman, Peter J"

Browsing by Author "Varman, Peter J"

Now showing 1 - 5 of 5

Compiler and Runtime Optimization of Computational Kernels for Irregular Applications
(2023-08-17) Milakovic, Srdan; Mellor-Crummey, John; Budimlić, Zoran; Varman, Peter J; Mamouras, Konstantinos
Many computationally-intensive workloads do not fit on individual compute nodes due to their size. As a consequence, such workloads are usually executed on multiple heterogenous compute nodes of a cluster or supercomputer. However, due to the complexity of the hardware, developing efficient and scalable code for modern compute nodes is difficult. Another challenge with sophisticated applications is that data structures, communication, and control patterns are often irregular and unknown before the program execution. Lack of regularity makes static analysis especially difficult or very often impossible. To overcome these issues, programmers use high-level and implicitly parallel programming models or domain-specific libraries that consist of composable building blocks. This dissertation explores compiler and runtime optimizations for automatic granularity selection in the context of two programming paradigms: Concurrent Collections (CnC)---a declarative,dynamic single-assignment, data-race free programming model---and GraphBLAS--a domain-specific Application-specific Programming Interface (API)---. Writing fine-grained CnC programs is easy and intuitive for domain experts because the programmers do not have to worry about parallelism. Additionally, fine-grained programs expose maximum parallelism. However, fine-grained programs can significantly increase the runtime overhead of CnC program execution due to a large number of data accesses and dependencies between computation tasks with respect to the amount of computation that is done by a fine-grained task. Runtime overhead can be reduced by coarsening the data accesses and task dependencies. However, coarsening is usually tedious, and it is not easy even for domain experts. For some applications, the coarse-grained code can be generated by a compiler. However, not all fine-grained applications can be converted to coarse-grained applications because not all information is statically known. In this dissertation, we introduce the concept of micro-runtimes. A micro-runtime is a Hierarchical CnC construct that enables fusion of multiple steps into a higher-level step during program execution. Another way for users to develop applications that efficiently exploit modern hardware is through domain-specific APIs that define composable building blocks. One such API specification is GraphBLAS. GraphBLAS allows users to specify graph algorithms using (sparse) linear algebra building blocks. Even though GraphBLAS libraries usually consist of highly hand-optimized building blocks, GraphBLAS libraries provide limited or no support for inter-kernel optimization. In this dissertation, we investigate multiple different approaches for inter-kernel optimization, including runtime optimizations and compile-time optimizations. Our optimizations reduce the number of arithmetic operations, memory accesses, and memory required for temporary objects.
Heterogeneous Resource Allocation in Datacenters
(2020-04-24) Parvez Khan, Mohammad Shahriar; Varman, Peter J
Virtualized datacenters are considerably cost-effective and convenient to their clients who routinely deploy clusters of communicating Virtual Machines (VMs) on physical infrastructures to create a distributed ensemble of servers for Web applications. These applications require multiple resources such as compute, memory, storage and network bandwidth for execution, motivating the need for fair allocation policies. Here, we present a new model for allocating multiple resources among clients, and our results show that we can obtain significantly better utilization than existing approaches, while provably maintaining good fairness properties like Envy Freedom and Sharing Incentive. We also look at datacenter optimization from an opposite standpoint, i.e., VM placement. It is necessary to place a high number of virtual servers per physical host to maximize the benefits of a shared infrastructure. We present a framework to automate the placement of VMs in several scenarios alongside the corresponding ILP optimization models and numerical results.
Memory and Communication Optimizations for Macro-dataflow Programs
(2015-06-23) Sbirlea, Dragos Dumitru; Sarkar, Vivek; Cooper, Keith D; Varman, Peter J
It is now widely recognized that increased levels of parallelism are a necessary condition for improved application performance on multicore computers. However, the memory-per-core ratio is already low and, as the number of cores increases, it is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. Further, the memory requirements of parallel applications can be significantly larger than for their sequential counterparts and their memory utilization also depends critically on the schedule used when running them. This thesis proposes techniques that enable awareness and control of the tradeoff between a program’s memory usage and resulting performance. It does so by taking advantage of the computation structure that is made explicit in macro-dataflow programs which is one of the benefits of macro-dataflow as a programming model for modern multicore applications. To address this challenge, we first introduce folding - a memory management technique that enables programmers to map multiple data values to the same memory slot. This reduces the memory requirement of the program while still preserving its macro-dataflow execution semantics. We then propose an approach that allows dynamic macro-dataflow programs running on shared-memory multicore systems to obey a user-desired memory bound. Using the inspector/executor model, we tailor the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. We show that our technique can gracefully span the spectrum (with decreasing memory bounds) from fully parallel to fully serial execution, with several intermediate points between the two. Comparison with OpenMP shows that it can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP’s performance. Finally, we turn our attention to distributed systems where often the memory size is not a limiting factor, but communication and load balancing are. For these systems, we show that data and task distributions can be selected automatically even for applications expressed as dynamic task graphs, freeing the programmer from the cumbersome selection process .We show that optimal selection can be achieved for certain classes of distributions and cost functions that capture the trade-off between communication and load balance.
Query Processing and Optimization for Database Stochastic Analytics
(2014-12-03) Perez, Luis Leopoldo; Jermaine, Christopher M; Ng, T.S. Eugene; Varman, Peter J
The application of relational database systems to analytical processing has been an active area of research for about two decades, motivated by constant surges in the scale of the data and in the complexity of the analysis tasks. Simultaneously, stochastic techniques have become commonplace in large-scale data analytics. This work is concerned with the application of relational database systems to support stochastic analytical tasks, particularly with the query evaluation and optimization phases. In this work, three problems are addressed in the context of MCDB/SimSQL, a relational database system for uncertain data management and analytics. The first contribution is a set of efficient techniques for evaluating queries that require satisfying a probability threshold, such as "Which pending orders are estimated to be processed and shipped by the end of the month, with a probability of at least 95%?" where the processing and shipment times of each order are generated by an arbitrary stochastic process. Results show that these techniques make sensible use of resources, weeding out data elements that require relatively few samples during the early stages of query evaluation. The second problem is concerned with recycling the materialized intermediate results of a query to optimize other queries in the future. Taking the assumption that a history of past queries provides an accurate picture of the workload, I describe techniques for query optimization that evaluate the costs and benefits of materializing intermediate results, with the objective of minimizing the hypothetical costs of future queries, subject to constraints on disk space. Results show a substantial improvement over conventional query caching techniques in workload and average query execution time. Finally, this work addresses the problem of evaluating queries for stochastic generative models, specified in a high level notation that treats random variables as first-class objects and allows operations with structured objects such as vectors and matrices. I describe a notation that, relying on the syntax of comprehensions, provides a language for denoting generative models that guarantees correspondence with relational algebra expressions, and techniques for translating a model into a database schema and set of relational queries.
System Support for Loosely Coupled Resources in Mobile Computing
(2014-07-31) Lin, Xiaozhu; Zhong, Lin; Cox, Alan L; Varman, Peter J
Modern mobile platforms are embracing not only heterogeneous but also loosely coupled computational resources. For instance, a smartphone usually incorporates multiple processor cores that have no hardware cache coherence. Loosely coupled resources allow a high degree of resource heterogeneity that can greatly improve system energy efficiency for a wide range of mobile workloads. However, loosely coupled resources create application programming difficulty: both resources and program state are distributed, which call for explicit communication for consistency. This difficulty is further exacerbated by the large numbers of mobile developers and mobile applications. In order to ease application programming over loosely coupled resources, in this thesis work we explore system support -- at both user level and OS level -- that bridges desirable programming abstractions with the underlying hardware. We study three loosely coupled architectures widely seen in mobile computing: i) a smartphone accompanied by wearable sensors, ii) a mobile device encompassing multiple processors that share no memory, and iii) a mobile System-on-Chip (SoC) with multiple cores sharing incoherent memory. In order to address the three architectures, this thesis contributes three closely related research projects. In project Dandelion, we propose a Remote Method Invocation scheme to hide communication details from application components that are synchronizing over wireless links. In project Reflex, we design an energy-efficient software Distributed Shared Memory (DSM) to automatically keep user state consistent; the DSM always employs a low-power processor to host shared memory objects in order to maximize sleep periods of high-power processors. In project K2, we identify and apply a shared-most OS model to construct a single OS image over loosely coupled processor cores. Following the shared-most model, high-level OS services (e.g., device drivers and file systems) are mostly unmodified with their state transparently kept consistent; low-level OS services (e.g., page allocator) are implemented as separate instances with independent state for the sake of minimum communication overhead. We report the research prototypes, our experiences in building them, and the experimental measurements. We discuss future directions, in particular how our principles in treating loosely coupled resources can be used for improving other key system aspects, such as scalability.

Browsing by Author "Varman, Peter J"

Results Per Page

Sort Options