Browsing by Author "Budimlic, Zoran"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Array optimizations for high productivity programming languages(2009) Joyner, Mackale; Kennedy, Ken; Sarkar, Vivek; Budimlic, ZoranWhile the HPCS languages (Chapel, Fortress and X10) have introduced improvements in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three aspects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank-independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transformations that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10. We use rank analysis information to enable storage transformations on arrays. If rank-independent array references still remain after compiler analysis, the programmer can use X10's dependent type system to safely annotate array variable declarations with additional information for the rank and region of the variable, and to enable the compiler to generate efficient code in cases where the dependent type information is available. Our solution to the third challenge is to use a new interprocedural array bounds analysis approach using regions to automatically determine when runtime bounds checks are not needed. Our performance results show that our optimizations deliver performance that rivals the performance of hand-tuned code with explicit rank-specific loops and lower-level array accesses, and is up to two orders of magnitude faster than unoptimized, high-level X10 programs. These optimizations also result in scalability improvements of X10 programs as we increase the number of CPUs. While we perform the optimizations primarily in X10, these techniques are applicable to other high-productivity languages such as Chapel and Fortress.Item Compiling Java for high performance and the Internet(2001) Budimlic, Zoran; Kennedy, KenJava is the first widely accepted language that addresses heterogeneous resources, security, and portability problems, making it attractive for scientific computation. It also encourages programmers to use object-oriented techniques in programming. Unfortunately, such object-oriented programs also incur unacceptable performance penalties. For example, using a polymorphic number hierarchy in a linear algebra package resulted in a code that is four times shorter, more extensible and less bug-prone than the equivalent Fortran-style code, but also many times slower. To address the poor performance problem, this dissertation introduces several new compilation techniques that can improve the performance of scientific Java programs written in a polymorphic, object oriented style to within a factor of two of the equivalent hand-coded Fortran-style programs. These techniques also maintain an acceptable level of Java byte-code portability and flexibility, thus rewarding, rather than penalizing, good object-oriented programming practice. This dissertation first discards the typical one-class-at-a-time Java compilation model for a whole-program model. It then introduces two novel whole-program optimizations, class specialization and object inlining, which improve the performance of high-level, object-oriented, scientific Java programs by up to two orders of magnitude, effectively eliminating the penalty of object-oriented design. Next, this dissertation introduces a new Almost-whole-program compilation model. This model improves the flexibility of the generated code, while still permitting whole-program optimizations and incurring only modest performance penalties. It enables the programmer balance performance and flexibility of the program after the development phase, instead of compromising the design for performance. Furthermore, this dissertation reduces the restrictions that Java imposes upon classical optimization techniques by introducing exception hiding and SSA conversion algorithms. Exception hiding transforms the code to create exception-free zones, in which code motion transformations can move the code without restraint. The new, nearly linear-time SSA-to-CFG conversion algorithm considerably reduces the number of copies inserted in the conversion process, improving the effectiveness of classical optimizations. Finally, this dissertation lays the groundwork for further research, particularly for fast register allocation, precise type analysis, coordinated compilation, and exception recovery.Item Improving object inlining for high-performance Java scientific applications(2005) Joyner, Mackale; Kennedy, Ken; Budimlic, ZoranJava is a popular programming language that enables many developers to achieve high productivity. Previous work in Java improved runtime performance by using object inlining. This thesis extends prior object inlining work by both analyzing the code and performing optimizations to further improve application runtime performance. Two impediments to object inlining and to increased runtime performance are object and array aliasing and binary method invocations. This thesis implements object and array alias strategies to address the aliasing problem while utilizing an idea from Telescoping Languages to address the binary method invocation problem. Application runtime gains of up to 20% result from employing these techniques. The improvements made to the compile-time object inlining optimization should increase the scientific community's acceptance of the Java programming language in the development of high-performance scientific applications by decreasing the performance.Item Mapping a Dataflow Programming Model onto Heterogeneous Architectures(2012-09-05) Sbirlea, Alina; Sarkar, Vivek; Cooper, Keith D.; Mellor-Crummey, John; Budimlic, ZoranThis thesis describes and evaluates how extending Intel's Concurrent Collections (CnC) programming model can address the problem of hybrid programming with high performance and low energy consumption, while retaining the ease of use of data-flow programming. The CnC model is a declarative, dynamic light-weight task based parallel programming model and is implicitly deterministic by enforcing the single assignment rule, properties which ensure that problems are modelled in an intuitive way. CnC offers a separation of concerns by allowing algorithms to be expressed as a two stage process: first by decomposing a problem into components and specifying how components interact with each other, and second by providing an implementation for each component. By facilitating the separation between a domain expert, who can provide an accurate problem specification at a high level, and a tuning expert, who can tune the individual components for better performance, we ensure that tuning and future development, such as replacement of a subcomponent with a more efficient algorithm, become straightforward. A recent trend in mainstream desktop systems is the use of graphics processor units (GPUs) to obtain order-of-magnitude performance improvements relative to general-purpose CPUs. In addition, the use of FPGAs has seen a significant increase for applications that can take advantage of such dedicated hardware. We see that computing is evolving from using many core CPUs to ``co-processing" on the CPU, GPU and FPGA, however hybrid programming models that support the interaction between multiple heterogeneous components are not widely accessible to mainstream programmers and domain experts who have a real need for such resources. We propose a C-based implementation of the CnC model for enabling parallelism across heterogeneous processor components in a flexible way, with high resource utilization and high programmability. We use the task-parallel HabaneroC language (HC) as the platform for implementing CnC-HabaneroC (CnC-HC), a language also used to implement the computation steps in CnC-HC, for interaction with GPU or FPGA steps and which offers the desired flexibility and extensibility of interacting with any other C based language. First, we extend the CnC model with tag functions and ranges to enable automatic code generation of high level operations for inter-task communication. This improves programmability and also makes the code more analysable, opening the door for future optimizations. Secondly, we introduce a way to specify steps that are data parallel and thus are fit to execute on the GPU, and the notion of task affinity, a tuning annotation in the specification language. Affinity is used by the runtime during scheduling and can be fine-tuned based on application needs to achieve better (faster, lower power, etc.) results. Thirdly, we introduce and develop a novel, data-driven runtime for the CnC model, using HabaneroC (HC) as a base language. In addition, we also create an implementation of the previous runtime approach and conduct a study to compare the performance. Next, we expand the HabaneroC dynamic work-stealing runtime to allow cross-device stealing based on task affinity. Cross-device dynamic work-stealing is used to achieve load balancing across heterogeneous platforms for improved performance. Finally, we implement and use a series of benchmarks for testing the model in different scenarios and show that our proposed approach can yield significant performance benefits and low power usage when using a hybrid execution.Item Runtime Systems for Extreme Scale Platforms(2013-12-06) Chatterjee, Sanjay; Sarkar, Vivek; Mellor-Crummey, John; Zhong, Lin; Budimlic, ZoranFuture extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(10^3) cores per node and O(10^6) nodes overall. Effective combination of inter node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focused on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems. In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing. First, we address the challenge of integrating an intra-node task parallel runtime with a communication system for scalable performance. We present a runtime communication system, called HC-COMM, designed to use dedicated communication cores on a system. We introduce the HCMPI programming model which integrates the Habanero-C asynchronous dynamic task parallel language with the MPI message passing communication model on the HC-COMM runtime. We also introduce the HAPGNS model that enables dataflow programming for extreme-scale systems in which the user does not require knowledge of MPI. Second, we address the challenge of separating locality optimizations from a programmer with domain specific knowledge. We present a tuning framework, through which performance experts can optimize existing applications by specifying runtime operations aimed at co-scheduling of affinitized tasks. Finally, we address the challenge of scalable synchronization for long running tasks on a dynamic task parallel runtime. We use the phaser construct to present a generalized tree-based synchronization algorithm and support unified collective operations at both inter-node and intra-node levels. Overcoming these runtime challenges are a first step towards effective programming on extreme-scale systems.