Browsing by Author "Dwarkadas, Sandhya"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System(1997-11-17) Cox, Alan; Dwarkadas, Sandhya; Zwaenepoel, WillyHigh Performance Fortran (HPF), as well as its predecessor FortranD,has attracted considerable attention as a promising language for writing portable parallel programs for a wide variety of distributed-memory architectures. Programmers express data parallelism using Fortran90 array operations and use data layout directives to direct the partitioning of the data and computation among the processors of a parallel machine. For HPF to gain acceptance as a vehicle for parallel scientific programming, it must achieve high performance on problems for which it is well suited. To achieve high performance with an HPF program on a distributed-memory parallel machine, an HPF compiler must do a superb job of translating Fortran90 data-parallel array constructs into an efficient sequence of operations that minimize the overhead associated with data movement and also maximize data locality. This dissertation presents and analyzes a set of advanced optimizations designed to improve the execution performance of HPF programs on distributed-memory architectures. Presented is a methodology for performing deep analysis ofFortran90 programs, eliminating the reliance upon pattern matching to drive the optimizations as is done in many Fortran90 compilers. The optimizations address the overhead of data movement, both interprocessor and intraprocessor movement, that results from the translation of Fortran90 array constructs. Additional optimizations address the issues of scalarizing array assignment statements, loop fusion, and data locality. The combination of these optimizations results in a compiler that is capable of optimizing dense matrix stencil computations more completely than all previous efforts in this area. This work is distinguished by advanced compile-time analysis and optimizations performed at the whole-array level as opposed to analysis and optimization performed at the loop or array-element levels.Item Efficient methods for cache performance prediction(1989) Dwarkadas, Sandhya; Jump, J. Robert; Sinclair, James B.The goal of our work is to develop techniques that accurately and efficiently simulate the behavior of computer systems with cache memories. This thesis describes the design, analysis, and validation of three such methods of cache performance prediction. Execution-driven simulation is a technique that avoids the high overhead associated with instruction-level simulation while retaining most of the accuracy of that technique. We have extended the execution-driven paradigm to develop a time and space-efficient technique for address trace generation and cache simulation, as well as to provide estimates of overall execution time. The second method that we have developed is an analytical model for the prediction of cache miss ratios using single-process traces. Finally, a simple and efficient estimative simulation technique based on the analytical model and the execution-driven paradigm has been outlined. This approach is demonstrated in the simulation of cache-based multiprocessor systems in conjunction with the Rice Parallel Processing Testbed, which simulates concurrent algorithms on parallel architectures.Item Synchronization, coherence, and consistency for high performance shared memory multiprocessing(1993) Dwarkadas, Sandhya; Jump, J. Robert; Sinclair, James B.Although improved device technology has increased the performance of computer systems, fundamental hardware limitations and the need to build faster systems using existing technology have led many computer system designers to consider parallel designs with multiple computing elements. Unfortunately, the design of efficient and scalable multiprocessors has proven to be an elusive goal. This dissertation describes a hierarchical bus-based multiprocessor architecture, an adaptive cache coherence protocol, and efficient and simple synchronization support that together meet this challenge. We have also developed an execution-driven tool for the simulation of shared-memory multiprocessors, which we use to evaluate the proposed architectural enhancements. Our simulator offers substantial advantages in terms of reduced time and space overheads when compared to instruction-driven or trace-driven simulation techniques, without significant loss of accuracy. The simulator generates correctly interleaved parallel traces at run time, allowing the accurate simulation of a variety of architectural alternatives for a number of programs. Our results provide a quantitative analysis of the viability of large-scale bus-based memory hierarchies. We evaluate the effect on performance of several architectural enhancements, and discuss the tradeoffs between reducing contention and increasing latency as the number of levels in the memory hierarchy are increased. Toward this end, we have developed a cache coherence protocol for a hierarchical bus-based architecture that minimizes total communication overhead by utilizing all available (bus-provided) information. Based on our evaluation, we propose an integrated set of architectural design decisions. These include synchronization using a conditional test&set operation that eliminates excess bus traffic and contention, conditional access scheduling, where bus traffic is reduced by keeping track of pending bus accesses for every cache line, adaptive caching, where each cache line is assigned a coherence protocol based upon the expected or observed access behavior for that line, and the use of relaxed memory consistency models, where writes are aggressively buffered. We also present a new classification of memory consistency models that, in addition to unifying all existing models into a common framework, provides insight into the implications of these models with respect to access ordering.