Browsing by Author "Adve, Sarita V."
Now showing 1 - 20 of 21
Results Per Page
Sort Options
Item An evaluation of memory consistency models for shared-memory systems with ILP processors(1997) Ranganathan, Parthasarathy; Adve, Sarita V.The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consistency justify its added programming complexity. This thesis performs the first detailed quantitative comparison of several implementations of sequential consistency and release consistency optimized for aggressive ILP processors. Our results indicate that although hardware prefetching and speculative reads dramatically improve the performance of sequential consistency, the simplest RC version continues to significantly outperform the most optimized SC version. Additionally, the performance of SC is highly sensitive to the cache write policy and the aggressiveness of the cache-coherence protocol, while the performance of RC is generally stable across all implementations. Overall our results show that RC hardware has significant performance benefits over SC hardware, and at the same time, requires less system complexity with ILP processors. Memory write latencies that hardware prefetching and speculative loads are unsuccessful in hiding are the main reason for the performance difference between SC and RC.Item Analytic Evaluation of Shared-Memory Systems with ILP Processors(1998-06-20) Sorin, Daniel J.; Pai, Vijay S.; Adve, Sarita V.; Vernon, Mary K.; Wood, David A.; CITI (http://citi.rice.edu/)NoneItem Code Transformations to Improve Memory Parallelism(2000-05-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previous-generation systems. These deficiencies arise largely because applications present limited opportunities for an out-of-order issue processor to overlap multiple read misses, the dominant source of memory stalls. This work proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. We present an analysis and transformation framework suitable for compiler implementation. Our simulation experiments show execution time reductions averaging 20% in a multiprocessor and 30% in a uniprocessor. A substantial part of these reductions comes from increases in memory parallelism. We see similar benefits on a Convex Exemplar.Item Code Transformations to Improve Memory Parallelism(1999-11-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Comparing and Combining Read Miss Clustering and Software Prefetching(2001-09-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem A Customized MVA Model for ILP Multiprocessors(1998-04-20) Sorin, Daniel J.; Vernon, Mary K.; Pai, Vijay S.; Adve, Sarita V.; Wood, David A.; CITI (http://citi.rice.edu/)This paper provides the customized MVA equations for an analytical model for evaluating architectural alternatives for shared-memory multiprocessors with processors that aggressively exploit instruction-level parallelism (ILP). Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds.Item An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors(1996-10-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; Harton, Tracy; CITI (http://citi.rice.edu/)NoneItem Exploiting instruction-level parallelism for memory system performance(2000) Pai, Vijay Sadananda; Adve, Sarita V.Current microprocessors improve performance by exploiting instruction-level parallelism (ILP). ILP hardware techniques such as multiple instruction issue, out-of-order (dynamic) issue, and non-blocking reads can accelerate both computation and data memory references. Since computation speeds have been improving faster than data memory access times, memory system performance is quickly becoming the primary obstacle to achieving high performance. This dissertation focuses on exploiting ILP techniques to improve memory system performance. This dissertation includes both an analysis of ILP memory system performance and optimizations developed using the insights of this analysis. First, this dissertation shows that ILP hardware techniques, used in isolation, are often unsuccessful at improving memory system performance because they fail to extract parallelism among data reads that miss in the processor's caches. The previously-studied latency-tolerance technique of software prefetching provides some improvement by initiating data read misses earlier, but also suffers from limitations caused by exposed startup latencies, excessive fetch-ahead distances, and references that are hard to prefetch. This dissertation then uses the above insights to develop compile-time software transformations that improve memory system parallelism and performance. These transformations improve the effectiveness of ILP hardware, reducing exposed latency by over 80% for a latency-detection microbenchmark and reducing execution time an average of 25% across 14 multiprocessor and uniprocessor cases studied in simulation and an average of 21% across 12 cases on a real system. These transformations also combine with software prefetching to address key limitations in either latency-tolerance technique alone, providing the best performance when both techniques are combined for most of the uniprocessor and multiprocessor codes that we study. Finally, this dissertation also explores appropriate evaluation methodologies for ILP shared-memory multiprocessors. Memory system parallelism is a key feature determining ILP performance, but is neglected in previous-generation fast simulators. This dissertation highlights the errors possible in such simulators and presents new evaluation methodologies to improve the tradeoff between accuracy and evaluation speed.Item Fine-grain producer-initiated communication in cache-coherent multiprocessors(1997) Abdel-Shafi, Hazim M.; Adve, Sarita V.Shared-memory multiprocessors are becoming increasingly popular as a high-performance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors due to physical constraints and cache coherence overheads. In addition, synchronization operations, which are necessary to ensure correctness in parallel programs, add further communication overhead in shared-memory multiprocessors. Software-controlled non-binding data prefetching is a widely used consumer-initiated mechanism to hide communication latency and is currently supported on most architectures. However, on an invalidation-based cache-coherent multiprocessor, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, fine-grain pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated primitives (referred to as remote writes) is better able to reduce the latency of communication. This work demonstrates experimentally that remote writes provide significant performance benefits in cache-coherent shared-memory multiprocessors both with and without prefetching. Further, the combination of remote writes and prefetching is able to eliminate most of the memory system overheads in our applications, except for misses due to cache conflicts.Item General-purpose architectures for media processing and database workloads(2000) Ranganathan, Parthasarathy; Adve, Sarita V.Workloads on general-purpose computing systems have changed dramatically over the past few years, with greater emphasis on emerging compute-intensive applications such as media processing and databases. However, until recently, most high performance computing studies have primarily focused on scientific and engineering workloads, potentially leading to designs not suitable for these emerging workloads. This dissertation addresses this limitation. Our key contributions include (i) the first detailed quantitative simulation-based studies of the performance of media processing and database workloads on systems using state-of-the-art processors, and (ii) cost-effective architectural solutions targeted at achieving the higher performance requirements of future systems running these workloads. The first part of the dissertation focuses on media processing workloads. We study the effectiveness of state-of-the-art features (techniques to extract instruction-level parallelism, media instruction-set extensions, software prefetching, and large caches). Our results identify two key trends: (i) media workloads on current general-purpose systems are primarily compute-bound and (ii) current trends towards devoting a large fraction of on-chip transistors (up to 80%) for caches can often be ineffective for media workloads. In response to these trends, we propose and evaluate a new cache organization, called reconfigurable caches. Reconfigurable caches allow the on-chip cache transistors to be dynamically divided into partitions that can be used for other activities (e.g., instruction memoization, application-controlled memory, and prefetching buffers), including optimizations that address the compute bottleneck. Our design of the reconfigurable cache requires relatively few modifications to existing cache structures and has small impact on cache access times. The second part of the dissertation evaluates the performance of database workloads like online transaction processing and decision support system on shared-memory multiprocessor servers with state-of-the-art processors. Our main results show that the key performance-limiting characteristics of online transaction processing workloads are (i) large instruction footprints (leading to instruction cache misses) and (ii) frequent data communication (leading to cache-to-cache misses). We show that both these inefficiencies can be addressed with simple cost-effective optimizations. Additionally, our analysis of optimized memory consistency models with state-of-the-art processors suggest that the choice of the hardware consistency model of the system may not be a dominant factor for database workloads.Item The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors(1999-02-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Abdel-Shafi, Hazim; Adve, Sarita V.; CITI (http://citi.rice.edu/)Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP- based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.Item The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology(1997-02-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Improving the Accuracy vs. Speed Tradeoff for Simulating Shared-Memory Multiprocessors with ILP Processors(1999-01-20) Durbhakula, Murthy; Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Improving the speed vs. accuracy tradeoff for simulating shared-memory multiprocessors with ILP processors(1998) Durbhakula, Suryanarayana N. Murthy; Adve, Sarita V.Current simulators for shared-memory multiprocessor architectures involve a large tradeoff between simulation speed and accuracy. Most simulators assume much simpler processors than the current generation of processors that aggressively exploit instruction-level parallelism (ILP). This can result in large simulation inaccuracies. A few newer simulators model current ILP processors more accurately, but are about ten times slower. This study proposes and evaluates a new simulation technique that requires almost no compromise in accuracy and far less compromise in speed compared to the state-of-the-art. This technique uses a novel adaptation of direct execution, a methodology used widely for simulation of multiprocessors with simple processors. We develop a new simulator based on this technique, called DirectRSIM. We compare the performance and accuracy of DirectRSIM with three other simulators--two current direct execution simulators that use a simple processor model, and RSIM, a state-of-the-art detailed simulator for multiprocessors with ILP processors. For various combinations of applications and system configurations, we find that DirectRSIM is an average of 4 times faster than RSIM with an average relative error of 1.6%. In contrast, the current direct execution simulators see large and variable errors relative to RSIM, with an average of around 40% with the best methodology and 130% for the most commonly used methodology. Despite its superior accuracy, DirectRSIM achieves a speed within a factor of 2.7 of that achieved by the current direct execution simulators with simple processors. Although the performance advantage of simple processor based simulators is still significant, it may no longer be enough to justify the errors that such simulators see in modeling the performance of shared-memory systems with state-of-the-art processors.Item The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems(1997-06-20) Ranganathan, Parthasarathy; Pai, Vijay S.; Abdel-Shafi, Hazim; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems(1999-03-20) Adve, Sarita V.; Pai, Vijay S.; Ranganathan, Parthasarathy; CITI (http://citi.rice.edu/)NoneItem RSIM Reference Manual: Version 1.0(1997-08-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; CITI (http://citi.rice.edu/)Simulation has emerged as an important method for evaluating new ideas in both uniprocessor and multiprocessor architecture. Compared to building real hardware, simulation provides at least two advantages. First it provides the flexibility to modify various architectural parameters and components and to analyze the benefits of such modification. Second, simulation allows for detailed statistics collection, providing a better understanding of the tradeoffs involved and facilitating further performance tuning. This document describes RSIM - the Rice Simulator for ILP Multiprocessors (Version 1.0). RSIM is an execution-driven simulator primarily designed to study shared-memory multiprocessor architectures built from state-of-the-art processors. Compared to other current publicly available shared-memory simulators, the key advantage of RSIM is that it supports a processor model that aggressively exploits instruction-level parallelism (ILP) and is more representative of current and near-future processors. Currently available shared-memory simulators assume a much simpler processor model, and can exhibit significant inaccuracies when used to study the behavior of shared-memory multiprocessors built from state-of-the-art ILP processors. A cost of the increased accuracy and detail of RSIM is that it is slower than simulators that do not model the processor. We have used RSIM at Rice for our research in computer architecture, as well as for undegraduate and graduate architecture courses covering both uniprocessor and multiprocessor architectures.Item RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors(1997-10-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors(2002-02-20) Hughes, Christopher J.; Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; CITI (http://citi.rice.edu/)Rsim is a publicly available architecture simulator for shared-memory systems built from processors that aggressively exploit instruction-level parallelism. Modeling ILP features in a multiprocessor is particularly important for applications that exhibit parallelism among read misses.Item The impact of instruction-level parallelism on multiprocessor performance and simulation methodology(1997) Pai, Vijay Sadananda; Adve, Sarita V.Current microprocessors exploit high levels of instruction-level parallelism (ILP). This thesis presents the first detailed analysis of the impact of such processors on shared-memory multiprocessors. We find that ILP techniques substantially reduce CPU time in multiprocessors, but are less effective in reducing memory stall time for our applications. Consequently, despite the latency-tolerating techniques incorporated in ILP processors, memory stall time becomes a larger component of execution time and parallel efficiencies are generally poorer in our ILP-based multiprocessor than in an otherwise equivalent previous-generation multiprocessor. We identify clustering independent read misses together in the processor instruction window as a key optimization to exploit the ILP features of current processors. We also use the above analysis to examine the validity of direct-execution simulators with previous-generation processor models to approximate ILP-based multiprocessors. We find that, with appropriate approximations, such simulators can reasonably characterize the behavior of applications with poor overlap of read misses. However, they can be highly inaccurate for applications with high overlap of read misses.