Browsing by Author "Pai, Vijay S."
Now showing 1 - 20 of 25
Results Per Page
Sort Options
Item Analytic Evaluation of Shared-Memory Systems with ILP Processors(1998-06-20) Sorin, Daniel J.; Pai, Vijay S.; Adve, Sarita V.; Vernon, Mary K.; Wood, David A.; CITI (http://citi.rice.edu/)NoneItem Challenges in Computer Architecture Evaluation(2003-08-20) Skadron, Kevin; Martonosi, Margaret; August, David; Hill, Mark; Lilja, David; Pai, Vijay S.A report to the US National Science Foundation argues that simulation and benchmarking technology will require a leap in capability within the next few years to maintain ongoing innovation in computer systems.Item Code Transformations to Improve Memory Parallelism(2000-05-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previous-generation systems. These deficiencies arise largely because applications present limited opportunities for an out-of-order issue processor to overlap multiple read misses, the dominant source of memory stalls. This work proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. We present an analysis and transformation framework suitable for compiler implementation. Our simulation experiments show execution time reductions averaging 20% in a multiprocessor and 30% in a uniprocessor. A substantial part of these reductions comes from increases in memory parallelism. We see similar benefits on a Convex Exemplar.Item Code Transformations to Improve Memory Parallelism(1999-11-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Comparing and Combining Read Miss Clustering and Software Prefetching(2001-09-20) Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem A Customized MVA Model for ILP Multiprocessors(1998-04-20) Sorin, Daniel J.; Vernon, Mary K.; Pai, Vijay S.; Adve, Sarita V.; Wood, David A.; CITI (http://citi.rice.edu/)This paper provides the customized MVA equations for an analytical model for evaluating architectural alternatives for shared-memory multiprocessors with processors that aggressively exploit instruction-level parallelism (ILP). Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds.Item An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors(1996-10-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; Harton, Tracy; CITI (http://citi.rice.edu/)NoneItem Exploiting Instruction-Level Parallelism for Memory System Performance(2000-08-20) Pai, Vijay S.; CITI (http://citi.rice.edu/)Current microprocessors improve performance by exploiting instruction-level parallelism (ILP). ILP hardware techniques such as multiple instruction issue, out-of-order (dynamic) issue, and non-blocking reads can accelerate both computation and data memory references. Since computation speeds have been improving faster than data memory access times, memory system performance is quickly becoming the primary obstacle to achieving high performance. This dissertation focuses on exploiting ILP techniques to improve memory system performance. This dissertation includes both an analysis of ILP memory system performance and optimizations developed using the insights of this analysis. First, this dissertation shows that ILP hardware techniques, used in isolation, are often unsuccessful at improving memory system performance because they fail to extract parallelism among data reads that miss in the processor's caches. The previously-studied latency-tolerance technique of software prefetching provides some improvement by initiating data read misses earlier, but also suffers from limitations caused by exposed startup latencies, excessive fetch-ahead distances, and references that are hard to prefetch. This dissertation then uses the above insights to develop compile-time software transformations that improve memory system parallelism and performance. These transformations improve the effectiveness of ILP hardware, reducing exposed latency by over 80% for a latency-detection microbenchmark and reducing execution time an average of 25% across 14 multiprocessor and uniprocessor cases studied in simulation and an average of 21% across 12 cases on a real system. These transformations also combine with software prefetching to address key limitations in either latency-tolerance technique alone, providing the best performance when both techniques are combined for most of the uniprocessor and multiprocessor codes that we study. Finally, this dissertation also explores appropriate evaluation methodologies for ILP shared-memory multiprocessors. Memory system parallelism is a key feature determining ILP performance, but is neglected in previous-generation fast simulators. This dissertation highlights the errors possible in such simulators and presents new evaluation methodologies to improve the tradeoff between accuracy and evaluation speed.Item Exploiting Task-Level Concurrency in a Programmable Network Interface(2003-06-20) Kim, Hyong-youb; Pai, Vijay S.; Rixner, ScottProgrammable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modern ASIC-based network interface for real network services.Item A Flexible and Efficient Application Programming Interface (API) for a Customizable Proxy Cache(2003-03-20) Pai, Vivek S.; Cox, Alan; Pai, Vijay S.; Zwaenepoel, WillyThis paper describes the design, implementation, and performance of a simple yet powerful Application Programming Interface (API) for providing extended services in a proxy cache. This API facilitates the development of customized content adaptation, content management, and specialized administration features. We have developed several modules that exploit this API to perform various tasks within the proxy, including a module to support the Internet Content Adaptation Protocol (ICAP) without any changes to the proxy core. The API design parallels those of high-performance servers, enabling its implementation to have minimal overhead on a high-performance cache. At the same time, it provides the infrastructure required to process HTTP requests and responses at a high level, shielding developers from low-level HTTP and socket details and enabling modules that perform interesting tasks without significant amounts of code. We have implemented this API in the portable and high-performance iMimic DataReactorâ ¢ proxy cache. We show that implementing the API imposes negligible performance overhead and that realistic content-adaptation services achieve high performance levels without substantially hindering a background benchmark load running at a high throughput level.Item The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors(1999-02-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Abdel-Shafi, Hazim; Adve, Sarita V.; CITI (http://citi.rice.edu/)Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP- based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.Item The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology(1997-02-20) Pai, Vijay S.; Ranganathan, Parthasarathy; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology(1997-04-20) Pai, Vijay S.; CITI (http://citi.rice.edu/)Current microprocessors exploit high levels of instruction-level parallelism (ILP). This theis presents the first detailed analysis of the impact of such processors on shared-memory multiprocessors. We find that ILP techniques substantially reduce CPU time in multiprocessors, but are less effective in reducing meory stall time for our applications. Consequently, despite the latency-tolerating techniques incorporated in ILP processors, memory stall time becomes a large component of execution time and parallel efficiencies are generally poorer in our ILP-based multiprocessor than in an otherwise equivalent previous-generation multiprocessor. We identify clustering independent read misses together in the processor instruction window as a key optimization to exploit the ILP features of current processors. We also use the above analysis to examine the validity of direct-execution simulators with previous-generation processor models to approximate ILP-based multiprocessors. We find that, with appropriate approximations, such simulators can reasonably characterize the behavior of applications with poor overlap of read misses. However, they can be highly inaccurate for applications with high overlap of read misses.Item Improving networking server performance with programmable network interfaces(2003) Kim, Hyong-Youb; Rixner, Scott; Pai, Vijay S.Networking servers, such as web servers, have been widely deployed in recent years. While developments in the operating system and applications continue to improve server performance, programmable network interfaces with local memory provide new opportunities to improve server performance through extended network services on the network interface. However, due to their embedded nature, programmable processors on the network interface may suffer from inadequate processing power when compared to non-programmable application-specific network interfaces. This thesis first shows that exploiting a multiprocessor architecture and task-level concurrency in network interface processing enables programmable network interfaces to overcome the performance disadvantages over application-specific network interfaces that result from programmability. Then, the thesis presents a network service on a programmable network interface that exploits the storage capacity of the interfaces to alleviate the local I/O interconnect bottleneck, thereby improving server performance. Thus, these two results show that programmable network interfaces can offset the performance disadvantages due to programmability and improve networking server performance through extended network services that exploit their computation power and storage capacity.Item Improving the Accuracy vs. Speed Tradeoff for Simulating Shared-Memory Multiprocessors with ILP Processors(1999-01-20) Durbhakula, Murthy; Pai, Vijay S.; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Increasing Web Server Throughput with Network Interface Data Caching(2002-10-20) Kim, Hyong-youb; Pai, Vijay S.; Rixner, Scott; Center for Multimedia Communications (http://cmc.rice.edu/)This paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines which data to store in the cache and for which packets it should use data from the cache. To facilitate data reuse across multiple packets and connections, the cache only stores application-level response content (such as HTTP data), with application-level and networking headers generated by the host CPU. Network interface data caching can reduce PCI traffic by up to 57% on a prototype implementation of a uniprocessor web server. This traffic reduction results in up to 31% performance improvement, leading to a peak server throughput of 1571 Mb/s.Item The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems(1997-06-20) Ranganathan, Parthasarathy; Pai, Vijay S.; Abdel-Shafi, Hazim; Adve, Sarita V.; CITI (http://citi.rice.edu/)NoneItem Isolating the Performance Impacts of Network Interface Cards through Microbenchmarks(2004-06-01) Pai, Vijay S.; Rixner, Scott; Kim, Hyong-youbThis paper studies the impact of network interface cards (NICs) on network server performance, testing six different Gigabit Ethernet NICs. Even with all other hardware and software configurations unchanged, a network service running on a PC-based server can achieve up to 150% more throughput when using the most effective NIC instead of the least effective one. This paper proposes a microbenchmark suite that isolates the micro-level behaviors of each NIC that shape these performance effects and relates these behaviors back to application performance. Unlike previous networking microbenchmark suites, the new suite focuses only on performance rather than aiming to achieve portability. This choice allows tight integration with the operating system, eliminating nearly all operating system overheads outside of the device driver for the network interface. The results show that the throughputs achieved by both a web server application and a software router have an evident relationship with the microbenchmarks related to handling bidirectional streams and small frames, but not with sends or receives of large frames.Item Item Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems(1999-03-20) Adve, Sarita V.; Pai, Vijay S.; Ranganathan, Parthasarathy; CITI (http://citi.rice.edu/)None