Browsing by Author "Cox, Alan L."
Now showing 1 - 20 of 36
Results Per Page
Sort Options
Item A consistent and transparent solution for caching dynamic Web content(2005) Mittal, Sumit; Cox, Alan L.Caching is an effective means for reducing load on web servers, especially for those that dynamically generate documents in dynamic web applications. While adding caching to a web application can greatly reduce response times for requests, the logic to ensure consistency with the backend database requires considerable effort to develop. Much of the complexity is in minimizing unnecessary page invalidations, a key goal for improving the cache hit rate and response times. In this thesis I explore a range of invalidation policies that are progressively more precise. A policy is more precise than the other if it produces less false positives (removal of valid pages). A contribution of this work is in achieving precise invalidations at the application server layer automatically. To explore these issues, I introduce AutoWebCache, a system for adding server-side caching for dynamic content automatically to web applications having a back-end database. To achieve automation, it uses aspect-oriented programming for injecting the cache code into the application. Dependencies between the read and write requests are determined automatically, during run-time. Formulating the dependencies requires SQL query analysis be performed at run-time, which is costly. I demonstrate how to reduce this dynamic analysis overhead through effective caching of intermediate analysis results. In two e-commerce benchmarks, RUBiS and TPC-W, I show my method can be highly effective, reducing the response times, 63% and 97%, respectively.Item An Efficient Threading Model to Boost Server Performance(2004-09-13) Chanda, Anupam; Cox, Alan L.; Elmeleegy, Khaled; Gil, Romer; Mittal, Sumit; Zwaenepoel, WillyWe investigate high-performance threading architectures for I/O intensive multi-threaded servers. We study thread architectures from two angles: (1) number of user threads per kernel thread, and (2) use of synchronous I/O vs. asynchronous I/O. We underline the shortcomings of 1-to-1threads with synchronous I/O, N-to-1 threads with asynchronous I/O, and N-to-M threads with synchronous I/O with respect to server performance We propose N-to-M threads with asynchronous I/O, a novel and previously unexplored thread model, for such servers. We explain the architectural benefits of this thread model over the above mentioned architectures. We have designed and implemented ServLib, a thread library, modeled in this architecture. We optimize ServLib to reduce context switches for large I/O transfers. ServLib exports standard POSIX threads (pthreads) API and can be slipped transparently beneath any multi-threaded application using such API. We have evaluated ServLib with two applications, the Apache multi-threaded web server, and the MySQL multi-threaded database server. Our results show that for synthetic and real workloads Apache with ServLib registers a performance improvement of 10-25%over Apache with 1-to-1 thread and N-to-1 thread libraries. For TPC-W workload, ServLib improves the performance of MySQL by 10-17% over 1-to-1 and N-to-1thread libraries.Item Control Plane Design and Performance Analysis for Optical Multicast-Capable Datacenter Networks(2014-04-18) Xia, Yiting; Ng, T. S. Eugene; Jermaine, Christopher M.; Cox, Alan L.This study presents a control plane design for an optical multicast-capable datacenter network and evaluates the system performance using simulations. The increasing number of datacenter applications with heavy one-to-many communications has raised the need for an efficient group data delivery solution. We propose a clean-slate architecture that uses optical multicast technology to enable ultra-fast, energy-efficient, low cost, and highly reliable group data delivery in the datacenter. Since the optical components are agnostic of existing communication protocols, I design novel control mechanisms to coordinate datacenter applications with the optical network. Applications send explicit requests for group data delivery through an API exposed by a centralized controller. Based on the collected traffic demands, the controller computes optical resource allocations using a proposed control algorithm to maximize utilization of the optical network. Finally, the controller changes the optical network topology according to the computation decision and sets forwarding rules to route traffic to the correct data paths. I evaluate the optimality and complexity of the control algorithm with real datacenter traffic. It achieves near optimal solutions in almost all experiment cases and can finish computation instantaneously on a large datacenter setting. I also develop a set of simulators to compare the performance of our system against a number of state-of-the-art group data delivery approaches, such as the non-blocking datacenter architecture, datacenter BitTorrent, datacenter IP multicast, etc. Extensive simulations using synthetic traffic show our solution can provide an order of magnitude performance improvement. Tradeoffs of our system are analyzed quantitatively as well.Item Designing Scalable Networks for Future Large Datacenters(2012-09-05) Stephens, Brent; Cox, Alan L.; Rixner, Scott; Ng, T. S. Eugene; Carter, JohnModern datacenters require a network with high cross-section bandwidth, fine-grained security, support for virtualization, and simple management that can scale to hundreds of thousands of hosts at low cost. This thesis first presents the firmware for Rain Man, a novel datacenter network architecture that meets these requirements, and then performs a general scalability study of the design space. The firmware for Rain Man, a scalable Software-Defined Networking architecture, employs novel algorithms and uses previously unused forwarding hardware. This allows Rain Man to scale at high performance to networks of forty thousand hosts on arbitrary network topologies. In the general scalability study of the design space of SDN architectures, this thesis identifies three different architectural dimensions common among the networks: source versus hop-by-hop routing, the granularity at which flows are routed, and arbitrary versus restrictive routing and finds that a source-routed, host-pair granularity network with arbitrary routes is the most scalable.Item Efficient virtualization of network interfaces without sacrificing safety and transparency(2009) Ram, Kaushik Kumar; Cox, Alan L.In modern day data centers economics is motivating server consolidation. Today, machine virtualization is being widely used to implement server consolidation. While great strides have been made in efficient virtualization of the machine's processors and memory, virtualization of I/O devices still incurs significant overheads. Xen uses the driver domain I/O model to support I/O virtualization. This model offers benefits such as fault isolation and device transparency. However, the processing overheads incurred in the driver domain to achieve these benefits limit overall I/O performance. This thesis presents mechanisms and optimizations to reduce the overhead of network interface virtualization when using the driver domain model without sacrificing its benefits. In particular, this thesis demonstrates the effectiveness of two approaches to reduce the CPU overhead of network I/O virtualization. First, Xen is modified to support multi-queue network interfaces to eliminate the software overheads of packet de-multiplexing and copying. Second, a new grant mechanism is developed to reduce memory sharing overheads. This thesis also presents and evaluates a series of optimizations that substantially reduce the I/O virtualization overheads in the guest domain. In combination, these mechanisms and optimizations increase the maximum throughput achieved by guest domains, in the receive path, from 3.0 Gb/s to full 10 Gigabit Ethernet link rates.Item Enhancing ethernet's reliability and scalability(2008) Elmeleegy, Khaled; Cox, Alan L.Ethernet is pervasive. This is due in part to its ease of use. Equipment can be added to an Ethernet network with little or no manual configuration. Furthermore, Ethernet is self-healing in the event of equipment failure or removal. Unfortunately, it suffers from significant reliability and scalability problems. Ethernet's distributed forwarding topology computation protocol—the Rapid Spanning Tree Protocol (RSTP)—is known to suffer from a classic count-to-infinity problem. However, the cause and implications of this problem are neither documented nor understood. In this dissertation, we identify the exact conditions under which the count-to-infinity problem manifests itself, and we characterize its effect on forwarding topology convergence. Also, we have discovered that a forwarding loop can form during count to infinity, and we provide a detailed explanation. In addition to count to infinity induced forwarding loops, Ethernet is known to suffer from other types of forwarding loops. A forwarding loop can cause packet loss and duplication, which in some cases may persist indefinitely. To address these reliability problems, we propose two solutions. First, we introduce the EtherFuse, a new device that can be inserted into an existing Ethernet. In the event of failures, the EtherFuse speeds the reconfiguration of the spanning tree and suppresses packet duplication. EtherFuse is backward compatible and requires no change to the existing hardware, software, or protocols. We describe a prototype EtherFuse implementation and experimentally demonstrate its effectiveness. Specifically, we characterize how quickly it responds to network failures, its ability to reduce packet loss and duplication, and its benefits on the end-to-end performance of common applications. Second, we propose a simple yet effective modification to the standard RSTP protocol called RSTP with Epochs. This solution guarantees that the forwarding topology converges in at most one round-trip time across the network and eliminates the possibility of a count-to-infinity induced forwarding loops. In addition to the reliability problems, Ethernet also faces scalability challenges due to broadcast traffic. We studied and characterized broadcast traffic in Ethernet networks using real traces. In these traces, we found that broadcast is mainly used in Ethernet for service and resource discovery. For example, the Address Resolution Protocol (ARP) uses broadcast to discover a MAC address that corresponds to a network layer address. To avoid broadcast requests for service and resource discovery, we propose that the network caches this information, then serves those requests from the cache. To this end, we introduce a new device, the EtherProxy* that uses caching to suppress broadcast traffic. EtherProxy is backward compatible and requires no changes to existing hardware, software, or protocols. Moreover, it requires no configuration. In our evaluation, we used real and synthetic workloads. The synthetic workloads try to simulate broadcast traffic in very large Ethernet networks. They are constructed with the aid of our characterization of broadcast traffic in real traces. Using both workloads, we experimentally demonstrate the effectiveness of the EtherProxy. *Not to be confused with Proxy ARP [40]Item Exploiting address space contiguity to accelerate TLB miss handling(2010) Barr, Thomas W.; Cox, Alan L.The traditional CPU-bound applications of the past have been replaced by multiple concurrent data-driven applications that use lots of memory. These applications, including databases and virtualization, put high stress on the virtual memory system which can have up to a 50% performance overhead for some applications. Virtualization compounds this problem, where the overhead can be upwards of 90%. While much research has been done on reducing the number of TLB misses, they can not be eliminated entirely. This thesis examines three techniques for reducing the cost of TLB miss handling. We test each against real-world workloads and find that the techniques that exploit course-grained locality in virtual address use and contiguity found in page tables show the best performance. The first technique reduces the overhead of multi-level page tables, such as those used in x86-64, with a dedicated MMU cache. We show that the most effective MMU caches are translation caches , which store partial translations and allow the page walk hardware to skip one or more levels of the page table. In recent years, both AMD and Intel processors have implemented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This thesis introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly introduced structures, both of which are translation cache variants, are better than existing structures in many situations. Secondly, this thesis examines the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This work shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table. Finally, we present a novel device, the SpecTLB, that is able to exploit alignment in the mapping from virtual address to physical address to interpolate translations without any memory accesses at all. Operating system support for automatic page size selection leaves many small pages aligned within large page "reservations". While large pages improve TLB coverage, they limit the control the operating system has over memory allocation and protection. Our device allows the latency penalty of small pages to be avoided while maintaining fine-grained allocation and protection.Item GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores(2014-05-15) Li, Conglong; Cox, Alan L.; Rixner, Scott; Mellor-Crummey, JohnVarious memory-based key-value stores, such as Memcached and Redis, are used to speed up dynamic web applications. Specifically, they are used to cache the results of computations, such as database queries. Currently, these key-value stores use either LRU or an LRU approximation as the replacement policy for choosing a key-value pair to be evicted from the store. However, if the cost of recomputing cached values varies significantly, as in the RUBiS and TPC-W benchmarks, then neither of these replacement policies are the best choice. When deciding what key-value pair to replace, it can be advantageous to take the cost of recomputation into consideration. To that end, this thesis proposes a new cost-aware replacement policy, GD-Wheel, which seamlessly integrates recency of access and cost of recomputation. This thesis applies GD-Wheel to Memcached and evaluates its performance using the Yahoo! Cloud Serving Benchmark. The evaluation shows that GD-Wheel, when compared to LRU, greatly reduces the total recomputation cost, as well as the average and 99th percentile read access latency for the application.Item Handling Congestion and Routing Failures in Data Center Networking(2015-09-01) Stephens, Brent; Cox, Alan L.; Rixner, Scott; Ng, T. S. Eugene; Zhong, LinToday's data center networks are made of highly reliable components. Nonetheless, given the current scale of data center networks and the bursty traffic patterns of data center applications, at any given point in time, it is likely that the network is experiencing either a routing failure or a congestion failure. This thesis introduces new solutions to each of these problems individually and the first combined solutions to these problems for data center networks. To solve routing failures, which can lead to both packet loss and a loss of connectivity, this thesis proposes a new approach to local fast failover, which allows for traffic to be quickly rerouted. Because forwarding table state limits both the fault tolerance and the largest network size that is implementable given local fast failover, this thesis introduces both a new forwarding table compression algorithm and Plinko, a compressible forwarding model. Combined, these contributions enable forwarding tables that contain routes for all pairs of hosts that can reroute traffic even given multiple arbitrary link failures on topologies with tens of thousands of hosts. To solve congestion failures, this thesis presents TCP-Bolt, which uses lossless Ethernet to prevent packets from ever being dropped. Unlike prior work, this thesis demonstrates that enabling lossless Ethernet does not reduce aggregate forwarding throughput in data center networks. Further, this thesis also demonstrates that TCP-Bolt can significantly reduce flow completion times for medium sized flows by allowing for TCP slow-start to be eliminated. Unfortunately, using lossless Ethernet to solve congestion failures introduces a new failure mode, deadlock, which can render the entire network unusable. No existing fault tolerant forwarding models are deadlock-free, so this thesis introduces both deadlock-free Plinko and deadlock-free edge disjoint spanning tree (DF-EDST) resilience, the first deadlock-free fault tolerant forwarding models for data center networks. This thesis shows that deadlock-free Plinko does not impact forwarding throughput, although the number of virtual channels required by deadlock-free Plinko increases as either topology size or fault tolerance increases. On the other hand, this thesis demonstrates that DF-EDST provides deadlock-free local fast failover without needing virtual channels. This thesis shows that, with DF-EDST resilience, less than one in a million of the flows in data center networks with thousands of hosts are expected to fail even given tens of failures. Further, this thesis shows that doing so incurs only a small impact on the maximal achievable aggregate throughput of the network, which is acceptable given the overall decrease in flow completion times achieved by enabling lossless forwarding.Item I/O-oriented applications on a software distributed-shared memory system(1999) Parker, Timothy Paul; Cox, Alan L.This thesis evaluates the use of a software distributed shared memory system, Treadmarks, as a platform for supporting an I/O-intensive application, specifically the database Postgres. Software distributed shared memory (DSM) systems allow applications to run on cheap and powerful networks of workstations without the complexity of explicit message-passing. Such systems are usually used for computationally-intensive scientific applications. I/O-intensive applications have significantly different characteristics. Despite this, Postgres needed only minimal changes to run on Treadmarks. This is partially because we wrote emulation layers for many APIs Postgres already used. We created additional support pieces for Treadmarks to handle the problems that arose because of the different application characteristics. These were divided into three areas. Some issues were related to usage of forms of communication other than shared memory, some to Treadmarks, and some to the UNIX API. We discuss and evaluate solutions to these problems.Item Improving the Efficiency of Map-Reduce Task Engine(2014-10-03) Chadha, Mehul; Cox, Alan L.; Rixner, Scott; Sarkar, VivekMap-Reduce is a popular distributed programming framework for parallelizing computation on huge datasets over a large number of compute nodes. This year completes a decade since it was invented by Google in 2004. Hadoop, a popular open source implementation of Map-Reduce was introduced by Yahoo in 2005. Over these years many researchers have worked on various problems related to Map-Reduce and similar distributed programming models. Hadoop itself has been the subject of various research projects. The prior work in this field is focussed on making Map- Reduce more efficient for iterative processing, or making it more pipelined across different jobs. This has resulted in an improvement of performance for iterative applications. However, little focus was given to the task engine which carries out the Map-Reduce computation itself. Our analysis of applications running on Hadoop shows that more than 50% of the time is spent in the framework in doing tasks such as sorting, serialization and deserialization . We solve this problem introducing an extension to the Map-Reduce programming model. This extension allows us to use more efficient data structures like hash tables. It also allows us to lower the cost of serialization and deserialization of the key value pairs. With these efforts we have been able to lower the overheads of the framework, and the performance of certain important applications such as Pagerank and Join has improved by 1.5 to 2.5 times.Item LAIO: Lazy asynchronous I/O for event-driven servers(2004) Elmeleegy, Khaled; Cox, Alan L.In this thesis, I introduce Lazy Asynchronous I/O (LAIO), a new API for performing I/O that is well-suited but not limited to the needs of high-performance, event-driven servers. In addition, I describe and evaluate an implementation of LAIO that demonstrably addresses certain critical limitations of the asynchronous and non-blocking I/O support in present Unix-like systems. LAIO is implemented entirely at user-level, without modification to the operating system's kernel. It utilizes scheduler activations. Using a micro-benchmark, LAIO was shown to be more than 3 times faster than AIO when the data was already available in memory. It also had a comparable performance to AIO when actual I/O needed to be made. An event driven web server (thttpd) achieved more than 38% increase in its throughput using LAIO. The Flash web server's throughput, originally achieved with kernel modifications, was matched using LAIO without making kernel modifications.Item Maestro: A System for Scalable OpenFlow Control(2010-12-04) Cai, Zheng; Cox, Alan L.; Ng, T. S. EugeneThe fundamental feature of an OpenFlow network is that the controller is responsible for the initial establishment of every flow by contacting related switches. Thus the performance of the controller could be a bottleneck. This paper shows how this fundamental problem is addressed by parallelism. The state of the art OpenFlow controller, called NOX, achieves a simple programming model for control function development by having a single-threaded event-loop. Yet NOX has not considered exploiting parallelism. We propose Maestro which keeps the simple programming model for programmers, and exploits parallelism in every corner together with additional throughput optimization techniques. We experimentally show that the throughput of Maestro can achieve near linear scalability on an eight core server machine.Item Maestro: Balancing Fairness, Latency and Throughput in the OpenFlow Control Plane(2011-12-20) Cai, Zheng; Cox, Alan L.; Ng, T. S. EugeneThe fundamental feature of an OpenFlow network is that the controller is responsible for the configuration of switches for every traffic flow. This feature brings programmability and flexibility, but also puts the controller in a critical role in the performance of an OpenFlow network. To fairly service requests from different switches, to achieve low request-handling latency, and to scale effectively on multi-core processors are fundamental controller design requirements. With these requirements in mind, we explore multiple workload distribution designs within our system called Maestro. These designs are evaluated against the requirements, together with the static partitioning and static batching design found in other available multi-threaded controllers, NOX and Beacon. We find that a Maestro design based on the abstraction that each individual thread services switches in a round-robin manner can achieve excellent throughput scalability (second only to another Maestro design) while maintaining far superior and near optimal maxim fairness. At the same time, low latency even at high throughput is achieved thanks to Maestro’s workload adaptive request batching.Item Method and system for scalable ethernet(2014-06-24) Rixner, Scott; Cox, Alan L.; Foss, Michael; Shafer, Jeffrey; Rice University; United States Patent and Trademark OfficeA computer readable medium comprising computer readable code for data transfer. The computer readable code, when executed, performs a method. The method includes receiving, at a first Axon, an ARP request from a source host directed to a target host. The method also includes obtaining a first route from the first Axon to the second Axon, and generating a target identification corresponding to the target host. The method further includes sending an Axon-ARP request to the second Axon using the first route, and receiving an Axon-ARP reply from the second Axon, where the Axon-ARP reply includes a second route. The method further includes storing the first route in storage space on the first Axon, where the storage space is indexed by the target identification, and sending an ARP reply to the first host where the source host is configured to send a packet to the target host.Item Microcontroller Programming for the Modern World(2014-04-25) Barr, Thomas William; Rixner, Scott; Cox, Alan L.; O'Malley, Marcia K.Microcontroller development is much too hard, not only for beginners, but also for experts. While the programming languages community has developed rich high-level languages and run-time systems that make programming traditional large systems easy and fun, the microcontroller developer languishes in a world of direct register access, incomplete C compilers, and manual memory management. For the past four years, the Rice Computer Architecture Group has been addressing this by developing Owl, an open-source microcon- troller development system for the modern world. Owl includes support for the proven and easy-to-use language Python. It also supports Medusa, a new language designed specif- ically for embedded, concurrent programming. Finally, it introduces Hoot, a distributed computing environment that allows a programmer to treat a heterogeneous collection of controllers and networks as a single large application. This thesis presents the design of Owl as well as a detailed quantitative evaluation of it. These results show that not only is it possible to run sophisticated system software on a microcontroller, but that doing so makes building applications much easier. The results and innovations presented here are adaptable to the embedded run-times of the future and have the potential to make microcontroller development easier for everyone.Item Multi-tier caching of dynamic content for database-driven Web sites(2002) Rajamani, Karthick; Cox, Alan L.Web sites have gradually shifted from delivering just static html pages and images to customized, user-specific content and plethora of online services. The new features and facilities are made possible by dynamic content which is produced at request time. Multi-tiered database-driven web sites form the predominant infrastructure for most structured and scalable approaches to dynamic content delivery. However, even with these scalable approaches, the request-time computation and high resource demands for dynamic content generation result in significantly higher latencies and lower throughputs than for sites with just static content. This thesis proposes the caching of dynamic content as the solution for improving the performance of web sites with significant amount of dynamic content. This work shows that there is significant locality in the data accesses and computations for content generation which can be exploited by caching to improve performance. This work introduces a novel multi-tier caching architecture that incorporates multiple, independent caching components to enable easy deployment and effective performance over the prevalent multi-tiered database-driven architecture for dynamic content delivery. The dynamic content infrastructure and the proposed caching strategy is evaluated with e-commerce workloads from the TPC-W benchmark. The evaluation of the system without caching shows that content generation overheads are dominated by the database component for e-commerce workloads. With multi-tier caching, each caching component overcomes specific overheads during content generation while the combination provides overall improvements in performance significantly greater than the individual contributions. The increased peak throughputs with caching range from 1.58 to 8.72 times the peak throughputs without caching at similar or significantly reduced average response times. At the same load as for the peak throughputs without caching, the response times were reduced by 90% to 97% with caching. The evaluations also establish the effectiveness of the strategy in relation to variation in platform and site configurations. Overall, the proposed multi-tier caching strategy brings about dramatic improvements in performance for dynamic content delivery.Item New Architectures and Mechanisms for the Network Subsystem in Virtualized Servers(2013-07-24) Ram, Kaushik Kumar; Cox, Alan L.; Rixner, Scott; Varman, Peter J.Machine virtualization has become a cornerstone of modern datacenters. It enables server consolidation as a means to reduce costs and increase efficiencies. The communication endpoints within the datacenter are now virtual machines (VMs), not physical servers. Consequently, the datacenter network now extends into the server and last hop switching occurs inside the server. Today, thanks to increasing core counts on processors, server VM densities are on the rise. This trend is placing enormous pressure on the network I/O subsystem and the last hop virtual switch to support efficient communication, both internal and external to the server. But the current state-of-the-art solutions fall short of these requirements. This thesis presents new architectures and mechanisms for the network subsystem in virtualized servers to build efficient virtualization platforms. Specifically, there are three primary contributions in this thesis. First, it presents a new mechanism to reduce memory sharing overheads in driver domain-based I/O architectures. The key idea is to enable a guest operating system to reuse its I/O buffers that are shared with a driver domain. Second, it describes Hyper-Switch, a highly streamlined, efficient, and scalable software-based virtual switching architecture, specifically for hypervisors that support driver domains. The Hyper-Switch combines the best of the existing architectures by hosting the device drivers in a driver domain to isolate any faults and placing the virtual switch in the hypervisor to perform efficient packet switching. Further, the Hyper-Switch implements several optimizations, such as virtual machine state-aware batching, preemptive copying, and dynamic offloading of packet processing to idle CPU cores, to enable efficient packet processing, better utilization of the available CPU resources, and higher concurrency. This architecture eliminates the memory sharing overheads associated with driver domains. Third, this thesis proposes an alternate virtual switching architecture, called sNICh, which explores the idea of server/switch integration. The sNICh is a combined network interface card (NIC) and datacenter switching accelerator. This takes the Hyper-Switch architecture one step further. It offloads the data plane of the switch to the network device, eliminating driver domains entirely.Item Optimizing network I/O virtualization through guest-driven scheduler bypass(2010) Crompton, Joanna; Cox, Alan L.Virtualization is increasingly utilized for consolidating server resources to improve efficiency by conserving power and space. However, significant hurdles remain in achieving satisfactory performance in a virtualized system. Notably, virtualization of network I/O continues to be a performance barrier. The driver domain model of I/O virtualization suffers from an inherent network performance disadvantage due to the necessity of scheduling a driver domain. However, this virtualization model is desirable because of its fault tolerance and isolation properties. In this work, I argue that it is possible to overcome the barrier of network I/O performance while maintaining domain protection by providing a i mechanism which enables guests to operate the driver domain on their own behalf without the intervention of the scheduler. I describe my implementation of the worldswitch mechanism and evaluate its performance. I show that with the worldswitch enabled, guests achieve higher bandwidth and lower latency than in an unmodified system.Item Performance Analysis and Configuration Selection for Applications in the Cloud(2015-05-29) Liu, Ruiqi; Ng, T. S. Eugene; Cox, Alan L.; Jermaine, Christopher M.Cloud computing is becoming increasingly popular and widely used in both industries and academia. Making best use of cloud computing resources is critically important. Default resource configurations provided by cloud platforms are often not tailored for applications. Hardware heterogeneity in cloud computers such as Amazon EC2 leads to wide variation in performance, which provides an avenue for research in saving cost and improving performance by exploiting the heterogeneity. In this thesis, I conduct exhaustive measurement studies on Amazon EC2 cloud platforms. I characterize the heterogeneity of resources, and analyze the suitability of different resource configurations for various applications. Measurement results show significant performance diversity across resource configurations of different virtual machine sizes and with different processor types. Diversity in resource capacity is not the only reason for performance diversity; diagnostic measurements reveal that the influence from the cloud provider’s scheduling policy is also an important factor. Furthermore, I propose a nearest neighbor shortlisting algorithm that selects a configuration leading to superior performance for an application by matching the characteristics of the application with that of known benchmark programs. My experimental evaluations show that nearest neighbor greatly reduces the testing overhead since only the shortlisted top configurations rather than all configurations need to be tested; the method achieves high accuracy because the target application chooses the configuration for itself via testing. Even without any test, nearest neighbor is able to obtain a configuration with less than 5% performance loss for 80% applications.