R-3 Repository :: Browsing by Author "Cox, Alan L"

Browsing by Author "Cox, Alan L"

Now showing 1 - 7 of 7

BitFlood: Multicast Enabled P2P Data Sharing in Datacenters
(2017-02-28) Nazaripouya, Omidreza; Cox, Alan L
One-to-many data transfers are a common activity within datacenters. Iterative machine learning algorithms, code and VM distribution, and fragment-replicate joins in Hadoop all perform one-to-many data transfers. Moreover, one-to-many data transfers can be costly. Some data analytics applications will distribute hundreds of gigabytes of data from a single node to hundreds of receiver nodes. Lastly, the time it takes to perform these data transfers can represent a large part of the overall execution time of an application. For example, data analytics applications that are used by Twitter and Netflix are reported to spend 30% to 45% of their execution time performing one-to-many data transfers. To address this problem, we propose BitFlood, which exploits the multicast capabilities of commodity switches in datacenters to speed up data transfers and lessen their impact on other application sharing the network. BitFlood is an extension to the BitTorrent protocol which utilizes IP multicast. IP Multicast is advantageous because in a one-to-many data transmission, the sender node sends the data once and it gets duplicated by switches in the network. Since IP multicast does not guarantee delivery of the data, BitFlood takes advantage of its P2P mechanism to recover lost data locally. To evaluate BitFlood, we implement it and compare it with two state-of-the-art data transmission approaches, namely, NORM and BitTorrent. NORM is a reliable multicast protocol and BitTorrent is a P2P protocol. To achieve a fair comparison, we tweak the implementation of both NORM and BitTorrent to have them perform better in a datacenter network. Our analysis shows that BitFlood achieves 10%-50% faster transfer time than both approaches for transferring data to multiple receivers. Furthermore, compared with the P2P approach of BitTorrent, BitFlood can reduce the load on the network by "n" times for a data transfer to "n" receivers.
Embargo
Effective Techniques for Managing Intermediate-Sized Superpages
(2024-08-09) Solomon, Eliot Hutton; Cox, Alan L
Translation lookaside buffers (TLBs) are pieces of hardware that cache the results of expensive address translations, improving the performance of the virtual memory system. Design constraints make it impossible for TLBs to store more than a few thousand entries, so "superpages" allow the operating system to instruct the TLB to cache a larger block of memory using a single entry. For small, frequently used memory objects like files and shared libraries, it can be difficult for the operating system to appropriately trade off the memory fragmentation induced by creating a 2 MB superpage with the performance benefits that doing so provides. Because of this, we investigate emerging hardware support for smaller “intermediate-sized” superpages. The first phase of our work explores PTE Coalescing, a feature of AMD Ryzen processors that transparently forms 16 KB or 32 KB superpages from aligned and contiguous groups of 4 KB base pages. We develop a custom microbenchmark to infer details of PTE Coalescing’s hardware implementation. We then determine that the contiguity generated by the Linux and FreeBSD physical memory allocators is insufficient to enable much coalescing and that reservation-based allocation is a good technique for generating additional contiguity to enhance PTE Coalescing. In the second phase of our work, we introduce the first production system capable of simultaneously managing two superpage sizes for file-backed and anonymous mappings by implementing support in the FreeBSD kernel for non-transparent 64 KB superpages on the ARM architecture using the latter’s Contiguous bit feature. We observe a 13.83% improvement in an exec() microbenchmark, a 6.83% boost in Node.js rendering performance, and a 11.18% speedup in a compilation-centric workload. More aggressive superpage promotion policies can further increase the performance benefits; we can boost the speedup to 15.67% using the right policy for the compilation-heavy workload.
Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data Processing Frameworks
(2015-07-13) Liu, Zhaolei; Ng, T. S. Eugene; Cox, Alan L; Jermaine, Christopher M
The shift to the in-memory data processing paradigm has had a major influence on the development of cluster data processing frameworks. Numerous frameworks from the industry, open source community and academia are adopting the in-memory paradigm to achieve functionalities and performance breakthroughs. However, despite the advantages of these in-memory frameworks, in practice they are susceptible to memory-pressure related performance collapse and failures. The contributions of this thesis are two-fold. Firstly, we conduct a detailed diagnosis of the memory pressure problem and identify three preconditions for the performance collapse. These preconditions not only explain the problem but also shed light on the possible solution strategies. Secondly, we propose a novel programming abstraction called the leaky buffer that eliminates one of the preconditions, thereby addressing the underlying problem. We have implemented the leaky buffer abstraction in Spark for two distinct use cases. Experiments on a range of memory intensive aggregation operations show that the leaky buffer abstraction can drastically reduce the occurrence of memory-related failures, improve performance by up to 507% and reduce memory usage by up to 87.5%.
Reliability and Optimization for Resource-Constrained Embedded Systems
(2015-04-23) Smith, Rebecca Jane; Rixner, Scott; Cox, Alan L; Cooper, Keith
Embedded systems are ubiquitous, powering countless devices ranging from cars to appliances. As the software requirements of these systems grow increasingly complex, it is necessary to develop new approaches to simplify embedded systems programming. Recently, managed run-time systems have emerged as a means of increasing the productivity of writing embedded applications. Along with increased productivity, these run-time systems bring an intrinsic structure which provides new opportunities for addressing fundamental challenges faced by resource-constrained embedded systems. This thesis presents novel mechanisms which utilize the structure imposed by managed run-time systems to address two key challenges of embedded systems programming: reliability and memory management. Though a wealth of past work explores these challenges in the context of conventional computing systems, the stringent resource constraints of embedded systems demand a more economical approach. Therefore, this thesis presents new techniques designed to accommodate the unique properties of embedded systems. First, this thesis presents Phoenix, a semi-automated system for recovering from hardware peripheral failures that is integrated into the run-time system. The design of Phoenix is uniquely tailored to embedded systems, inspired by novel insights into the characteristics of these systems as they pertain to reliability. Second, this thesis proposes a new technique for memory optimization andanalysis in embedded systems that capitalizes on the structure of a managed run-time system. It presents GEM, an extensible framework that implements this technique, and highlights the versatility of this framework through the implementation and evaluation of four use cases. Through these two systems, this thesis demonstrates the power of managed run-time systems to improve the future of developing safe and efficient embedded applications.
System Support for Loosely Coupled Resources in Mobile Computing
(2014-07-31) Lin, Xiaozhu; Zhong, Lin; Cox, Alan L; Varman, Peter J
Modern mobile platforms are embracing not only heterogeneous but also loosely coupled computational resources. For instance, a smartphone usually incorporates multiple processor cores that have no hardware cache coherence. Loosely coupled resources allow a high degree of resource heterogeneity that can greatly improve system energy efficiency for a wide range of mobile workloads. However, loosely coupled resources create application programming difficulty: both resources and program state are distributed, which call for explicit communication for consistency. This difficulty is further exacerbated by the large numbers of mobile developers and mobile applications. In order to ease application programming over loosely coupled resources, in this thesis work we explore system support -- at both user level and OS level -- that bridges desirable programming abstractions with the underlying hardware. We study three loosely coupled architectures widely seen in mobile computing: i) a smartphone accompanied by wearable sensors, ii) a mobile device encompassing multiple processors that share no memory, and iii) a mobile System-on-Chip (SoC) with multiple cores sharing incoherent memory. In order to address the three architectures, this thesis contributes three closely related research projects. In project Dandelion, we propose a Remote Method Invocation scheme to hide communication details from application components that are synchronizing over wireless links. In project Reflex, we design an energy-efficient software Distributed Shared Memory (DSM) to automatically keep user state consistent; the DSM always employs a low-power processor to host shared memory objects in order to maximize sleep periods of high-power processors. In project K2, we identify and apply a shared-most OS model to construct a single OS image over loosely coupled processor cores. Following the shared-most model, high-level OS services (e.g., device drivers and file systems) are mostly unmodified with their state transparently kept consistent; low-level OS services (e.g., page allocator) are implemented as separate instances with independent state for the sake of minimum communication overhead. We report the research prototypes, our experiences in building them, and the experimental measurements. We discuss future directions, in particular how our principles in treating loosely coupled resources can be used for improving other key system aspects, such as scalability.
Towards Efficient and Effective IOMMU-based Protection from DMA Attacks
(2018-04-20) Gutstein, Brett Ferdosi; Cox, Alan L
Malicious actors can carry out direct memory access (DMA) attacks to compromise computer systems. In such attacks, peripheral devices abuse their ability to read and write physical memory independently of the CPU to violate the confidentiality or integrity of a system’s data. Relatively recently, commodity architectures have incorporated the I/O memory management unit (IOMMU), which allows the CPU to govern peripheral device memory access. This thesis demonstrates that IOMMU usage in existing operating systems does not protect against DMA attacks effectively and comes with a prohibitively high performance cost. It introduces Thunderclap, a novel DMA attack platform used to carry out new attacks that completely compromise FreeBSD, macOS, Linux, and Windows, even with their current IOMMU-based protections enabled. It then presents and evaluates strategies for IOMMU usage that make strides towards efficient and effective protection from DMA attacks.
Virtual Memory Management for Emerging Accelerators and Large-memory Applications
(2022-12-02) Zhu, Weixi; Rixner, Scott; Cox, Alan L
Today, the operating system (OS) is called upon to support a variety of applications that process large amounts of data using an ever growing collection of specialized hardware accelerators. Nonetheless, current OSes still fail to (1) ease the development of drivers for new accelerators that need access to in-memory data and to (2) provide efficient access to that data by both the CPU and accelerators. Applications need virtual memory abstractions to securely isolate data and hide hardware details of the CPU and accelerators. Currently, OS memory management is designed for managing the CPU's memory and cannot be directly used for many accelerators. However, the absence of better OS memory management support for devices affects driver authors in terms of ease of development. They implement ad-hoc and specialized virtual memory management that reinvents many existing mechanisms from OS memory management. Unfortunately, the large complexity of virtual memory management hinders the implementation of an efficient one, so accelerator users may suffer from bad performance. Furthermore, the continued growth of data set sizes amplifies the performance impact of hardware limitations of both the CPU and accelerators. These limitations can be alleviated independently with innovative optimizations for OS memory management and drivers' ad-hoc memory management. However, this further complicates the difficulties of sharing these innovations. This thesis presents GMEM, generalized memory management, that refactors OS memory management to provide a high-level interface for both the CPU and emerging accelerators to share existing memory management mechanisms and innovative optimizations. For instance, the GMEM-based driver of a simulated device takes less than 100 hardware-independent LoC to provide a similar virtual memory abstraction to that from Nvidia's GPU driver. Additionally, this thesis presents two innovative memory management optimizations for FreeBSD and Nvidia's GPU driver in response to applications' larger and larger memory footprint. For example, its optimization for Nvidia's GPU driver enables a deep learning application to obtain 60% higher training throughput. These two innovations are to be merged with mainstream FreeBSD and Nvidia's GPU driver respectively, but more importantly, they are sharable via GMEM.

Browsing by Author "Cox, Alan L"

Results Per Page

Sort Options