R-3 Repository :: Browsing by Author "Rixner, Scott"

Browsing by Author "Rixner, Scott"

Now showing 1 - 20 of 43

A Browser-based Program Execution Visualizer for Learning Interactive Programming in Python
(2015-04-23) Tang, Lei; Warren, Joe; Rixner, Scott; Jermaine, Christopher
Good educational programming tools help students practice programming skills and build better understanding of basic concepts and logic. As Rice University started offering free Massive Open Online Courses (MOOC) on the internet, we developed a web-based programming environment to teach introductory programming course in Python. The course is now one of the top-rated MOOC courses, which is believed largely due to the successful web-based educational programming environment. Here we will introduce the thought processes behind the design and then focus on the key innovations incorporated in it. The main contribution of this thesis is an entirely browser-based Python program execution visualizer that graphically demonstrates the execution information to help students understand the dynamics of program execution. Especially, this tool can also be used to visualize and debug event-driven programs. The design details and unit test infrastructure for the program execution visualizer are both introduced in this thesis.
A storage architecture for data-intensive computing
(2010) Shafer, Jeffrey; Rixner, Scott
The assimilation of computing into our daily lives is enabling the generation of data at unprecedented rates. In 2008, IDC estimated that the "digital universe" contained 486 exabytes of data [9]. The computing industry is being challenged to develop methods for the cost-effective processing of data at these large scales. The MapReduce programming model has emerged as a scalable way to perform data-intensive computations on commodity cluster computers. Hadoop is a popular open-source implementation of MapReduce. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem --- HDFS --- is written in Java and designed for portability across heterogeneous hardware and software platforms. The efficiency of a Hadoop cluster depends heavily on the performance of this underlying storage system. This thesis is the first to analyze the interactions between Hadoop and storage. It describes how the user-level Hadoop filesystem, instead of efficiently capturing the full performance potential of the underlying cluster hardware, actually degrades application performance significantly. Architectural bottlenecks in the Hadoop implementation result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Further, HDFS implicitly makes assumptions about how the underlying native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. Methods to eliminate these bottlenecks in HDFS are proposed and evaluated both in terms of their application performance improvement and impact on the portability of the Hadoop framework. In addition to improving the performance and efficiency of the Hadoop storage system, this thesis also focuses on improving its flexibility. The goal is to allow Hadoop to coexist in cluster computers shared with a variety of other applications through the use of virtualization technology. The introduction of virtualization breaks the traditional Hadoop storage architecture, where persistent HDFS data is stored on local disks installed directly in the computation nodes. To overcome this challenge, a new flexible network-based storage architecture is proposed, along with changes to the HDFS framework. Network-based storage enables Hadoop to operate efficiently in a dynamic virtualized environment and furthers the spread of the MapReduce parallel programming model to new applications.
An Automated System for Interactively Learning Software Testing
(Association for Computing Machinery, 2017) Smith, Rebecca; Tang, Terry; Warren, Joe; Rixner, Scott
Testing is an important, time-consuming, and often difficult part of the software development process. It is therefore critical to introduce testing early in the computer science curriculum, and to provide students with frequent opportunities for practice and feedback. This paper presents an automated system to help introductory students learn how to test software. Students submit test cases to the system, which uses a large corpus of buggy programs to evaluate these test cases. In addition to gauging the quality of the test cases, the system immediately presents students with feedback in the form of buggy programs that nonetheless pass their tests. This enables students to understand why their test cases are deficient and gives them a starting point for improvement. The system has proven effective in an introductory class: students that trained using the system were later able to write better test cases -- even without any feedback -- than those who were not. Further, students reported additional benefits such as improved ability to read code written by others and to understand multiple approaches to the same problem.
A Bandwidth-Efficient Architecture for Media Processing
(1998-11-20) Rixner, Scott; Dally, William J.; Kapasi, Ujval J.; Khailany, Brucek; Lopez-Lagunas, Abelardo; Mattson, Peter; Owens, John D.
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor, Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.
Characterization of block memory operations
(2006) Calhoun, Michael; Rixner, Scott
Block memory operations are frequently performed by the operating system and consume an increasing fraction of kernel execution time. These operations include memory copies, page zeroing, interprocess communication, and networking. This thesis demonstrates that performance of these common OS operations is highly dependent on the cache state and future use pattern of the data. This thesis argues that prediction of both initial cache state and data reuse patterns can be used to dynamically select the optimal algorithm. It describes an innovative method for predicting the state of the cache by using a single cache-line probe. The performance of networking, which is dominated by kernel copies, is improved by the addition of dedicated hardware in the network interface. Finally, based upon the behavior of block memory operations, this thesis proposes improvements such as a hardware cache probe instruction, a dedicated memory controller copy engine, and centralized handling of block memory operations to improve performance in future systems.
Computer Science Education at Scale: Providing Personalized and Interactive Learning Experiences Within Large Introductory Courses
(2019-12-05) Smith, Rebecca; Rixner, Scott
As a result, enrollment in undergraduate computer science programs has expanded rapidly. While the influx of talent into the field will undoubtedly lead to countless technological developments, this growth also brings new pedagogical challenges. Educational resources, ranging from instructional time to classroom space, are limited. In the face of these resource constraints, it is difficult to scale courses in a manner that still retains the personalization and interaction that are characteristic of a high-quality education. The challenges of scale are particularly pronounced in introductory courses, which typically attract large numbers of majors and non-majors alike. This thesis aims to explore and tackle the pedagogical challenges within large introductory courses using three orthogonal means: data analysis, pedagogical tools, and structural innovations. First, this thesis presents a series of analyses on student-written code in order to characterize the mistakes that novice programmers make, and subsequently to inform the pedagogical choices that instructors make. Second, this thesis describes the design and implementation of two automated pedagogical tools, VizQuiz and Compigorithm. These tools provide interactive learning experiences that can scale to meet the demands of the growing numbers of students that are pursuing computer science without increasing the burden on the instructor. Last, this thesis examines the viability of structural innovations — in particular, collaborative online learning experiences — to scale an introductory computational thinking course, ultimately finding minimal statistically significant differences between the online and in-person sections of the course. Together, these three complementary lines of work advance the field of computer science education by empowering instructors of large computer science courses to provide learning experiences that are personalized, interactive, and scalable.
Design and evaluation of FPGA-based gigabit-Ethernet/PCI network interface card
(2004) Mohsenin, Tinoosh; Rixner, Scott
The continuing advances in the performance of network servers make it essential for network interface cards (NICs) to provide more sophisticated services and data processing. Modern network interfaces provide fixed functionality and are optimized for sending and receiving large packets. One of the key challenges for researchers is to find effective ways to investigate novel architectures for these new services and evaluate their performance characteristics in a real network interface platform. This thesis presents the design and evaluation of a flexible and configurable Gigabit Ethernet/PCI network interface card using FPGAs. The FPGA-based NIC includes multiple memories, including SDRAM SODIMM, for adding new network services. The experimental results at Gigabit Ethernet receive interface indicate that the NIC can receive all packet sizes and store them at SDRAM at Gigabit Ethernet line rate. This is promising since no existing NIC use SDRAM due to the SDRAM latency.
Design space exploration for real-time embedded stream processors
(2004-07-01) Rajagopal, Sridhar; Cavallaro, Joseph R.; Rixner, Scott; Center for Multimedia Communications (http://cmc.rice.edu/)
We present a design framework for rapidly exploring the design space for stream processors in real-time embedded systems. Stream processors enable hundreds of arithmetic units in programmable pro-cessors by using clusters of functional units. However, to meet a certain real-time requirement for an embedded system, there is a trade-off between the number of arithmetic units in a cluster, number of clusters and the clock frequency as each solution meets real-time with a different power consumption. We have developed a design exploration tool that explores this trade-off and presents a heuristic that minimizes the power consumption in the (functional units, clusters, frequency) design space. Our design methodology relates the instruction level parallelism, subword parallelism and data parallelism to the organization of the functional units in an embedded stream processor. We show that the power minimization methodology also provides insights into the functional unit utilization of the processor. The design exploration tool exploits the static nature of signal processing workloads, providing an extremely fast design space exploration and provides an initial lower bound estimate of the real-time performance of the embedded processor. A sensitivity analysis of the design tool results to the technology and modeling also enables the designer to check the robustness of the design exploration.
Designing Scalable Networks for Future Large Datacenters
(2012-09-05) Stephens, Brent; Cox, Alan L.; Rixner, Scott; Ng, T. S. Eugene; Carter, John
Modern datacenters require a network with high cross-section bandwidth, fine-grained security, support for virtualization, and simple management that can scale to hundreds of thousands of hosts at low cost. This thesis first presents the firmware for Rain Man, a novel datacenter network architecture that meets these requirements, and then performs a general scalability study of the design space. The firmware for Rain Man, a scalable Software-Defined Networking architecture, employs novel algorithms and uses previously unused forwarding hardware. This allows Rain Man to scale at high performance to networks of forty thousand hosts on arbitrary network topologies. In the general scalability study of the design space of SDN architectures, this thesis identifies three different architectural dimensions common among the networks: source versus hop-by-hop routing, the granularity at which flows are routed, and arbitrary versus restrictive routing and finds that a source-routed, host-pair granularity network with arbitrary routes is the most scalable.
DSP architectural considerations for optimal baseband processing
(2002-08-20) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Aazhang, Behnaam; Center for Multimedia Communications (http://cmc.rice.edu/)
The data rate requirements for future wireless systems has increased by orders-of-magnitude (from Kbps to several Mbps), requiring more sophisticated algorithms for their implementation. This tutorial will explore different architectural issues to consider for optimal wireless baseband processing. It will look at research into real-time architectural design issues such as number of functional units, data access from memory and sequential traceback for Viterbi decoding using digital signal processors
Evaluate Namespace as a Labeling System for Malware Detection
(2021-12-02) Ding, Chenkai; Rixner, Scott
Nowadays, kernel tracing tools are built on limited Linux features. In this thesis, we explore a new method to help improving kernel tracing. We modified Memorizer, a novel kernel tracing tool that offers a comprehensive coverage of kernel accesses, and combined it with the Linux Namespace system. As an original compartment feature in Linux, introducing namespaces gives us a chance to describe kernel accesses and exploit behaviors in a different perspective. Experiments showed that our modified Memorizer can provide novel insights about how the kernel works between modules and containers. Moreover, we proposed a series of analysis methods that allows us to extract a small and unique profile for a certain exploit, which could contribute to developing security identifying software in the future.
Exploiting Task-Level Concurrency in a Programmable Network Interface
(2003-06-20) Kim, Hyong-youb; Pai, Vijay S.; Rixner, Scott
Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modern ASIC-based network interface for real network services.
Exploring Superpage Promotion Policies for Efficient Address Translation
(2019-03-19) Zhu, Weixi; Rixner, Scott
Address translation performance for modern applications depends heavily upon the number of translation entries cached in the hardware TLB (translation look-aside buffer). Therefore, the efficiency of address translation relies directly on the TLB hit rate. The number of TLB entries continues to fall further behind the growth of memory consumption for modern applications. Superpages, which are pages with larger sizes, can increase the efficiency of the TLB by enabling each translation entry to cover a larger memory region. Without requiring more TLB entries, using superpages can increase the TLB hit rate and benefit address translation. However, using superpages can bring overhead. The TLB uses a single dirty bit to mark a page as dirty during address translation before modifying the page, so the granularity of the dirty bit corresponds to the coverage of the translation entry. As a result, the OS (operating system) will pay extra I/O effort when it allocates or writes an underutilized superpage back to disk. Such extra overhead can easily surpass the address translation benefits of superpages. This thesis discusses the performance trade-offs of superpages by exploring the design space of superpage promotion policies in the OS. A data collection infrastructure is built based on QEMU with kernel instrumentation on FreeBSD to collaboratively collect both memory accesses and kernel events. Then, the TLB behavior of Intel Skylake x86 family processors is simulated. The simulation has been validated to be faithful and consistent with the real-world performance. Last, this thesis evaluates and compares both TLB performance benefits and I/O overheads among the superpage promotion policies to discuss the trade-offs in the design space.
GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores
(2014-05-15) Li, Conglong; Cox, Alan L.; Rixner, Scott; Mellor-Crummey, John
Various memory-based key-value stores, such as Memcached and Redis, are used to speed up dynamic web applications. Specifically, they are used to cache the results of computations, such as database queries. Currently, these key-value stores use either LRU or an LRU approximation as the replacement policy for choosing a key-value pair to be evicted from the store. However, if the cost of recomputing cached values varies significantly, as in the RUBiS and TPC-W benchmarks, then neither of these replacement policies are the best choice. When deciding what key-value pair to replace, it can be advantageous to take the cost of recomputation into consideration. To that end, this thesis proposes a new cost-aware replacement policy, GD-Wheel, which seamlessly integrates recency of access and cost of recomputation. This thesis applies GD-Wheel to Memcached and evaluates its performance using the Yahoo! Cloud Serving Benchmark. The evaluation shows that GD-Wheel, when compared to LRU, greatly reduces the total recomputation cost, as well as the average and 99th percentile read access latency for the application.
Handling Congestion and Routing Failures in Data Center Networking
(2015-09-01) Stephens, Brent; Cox, Alan L.; Rixner, Scott; Ng, T. S. Eugene; Zhong, Lin
Today's data center networks are made of highly reliable components. Nonetheless, given the current scale of data center networks and the bursty traffic patterns of data center applications, at any given point in time, it is likely that the network is experiencing either a routing failure or a congestion failure. This thesis introduces new solutions to each of these problems individually and the first combined solutions to these problems for data center networks. To solve routing failures, which can lead to both packet loss and a loss of connectivity, this thesis proposes a new approach to local fast failover, which allows for traffic to be quickly rerouted. Because forwarding table state limits both the fault tolerance and the largest network size that is implementable given local fast failover, this thesis introduces both a new forwarding table compression algorithm and Plinko, a compressible forwarding model. Combined, these contributions enable forwarding tables that contain routes for all pairs of hosts that can reroute traffic even given multiple arbitrary link failures on topologies with tens of thousands of hosts. To solve congestion failures, this thesis presents TCP-Bolt, which uses lossless Ethernet to prevent packets from ever being dropped. Unlike prior work, this thesis demonstrates that enabling lossless Ethernet does not reduce aggregate forwarding throughput in data center networks. Further, this thesis also demonstrates that TCP-Bolt can significantly reduce flow completion times for medium sized flows by allowing for TCP slow-start to be eliminated. Unfortunately, using lossless Ethernet to solve congestion failures introduces a new failure mode, deadlock, which can render the entire network unusable. No existing fault tolerant forwarding models are deadlock-free, so this thesis introduces both deadlock-free Plinko and deadlock-free edge disjoint spanning tree (DF-EDST) resilience, the first deadlock-free fault tolerant forwarding models for data center networks. This thesis shows that deadlock-free Plinko does not impact forwarding throughput, although the number of virtual channels required by deadlock-free Plinko increases as either topology size or fault tolerance increases. On the other hand, this thesis demonstrates that DF-EDST provides deadlock-free local fast failover without needing virtual channels. This thesis shows that, with DF-EDST resilience, less than one in a million of the flows in data center networks with thousands of hosts are expected to fail even given tens of failures. Further, this thesis shows that doing so incurs only a small impact on the maximal achievable aggregate throughput of the network, which is acceptable given the overall decrease in flow completion times achieved by enabling lossless forwarding.
High-performance MPI libraries for Ethernet
(2005) Majumder, Supratik; Rixner, Scott
A MPI library performs two tasks---computation on behalf of the application, and communication in the form of sending and receiving messages among processes forming the application. Efficient communication is key to a high-performance MPI library, and the use of specialized interconnect technologies has been a common way to achieve this goal. However, these custom technologies lack the portability and simplicity of a generic communication solution like TCP over Ethernet. This thesis first shows that even though TCP is a higher overhead protocol than UDP, as a messaging medium it performs better than the latter, because of library-level reliability overheads with UDP. Then, the thesis presents a technique to separate computation and communication aspects of a MPI library, and handle each with the most efficient mechanism. The results show a significant improvement in performance of MPI libraries with this technique; bringing Ethernet closer to the specialized networks.
Imagine: Media Processing with Streams
(2001-03-20) Khailany, Brucek; Dally, William J.; Kapasi, Ujval J.; Mattson, Peter; Namkoong, Jinyung; Owens, John D.; Towles, Brian; Chang, Andrew; Rixner, Scott
The Power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 Gflops and sustain 18.3 GOPS on MPEG-2 encoding.
Improving Fairness in I/O Scheduling for Virtualized Environments
(2017-11-30) Gibson, Riley; Rixner, Scott
Modern virtualization systems must balance fair access to I/O resources while still maintaining high utilization of those resources. It is difficult to balance fairness and efficiency when scheduling disk accesses due to the non-uniform nature of disk I/O. Current open source virtualization systems, including Xen and KVM, utilize the stock Linux disk scheduler to provide access to storage. Although the Linux disk scheduler can provide good I/O performance for individual virtual machines, it does not necessarily provide equal access to disk I/O resources across competing virtual machines. This can result in unfair and unpredictable application I/O performance behavior among virtual machines. This thesis presents the Virtual Deadline I/O Scheduler, a new disk scheduler that improves the fairness of scheduling I/O resources across virtual machines. The virtual deadline scheduler makes the Linux deadline I/O scheduler virtualization-aware, enabling it to schedule I/O requests more adaptively and fairly. In particular, request deadlines are dynamically determined based upon the level of service that has been provided to the vir- tual machine from which the request originated. The virtual deadline scheduler increases fairness of I/O performance while minimizing aggregate performance degradation.
Improving networking server performance with programmable network interfaces
(2003) Kim, Hyong-Youb; Rixner, Scott; Pai, Vijay S.
Networking servers, such as web servers, have been widely deployed in recent years. While developments in the operating system and applications continue to improve server performance, programmable network interfaces with local memory provide new opportunities to improve server performance through extended network services on the network interface. However, due to their embedded nature, programmable processors on the network interface may suffer from inadequate processing power when compared to non-programmable application-specific network interfaces. This thesis first shows that exploiting a multiprocessor architecture and task-level concurrency in network interface processing enables programmable network interfaces to overcome the performance disadvantages over application-specific network interfaces that result from programmability. Then, the thesis presents a network service on a programmable network interface that exploits the storage capacity of the interfaces to alleviate the local I/O interconnect bottleneck, thereby improving server performance. Thus, these two results show that programmable network interfaces can offset the performance disadvantages due to programmability and improve networking server performance through extended network services that exploit their computation power and storage capacity.
Improving power efficiency in stream processors through dynamic cluster reconfiguration
(2004-12-01) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)
Stream processors support hundreds of functional units in a programmable architecture by clustering functional units and utilizing a bandwidth hierarchy. Clusters are the dominant source of power consumption in stream processors. When the data parallelism falls below the number of clusters, unutilized clusters can be turned off to save power. This paper improves power efficiency in stream processors by dynamically reconfiguring the number of clusters in a stream processor to match the time varying data parallelism of an application. We explore 3 mechanisms for dynamic reconfiguration: using memory, conditional streams and a multiplexer network. A 32-user wireless basestation is a prime example of a workload that benefits from such reconfiguration. When the number of users supported by the basestation dynamically changes from 32 to 4, the reconfiguration from a 32-cluster stream processor to a 4-cluster stream processor yields 15-85% power savings over and above a stream processor that uses conventional power saving techniques such as dynamic voltage and frequency scaling. The dynamic reconfiguration support extends stream processors from traditional high performance applications to power-sensitive applications in which the data parallelism varies dynamically and falls below the number of clusters.

Browsing by Author "Rixner, Scott"

Results Per Page

Sort Options