R-3 Repository :: Browsing by Author "Bennett, John K."

Browsing by Author "Bennett, John K."

Now showing 1 - 12 of 12

EDIF netlist optimization of pipelined designs
(2000) Balabanos, Vasileios; Bennett, John K.
This thesis describes the design, implementation, and evaluation of a software system for optimizing synthesized logic circuits. The particular implementation described is targeted to the Xilinx Virtex family of FPGAs, but the techniques developed are relevant to other families of array-based semi-custom programmable logic circuits. One of the unique aspects of my approach is that the optimization occurs after the circuit is mapped onto the logic array. Prior to this work it was commonly believed that optimization after mapping was infeasible. The advantages of this approach include the ability to optimize a design without having the VHDL source code, the opportunity to selectively optimize only parts of a circuit and the preservation of the original the state encoding. The optimizations are also transparent to the synthesis process. This is a powerful and versatile method, which gives the designer considerable freedom in optimizing parts of the design according to his or her preferences. The optimization process proceeds as follows. The behavioral or structural description of the design is first written in VHDL. The design is then synthesized using the Workview Office synthesis tool and extracted to an EDIF (Electronic Design Interface Format) mapped netlist targeting Xilinx's Virtex family of FPGAs. This netlist is then analyzed, and an internal representation of the given circuit is created. Any pipelines (blocks of combinational logic feeding one or more registers) that exist in the circuit are then identified and common blocks of logic that reside between the pipeline registers are extracted. Multilevel minimization algorithms in the SIS framework are applied in order to optimize the design. The optimized equations are then converted to an EDIF-compatible format and all the necessary modifications are computed in order to restructure the original netlist to produce the optimized one. The resultant remapped circuit is then placed and routed as before.
Effectiveness and performance analysis of a class of parallel robot controllers with fault tolerance
(1996) Hamilton, Deirdre Lynne; Walker, Ian D.; Bennett, John K.
In the past, robots were only applied to simple repetitive tasks, such as assembly line work. However robotics research now encompasses a broad spectrum of application possibilities. Robots are being considered for use in more advanced manufacturing applications, medical and space applications, and numerous other tasks. Speed and precision of control are two primary issues for current and future applications of robotic systems. Fault tolerance is also increasingly important for many robot tasks. This work focuses on improving the efficiency and fault tolerance capabilities of robot controllers. Here we address the following questions: "How can robot control be improved from the perspective of the algorithm implementation? What combination of speed and precision can we achieve for good overall performance?" Due to the coupling in the dynamics equations, coarse-grain parallelization of robot control algorithms is particularly difficult. In this thesis, we develop a new parallel control algorithm for robots based on the Newton-Euler dynamics formulation that overcomes the serial nature of these equations, allowing a high level of parallelism. Our controller uses data from a previous control step in current calculations to allow many more tasks to be executed in parallel, thus providing higher control update rates. The use of 'stale' data is an effective solution to the speedup problem, but presents some special difficulties. A stability issue when using 'stale' data that is encountered in previous algorithm approaches is discussed here. The incorporation of fault tolerance techniques into robot systems improves the reliability, but also increases the hardware and computational requirements in the overall system. Since all of these things affect system design, it is not always clear how to evaluate the merit, or 'effectiveness' of different fault tolerance approaches for a given application. In this thesis, we present a new set of performance criteria designed to measure and compare the effectiveness of robot fault tolerance strategies. The measures, which are designed to evaluate fault tolerance/performance/cost tradeoffs, can also be used to evaluate pure performance or pure fault tolerance strategies. We show their usefulness using a variety of proposed fault tolerance approaches in the literature, focusing on multiprocessor control architectures.
Efficient runtime support for cluster-based distributed shared memory multiprocessors
(1998) Speight, William Evan; Bennett, John K.
Distributed shared memory (DSM) systems provide a shared memory programming paradigm on top of a physically distributed network of computers. The DSM system removes the necessity for programmers to move data explicitly between processors. The principle challenge in the development of an efficient DSM system lies in reducing the amount of communication necessary to maintain coherence to an absolute minimum. This thesis presents Brazos, a DSM system for use in an environment of symmetric multiprocessor (SMP) personal computers that are networked together by industry-standard 100 Mbps FastEthernet. Brazos is distinguished by its use of application-level multithreading, selective multicast, adaptive runtime mechanisms, and a unique performance history mechanism. Through the detailed analysis of twelve scientific programs, we show that Brazos outperforms the current state-of-the-art software DSM system by an average of 83%, and outperforms a version of the same DSM system that has been altered to take advantage of SMP personal computers by an average of 32%. Our results indicate that networks of commodity personal computers using available PC networks and operating systems can perform comparably on a wide variety of scientific applications to more traditional networks of high-end engineering workstations.
Implementing multicast in a software emulation of the virtual interface architecture
(2000) Dobric, Damian; Bennett, John K.
The Virtual Interface Architecture (VIA) is an emerging standard for low-latency, high-bandwidth, user-level communication designed to achieve high performance by minimizing data copying and kernel/user transitions. Currently very few network controllers provide VIA support, and the current specification for VIA does not include multicast, a useful mechanism for distributed applications. This thesis tests two ideas by experiment. Whether a software implementation of VIA can provide useful performance enhancement, and whether multicast support can be incorporated into VIA with tangible benefit. I designed a Windows NT driver software implementation of VIA for Gigabit Ethernet that achieved an average of 57% lower latency than Ethernet (UDP) for messages of one to 64K bytes. These low-level benefits translated to a reduction in execution time of 10--14% over UDP for several distributed applications, and with multicast, an additional reduction of 1% to 15%. We conclude that multicast support would be a useful extension to the VIA specification that could be added without difficulty.
Memory architecture in multi-channel optically interconnected distributed shared memory multiprocessor systems
(1997) Xiao, Yanyang; Bennett, John K.
Multi-channel optical networks, although common in telecommunication applications, have only recently found application in computer systems. Multi-channel optical networks offer the potential for high performance interconnects for both local computer networks and multiprocessor systems. In addition to providing high bandwidth, multi-channel optical networks exhibit the capability for efficient broad-cast. The absence of an efficient broadcast mechanism in point-to-point networks has governed the choice of memory subsystem architecture for many parallel computer systems, and has in particular favored cache-coherent non-uniform memory access (CC-NUMA) over cache-only memory access (COMA) architecture. This thesis examined the choice of memory system architecture in the presence of high band-width and efficient broadcast. Using computer simulation, we compared the performance of CC-NUMA and COMA memory architectures in a multi-channel optical network multiprocessing system. Seven well-known parallel benchmarks were used in the study. Our results indicate that COMA consistently and significantly outperforms CC-NUMA in the multi-channel network. We also examined the performance of multi-channel optical networks for a varying-number of optical channels. We found that for the simulated architecture, the sampled benchmarks exhibit significant performance gain using only a small number of channels, relative to the number of nodes in the system. We further found that multiple channels offer better performance than that of a single channel with the same aggregated bandwidth.
ParaView: Performance debugging through visualization of shared data
(1994) Speight, William Evan; Bennett, John K.
Performance debugging is the process of isolating and correcting performance problems in an otherwise correct parallel program. Problems not immediately visible to the parallel programmer often lead to poor application performance. This thesis describes the design, implementation, and use of ParaView, a tool to locate performance inefficiencies in programs written for shared-memory multiprocessors. ParaView supplies an intuitive, graphical interface based upon the X-windows system. ParaView aids parallel applications programmers in uncovering performance bugs relating to poor cache performance, load balancing, false sharing, and inefficient synchronization. Eleven parallel programs have been analyzed using ParaView, and performance limitations in five of these were improved. Reductions in overall execution times range from 25% to 86% for sixteen processor simulations. Our experience demonstrates that ParaView facilitates parallel program performance debugging by reducing the amount of time required to uncover and correct performance problems relating to poor data partitioning, false sharing, contention for shared data constructs, and unnecessary synchronization.
Performance and reliability of a parallel robot controller
(1992) Hamilton, Deirdre Lynne; Bennett, John K.; Walker, Ian D.
Most robot controllers today are uniprocessor architectures. As robot control algorithms become more complex, these serial controllers have difficulty providing the desired response times. Additionally, with robots being used in environments that are hazardous or inaccessible to humans, fault-tolerant robotic systems are particularly desirable. A uniprocessor control architecture cannot offer tolerance of processor faults. Use of multiple processors for robot control offers two advantages over single processor systems. Parallel control provides a faster response, which in turn allows a finer granularity of control. Processor fault tolerance is also made possible by the existence of multiple processors. There is a trade-off between performance and level of fault tolerance provided. The work of this thesis shows that a shared memory multiprocessor robot controller can provide higher performance than a uniprocessor controller, as well as processor fault tolerance. The trade-off between these two attributes is also demonstrated.
Reliable parallel computing on clusters of multiprocessors
(2000) Abdel-Shafi, Hazim M.; Bennett, John K.
This dissertation describes the design, implementation, and performance of two mechanisms that address reliability and system management problems associated with parallel computing clusters: thread migration and checkpoint/recovery. A unique aspect of this work is the integration of these two mechanisms. Although there has been considerable prior work on each of these mechanisms in isolation, their integration offers synergistic benefit to both functionality and performance. Used in, conjunction, these mechanisms facilitate failure recovery, and node addition and removal with minimal disruption of executing applications. Our implementation differs from previous work in the following ways. First, by using thread migration instead of process migration, the overhead of moving computation among nodes is reduced. Second, because our implementation of checkpoint/recovery separates computation and data, it is possible to distribute data and threads among other nodes during recovery. This is possible because the underlying support for thread migration in the system allows the recovery of a thread from any checkpoint on any node. Third, our implementation does not require repartitioning of a running parallel application when resources are added or removed. Finally, the checkpoint/recovery and thread migration mechanisms are both implemented at user-level. The benefits of a user-level implementation include ease of development since operating system source code is not required, adaptability to other platforms, and simple upgrades to new versions of the underlying operating system and hardware. The prototype implementation described in this thesis was developed as an extension to the Brazos software distributed shared memory system. Brazos allows multithreaded parallel applications to execute on networks of multiprocessor servers running the Windows NT/2000 operating system.
Simulation of shared memory parallel systems
(1990) Mukherjee, Rajat; Bennett, John K.
This thesis describes a method to simulate parallel programs written for shared memory multiprocessors. We have extended execution-driven simulation to facilitate the simulation of shared memory. We have developed a shared memory profiler, which, at compile-time, inserts simulation support code into the assembly code of the program to be able to extract the data address references at run-time. From the data address, we determine the nature of the reference, simulate the access and account for it. Programs to be simulated are written using Presto, an object-oriented parallel programming environment for shared memory multiprocessors based on C++. To validate the accuracy of our simulation methods, we have developed and evaluated an architecture model for the BBN Butterfly shared memory multiprocessor. The results of these tests are presented and discussed. We also describe extensions that would allow the simulation of shared memory systems with caches using execution-driven simulation techniques.
The design of a high performance interconnect for distributed shared memory multiprocessing
(1997) Filippo, Michael Alan; Bennett, John K.
This thesis describes and evaluates the design of a high performance interconnect for use in a distributed shared memory multiprocessor. The network is based on the Peripheral Component Interconnect (PCI) bus and is fully compliant with PCI Specification Revision 2.1. It includes a high performance crossbar switch with support for up to sixteen fully-concurrent, packet-switched, duplex communication channels, each operating at 528 Mb/s. The design of the network was approached from three perspectives. First, we examined the architectural aspects of the network to determine its critical features. Second, we performed detailed simulations of three parallel applications, a useful approach for architectural validation and to provide precise approximations of subsystem requirements. Finally, we developed a complete logical and physical description of the datapaths and control logic used in the major subsystems, and created a state-accurate Verilog model to verify the physical design. All aspects of the resulting design were optimized to maximize network performance. Preliminary results indicate the network is capable of sustained throughput in excess of 3.5 Gb/s on real applications, sustained packet bandwidth exceeding 4.3 million 64-byte packets per second and packet latencies below 1$\mu$s.
The design of a scalable, hierarchical-bus, shared-memory multiprocessor
(1992) Greenwood, Jay Alan; Bennett, John K.
The hierarchical-bus architecture is an attractive solution to many of the problems associated with connecting processors together into a multiprocessing system but it also poses a number of design challenges. This thesis evaluates several architectural features of a hierarchical-bus multiprocessor. Our results show that applications with significant amounts of shared data achieve higher performance when run on a multiprocessor with a hierarchy of buses than on a single-bus multiprocessor. Also, applications with a significant number of write accesses to private data perform better using a cache protocol that modifies data within the cache (a copy-back protocol). This thesis describes a copy-back protocol for a hierarchical-bus multiprocessor and compares it with a cache protocol that broadcasts writes on the bus (a write-through protocol).
The interaction of architecture and operating system in the designing of a scalable shared memory multiprocessor
(1995) Mukherjee, Rajat; Bennett, John K.
This dissertation describes the implementation and evaluation of operating system design techniques that can be used to achieve scalability and to improve performance in large-scale shared memory multiprocessors with non-uniform memory hierarchies. We describe the implementation of SALSA, an operating system that incorporates these techniques and that executes on a commercially available processor. The contributions of this dissertation include the implementation of a technique that masks memory latency and increases processor utilization via rapid context switching, and a detailed study of the effects of cache organization and caching policy on latency hiding. The dissertation presents the relative performance of several alternatives for context caching on a register window architecture and shows that write-back, set-associative caches provide best latency hiding performance, especially with constructive cache interference. We have demonstrated significant improvements in program performance (120%) with latency hiding when cache miss latency is high, even with low cache miss rates (1-2%). We show that direct-mapped caches are unsuitable when operating system code is highly sensitive to cache misses, as in the case of context switching trap code. We also show that increased processor utilization can significantly increase contention on the underlying network. In architectures with non-uniform memory access behavior, the exploitation of thread and data placement by the operating system is mandatory for improved performance. The organization of the SALSA kernel exploits the underlying memory architecture. We describe a programming model that takes into account the clustering in the system, and provides primitives for hierarchical data placement and hierarchical thread scheduling. We show that proper data placement can double performance in a three-level memory hierarchy such as Willow. SALSA also provides user control over memory allocation for fine-tuning a program's memory requirements, which was shown to improve program performance by up to 20%. Although the techniques described in this dissertation have been evaluated on a hierarchical bus-based architecture similar to Willow, they are applicable to any large-scale multiprocessor characterized by non-uniform memory access behavior and large memory access latency.

Browsing by Author "Bennett, John K."

Results Per Page

Sort Options