Browsing by Author "Sinclair, James B."
Now showing 1 - 15 of 15
Results Per Page
Sort Options
Item An efficient implementation of Batcher's odd-even merge algorithm and its application in parallel sorting schemes(1981) Kumar, Manoj; Hirschberg, Daniel S.; Sinclair, James B.; Jump, J. RobertAn algorithm is presented to merge two subfiles of size n/2 each, stored in the left and the right halves of a linearly-connected processor array, in 3n/2 route steps and log n compare-exchange steps. This algorithm is extended to merge two horizontally adjacent subfiles of size mXn/2 each, stored in an mXn mesh-connected processor array in row-major order, in m+2n route steps and log mn compare-exchange steps. These algorithms are faster than their counterparts proposed so far. Next, an algorithm is presented to merge two vertically aligned subfiles, stored in a mesh-connected processor array in row-major order. Finally, a sorting scheme is proposed that requires lln route steps and 2 log n compare-exchange steps to sort n elements stored in an nXn mesh-connected processor array. The previous best sorting algorithm requires 14 n route steps ( for practical values of n, 4 < n 512 ).Item Analytical performance prediction of parallel systems(1993) Dawkins, William Price; Sinclair, James B.The need for more computing power for many computer applications has increased beyond the capabilities of traditional von Neumann architectures. This has generated interest in using parallel computers to improve processing power. Because of the high cost of implementing parallel computers, accurate and efficient tools are needed to predict the performance of parallel computer designs to determine which designs should be built. This thesis presents an analytical technique for predicting the performance of parallel algorithms realized on parallel architectures. The technique models arbitrary non-deterministic task execution times, explicit precedence constraints, resource contention, and a variety of resource scheduling policies. The program model used by this technique is a task graph. The nodes of a task graph represent parts of a parallel algorithm that require service from resources. The edges connecting nodes represent precedence constraints. Each task is assigned a random variable representing its execution time. The technique approximates arbitrarily distributed task execution time random variables with simple random variables. The different sequences of events that can occur due to resource contention are represented in a sequencing tree. A fully constructed sequencing tree represents all the possible sequences of events that can occur during the execution of a task graph. The technique allows the user to trade accuracy for efficiency by basing performance predictions on partial constructions of the sequencing trees. The technique is implemented in a program called ES. ES predicts both the mean and standard deviation of the execution time of a task graph representing a parallel algorithm. ES predictions are compared to predictions of simulations and timings of real parallel algorithms executing on a real parallel computer. Two mergesort algorithms were implemented on a hypercube multicomputer. ES estimates of the mean execution times of these algorithms differ by at most 7.0% from the mean execution times measured on the hypercube. ES estimates of the execution time of an FFT algorithm are compared to estimates obtained from execution-driven simulations. The ES estimates are within $\pm$0.15% of the estimates predicted by simulation.Item CSIM: an efficient implementation of a discrete-event simulator(1985) Covington, Richard Glenn; Jump, J. Robert; Sinclair, James B.; Briggs, Faye A.We discuss the design and efficient implementation of the Rice C Simulation Package (CSIM), a software tool, written in C, and designed to work compatibly with the standard C compiler. The tool provides support for discrete-event or process-interaction simulation, especially for digital logic and queuing theoretic models. We first discuss the existing modeling formalism necessary to abstract a discrete-event model from a real system. We then introduce a set of primitives which are sufficient for preparing an algorithmic specification of the abstract model. Finally, we report on the successful realization of the primitives, discussing or clarifying existing modeling methodology and establishing new methodology when necessary. We also describe the implementation of a recently proposed efficient event list algorithm (the TL algorithm), and present a study of its complexity.Item Design of a graphical input simulation tool for extended queueing network models(1985) Madala, Sridhar; Sinclair, James B.; Briggs, Faye A.; Varman, Peter J.In analyzing the performance of computer and digital communication systems, contention for finite capacity resources is often seen to be a dominant factor. Extended Queueing Network (EQN) models are appropriate for modeling suck systems and bave been used with considerable success, EQN models can be solved by exact analysis, by approximate analysis, or by simulation. This thesis is concerned with the design of a high level tool for solving EQN models by simulation that accepts specifications in a natural graphical manner. The primary motivation for this research was to prove the feasibility of providing a versatile tool that is easy to learn and use and is complete with respect to EQN models. Existing software tools for EQN modeling do not take advantage of the fact that a natural way to specify such models is graphical. Our modeling tool, the Graphical Input Simulation Tool (GIST), achieves these objectives 1) by utilizing a transaction-oriented approach as opposed to a language-based approach, 2) by providing two user interfaces, a graphical interface and a textual interface, that permit specification of EQN models at a very high level of abstraction, and 3) by means of a versatile set of modeling abstractions. In terms of modeling capabilities, GIST provides analogs of most abstractions commonly found in other high-level BQN modeling tools and also includes abstractions that have no counterparts in other tools. Ve demonstrate the feasibility and utility of providing a GISTlike tool and conclude that the transaction oriented-approach is also applicable and appropriate for building modeling tools in areas other than EQNs. GIST can also be extended to further research in performance evaluation tools and in performance evaluation in general.Item Dynamic memory interconnections for rapid access(1981) Iyer, Balakrishna R.; Sinclair, James B.; Hirschberg, Daniel S.; Jump, J. RobertA dynamic memory is a storage facility for fixed-size data items. The memory is comprised of cells, each cell capable of storing one datum. Data paths between cells are provided by a memory interconnection network. Each cell is directly connected to only a small number of cells. At every clock pulse, data items migrate from cell to cell via the data paths. The memory cells may be divided into several groups. A control mechanism provides each group of memory cells with a control signal. This control signal determines the data paths to be taken by data items contained in all cells within the group. Many dynamic memory organizations have been proposed. These exhibit trade-offs between the time to access a datum randomly and the time to access serially a block of logically contiguous data. The access times for these organizations are derived where necessary and compared. A new organization called the deck memory organization is proposed. Access times for the deck are determined and compared with access times derived for other organizations.Item Efficient methods for cache performance prediction(1989) Dwarkadas, Sandhya; Jump, J. Robert; Sinclair, James B.The goal of our work is to develop techniques that accurately and efficiently simulate the behavior of computer systems with cache memories. This thesis describes the design, analysis, and validation of three such methods of cache performance prediction. Execution-driven simulation is a technique that avoids the high overhead associated with instruction-level simulation while retaining most of the accuracy of that technique. We have extended the execution-driven paradigm to develop a time and space-efficient technique for address trace generation and cache simulation, as well as to provide estimates of overall execution time. The second method that we have developed is an analytical model for the prediction of cache miss ratios using single-process traces. Finally, a simple and efficient estimative simulation technique based on the analytical model and the execution-driven paradigm has been outlined. This approach is demonstrated in the simulation of cache-based multiprocessor systems in conjunction with the Rice Parallel Processing Testbed, which simulates concurrent algorithms on parallel architectures.Item Efficient simulation and utilization of a parallel digital signal processing architecture(1989) Foundoulis, William James; Jump, J. Robert; Sinclair, James B.In this study we discuss the development and validation of an efficient and accurate execution-driven simulation of the Texas Instruments Odyssey System, a parallel configuration of digital signal processors. We also evaluate the performance of a high-level parallel programming interface, Odyssey Concurrent C, designed to effectively utilize the parallelism available in the Odyssey architecture. Parallel versions of three dissimilar algorithms--merge sort, 2-dimensional convolution, and successive over-relaxation--have been run on both the Odyssey and the simulator. Quantitative differences between performance results obtained on the Odyssey and those predicted by simulation are enumerated, and shown to validate the accuracy of the execution-driven approach. The simulation is also shown to be efficient relative to the degree of accuracy obtainable. Finally, the Odyssey Concurrent C utilities are shown to provide a flexible and effective mechanism for managing parallelism in the Odyssey environment.Item Extended queuing network modeling(1985) Doshi, Kshitij Arun; Sinclair, James B.; Briggs, Faye A.; Zwaenepoel, WillyEvaluating the performance of a system is of central concern in making engineering decisions. When direct measurement of performance is not possible or feasible, evaluation consists of two phases: specification of an appropriate performance model, and evaluation of the model to obtain the performance measures. Broadly, a performance model can be evaluated by exact or approximate analysis, or by simulation. A class of models popular for evaluation of a number of systems, computer systems in particular, is that of Extended Queuing Network (EQN) Models. Software tools are typically used for building EQN models for evaluation through analyses or simulation. This thesis describes an effort in experimenting with an approach to the design and implementation of a tool for performance evaluation of EQN models via simulation. The objective in this effort is to design a tool that is easy and intuitive to use, yet versatile and powerful in its modeling capabilities. The tool we have implemented is called Graphical Input Simulation Tool (GIST). GIST meets its design objectives by (1) providing a pair of user interfaces that are capable of accepting the abstract EQN model specification directly, are easy and intuitive to learn and use, and are helpful in quick model specification with reduced likelihood of semantic and syntactic specification errors, and (2) incorporating into the set of EQN objects it provides, the capabilities perceived necessary for realistic modeling of activities that characterize the systems of interest.Item Incremental compilation and code generation(1980) Bruce, Robert Ewing; Kennedy, Kenneth W.; Jump, J. Robert; Sinclair, James B.Time sharing compilers are typically batch compilers that have been modified, via inclusion of a symbolic debugger, to "emulate" an interactive environment. The trend is away from interaction with the language processor and toward interaction with a common symbolic debugger. There are several problems with this approach: 1) It causes great time delays when a problem is isolated and a source change is necessary to correct a "bug". The use of an editor, compiler, and linkage editor is required to get back to the symbolic debugger. 2) Using a symbolic debugger requires knowledge of yet another language (the debugger's language). 3) Typically a symbolic debugger is written to work with more than one language and therefore has to house (sometimes incompatible) constructs for all of the languages it supports. The incremental compiler on the other hand has a rapid response to source changes. There is no need for a complete recompilation and linkage edit in order to re-execute the program after a change is made. The time required to make a change is proportional to the size of the change and not to the size of the program. The BASIC language processing system discussed in this work can operate as an incremental language processor as well as a Link-Ready (LR) code generating compiler. The term 'link-ready' denotes a type of relocatable object code that is suitable for linking with the BASIC library as well as with other user-written routines that have been separately compiled by BASIC or any other language processor. The compiler system operates in two modes, a link-ready code generating mode commonly called a batch compiler and an incremental or interactive mode that allows the user to enter source code lines in any order receiving error messages (if the source line is in error) as each line is entered. A BASIC program is first developed using the incremental compiler. Once the program is "debugged", it is compiled using the batch compiler to produce a more efficient executable module.Item Message passing in buffered delta networks(1985) Walkup, James W.; Jump, J. Robert; Sinclair, James B.; Briggs, Faye A.Delta networks are multistage interconnection networks that can be used to communicate rapidly between components of a modular system. In this thesis their performance is examined and evaluated for two types of message passing: circuit switching and packet switching. In circuit switching, a start packet enters the network as the first packet in the message and reserves a path for the rest of the packets. At the end of the message, an end packet cleans up the path by releasing the switches in the path set by the start packet. In packet switching there is no start packet to reserve the path and so each packet of the message contends on its own for every switch in its path. The performance of circuit switching is found to be as good as or better than that of packet switching in terms of throughput and message delay, with packet switching being better for initial delay. The effects of changes in network parameters is also examined for both schemes and the differences are discussed. Network parameters that are varied are the time taken to pass a packet through a switch, the time taken to decode the address for the next switch, the size of the buffers between the stages, the size of the network, and the length of the messages.Item Module assignment in distributed systems(1984) Lu, Mi; Sinclair, James B.; Varman, Peter J.; Jump, J. RobertThe problem of finding an optimal assignment of a modular program for n processors in a distributed system is studied. We characterize the distributed programs by Stone's graph model and attempt to find an assignment of modules to processors which minimizes the sum of module execution costs and intermodule communication costs. The problem is NP-complete for more than three processors. We first show how to identify all modules which must be assigned to a particular processor under any optimal assignment. This usually results in a significant reduction in the complexity of the optimal assignment problem. We also present a heuristic algorithm for finding assignments and experimentally verify that it almost always finds an optimal assignment.Item Performance of multicomputers using high-speed communication links(1993) Rizvi, Haider Abbas; Sinclair, James B.This thesis presents the results of a simulation study of the performance of a message-passing multicomputer using high-speed point-to-point communication links. The multicomputer system consists of IBM RS/6000 machines linked by 220 megabits per second fiber-optic links. This system is simulated using RIOSIM, a fast, accurate, and flexible execution-driven parallel architecture simulator. An accurate timing profiler, simulating the superscalar capabilities of the RS/6000 at runtime, generates dynamically timing estimates for the instructions executed. The simulation results are validated against actual measurements on a two-processor system, using a variety of algorithms. Results show that the errors are typically around 8%. The validated model is used to study systems with more than two processors. Simulation results indicate that this setup is suitable for coarse-grained parallel algorithms, some of which show almost linear speedups. For fine-grained algorithms, the high overhead in message passing proves to be a serious bottleneck, resulting in less than linear speedups.Item Performance of synchronous parallel algorithms with regular structures(1988) Madala, Sridhar; Sinclair, James B.The ability to model the execution and predict the performance of parallel algorithms is necessary if parallel computer systems are to be designed and utilized effectively. One factor which is inherent to the structure of the algorithm is the delay due to synchronization requirements when one task has to await the completion of another task. This thesis is concerned with the problem of evaluating the performance of parallel algorithms on MIMD computer systems when tasks in the algorithms have non-deterministic execution times. It addresses the following questions: How does non-determinism affect the synchronization delays and ultimately the performance of parallel algorithms? Is it possible to predict this effect independent of implementation on a specific architecture? How can such predictions be validated? The approach adopted is based on the view of a parallel algorithm as a collection of tasks with constraints on their relative order of execution. Important classes of parallel algorithms which exhibit regularity in their task graph structures are identified and the performance of these structures is predicted using results from order statistics and extreme value theory. These algorithm classes are those whose task graphs have multiphase, partitioning, and pipeline structures. The Rice Parallel Processing Testbed (RPPT), a program-driven simulation tool for the performance evaluation of parallel algorithms on parallel architectures, is used along with distribution-driven simulation to validate the results. Variations in the execution times of tasks generally increase the length of synchronization delays and adversely affect the performance of a parallel algorithm. In the case of an algorithm with a regular task graph structure it is possible to quantify the performance degradation due to non-determinism under certain assumptions about the task execution times and the number of processors available to execute the tasks. In particular, it is possible to place bounds on and approximate the mean execution time of the multiphase, partitioning, and pipeline structures. Distribution-driven simulations and results from sorting algorithms run on the RPPT indicate that these bounds and approximations, which are based on independence assumptions about task execution time distributions, are robust even in the presence of dependencies among task execution times. (Abstract shortened with permission of author.)Item Synchronization, coherence, and consistency for high performance shared memory multiprocessing(1993) Dwarkadas, Sandhya; Jump, J. Robert; Sinclair, James B.Although improved device technology has increased the performance of computer systems, fundamental hardware limitations and the need to build faster systems using existing technology have led many computer system designers to consider parallel designs with multiple computing elements. Unfortunately, the design of efficient and scalable multiprocessors has proven to be an elusive goal. This dissertation describes a hierarchical bus-based multiprocessor architecture, an adaptive cache coherence protocol, and efficient and simple synchronization support that together meet this challenge. We have also developed an execution-driven tool for the simulation of shared-memory multiprocessors, which we use to evaluate the proposed architectural enhancements. Our simulator offers substantial advantages in terms of reduced time and space overheads when compared to instruction-driven or trace-driven simulation techniques, without significant loss of accuracy. The simulator generates correctly interleaved parallel traces at run time, allowing the accurate simulation of a variety of architectural alternatives for a number of programs. Our results provide a quantitative analysis of the viability of large-scale bus-based memory hierarchies. We evaluate the effect on performance of several architectural enhancements, and discuss the tradeoffs between reducing contention and increasing latency as the number of levels in the memory hierarchy are increased. Toward this end, we have developed a cache coherence protocol for a hierarchical bus-based architecture that minimizes total communication overhead by utilizing all available (bus-provided) information. Based on our evaluation, we propose an integrated set of architectural design decisions. These include synchronization using a conditional test&set operation that eliminates excess bus traffic and contention, conditional access scheduling, where bus traffic is reduced by keeping track of pending bus accesses for every cache line, adaptive caching, where each cache line is assigned a coherence protocol based upon the expected or observed access behavior for that line, and the use of relaxed memory consistency models, where writes are aggressively buffered. We also present a new classification of memory consistency models that, in addition to unifying all existing models into a common framework, provides insight into the implications of these models with respect to access ordering.Item Variable length chain coding with applications(1979) Pau, Charles King-Chiu; Figueiredo, Rui J. P. de; Johnson, Don H.; Sinclair, James B.A special scheme for coding a digitized picture is presented. This scheme encodes a picture by coding boundaries of areas in the picture as sets of line segments. It is shown that this scheme is superior to the conventional schemes in terms of the data compression factor and faithful reproduction of the picture. One specific application of this scheme is in template matching. Experimental results are presented which show that template matching using this scheme is faster than the conventional template matching technique by a factor of at least 1.