R-3 Repository :: Browsing by Author "Cooper, Keith D."

Browsing by Author "Cooper, Keith D."

Now showing 1 - 20 of 45

A simple, fast dominance algorithm
(2006-01-11) Cooper, Keith D.; Harvey, Timothy J.; Kennedy, Ken
The problem of finding the dominators in a control-flow graph has a long history in the literature. The original algorithms suffered from a large asymptotic complexity but were easy to understand. Subsequent work improved the time bound, but generally sacrificed both simplicity and ease of implementation. This paper returns to a simple formulation of dominance as a global data-flow problem. Some insights into the nature of dominance lead to an implementation of an O(N2) algorithm that runs faster, in practice, than the classic Lengauer-Tarjan algorithm, which has a timebound of O(E ∗ log(N)). We compare the algorithm to Lengauer-Tarjan because it is the best known and most widely used of the fast algorithms for dominance. Working from the same implementation insights, we also rederive (from earlier work on control dependence by Ferrante, et al.) a method for calculating dominance frontiers that we show is faster than the original algorithm by Cytron, et al. The aim of this paper is not to present a new algorithm, but, rather, to make an argument based on empirical evidence that algorithms with discouraging asymptotic complexities can be faster in practice than those more commonly employed. We show that, in some cases, careful engineering of simple algorithms can overcome theoretical advantages, even when problems grow beyond realistic sizes. Further, we argue that the algorithms presented herein are intuitive and easily implemented, making them excellent teaching tools.
A type-based prototype compiler for telescoping languages
(2009) McCosh, Cheryl; Kennedy, Ken; Cooper, Keith D.
Scientists want to encode their applications in domain languages with high-level operators that reflect the way they conceptualize computations in their domains. Telescoping languages calls for automatically generating optimizing compilers for these languages by pre-compiling the underlying libraries that define them to generate multiple variants optimized for use in different possible contexts, including different argument types. The resulting compiler replaces calls to the high-level constructs with calls to the optimized variants. This approach aims to automatically derive high-performance executables from programs written in high-level domain-specific languages. TeleGen is a prototype telescoping-languages compiler that performs type-based specializations. For the purposes of this dissertation, types include any set of variable properties such as intrinsic type, size and array sparsity pattern. Type inference and specialization are cornerstones of the telescoping-languages strategy. Because optimization of library routines must occur before their full calling contexts are available, type inference gives critical information needed to determine which specialized variants to generate as well as how to best optimize each variant to achieve the highest performance. To build the prototype compiler, we developed a precise type-inference algorithm that infers all legal type tuples, or type configurations, for the program variables, including routine arguments, for all legal calling contexts. We use the type information inferred by our algorithm to drive specialization and optimization. We demonstrate the practical value of our type-inference algorithm and the type-based specialization strategy in TeleGen.
ACME: Adaptive Compilation Made Efficient/Easy
(2005-06-17) Cooper, Keith D.; Grosul, Alexander; Harvey, Timothy J.; Reeves, Steven W.; Subramanian, Devika; Torczon, Linda
Research over the past five years has shown significant performance improvements are possible using adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to guide a series of compilations towards some performance goal, such as minimizing execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the complexity inherent in a feedback-driven adaptive system makes it difficult to build and hard to use, and the large amounts of time that the system needs to perform the many compilations and executions prohibits most users from adopting these techniques. We have developed a technique called {\em virtual execution} to decrease the time requirements for adaptive compilation. Virtual execution runs the program a single time and preserves information that allows us to accurately predict performance with different optimization sequences. This technology significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. It limits the amount of information that the user must provide to get started, by providing appropriate defaults. At the same time, it lets the user exert fine-grained control over the parameters that control the system. In particular, the user has direct and obvious control over the maximum amount of time the compiler can spend, as well as the ability to choose the number of routines to be examined. (The tool uses profiling to identify the most-executed procedures.) The GUI provides an output screen so that the user can monitor the progress of the compilation.
Adaptive compilation and inlining
(2006) Waterman, Todd; Cooper, Keith D.
Adaptive compilation uses a feedback-driven process to leverage additional compilation time into improved executable performance. Previous work on adaptive compilation has demonstrated its benefit at an inter-optimization level. This dissertation investigates the ability of adaptive techniques to improve the performance of individual compiler optimizations. We first examine the ability to use adaptive compilation with current commercial compilers. We use adaptive techniques to find good blocking sizes with the MIPSpro compiler. However, we also observe that current compilers are poorly parameterized for adaptive compilation. We then construct an adaptive inlining system that demonstrates the potential of adaptive compilation to improve individual optimizations. We design the inliner to accept condition strings that determine which call sites are inlined. We develop an adaptive controller for the inliner based on a detailed understanding of the search space that the condition strings provide. Our adaptive inlining system consistently finds good sets of inlining decisions and outperforms static techniques. In addition, we demonstrate the inability of static techniques to provide a universal inlining solution and the necessity of adaptive inlining. Adaptive inlining demonstrates the capacity of adaptive compilation to improve the performance of a single, carefully designed optimization.
Adaptive ordering of code transformations in an optimizing compiler
(2005) Grosul, Alexander; Cooper, Keith D.
It has long been known that the quality of the code produced by an optimizing compiler is dependent upon the ordering of transformations applied to the code. In this dissertation, we show that the best orderings vary in unpredictable ways according to the properties of the input code and performance objectives, making adaptation a necessity to obtain the best results. We further demonstrate the most practical techniques to search the spaces of transformation orderings. Our analysis of six exhaustively enumerated subspaces of limited size determines the choice and parameters of search algorithms described and implemented in this work: random sampling, greedy methods, variations of the stochastic hillclimber, and genetic algorithms. We then apply the search algorithms to the full spaces of all available transformations, which are too big to enumerate. We evaluate the performance and cost of running these algorithms and discuss the tradeoffs between the quality of discovered orderings and an effort to find them. Stochastic hillclimbers discover effective orderings within approximately 500 evaluations. Compared to a fixed ordering of transformations, they result in 5%--40% improvements for a variety of input programs and performance objectives. To reduce the computational overhead in finding these orderings, we introduce and analyze a novel approach to precise static estimation of runtime frequencies of basic blocks. Termed "Estimated Virtual Execution", this approach reduces the search time by 40%--60%.
An experimental analysis of a set of compiler algorithms*
(2003) Harvey, Timothy John; Cooper, Keith D.
The thesis of this dissertation is that experimental analysis in computer science is an essential component of understanding algorithmic behavior. In three different experiments, we compare and contrast well-chosen algorithms and show empirical evidence of performance differences. In all cases herein, the algorithms that were assumed to be the best are shown to have superior, sometimes surprising alternatives, justifying our thesis. In each case, we developed a set of solutions and evaluated and refined them until we discovered better methods. In the first experiment, we examine different methods for building an interference graph, the pivotal structure of a graph-coloring register allocator. Our results show that a twenty-year-old assumption about the graph's characteristics is flawed, and we show extensive data to explain the best way to build this graph. In the second experiment, we take an O(n 2) algorithm for computing dominance and show how to make it run, in general, faster than the commonly used dominator algorithm due to Lengauer and Tarjan, which runs in time proportional to the inverse of Ackermann's function. The dominance work led to our third experiment, which seeks to show the best way to compute iterative data-flow analyses. There is a well understood theoretical specification due to Kam and Ullman that bounds the time necessary to compute specific types of data-flow equations. We compare their algorithm against a simple worklist algorithm that takes its intuition from a study of how Kam and Ullman's algorithm propagates information, and we show that the worklist algorithm is faster, in practice. In many ways, the value of this work is in the engineering improvements to necessary and complicated phases of the compiler. *This work is based in part upon work supported by the Texas Advanced Technology Program under Grant No. 003604-015 and by Darpa through Army Contract DABT63-95-C-0115.
An Experimental Evaluation of List Scheduling
(1998-09-30) Cooper, Keith D.; Schielke, Philip; Subramanian, Devika
While altering the scope of instruction scheduling has a rich heritage in compiler literature, instruction scheduling algorithms have received little coverage in recent times. The widely held belief is that greedy heuristic techniques such as list scheduling are "good" enough for most practical purposes. The evidence supporting this belief is largely anecdotal with a few exceptions. In this paper we examine some hard evidence in support of list scheduling. To this end we present two alternative algorithms to list scheduling that use randomization: randomized backward forward list scheduling, and iterative repair. Using these alternative algorithms we are better able to examine the conditions under which list scheduling performs well and poorly. Specifically, we explore the efficacy of list scheduling in light of available parallelism, the list scheduling priority heuristic, and number of functional units. While the generic list scheduling algorithm does indeed perform quite well overall, there are important situations which may warrant the use of alternate algorithms.
Automatic Tuning of Scientific Applications
(2007) Qasem, Apan; Cooper, Keith D.
Over the last several decades we have witnessed tremendous change in the landscape of computer architecture. New architectures have emerged at a rapid pace with computing capabilities that have often exceeded our expectations. However, the rapid rate of architectural innovations has also been a source of major concern for the high-performance computing community. Each new architecture or even a new model of a given architecture has brought with it new features that have added to the complexity of the target platform. As a result, it has become increasingly difficult to exploit the full potential of modern architectures for complex scientific applications. The gap between the theoretical peak and the actual achievable performance has increased with every step of architectural innovation. As multi-core platforms become more pervasive, this performance gap is likely to increase. To deal with the changing nature of computer architecture and its ever increasing complexity, application developers laboriously retarget code, by hand, which often costs many person-months even for a single application. To address this problem, we developed a software-based strategy that can automatically tune applications to different architectures to deliver portable high-performance. This dissertation describes our automatic tuning strategy. Our strategy combines architecture-aware cost models with heuristic search to find the most suitable optimization parameters for the target platform. The key contribution of this work is a novel strategy for pruning the search space of transformation parameters. By focusing on architecture-dependent model parameters instead of transformation parameters themselves, we show that we can dramatically reduce the size of the search space and yet still achieve most of the benefits of the best tuning possible with exhaustive search. We present an evaluation of our strategy on a set of scientific applications and kernels on several different platforms. The experimental results presented in this dissertation suggest that our approach can produce significant performance improvement on a range of architectures at a cost that is not overly demanding.
BUBO: An Experiment in Local Spilling for Global Register Allocation
(2017-08-15) Li, Lung; Cooper, Keith D.
This thesis presents the design of the Bubble-Out, Bottom-Up Partial Live-Range Spiller (BUBO), a tool designed to be used as a pre-conditioner for a graph-coloring global register allocator. The job of a register allocator is to decide, at each point in the code, which values should be kept in registers and which values should be kept, instead, in memory; and which values should be kept in which register for those kept in registers. The strength of graph-coloring allocators is that they do an excellent job of assigning registers effectively.Their primary weakness lies in the methods that they use to decide which values cannot be kept in a register, a process called spilling. The idea behind bubo is to pre-condition the problem—that is, to transform the code to a form where the graph-coloring allocator will produce better code. BUBO makes the spill decisions and modifies the code to reflect them. It presents the allocator with a version of the program that has, at each point, demand for registers that can be satisfied by the processor’s set of registers. By building a tool that focuses intensely on spilling and that builds on and extends the strong spilling techniques developed in local register allocation, bubo should improve the overall quality of code produced by a compiler’s back end. Bubo extends the local spilling ideas from Best’s classic algorithm to a global scope, while retaining a focus on those regions in the code where spilling is actually needed. It extends the notion of distance used in local allocation to include both a forward distance, distance to next use, to estimate the cost of restores; and a backward distance, distance to previous access, to estimate the effectiveness of spilling. It extends those measures from local scope to global scope by using branch probabilities. It reconciles conflicting local allocation decisions when it processes merge points in the reverse CFG, using a careful analysis to find a solution with low estimated runtime cost. It prevents placing spill code inside loops, unless it’s necessary, to reduce the execution frequency of the inserted code. As a result, bubo introduces spills in a way that is sensitive to the demand for registers—it only spills values that are live in regions of the code where the number of such values exceeds the available set of hardware registers—and to the structure of the program—it tries to place the inserted operations in places that execute less frequently than the accesses to the value—and to the runtime cost of spill code—it estimates the cost of restores instead of assuming a fixed cost and placing spills as close to corresponding definitions as possible so that the latencies of spills are likely to be hidden by scheduling. During the design of BUBO, several notions were introduced to describe the spilling problm. The key notions are high pressure region, spill decision points(SDP) and restore decision points(RDP). The notion of high pressure region can be used to understand and thus evaluate the effectiveness of a spill. A deep sight of using a distance metric for spilling is that it approximates high pressure region coverage. The notion of SDP and RDP lead to better estimation of spill cost. A spill must happen before a SDP and a restore can be placed at earliest after a RDP. This insight leads to the reasoning of the runtime cost of spill and restore statically, at compile time, and thus to better spill decisions.
Büchi Automata as Specifications for Reactive Systems
(2013-06-05) Fogarty, Seth; Vardi, Moshe Y.; Cooper, Keith D.; Nakhleh, Luay K.; Simar, Ray
Computation is employed to incredible success in a massive variety of applications, and yet it is difficult to formally state what our computations are. Finding a way to model computations is not only valuable to understanding them, but central to automatic manipulations and formal verification. Often the most interesting computations are not functions with inputs and outputs, but ongoing systems that continuously react to user input. In the automata-theoretic approach, computations are modeled as words, a sequence of letters representing a trace of a computation. Each automaton accepts a set of words, called its language. To model reactive computation, we use Büchi automata: automata that operate over infinite words. Although the computations we are modeling are not infinite, they are unbounded, and we are interested in their ongoing properties. For thirty years, Büchi automata have been recognized as the right model for reactive computations. In order to formally verify computations, however, we must also be able to create specifications that embody the properties we want to prove these systems possess. To date, challenging algorithmic problems have prevented Büchi automata from being used as specifications. I address two challenges to the use of Buechi automata as specifications in formal verification. The first, complementation, is required to check program adherence to a specification. The second, determination, is used in domains such as synthesis, probabilistic verification, and module checking. I present both empirical analysis of existing complementation constructions, and a new theoretical contribution that provides more deterministic complementation and a full determination construction.
Building a Control-flow Graph from Scheduled Assembly Code
(2002-02-01) Cooper, Keith D.; Harvey, Timothy J.; Waterman, Todd
A variety of applications have arisen where it is worthwhile to apply code optimizations directly to the machine code (or assembly code) produced by a compiler. These include link-time whole-program analysis and optimization, code compression, binary- to-binary translation, and bit-transition reduction (for power). Many, if not most, optimizations assume the presence of a control-flow graph (cfg). Compiled, scheduled code has properties that can make cfg construction more complex than it is inside a typical compiler. In this paper, we examine the problems of scheduled code on architectures that have multiple delay slots. In particular, if branch delay slots contain other branches, the classic algorithms for building a cfg produce incorrect results. We explain the problem using two simple examples. We then present an algorithm for building correct cfgs from scheduled assembly code that includes branches in branch-delay slots. The algorithm works by building an approximate cfg and then refining it to reflect the actions of delayed branches. If all branches have explicit targets, the complexity of the refining step is linear with respect to the number of branches in the code. Analysis of the kind presented in this paper is a necessary first step for any system that analyzes or translates compiled, assembly-level code. We have implemented this algorithm in our power-consumption experiments based on the TMS320C6200 architecture from Texas Instruments. The development of our algorithm was motivated by the output of TI’s compiler.
Building Adaptive Compilers
(2005-01-29) Almagor, L.; Cooper, Keith D.; Grosul, Alexander; Harvey, Timothy J.; Reeves, Steven W.; Subramanian, Devika; Torczon, Linda; Waterman, Todd
Traditional compilers treat all programs equally -that is, they apply the same set of techniques to every program that they compile. Compilers that adapt their behavior to fit specific input programs can produce better results. This paper describes out experience building and using adaptive compilers. It presents experimental evidence to show two problems for which adaptive behavior can lead to better results: choosing compilation orders and choosing block sizes. It present data from experimental characterizations of the search spaces in which these adaptive systems operate and describes search algorithms that successfully operate in these spaces. Building these systems has taught us a number of lessons about the construction of modular and reconfigurable compilers. The paper describes some of the problems that we encountered and the solutions that we adopted. It also outlines a number of fertile areas for future research in adaptive compilation.
Cluster assignment and instruction scheduling for partitioned register-set machines
(2000) He, Jingsong; Cooper, Keith D.
For half a century, computer architects have been striving to improve uniprocessor computer performance. Many of their successful designs such as VLIW and superscalar machines use multiple functional units trying to exploit instruction level parallelism in computer programs. As the number of functional units rises, another hardware constraint enters the picture---the number of register-file ports needed grows directly with the number of functional units. At some point, the multiplexing logic on register ports can come to dominate the processor's cycle time. A reasonable solution is to partition the register file into independent sets and associate each functional unit with a specific register set. Such partitioned register sets have appeared in a number of commercial machines, such as Texas Instruments TMS320C6xxx DSP chips. Partitioned register-set architectures present a new set of challenges to compiler designers---the compiler must assign each operation to a specific clusters and coordinate data movement between clusters. In this thesis, we investigate five instruction scheduling methods with different scopes to find a suitable one for partitioned register-set architectures. Next, we examine previous algorithms for the combined cluster assignment and scheduling problem and propose two new algorithms that improve upon the prior art. Then we study the difficulties introduced by limited number of registers and provide an approach to handle them. Finally we take several other measurements of partitioned register-set architectures that may shed light on some of the architectural decisions.
Combining analyses, combining optimizations
(1995) Click, Clifford Noel, Jr; Cooper, Keith D.
This thesis presents a framework for describing optimizations. It shows how to combine two such frameworks and how to reason about the properties of the resulting framework. The structure of the framework provides insight into when a combination yields better results. Also presented is a simple iterative algorithm for solving these frameworks. A framework is shown that combines Constant Propagation, Unreachable Code Elimination, Global Congruence Finding and Global Value Numbering. For these optimizations, the iterative algorithm runs in $O(n\sp2)$ time. This thesis then presents an O(n log n) algorithm for combining the same optimizations. This technique also finds many of the common subexpressions found by Partial Redundancy Elimination. However, it requires a global code motion pass to make the optimized code correct, also presented. The global code motion algorithm removes some Partially Dead Code as a side-effect. An implementation demonstrates that the algorithm has shorter compile times than repeated passes of the separate optimizations while producing run-time speedups of 4%-7%. While global analyses are stronger, peephole analyses can be unexpectedly powerful. This thesis demonstrates parse-time peephole optimizations that find more than 95% of the constants and common subexpressions found by the best combined analysis. Finding constants and common subexpressions while parsing reduces peak intermediate representation size. This speeds up the later global analyses, reducing total compilation time by 10%. In conjunction with global code motion, these peephole optimizations generate excellent code very quickly, a useful feature for compilers that stress compilation speed over code quality.
Compilation Order Matters: Exploring the Structure of the Space of Compilation Sequences Using Randomized Search Algorithms
(2004-06-18) Almagor, L.; Cooper, Keith D.; Grosul, Alexander; Harvey, Timothy J.; Reeves, Steven W.; Subramanian, Devika; Torczon, Linda; Waterman, Todd
Most modern compilers operate by applying a fixed sequence of code optimizations, called a compilation sequence, to all programs. Compiler writers determine a small set of good, general-purpose, compilation sequences by extensive hand-tuning over particular benchmarks. The compilation sequence makes a significant difference in the quality of the generated code; in particular, we know that a single universal compilation sequence does not produce the best results over all programs. Three questions arise in customizing compilation sequences: (1) What is the incremental benefit of using a customized sequence instead of a universal sequence? (2) What is the average computational cost of constructing a customized sequence? (3) When does the benefit exceed the cost? We present one of the first empirically derived cost-benefit tradeoff curves for custom compilation sequences. These curves are for two randomized sampling algorithms: descent with randomized restarts and genetic algorithms. They demonstrate the dominance of these two methods over simple random sampling in sequence spaces where the probability of finding a good sequence is very low. Further, these curves allow compilers to decide whether custom sequence generation is worthwhile, by explicitly relating the computational effort required to obtain a program-specific sequence to the incremental improvement in quality of code generated by that sequence.
Digital computer register allocation and code spilling using interference graph coloring
(1993-09-28) Briggs, Preston P.; Cooper, Keith D.; Kennedy, Kenneth W. Jr.; Torczon, Linda M.; Rice University; United States Patent and Trademark Office
A method is disclosed for allocating internal machine registers in a digital computer for use in storing values defined and referenced by a computer program. An allocator in accordance with the present invention constructs a interference graph having a node therein for the live range of each value defined by a computer program, and having an edge between every two nodes whose associated live ranges interfere with each other. The allocator models the register allocation process as a graph-coloring problem, such that for a computer having R registers, the allocator of the present invention iteratively attempts to R-color the interference graph. The interference graph is colored to the extent possible on each iteration before a determination is made that one or more live ranges must be spilled. After spill code has been added to the program to transform spilled live ranges into multiple smaller live ranges, the allocator constructs a new interference graph and the process is repeated.
Dynamically reconfigurable data caches in low-power computing
(2003) Brogioli, Michael C.; Cooper, Keith D.
In order to curb microprocessor power consumption, we propose an L1 data cache which can be reconfigured dynamically at runtime according to the cache requirements of a given application. A two phase approach is used involving both compile time information, and the runtime monitoring of program performance. The compiler predicts L1 data cache requirements of loop nests in the input program, and instructs the hardware on how much L1 data cache to enable during a loop nest's execution. For regions of the program not analyzable at compile time, the hardware itself monitors program performance and reconfigures the L1 data cache so as to maintain cache performance while minimizing cache power consumption. In addition to this, we provide a study of data reuses inside loop nests of the SPEC CPU2000 and Mediabench benchmarks. The sensitivity of data reuses to L1 data cache associativity is analyzed to illustrated the potential power savings a reconfigurable L1 data cache can achieve.
Foundations for Automatic, Adaptable Compilation
(2011) Sandoval, Jeffrey Andrew; Cooper, Keith D.
Computational science demands extreme performance because the running time of an application often determines the size of the experiment that a scientist can reasonably compute. Unfortunately, traditional compiler technology is ill-equipped to harness the full potential of today's computing platforms, forcing scientists to spend time manually tuning their application's performance. Although improving compiler technology should alleviate this problem, two challenges obstruct this goal: hardware platforms are rapidly changing and application software is difficult to statically model and predict. To address these problems, this thesis presents two techniques that aim to improve a compiler's adaptability: automatic resource characterization and selective, dynamic optimization. Resource characterization empirically measures a system's performance-critical characteristics, which can be provided to a parameterized compiler that specializes programs accordingly. Measuring these characteristics is important, because a system's physical characteristics do not always match its observed characteristics. Consequently, resource characterization provides an empirical performance model of a system's actual behavior, which is better suited for guiding compiler optimizations than a purely theoretical model. This thesis presents techniques for determining a system's data cache and TLB capacity, line size, and associativity, as well as instruction-cache capacity. Even with a perfect architectural-model, compilers will still often generate suboptimal code because of the difficulty in statically analyzing and predicting a program's behavior. This thesis presents two techniques that enable selective, dynamic-optimization for cases in which static compilation fails to deliver adequate performance. First, intermediate-representation (IR) annotation generates a fully-optimized native binary tagged with a higher-level compiler representation of itself. The native binary benefits from static optimization and code generation, but the IR annotation allows targeted and aggressive dynamic-optimization. Second, adaptive code-selection allows a program to empirically tune its performance throughout execution by automatically identifying and favoring the best performing variant of a routine. This technique can be used for dynamically choosing between different static-compilation strategies; or, it can be used with IR annotation for performing dynamic, feedback-directed optimization.
Global register allocation using program structure
(2005) Eckhardt, Jason; Cooper, Keith D.
The Chaitin-Briggs approach to register allocation by graph coloring is the dominant method used in industrial and research compilers. It usually produces highly-efficient allocations, but sometimes exhibits pathological spilling behavior so that some programs execute significantly more spill operations than is necessary. This thesis examines and improves two previously proposed approaches of attacking this problem. Passive splitting attempts a lazy form of live range splitting which can substantially reduce dynamic spill count compared to Chaitin-Briggs. We incorporate program structure into the passive splitting framework to better guide splitting decisions and to place splits at infrequently executed regions of code. Also investigated is the Hierarchical Graph Coloring approach, which uses program structure during allocation. We provide an empirical evaluation of this poorly-understood algorithm, and propose some improvements.
Grid-centric scheduling strategies for workflow applications
(2010) Zhang, Yang; Cooper, Keith D.
Grid computing faces a great challenge because the resources are not localized, but distributed, heterogeneous and dynamic. Thus, it is essential to provide a set of programming tools that execute an application on the Grid resources with as little input from the user as possible. The thesis of this work is that Grid-centric scheduling techniques of workflow applications can provide good usability of the Grid environment by reliably executing the application on a large scale distributed system with good performance. We support our thesis with new and effective approaches in the following five aspects. First, we modeled the performance of the existing scheduling approaches in a multi-cluster Grid environment. We implemented several widely-used scheduling algorithms and identified the best candidate. The study further introduced a new measurement, based on our experiments, which can improve the schedule quality of some scheduling algorithms as much as 20 fold in a multi-cluster Grid environment. Second, we studied the scalability of the existing Grid scheduling algorithms. To deal with Grid systems consisting of hundreds of thousands of resources, we designed and implemented a novel approach that performs explicit resource selection decoupled from scheduling Our experimental evaluation confirmed that our decoupled approach can be scalable in such an environment without sacrificing the quality of the schedule by more than 10%. Third, we proposed solutions to address the dynamic nature of Grid computing with a new cluster-based hybrid scheduling mechanism. Our experimental results collected from real executions on production clusters demonstrated that this approach produces programs running 30% to 100% faster than the other scheduling approaches we implemented on both reserved and shared resources. Fourth, we improved the reliability of Grid computing by incorporating fault- tolerance and recovery mechanisms into the workow application execution. Our experiments on a simulated multi-cluster Grid environment demonstrated the effectiveness of our approach and also characterized the three-way trade-off between reliability, performance and resource usage when executing a workflow application. Finally, we improved the large batch-queue wait time often found in production Grid clusters. We developed a novel approach to partition the workow application and submit them judiciously to achieve less total batch-queue wait time. The experimental results derived from production site batch queue logs show that our approach can reduce total wait time by as much as 70%. Our approaches combined can greatly improve the usability of Grid computing while increasing the performance of workow applications on a multi-cluster Grid environment.

Browsing by Author "Cooper, Keith D."

Results Per Page

Sort Options