Browsing by Author "Ding, Chen"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse(2000) Ding, Chen; Kennedy, KenWhile CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during the entire execution. The major new techniques and their unique contributions are: Maximal loop fusion : an algorithm that achieves maximal fusion among all program statements and bounded reuse distance within a fused loop. Inter-array data regrouping : the first to selectively group global data structures and to do so with guaranteed profitability and compile-time optimality. Locality grouping and dynamic packing: the first set of compiler-inserted and compiler-optimized computation and data transformations at run time. These optimizations have been implemented in a research compiler and evaluated on real-world applications on SGI Origin2000. The result shows that, on average, the new strategy eliminates 41% of memory loads in regular applications and 63% in irregular and dynamic programs. As a result, the overall execution time is shortened by 12% to 77%. In addition to compiler optimizations, this research has developed a performance model and designed a performance tool. The former allows precise measurement of the memory bandwidth bottleneck; the latter enables effective user tuning and accurate performance prediction for large applications: neither goal was achieved before this thesis.Item Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse(2000-01-21) Ding, ChenWhile CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during the entire execution. The major new techniques and their unique contributions areMaximal loop fusion: an algorithm that achieves maximal fusion among all program statements and bounded reuse distance within a fused loop. Inter-array data regrouping: the first to selectively group global data structures and to do so with guaranteed profitability and compile-time optimalityLocality grouping and dynamic packing: the first set of compiler-inserted and compiler-optimized computation and data transformations at run time. These optimizations have been implemented in a research compiler and evaluated on real-world applications on SGI Origin2000. The result shows that, on average, the new strategy eliminates 41% of memory loads in regular applications and 63% in irregular and dynamic programs. As a result, the overall execution time is shortened by 12% to 77%. In addition to compiler optimizations, this research has developed a performance model and designed a performance tool. The former allows precise measurement of the memory bandwidth bottleneck; the latter enables effective user tuning and accurate performance prediction for large applications: neither goal was achieved before this thesis.Item Resource Constrained Loop Fusion(2003-09-03) Ding, Chen; Kennedy, KenEmbedded processors have limited on-chip memory. Fusing loops that use the same data can reduce the distance between accesses to the same memory location, avoiding costly off-chip memory transfer. Most existing greedy fusion algorithms solve the unconstrained problem—they do not guard against negative effects of excessive fusion. When a large program contains a great number of loops, unconstrained fusion may generate huge loops that overflow on-chip memory, leading to lower performance. This paper studies the problem for on strained weighted fusion, in which the graph edges carry weights indicating the profitability of fusing the inputs and vertices are annotated with resource requirements. The optimal solution of a constrained weighted fusion problem is a collection of vertex sets such that the total weight associated with pairs of vertices within clusters is maximized and the aggregate resource requirement of every cluster is less than a fixed upper bound R. Finding the optimal solution to a weighted fusion problem (constrained or unconstrained) is P-complete, so we use heuristics. We present two methods. The first picks a group of loops at each fusion step. To ease the resource calculation and fusibility test, the second method picks only a pair of candidate loops at each step. The paper presents the two algorithms, their complexity, and an experimental evaluation.Item Scalability and Data Placement on SGI Origin(1998-04-01) Chauhan, Arun; Ding, Chen; Sheraw, BerryCache-coherent non-uniform memory access (ccNUMA) architectures have attracted lots of academic and industry interests as a promising direction to large scale parallel computing. Data placement has been used as a major optimization method on such machines. This study examined the scalability and the effect of data placement on a state-of-the-art ccNUMA machine, SGI Origin, using 16 processors. Three applications from SPLASH-2 are used, FFT, Radix and Barnes-Hut. The results showed that FFT and Radix cannot scale to 16 processors with 70% efficiency even for the largest data sizes tested. Barnes-Hut doesn't scale for small data size but scales linearly for large input size. The results also showed that data placement does not make any difference on performance for all three applications. We attribute these results to the effect of the advanced uni-processor used on the Origin, R10K, the optimizing compiler, and the aggressive communication architecture. Some of our results are quite different from the predictions of two recent simulation studies on directory-based ccNUMA machines (Holt:ISCA96) and (Pai:HPCA97), especially on FFT. These differences are partly due to the fact that the machine models used in previous simulation studies are different from the Origin machine in some important aspects. Our results also include data sizes that are larger than any of the previous simulation studies. To increase our confidence on the latency numbers and data placement tools, we also measured memory latencies using micro-benchmarks.