R-3 Repository :: Browsing by Author "Kapasi, Ujval J."

Browsing by Author "Kapasi, Ujval J."

Now showing 1 - 5 of 5

A Bandwidth-Efficient Architecture for Media Processing
(1998-11-20) Rixner, Scott; Dally, William J.; Kapasi, Ujval J.; Khailany, Brucek; Lopez-Lagunas, Abelardo; Mattson, Peter; Owens, John D.
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor, Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.
Imagine: Media Processing with Streams
(2001-03-20) Khailany, Brucek; Dally, William J.; Kapasi, Ujval J.; Mattson, Peter; Namkoong, Jinyung; Owens, John D.; Towles, Brian; Chang, Andrew; Rixner, Scott
The Power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 Gflops and sustain 18.3 GOPS on MPEG-2 encoding.
Memory Access Scheduling
(2000-06-20) Rixner, Scott; Dally, William J.; Kapasi, Ujval J.; Mattson, Peter; Owens, John D.
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the "3-D" structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.
Programmable Stream Processors
(2003-08-20) Kapasi, Ujval J.; Rixner, Scott; Dally, William J.; Khailany, Brucek; Ahn, Jung Ho; Mattson, Peter; Owens, John D.; CITI (http://citi.rice.edu/)
Stream processing promises to bridge the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications.
Register Organization for Media Processing
(2000-01-20) Rixner, Scott; Dally, William J.; Khailany, Brucek; Mattson, Peter; Kapasi, Ujval J.; Owens, John D.
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10^11 operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay, and power of the arithmetic units. In this paper we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level parallel, and memory hierarchy axes, and by optimizing the hierarchical register organization to operate on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay, and power dissipation of a media processor by factors of 195, 20, and 430, respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks.

Browsing by Author "Kapasi, Ujval J."

Results Per Page

Sort Options