Browsing by Author "Rajagopal, Sridhar"
Now showing 1 - 20 of 22
Results Per Page
Sort Options
Item Arithmetic Acceleration Techniques for Wireless Communication Receivers(1999-10-20) Das, Suman; Rajagopal, Sridhar; Sengupta, Chaitali; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)We develop techniques to accelerate the implementation of the next generation wireless communication algorithms in hardware. We discuss an implementation of a key computationally intensive baseband algorithm for joint multiuser channel estimation and detection for this purpose and study its real-time requirements. An analysis of the bottlenecks present in the algorithm is made. We present an acceleration technique using task decomposition to take advantage of the existing pipelining and parallelism flow in the algorithm. We show that an application specific system design with multiple processing elements is more effective than the conventional single processor approach as it can satisfy the high data rate requirements of the next generation wireless communication systems. Our analysis is done independent of the final mapping of the processing elements in hardware.Item Baseband Architecture Design for Future Wireless Base-Station Receivers(2000-05-20) Rajagopal, Sridhar; Center for Multimedia Communications (http://cmc.rice.edu/)This thesis demonstrates designing efficient algorithms and architectures to meet the real-time requirements of future wireless base-station receivers. Next generation receivers require orders-of-magnitude performance improvements in order to provide support for features such as Multimedia, Quality-Of-Service and extremely high data rates. The sophisticated, compute-intensive algorithms proposed to integrate these features make their real-time implementation difficult on current DSP-based receivers. A real-time implementation can be achieved by (1.) making the algorithms computationally efficient, without significant loss in error rate performance, (2.) task partitioning, and (3.) designing hardware to exploit available pipelining, parallelism and bit-level computations. Multiuser Channel Estimation and Detection, two of the most compute-intensive baseband tasks in the receiver, are studied on DSPs for performance evaluation. A reduced complexity iterative channel estimation scheme for slow fading channels is proposed for a fixed point, area-time efficient and real-time VLSI architecture. The multiuser detection algorithm is modified for a simple, pipelined structure. A GPP or DSP based architecture with reconfigurable support suited for wireless communications is proposed and extensions are developed to accelerate the implementation of wireless communication algorithms.Item Baseband architecture design for future wireless base-station receivers(2000) Rajagopal, Sridhar; Cavallaro, Joseph R.This thesis demonstrates the use of designing efficient algorithms and architectures to meet the real-time requirements of future wireless base-station receivers. Next generation receivers will require orders-of-magnitude performance improvements in order to provide support for features such as Multimedia, Quality-Of-Service and extremely high data rates. The sophisticated, compute-intensive algorithms proposed to integrate these features make their real-time implementation difficult on current Digital Signal Processor (DSP)-based receivers. A real-time implementation can be achieved by (1) making the algorithms computationally efficient, without significant loss in error rate performance, (2) task partitioning and (3) designing hardware to exploit available pipelining, parallelism and bit-level computations. Multiuser Channel Estimation and Detection, two of the most compute-intensive baseband tasks in the receiver, are implemented on DSPs for performance evaluation. A reduced complexity iterative channel estimation scheme for slow fading channels is proposed for a fixed point, area-time efficient and real-time VLSI architecture. The multiuser detection algorithm is modified for a simple, pipelined structure. A General Purpose Processor (GPP) or DSP based architecture with reconfigurable support suited for different wireless communication standards is proposed and extensions are developed to accelerate the implementation of wireless communication algorithms.Item A bit-streaming pipelined multiuser detector for wireless communications(2001-05-20) Rajagopal, Sridhar; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)This paper presents a bit-streaming, pipelined and reduced complexity architecture to meet real-time requirements for asynchronous multiuser detection in wireless communication CDMA receivers. Typically, asynchronous multiuser detection involves multishot detection, which involves block-based computations and matrix inversions. Hence, iterative based suboptimal schemes have been studied to decrease the computational complexity and eliminate the need for matrix inversions. However, we show that such low-complexity schemes can have an added advantage of avoiding multishot detection if they start from a matched filter estimate. The stages of the iteration can be pipelined and bits processed in a streaming fashion. We show that such an implementation scheme reduces the latency of the bits by the detection window length D and eliminates the storage requirements for block computation, which helps in DSP implementations. We also avoid edge-bit computation effects, which reduces the computation by 2/D per detection stage. This scheme also results in a simple, bit-streaming and pipelined architecture. DSP simulations show that data rates of 800 Kbps for a single user to 50 Kbps for 32 users can be processed in real-time with additional FPGAs in a pipelined fashion for a spreading gain of 31, giving at least a 4X speedup over a single DSP implementation.Item Communication Processors(Wiley, 2005-07-01) Rajagopal, Sridhar; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Communication processors are processors with specific optimizations to support communication sys-tems. Communication processors exist in a wide variety of forms and can be categorized based on the communication system, such as wired or wireless and based on the layer in the communication system, such as the physical layer, the medium access control layer or the network layer. Communication processors can be further categorized based on the application, such as audio, video or data and the end system requiring the communication system such as a laptop, a cell phone or a personal computer. In this book chapter, we present a brief outline of the different types of communication processors and the need and requirements of each of these processors. However, we will focus on the challenges in the physical layer design in communication processors with the increase in data rates, increase in algorithm complexity, need for flexibility to adapt to different protocols and environments, need to optimize over varying con-straints such as area, power, performance and the need for supporting multiple interfaces, devices and applications.Item Data-parallel digital signal processors: Algorithm mapping, architecture scaling and workload adaptation(2004) Rajagopal, Sridhar; Cavallaro, Joseph R.Emerging applications such as high definition television (HDTV), streaming video, image processing in embedded applications and signal processing in high-speed wireless communications are driving a need for high performance digital signal processors (DSPs) with real-time processing. This class of applications demonstrates significant data parallelism, finite precision, need for power-efficiency and the need for 100's of arithmetic units in the DSP to meet real-time requirements. Data-parallel DSPs meet these requirements by employing clusters of functional units, enabling 100's of computations every clock cycle. These DSPs exploit instruction level parallelism and subword parallelism within clusters, similar to a traditional VLIW (Very Long Instruction Word) DSP, and exploit data parallelism across clusters, similar to vector processors. Stream processors are data-parallel DSPs that use a bandwidth hierarchy to support dataflow to 100's of arithmetic units and are used for evaluating the contributions of this thesis. Different software realizations of the dataflow in the algorithms can affect the performance of stream processors by greater than an order-of-magnitude. The thesis first presents the design of signal processing algorithms that map efficiently on stream processors by parallelizing the algorithms and by re-ordering the flow of data. The design space for stream processors also exhibits trade-offs between arithmetic units per cluster, clusters and the clock frequency to meet the real-time requirements of a given application. This thesis provides a design space exploration tool for stream processors that meets real-time requirements while minimizing power consumption. The presented exploration methodology rapidly searches this design space at compile time to minimize power consumption and selects the number of adders, multipliers, clusters and the real-time clock frequency in the processor. Finally, the thesis improves the power efficiency in the designed stream processor by adapting the compute resources to run-time variations in the workload. The thesis presents an adaptive multiplexer network that allows the number of active clusters to be varied during run-time by turning off unused clusters. Thus, by efficient mapping of algorithms, exploring the architecture design space, and by compute resource adaptation, this thesis improves power efficiency in stream processors and enhances their suitability for high performance, power-aware, signal processing applications.Item Data-parallel Digital Signal Processors: Algorithm Mapping, Architecture Scaling and Workload Adaptation(2004-05-01) Rajagopal, Sridhar; Center for Multimedia Communications (http://cmc.rice.edu/)Emerging applications such as high definition television (HDTV), streaming video, image processing in embedded applications and signal processing in high-speed wireless communications are driving a need for high performance digital signal processors (DSPs) with real-time processing. This class of applications demonstrates significant data parallelism, finite precision, need for power-efficiency and the need for 100's of arithmetic units in the DSP to meet real-time requirements. Data-parallel DSPs meet these requirements by employing clusters of functional units, enabling 100's of computations every clock cycle. These DSPs exploit instruction level parallelism and subword parallelism within clusters, similar to a traditional VLIW (Very Long Instruction Word) DSP, and exploit data parallelism across clusters, similar to vector processors. Stream processors are data-parallel DSPs that use a bandwidth hierarchy to support dataflow to 100's of arithmetic units and are used for evaluating the contributions of this thesis. Different software realizations of the dataflow in the algorithms can affect the performance of stream processors by greater than an order-of-magnitude. The thesis first presents the design of signal processing algorithms that map efficiently on stream processors by parallelizing the algorithms and by re-ordering the flow of data. The design space for stream processors also exhibits trade-offs between arithmetic units per cluster, clusters and the clock frequency to meet the real-time requirements of a given application. This thesis provides a design space exploration tool for stream processors that meets real-time requirements while minimizing power consumption. The presented exploration methodology rapidly searches this design space at compile time to minimize power consumption and selects the number of adders, multipliers, clusters and the real-time clock frequency in the processor. Finally, the thesis improves the power efficiency in the designed stream processor by adapting the compute resources to run-time variations in the workload. The thesis presents an adaptive multiplexer network that allows the number of active clusters to be varied during run-time by turning off unused clusters. Thus, by efficient mapping of algorithms, exploring the architecture design space, and by compute resource adaptation, this thesis improves power efficiency in stream processors and enhances their suitability for high performance, power-aware, signal processing applications.Item Design space exploration for real-time embedded stream processors(2004-07-01) Rajagopal, Sridhar; Cavallaro, Joseph R.; Rixner, Scott; Center for Multimedia Communications (http://cmc.rice.edu/)We present a design framework for rapidly exploring the design space for stream processors in real-time embedded systems. Stream processors enable hundreds of arithmetic units in programmable pro-cessors by using clusters of functional units. However, to meet a certain real-time requirement for an embedded system, there is a trade-off between the number of arithmetic units in a cluster, number of clusters and the clock frequency as each solution meets real-time with a different power consumption. We have developed a design exploration tool that explores this trade-off and presents a heuristic that minimizes the power consumption in the (functional units, clusters, frequency) design space. Our design methodology relates the instruction level parallelism, subword parallelism and data parallelism to the organization of the functional units in an embedded stream processor. We show that the power minimization methodology also provides insights into the functional unit utilization of the processor. The design exploration tool exploits the static nature of signal processing workloads, providing an extremely fast design space exploration and provides an initial lower bound estimate of the real-time performance of the embedded processor. A sensitivity analysis of the design tool results to the technology and modeling also enables the designer to check the robustness of the design exploration.Item DSP architectural considerations for optimal baseband processing(2002-08-20) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Aazhang, Behnaam; Center for Multimedia Communications (http://cmc.rice.edu/)The data rate requirements for future wireless systems has increased by orders-of-magnitude (from Kbps to several Mbps), requiring more sophisticated algorithms for their implementation. This tutorial will explore different architectural issues to consider for optimal wireless baseband processing. It will look at research into real-time architectural design issues such as number of functional units, data access from memory and sequential traceback for Viterbi decoding using digital signal processorsItem Efficient VLSI Architectures for Baseband Signal Processing for Wireless Base-Station Receivers(2000-07-20) Rajagopal, Sridhar; Bhashyam, Srikrishna; Cavallaro, Joseph R.; Aazhang, Behnaam; Center for Multimedia Communications (http://cmc.rice.edu/)A real-time VLSI architecture is designed for multiuser channel estimation, one of the core base-band processing operations in wireless base-station receivers. Future wireless basestation receivers will need to use sophisticated algorithms to support extremely high data rates and multimedia. Current DSP architectures are unable to fully exploit the parallelism and bit level arithmetic present in these algorithms. These features can be revealed and efficiently implemented by task partitioning the algorithms for a VLSI solution. We modify the channel estimation algorithm for a reduced complexity fixed-point hardware implementation. We show the complexity and hardware required for three different area-time tradeoffs: an area-constrained, a time-constrained and an area-time efficient architecture. The area-constrained architecture achieves low data rates with minimum hardware, which may be used in picocell base-stations. The time-constrained solution exploits the entire available parallelism and determines the maximum theoretical data rates. The area-time efficient architecture meets real-time requirements with minimum area overhead. The orders-of-magnitude difference between area and time constrained solutions reveals significant inherent parallelism in the algorithm. All proposed VLSI solutions exhibit better time performance than a previous DSP implementation.Item Efficient VLSI architectures for multiuser channel estimation in wireless base-station receivers(Kluwer Academic Pubishers, 2002-06-20) Rajagopal, Sridhar; Bhashyam, Srikrishna; Cavallaro, Joseph R.; Aazhang, Behnaam; Center for Multimedia Communications (http://cmc.rice.edu/)This paper presents a reduced-complexity, fixed-point algorithm and efficient real-time VLSI architectures for multiuser channel estimation, one of the core baseband processing operations in wireless base-station receivers for CDMA. Future wireless base-station receivers will need to use sophisticated algorithms to support extremely high data rates and multimedia. Current DSP implementations of these algorithms are unable to meet real-time requirements. However, there exists massive parallelism and bit level arithmetic present in these algorithms than can be revealed and efficiently implemented in a VLSI architecture. We it re-design an existing channel estimation algorithm from an implementation perspective for a reduced complexity, fixed-point hardware implementation. Fixed point simulations are presented to evaluate the precision requirements of the algorithm. A dependence graph of the algorithm is presented and area-time trade-offs are developed. An area-constrained architecture achieves low data rates with minimum hardware, which may be used in pico-cell base-stations. A time-constrained solution exploits the entire available parallelism and determines the maximum theoretical data processing rates. An area-time efficient architecture meets real-time requirements with minimum area overhead.Item Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors(1999-08-20) Rajagopal, Sridhar; Xu, Gang; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Proposed algorithms for Third Generation W-CDMA communication systems have extremely high performance requirements. In this paper, we study the implementation issues involved for one of the proposed multiuser channel estimation and detection algorithms for base-stations in the uplink using TI's TMS320C6x DSP Evaluation Modules(EVM). It was found that these proposed algorithms for multiuser channel estimation and detection have different processing and precision requirements. While the detector can be implemented using the C6201 16-bit fixed point DSP, the proposed channel estimation algorithm may be more suitable for a oating point implementation using the C6701 floating point DSP. We study the effects of the specialized approximate instructions available on the C6701 DSP on channel estimation. Then, the advantage of multistep optimizations and use of assembly code is studied for both the algorithms. Memory issues involved in the implementation of these algorithms is also investigated. It was found that the data memory requirements for channel estimation for the chosen system parameters necessitates the use of external memory while the multistage detection algorithm could be placed in the available internal data memory. We finally discuss the current and future trends of DSPs and their utilization for such wireless communication applications.Item Improving power efficiency in stream processors through dynamic cluster reconfiguration(2004-12-01) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Stream processors support hundreds of functional units in a programmable architecture by clustering functional units and utilizing a bandwidth hierarchy. Clusters are the dominant source of power consumption in stream processors. When the data parallelism falls below the number of clusters, unutilized clusters can be turned off to save power. This paper improves power efficiency in stream processors by dynamically reconfiguring the number of clusters in a stream processor to match the time varying data parallelism of an application. We explore 3 mechanisms for dynamic reconfiguration: using memory, conditional streams and a multiplexer network. A 32-user wireless basestation is a prime example of a workload that benefits from such reconfiguration. When the number of users supported by the basestation dynamically changes from 32 to 4, the reconfiguration from a 32-cluster stream processor to a 4-cluster stream processor yields 15-85% power savings over and above a stream processor that uses conventional power saving techniques such as dynamic voltage and frequency scaling. The dynamic reconfiguration support extends stream processors from traditional high performance applications to power-sensitive applications in which the data parallelism varies dynamically and falls below the number of clusters.Item On-line Arithmetic for Detection in Digital Communication Receivers(2001-06-20) Rajagopal, Sridhar; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)This paper demonstrates the advantages of using on-line arithmetic for traditional and advanced detection algorithms for communication systems. Detection is one of the core computationally-intensive physical layer operations in a communication receiver and determines the communication data rates. Detection algorithms typically involve hard decisions} (sign based testing) to find the sign of the transmitted information bit. This results in extraneous computations in a conventional number system as the sign is obtained only at the end due to the Least Significant Digit First (LSDF) nature of computations. On-line arithmetic, based on a signed digit number representation, provides Most Significant Digit First (MSDF) computation. Hence, the computations can stop after the first non-zero MSD (sign) is computed and additional computations for the successive digits can be avoided. Back-conversion to a conventional number system is not required as the sign of the digit represents the detected bit. A comparison of a radix-4 serial digit on-line multiuser detector with an 8-bit parallel conventional arithmetic multiuser detector shows a decrease in latency by 1.95X, a 3X increase in throughput, and possible savings in area.Item A programmable baseband processor design for software defined radios(IEEE, 2002-08-20) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Future wireless systems need extremely fast and flexible architectures to support varying standards, algorithms and protocols with data rates in the range of 10-100 Mbps. Software Defined Radios (SDRs) based on DSP-FPGAs are a widely proposed solution for these systems. However, these SDR solutions have not been able to meet real-time requirements. We propose a programmable architecture solution for SDRs using a stream-based architecture based on the Imagine media processor. The configurable Imagine simulator allows us to investigate issues such as memory bottlenecks, number and type of functional units needed, and the utilization of those functional units. To evaluate stream-based architectures for baseband processing, we parallelize and implement sophisticated baseband algorithms including multiuser estimation, multiuser detection and Viterbi decoding on this simulator. We present the bottlenecks in such a stream-based architecture for efficient communications processing. Comparisons with current generation DSP-based solutions show orders-of-magnitude performance improvements, both due to the stream-based nature of computations as well as the increase in the number of functional units having a high utilization factor. The result is a baseband processor designed with broad system functionality and flexibility that approaches real-time performance for future wireless systems.Item Real-Time Algorithms and Architectures for Multiuser Channel Estimation and Detection in Wireless Base-Station Receivers(IEEE, 2002-07-20) Rajagopal, Sridhar; Bhashyam, Srikrishna; Cavallaro, Joseph R.; Aazhang, Behnaam; Center for Multimedia Communications (http://cmc.rice.edu/)This paper presents alogrithms and architecture designs that can meet real-time requirements of multiuser channel estimation and detection in future wireless base-station receivers. Sophisticated algorithms proposed to implement multiuser channel estimation and detection make their real-time implementation difficult on current Digital Signal Processor (DSP)-based receivers. A maximum-likelihood based multiuser channel estimation scheme requiring matrix inversions is redesigned from an implementation perspective for a reduce complexity, iterative scheme with a simple fixed-point VLSI architecture. A reduced-complexity, bit-streaming multiuser detection algorithm that avoids the need for multishot detection is also developed for a simple, pipelined VLSI architecutre. Thus, we show that real-time solutions can be achieved for third generation wireless systems by (1) designing the alogrithms from a fixed-point implementation perspective, without significant loss in error rate performance, (2) task partitioning and (3) designing bit-streaming fixed-point VLSI architectures that explore pipelining, parallelism and bit-level computations to achieve real-time with minumum area overhead.Item Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers(2000-08-20) Jones, Bryan Allen; Rajagopal, Sridhar; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)The convergence of cellular phones, the Internet, and laptop computers into a single small, lightweight, wireless information appliance drives a need for a high data rate, low-power digital wireless communication link to enable the creation of such a device. A simulation environment supporting rapid prototyping is developed, and used to evaluate the real-time data-rate performance of these receivers implemented on a multiprocessor DSP board. Simulations of a multiprocessor implementation of joint multiuser channel estimation and detection algorithms is projected to achieve combined perform-ance of 15.6 Kb/user/sec for 10 users. Performance gains over a single-processor implementation range from 5% for the three-user case to 69% for a 15 user case.Item Reconfigurable stream processors for wireless base-stations(2003-10-20) Rajagopal, Sridhar; Rixner, Scott; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)This paper presents the design and use of reconfigurable stream processors for the physical layer processing in wireless base-stations. Stream processors, traditionally used for high performance media processing, use clusters of functional units to provide support for hundreds of functional units in a programmable architecture. We provide hardware support for reconfiguration in stream processors, enabling them to be power-efficient by adapting to the compute requirements of the application. We demonstrate the real-time implementation of a 32-user wireless base-station, employing multiuser channel estimation, multiuser detection and Viterbi decoding physical layer algorithms, supporting a data rate of 128 Kbps/user. The reconfigurable stream processor runs at 1.2 GHz and has an estimated power consumption of 12.38 W at full workload. However, basestations rarely operate at full capacity. When the base-station workload decreases, the reconfigurable stream processor adapts the number of clusters, functional units, voltage and frequency dynamically for power efficiency. When the application workload changes to 4 users, the reconfiguration support reduces the power to 300 mW at 433 MHz, providing a 41.27X decrease in power consumption. The cluster reconfiguration yields an additional 15-85% power savings over a stream processor with dynamic voltage and frequency scaling.Item Task Partitioning Wireless Base-station Receiver Algorithms on Multiple DSPs and FPGAs(2000-10-20) Rajagopal, Sridhar; Jones, Bryan Allen; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)This paper presents a multiprocessor solution to meet real-time requirements of implementing advanced algorithms for multiuser channel estimation and detection for third and fourth generation wireless base-station receivers. We identify the key bottlenecks in the algorithms and task-partition the algorithms on multiple processors. We get speedups, ranging from 1.19X to 5.92X for a dual-DSP implementation due to both additional computational power and additional internal memory compared to a single DSP implementation using external memory. We also identify parts of the algorithm that exhibit bit-level parallelism, not utilized by DSPs. FPGAs can then be used to accelerate these parts and meet real-time requirements of 128 Kbps for next generation wireless systems.Item Truncated on-line arithmetic with applications to communication systems(2006-09-01) Rajagopal, Sridhar; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Truncation and saturation in digit-precision are very important and common operations in embedded system design for bounding the required finite precision and for area-time-power savings. In this paper, we present the use of on-line arithmetic to provide truncated computations with communication systems as one of the applications. In contrast to truncation in conventional arithmetic, on-line arithmetic can truncate dynamically and produce both area and time benefits due to the digit-serial nature of computations. This is of great advantage in communication systems where the precision requirements can change dynamically with the environment. While truncation in conventional arithmetic can have significant truncation errors, the redundancy and most significant digit first nature of on-line arithmetic produces truncation error only in the least significant digit of the truncated result. As an application that uses significant truncation in precision, a code matched filter detector for wireless systems is designed using truncated on-line arithmetic. The detector can provide both hard decisions and soft(er) decisions dynamically as well as interface with other conventional arithmetic circuits or act as a DSP co-processor. Thus, optimized communication receivers with co-existing conventional arithmetic for saturation and on-line arithmetic for truncation can now be built. The truncated on-line arithmetic detector was also verified with a VLSI implementation in an AMI 0.5 micron MOSIS Tiny Chip process and is currently under fabrication.