Browsing by Author "Sun, Yang"
Now showing 1 - 20 of 33
Results Per Page
Sort Options
Item Application-Specific Accelerators for Communications(Springer Science+Business Media, LLC, 2010-01-01) Sun, Yang; Amiri, Kiarash; Brogioli, Michael; Cavallaro, Joseph R.; Center for Multimedia CommunicationFor computation-intensive digital signal processing algorithms, complexity is exceeding the processing capabilities of general-purpose digital signal processors (DSPs). In some of these applications, DSP hardware accelerators have been widely used to off-load a variety of algorithms from the main DSP host, including FFT, FIR/IIR filters, multiple-input multiple-output (MIMO) detectors, and error correction codes (Viterbi, Turbo, LDPC) decoders. Given power and cost considerations, simply implementing these computationally complex parallel algorithms with high-speed general-purpose DSP processor is not very efficient. However, not all DSP algorithms are appropriate for off-loading to a hardware accelerator. First, these algorithms should have data-parallel computations and repeated operations that are amenable to hardware implementation. Second, these algorithms should have a deterministic dataflow graph that maps to parallel datapaths. The accelerators that we consider are mostly coarse grain to better deal with streaming data transfer for achieving both high performance and low power. In this chapter, we focus on some of the basic and advanced digital signal processing algorithms for communications and cover major examples of DSP accelerators for communications.Item Architectures for Cognitive Radio Testbeds and Demonstrators – An Overview(IEEE, 2010-06-01) Gustafsson, Oscar; Amiri, Kiarash; Andersson, Dennis; Blad, Anton; Bonner, Christian; Cavallaro, Joseph R.; Declerck, Jeroen; Dejonghe, Antoine; Eliardsson, Patrik; Glasse, Miguel; Hayar, Aawatif; Hollevoet, Lieven; Hunter, Chris; Joshi, Madhura; Kaltenberger, Florian; Knopp, Raymond; Le, Khanh; Miljanic, Zoran; Murphy, Patrick; Naessens, Frederik; Nikaein, Navid; Nussbaum, Dominique; Pacalet, Renaud; Raghavan, Praveen; Sabharwal, Ashutosh; Sarode, Onkar; Spasojevic, Predrag; Sun, Yang; Tullberg, Hugo M.; Vander Aa, Tom; Van der Perre, Liesbet; Wetterwald, Michelle; Wu, Michael; Center for Multimedia CommunicationWireless communication standards are developed at an ever-increasing rate of pace, and significant amounts of effort is put into research for new communication methods and concepts. On the physical layer, such topics include MIMO, cooperative communication, and error control coding, whereas research on the medium access layer includes link control, network topology, and cognitive radio. At the same time, implementations are moving from traditional fixed hardware architectures towards software, allowing more efficient development. Today, field-programmable gate arrays (FPGAs) and regular desktop computers are fast enough to handle complete baseband processing chains, and there are several platforms, both open-source and commercial, providing such solutions. The aims of this paper is to give an overview of five of the available platforms and their characteristics, and compare the features and performance measures of the different systems.Item Configurable and Scalable High Throughput Turbo Decoder Architecture for Multiple 4GWireless Standards(IEEE, 2008-07-01) Sun, Yang; Zhu, Yuming; Goel, Manish; Cavallaro, Joseph R.; Center for Multimedia CommunicationIn this paper, we propose a novel multi-code turbo decoder architecture for 4G wireless systems. To support various 4G standards, a configurable multi-mode MAP (maximum a posteriori) decoder is designed for both binary and duo-binary turbo codes with small resource overhead (less than 10%) compared to the single-mode architecture. To achieve high data rates in 4G, we present a parallel turbo decoder architecture with scalable parallelism tailored to the given throughput requirements. High-level parallelism is achieved by employing contention-free interleavers. Multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. We designed a very low-complexity recursive on-line address generator supporting multiple interleaving patterns, which avoids the interleaver address memory. Design trade-offs in terms of area and power efficiency are explored to find the optimal architectures. A 711 Mbps data rate is feasible with 32 Radix-4 MAP decoders running at 200 MHz clock rate.Item Configurable and Scalable Turbo Decoder for 4G Wireless Receivers(IGI-Global Press, 2010-01-01) Sun, Yang; Cavallaro, Joseph R.; Zhu, Yuming; Manish, Goel; Center for Multimedia CommunicationThe increasing requirements of high data rates and quality of service (QoS) in fourth-generation (4G) wireless communication require the implementation of practical capacity approaching codes. In this chapter, the application of Turbo coding schemes that have recently been adopted in the IEEE 802.16e WiMax standard and 3GPP Long Term Evolution (LTE) standard are reviewed. In order to process several 4G wireless standards with a common hardware module, a reconfigurable and scalable Turbo decoder architecture is presented. A parallel Turbo decoding scheme with scalable parallelism tailored to the target throughput is applied to support high data rates in 4G applications. High-level decoding parallelism is achieved by employing contention-free interleavers. A multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. A new on-line address generation technique is introduced to support multiple Turbo interleaving patterns, which avoids the interleaver address memory that is typically necessary in the traditional designs. Design trade-offs in terms of area and power efficiency are analyzed for different parallelism and clock frequency goals.Item Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder(Elsevier, 2011-09-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationWe present an efficient VLSI architecture for 3GPP LTE/LTE-Advance Turbo decoder by utilizing the algebraic-geometric properties of the quadratic permutation polynomial (QPP) interleaver. The high throughput 3GPP LTE/LTE-Advance Turbo codes require a highly-parallel decoder architecture. Turbo interleaver is known to be the main obstacle to the decoder parallelism due to the collisions it introduces in accesses to memory. The QPP interleaver solves the memory contention issues when several MAP decoders are used in parallel to improve Turbo decoding throughput. In this paper, we propose a low-complexity QPP interleaving address generator and a multi-bank memory architecture to enable parallel Turbo decoding. Design trade-offs in terms of area and throughput efficiency are placed an routed in a 65-nm CMOS technology with a core area of 8.3mm2 and a maximum clock frequency of 400 MHz. This parallel decoder, comprising 64 MAP decoder cores, can achieve a maximum decoding throughput of 1.28 Gbps at 6 iterations.Item A Flexible LDPC/Turbo Decoder Architecture(Springer, 2011-07-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationLow-density parity-check (LDPC) codes and convolutional Turbo codes are two of the most powerful error correcting codes that are widely used in modern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be required. However, the different decoding approaches for LDPC and Turbo codes usually lead to different hardware architectures. In this paper we propose a unified message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low hardware overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO decoder architecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel decoder for higher decoding throughput. As a case study, a flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm2. The decoder can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE Turbo codes. Running at 500 MHz clock frequency, the decoder can sustain up to 600 Mbps LDPC decoding or 450 Mbps Turbo decoding.Item FPGA Prototyping of A High Data Rate LTE Uplink Baseband Receiver(IEEE, 2009-11-01) Wang, Guohui; Yin, Bei; Amiri, Kiarash; Sun, Yang; Wu, Michael; Cavallaro, Joseph R.; Center for Multimedia CommunicationThe Third Generation Partnership Project (3GPP) Long Term Evolution (LTE) standard is becoming the appropriate choice to pave the way for the next generation wireless and cellular standards. While the popular OFDM technique has been adopted and implemented in previous standards and also in the LTE downlink, it suffers from high peak-to-average-power ratio (PAPR). High PAPR requires more sophisticated power amplifiers (PAs) in the handsets and would result in lower efficiency PAs. In order to combat such effects, the LTE uplink choice of transmission is the novel Single Carrier Frequency Division Multiple Access (SC-FDMA) scheme which has lower PAPR due to its inherent signal structure. While reducing the PAPR, the SC-FDMA requires a more complicated detector structure in the base station for multi-antenna and multi-user scenarios. Since the multi-antenna and multi-user scenarios are critical parts of the LTE standard to deliver high performance and data rate, it is important to design novel architectures to ensure high reliability and data rate in the receiver. In this paper, we propose a flexible architecture of a high data rate LTE uplink receiver with multiple receive antennas and implemented a single FPGA prototype of this architecture. The architecture is verified on the WARPLab (a software defined radio platform based on Rice Wireless Open-access Research Platform) and tested in the real over-the-air indoor channel.Item GPU Accelerated Scalable Parallel Decoding of LDPC Codes(IEEE, 2011-11-01) Wang, Guohui; Wu, Michael; Sun, Yang; Center for Multimedia CommunicationThis paper proposes a flexible low-density parity-check (LDPC) decoder which leverages graphic processor units (GPU) to provide high decoding throughput. LDPC codes are widely adopted by the new emerging standards for wireless communication systems and storage applications due to their near-capacity error correcting performance. To achieve high decoding throughput on GPU, we leverage the parallelism embedded in the check-node computation and variable-node computation and propose a parallel strategy of partitioning the decoding jobs among multi-processors in GPU. In addition, we propose a scalable multi-codeword decoding scheme to fully utilize the computation resources of GPU. Furthermore, we developed a novel adaptive performance-tuning method to make our decoder implementation more flexible and scalable. The experimental results show that our LDPC decoder is scalable and flexible, and the adaptive performance-tuning method can deliver the peak performance based on the GPU architecture.Item A GPU Implementation of a Real-Time MIMO Detector(IEEE, 2009-10-01) Wu, Michael; Gupta, Siddharth; Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationMultiple-input multiple-output (MIMO) is an existing technique that can significantly increase throughput of the system by employing multiple antennas at the transmitter and the receiver. Realizing maximum benefit from this technique requires computationally intensive detectors which poses significant challenges to receiver design. Furthermore, a flexible detector or multiple detectors are needed to handle different configurations. Graphical Processor Unit (GPU), a highly parallel commodity programmable co-processor, can deliver extremely high computation throughput and is well suited for signal processing applications. However, careful architecture aware design is needed to leverage performance offered by GPU. We show we can achieve good performance while maintaining flexibility by employing an optimized trellis-based MIMO detector on GPU.Item High Throughput VLSI Architecture for Soft-Output MIMO Detection Based on A Greedy Graph Algorithm(ACM, 2009-05-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationMaximum-likelihood (ML) decoding is a very computational-intensive task for multiple-input multiple-output (MIMO) wireless channel detection. This paper presents a new graph based algorithm to achieve near ML performance for soft MIMO detection. Instead of using the traditional tree search based structure, we represent the search space of the MIMO signals with a directed graph and a greedy algorithm is applied to compute the a posteriori probability (APP) for each transmitted bit. The proposed detector has two advantages: 1) it keeps a fixed throughput and has a regular and parallel datapath structure which makes it amenable to high speed VLSI implementation, and 2) it attempts to maximize the a posteriori probability by making the locally optimum choice at each stage with the hope of finding the global minimum Euclidean distance for every transmitted bit. Compared to the soft K-best detector, the proposed solution significantly reduces the complexity because sorting is not required, while still maintaining good bit error rate (BER) performance. The proposed greedy detection algorithm has been designed and synthesized for a 4 x 4 16-QAM MIMO system in a TSMC 65 nm CMOS technology. The detector achieves a maximum throughput of 600 Mbps with a 0.79 mm2 core area.Item High Throughput VLSI Architecture for Soft-Output MIMO Detection Based on A Greedy Graph Algorithm(ACM, 2009-05-10) Sun, Yang; Cavallaro, Joseph R.; CMCMaximum-likelihood (ML) decoding is a very computational- intensive task for multiple-input multiple-output (MIMO) wireless channel detection. This paper presents a new graph based algorithm to achieve near ML performance for soft MIMO detection. Instead of using the traditional tree search based structure, we represent the search space of the MIMO signals with a directed graph and a greedy algorithm is ap- plied to compute the a posteriori probability (APP) for each transmitted bit. The proposed detector has two advantages: 1) it keeps a fixed throughput and has a regular and parallel datapath structure which makes it amenable to high speed VLSI implementation, and 2) it attempts to maximize the a posteriori probability by making the locally optimum choice at each stage with the hope of finding the global minimum Euclidean distance for every transmitted bit x_k element of {-1, +1}. Compared to the soft K-best detector, the proposed solution significantly reduces the complexity because sorting is not required, while still maintaining good bit error rate (BER) performance. The proposed greedy detection algorithm has been designed and synthesized for a 4 x 4 16-QAM MIMO system in a TSMC 65 nm CMOS technology. The detector achieves a maximum throughput of 600 Mbps with a 0.79 mm2 core area.Item HIGH THROUGHPUT, PARALLEL, SCALABLE LDPC ENCODER/DECODER ARCHITECTURE FOR OFDM SYSTEMS(IEEE, 2006-10-01) Sun, Yang; Karkooti, Marjan; Cavallaro, Joseph R.; Center for Multimedia CommunicationThis paper presents a high throughput, parallel, scalable and irregular LDPC coding and decoding system hardware implementation that supports twelve combinations of block lengths 648, 1296, 1944 bits and code rates 1/2, 2/3, 3/4, 5/6 based on IEEE 802.11n standard. Based on architecture-aware LDPC codes, we propose an efficient joint LDPC coding and decoding hardware architecture. The prototype architecture is being implemented on FPGA and tested over the air on our wireless OFDM testbed, which is a highly capable, scalable and extensible platform for advanced wireless research. The ASIC resource requirements of the decoder are reported and a trade-off between pipelined and non-pipelined implementation is describedItem High-Level Design Tools for Complex DSP Applications(Elsevier, Waltham, MA, 2012-07-12) Sun, Yang; Amiri, Kiarash; Wang, Guohui; Yin, Bei; Cavallaro, Joseph R.; Ly, Tai; Center for Multimedia CommunicationHigh-level synthesis design methodology - High level synthesis (HLS) [1], also known as behavioral synthesis and algorithmic synthesis, is a design process in which a high level, functional description of a design is automatically compiled into a RTL implementation that meets certain user specified design constraints. The HLS design description is ‘high level’ compared to RTL in two aspects: design abstraction, and specification language.Item High-Throughput Contention-Free Concurrent Interleaver Architecture for Multi-Standard Turbo Decoder(IEEE, 2011-09-01) Wang, Guohui; Sun, Yang; Cavallaro, Joseph R.; Guo, Yuanbin; Center for Multimedia CommunicationTo meet the higher data rate requirement of emerging wireless communication technology, numerous parallel turbo decoder architectures have been developed. However, the interleaver has become a major bottleneck that limits the achievable throughput in the parallel decoders due to the massive memory conflicts. In this paper, we propose a flexible Double-Buffer based Contention-Free (DBCF) interleaver architecture that can efficiently solve the memory conflict problem for parallel turbo decoders with very high parallelism. The proposed DBCF architecture enables high throughput concurrent interleaving for multi-standard turbo decoders that support UMTS/HSPA+, LTE and WiMAX, with small datapath delays and low hardware cost. We implemented the DBCF interleaver with a 65nm CMOS technology. The implementation of this highly efficient DBCF interleaver architecture shows significant improvement in terms of the maximum throughput and occupied chip area compared to the previous work.Item High-Throughput Soft-Output MIMO Detector Based on Path-Preserving Trellis-Search Algorithm(IEEE, 2012-07-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationIn this paper, we propose a novel path-preserving trellis-search (PPTS) algorithm and its high-speed VLSI architecture for soft-output multiple-input-multiple-output (MIMO) detection. We represent the search space of the MIMO signal with an unconstrained trellis, where each node in stage of the trellis maps to a possible complex-valued symbol transmitted by antenna. Based on the trellis model, we convert the soft-output MIMO detection problem into a multiple shortest paths problem subject to the constraint that every trellis node must be covered in this set of paths. The PPTS detector is guaranteed to have soft information for every possible symbol transmitted on every antenna so that the log-likelihood ratio (LLR) for each transmitted data bit can be more accurately formed. Simulation results show that the PPTS algorithm can achieve near-optimal error performance with a low search complexity. The PPTS algorithm is a hardware-friendly data-parallel algorithm because the search operations are evenly distributed among multiple trellis nodes for parallel processing. As a case study, we have designed and synthesized a fully-parallel systolic-array detector and two folded detectors for a 4x4 16-QAM system using a 1.08 V TSMC 65-nm CMOS technology.With a 1.18 mm2 core area, the folded detector can achieve a throughput of 2.1 Gbps.With a 3.19 mm2 core area, the fully-parallel systolic-array detector can achieve a throughput of 6.4 Gbps.Item Implementation of a High Throughput 3GPP Turbo Decoder on GPU(Springer, 2011-11-01) Wu, Michael; Sun, Yang; Wang, Guohui; Cavallaro, Joseph R.; Center for Multimedia CommunicationTurbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput.Item Implementation of a High Throughput Soft MIMO Detector on GPU(Springer, 2011-07-01) Wu, Michael; Sun, Yang; Gupta, Siddharth; Cavallaro, Joseph R.; Center for Multimedia CommunicationMultiple-input multiple-output (MIMO) significantly increases the throughput of a communication system by employing multiple antennas at the transmitter and the receiver. To extract maximum performance from a MIMO system, a computationally intensive search based detector is needed. To meet the challenge of MIMO detection, typical suboptimal MIMO detectors are ASIC or FPGA designs. We aim to show that a MIMO detector on Graphic processor unit (GPU), a low-cost parallel programmable co-processor, can achieve high throughput and can serve as an alternative to ASIC/FPGA designs. However, careful architecture aware software design is needed to leverage the performance offered by GPU. We propose a novel soft MIMO detection algorithm, multi-pass trellis traversal (MTT), and show that we can achieve ASIC/FPGA-like performance and handle different configurations in software on GPU. The proposed design can be used to accelerate wireless physical layer simulations and to offload MIMO detection processing in wireless testbed platforms.Item Low complexity scalable MIMO sphere detection through antenna detection reordering(Springer, 2012-07-01) Wu, Michael; Dick, Chris; Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationThis paper describes a novel low complexity scalable multiple-input multiple-output (MIMO) detector that does not require preprocessing and the optimal squared l2-norm computations to achieve good bit error (BER) performance. Unlike existing detectors such as Flexsphere that use preprocessing before MIMO detection to improve performance, the proposed detector instead performs multiple search passes, where each search pass detects the transmit stream with a different permuted detection order. In addition, to reduce the number of multipliers required in the design, we use l1-norm in place of the optimal squared l2-norm. To ameliorate the BER performance loss due to l1- norm, we propose squaring then scaling the l1-norm. By changing the number of parallel search passes and using norm scaling, we show that this design achieves comparable performance to Flexsphere with reduced resource requirement or achieves BER performance close to exhaustive search with increased resource requirement.Item LOW-COMPLEXITY AND HIGH-PERFORMANCE SOFT MIMO DETECTION BASED ON DISTRIBUTED M-ALGORITHM THROUGH TRELLIS-DIAGRAM(IEEE, 2010-03-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationThis paper presents a novel low-complexity multiple-input multipleoutput (MIMO) detection scheme using a distributed M-algorithm (DM) to achieve high performance soft MIMO detection. To reduce the searching complexity, we build a MIMO trellis graph and split the searching operations among different nodes, where each node will apply the M-algorithm. Instead of keeping a global candidate list as the traditional detector does, this algorithm keeps multiple small candidate lists to generate soft information. Since the DM algorithm can achieve good BER performance with a small M, the sorting cost of the DM algorithm is lower than that of the conventional K-best MIMO algorithm. The proposed algorithm is very suitable for high speed parallel processing.Item A LOW-POWER 1-Gbps RECONFIGURABLE LDPC DECODER DESIGN FOR MULTIPLE 4G WIRELESS STANDARDS(IEEE, 2008-09-01) Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationIn this paper we present an efficient system-on-chip implementation of a 1-Gbps LDPC decoder for 4G (or beyond 3G) wireless standards. The decoder has a scalable data path and can be dynamically reconfigured to support multiple 4G standards. We utilize a pipelined version of the layered belief propagation algorithm to achieve partial-parallel decoding of structured LDPC codes. Instead of using the sub-optimal Minsum algorithm, we propose to use the powerful belief propagation (BP) decoding algorithm by designing an area-efficient soft-input soft-output (SISO) decoder. Two power saving schemes are employed to reduce the power consumption up to 65%. The decoder has been synthesized, placed, and routed on a TSMC 90nm 1.0V 8-metal layer CMOS technology with a total area of 3.5 mm2. The maximum clock frequency is 450 MHz and the estimated peak power consumption is 410 mW.