Browsing by Author "Guo, Yuanbin"
Now showing 1 - 20 of 36
Results Per Page
Sort Options
Item Advanced MIMO-CDMA receiver for interference suppression: Algorithms, system-on-chip architectures and design methodology(2005) Guo, Yuanbin; Cavallaro, Joseph R.MIMO (Multiple Input Multiple Output) technology is proposed in CDMA systems for much higher rate packet services. The receiver architecture is essential for the mobile devices to support high speed multimedia service. The design challenges come from both detection algorithms and hardware architectures. Much more complicated algorithms are required to suppress various interferences. However, the current hardware design architecture and methodology is falling far behind the requirements of small size, low cost and power consumption. System-On-Chip (SoC) architectures are a major revolution taking place in the design of integrated circuits due to many advantages in the power consumption and compact size. The VLSI-oriented complexity reduction of the numerical algorithms plays an essential role to design efficient real-time architectures. Thus, the thesis contributes to three major aspects: to propose high performance algorithms with realistic complexity in different channel conditions; to propose real-time SoC architectures with area/speed/power efficiency; and to propose an efficient design methodology for modelling, partitioning/binding, verification and synthesis of the wireless systems. Specifically, to cut the design cycle and enable extensive architecture tradeoff study, an integrated wireless development methodology by High-Level-Synthesis for joint algorithm and architecture optimization is proposed. To address the performance/complexity tradeoff, we propose two LMMSE equalizer algorithms and SoC architectures for different channel conditions. Both an FFT circulant MIMO equalizer and a frequency domain iterative equalizer are proposed to avoid Direct-Matrix-Inverse for the well-conditioned channel as well as long channels working in bad conditions respectively. We then propose a displacement Kalman equalizer with VLSI-oriented architectural optimization for better performance in fast fading environments. For systems with the multi-users' signaling, we propose an adaptive Parallel-Residue-Compensation architecture with stage and user specific weights by viewing the multiple transmitter antennas as virtual users to cancel the interferences explicitly. The increased accuracy in interference cancellation leads to significant performance gain over both the complete and partial PIC. The complexity is reduced by using the commonality to avoid the direct interference cancellation. Finally, dynamic power management schemes are proposed to reduce the power consumption in the VLSI architectures using the inherent features of the interference suppression algorithms.Item Advanced MIMO-CDMA Receiver for Interference Suppression: Algorithms, System-on-Chip Architectures and Design Methodology(2005-05-01) Guo, Yuanbin; Center for Multimedia Communications (http://cmc.rice.edu/)MIMO (Multiple Input Multiple Output) technology is proposed in CDMA systems for much higher rate packet services. The receiver architecture is essential for the mobile devices to support high speed multimedia service. The design challenges come from both detection algorithms and hardware architectures. Much more complicated algorithms are required to suppress various interferences. However, the current hardware design archi-tecture and methodology is falling far behind the requirements of small size, low cost and power consumption. System-On-Chip (SoC) architectures are a major revolution taking place in the design of integrated circuits due to many advantages in the power consumption and compact size. The VLSI-oriented complexity reduction of the numerical algorithms plays an essential role to design efficient real-time architectures. Thus, the thesis contributes to three major as-pects: to propose high performance algorithms with realistic complexity in different chan-nel conditions; to propose real-time SoC architectures with area/speed/power efficiency; and to propose an efficient design methodology for modelling, partitioning/binding, verifi-cation and synthesis of the wireless systems. Specifically, to cut the design cycle and enable extensive architecture tradeoff study, an integrated wireless development methodology by High-Level-Synthesis for joint algorithm and architecture optimization is proposed. To address the performance/complexity tradeoff, we propose two LMMSE equalizer algorithms and SoC architectures for different channel conditions. Both an FFT circulant MIMO equalizer and a frequency domain iterative equalizer are proposed to avoid Direct-Matrix-Inverse for the well-conditioned channel as well as long channels working in bad conditions respectively. We then propose a displacement Kalman equalizer with VLSI-oriented architectural optimization for better performance in fast fading environments. For systems with the multi-usersâ signaling, we propose an adaptive Parallel-Residue-Compensation architecture with stage and user spe-cific weights by viewing the multiple transmitter antennas as virtual users to cancel the interferences explicitly. The increased accuracy in interference cancellation leads to signif-icant performance gain over both the complete and partial PIC. The complexity is reduced by using the commonality to avoid the direct interference cancellation. Finally, dynamic power management schemes are proposed to reduce the power consumption in the VLSI architectures using the inherent features of the interference suppression algorithms.Item Compact Hardware Accelerator for Functional Verification and Rapid Prototyping of 4G Wireless Communication Systems(2004-11-01) Guo, Yuanbin; McCain, Dennis; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we propose an FPGA-based hardware accelerator platform with Xilinx Virtex-II V3000 in a compact PCMCIA form factor. By partitioning the complex algorithms in the 4G simulator to the hardware accelerator, we apply an efficient Catapult-C methodology to quickly evaluate the area/speed tradeoffs and rapidly schedule synthesizable RTL models for implementation. The simulation time is accelerated by 100£ for a QRD-M algorithm. This not only enables much faster verification in the 4G standard environment, but also provides software/hardware co-design and rapid prototyping of the core algorithm in a realistic fixed-point platform.Item Displacement MIMO Kalman equalizer architecture for CDMA downlink in fast fading channels(2005-07-01) Guo, Yuanbin; Zhang, Jianzhong (Charlie); McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we explore the displacement structure in a Kalman equalizer for MIMO-CDMA downlink. A streamlined MIMO Kalman equalizer architecture is proposed to extract the commonality in the data path by exploiting the displacement structure of the transition matrix and the block-Toeplitz structure of the channel matrix. Numerical matrix multiplications with O(F^3) complexity are eliminated by simple data loading process. Utilizing the block Toeplitz structure of the channel matrix, an FFT-based acceleration is proposed to avoid direct matrix multiplications in the time domain. Finally, an iterative Conjugate-Gradient based algorithm is proposed to avoid the inversion of the innovation correlation matrix in Kalman gain calculation. The proposed architecture not only reduces the numerical complexity to O(F log2 F) per chip, but also facilitates the parallel and pipelined VLSI implementation for real-time processing.Item Displacement MIMO Kalman Equalizer for CDMA Downlink in Fast Fading Channels(2005-11-01) Guo, Yuanbin; Zhang, Jianzhong (Charlie); McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, a streamlined MIMO Kalman equalizer architecture is proposed to extract the commonality in the data path by jointly considering the displacement structure of the transition matrix and the block-Toeplitz structure of the channel matrix. Finally, an iterative Conjugate-Gradient based algorithm is proposed to avoid the inverse of the Hermitian symmetric innovation correlation matrix in Kalman gain processor. The proposed architecture not only reduces the numerical complexity to O(F log F) per chip, but also facilitates the parallel and pipelined VLSI implementation in real-time processing.Item An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture(Hindawi Publishing Corporation, 2006-02-01) Guo, Yuanbin; Zhang, Jianzhong; McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia CommunicationWe present an efficient circulant approximation-based MIMO equalizer architecture for the CDMA downlink. This reduces the direct matrix inverse (DMI) of size (NF×NF) with O((NF)3) complexity to some FFT operations with O(NF log2(F)) complexity and the inverse of some (N×N) submatrices.We then propose parallel and pipelined VLSI architectures with Hermitian optimization and reduced-state FFT for further complexity optimization. Generic VLSI architectures are derived for the (4×4) high-order receiver from partitioned (2 × 2) submatrices. This leads to more parallel VLSI design with 3× further complexity reduction. Comparative study with both the conjugate-gradient and DMI algorithms shows very promising performance/complexity tradeoff. VLSI design space in terms of area/time efficiency is explored extensively for layered parallelism and pipelining with a Catapult C high-level-synthesis methodology.Item An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture(2005-12-01) Guo, Yuanbin; Zhang, Jianzhong (Charlie); McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we present an efficient circulant approximation based MIMO equalizer architecture for the CDMA downlink. This reduces the Direct-Matrix-Inverse (DMI) of size (NF x NF) with O((NF)³) complexity to some FFT operations with O(NF log2(F)) complexity and the inverse of some (N x N) sub-matrices. We then propose parallel and pipelined VLSI architectures with Hermitian optimization and reduced-state FFT for further complexity optimization. Generic VLSI architectures are derived for the (4 x 4) high-order receiver from partitioned (2 x 2) sub-matrices. This leads to more parallel VLSI design with 3x further complexity reduction. Comparative study with both the Conjugate-Gradient and DMI algorithms shows very promising performance/complexity tradeoff. VLSI design space in terms of area/time efficiency is explored extensively for layered parallelism and pipelining with a Catapult C High-Level-Synthesis methodology.Item Efficient MIMO equalization for downlink multi-code CDMA: complexity optimization and comparative study(2004-11-01) Guo, Yuanbin; Zhang, Jianzhong (Charlie); McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we present an efficient LMMSE chip equalizer to suppress the interference caused by the multipath fading channel in the MIMO multi-code CDMA downlink. The block-Toeplitz structure in the correlation matrix is approximated with a block circulant matrix. An FFT-based algorithm is applied to avoid the Direct-Matrix-Inverse (DMI) in the system equation. Hermitian optimization is proposed to further reduce the complexity. A comparative study in both performance and complexity with the Conjugate-Gradient (CG) algorithm is then presented. The simulation shows very promising results for the FFT-based equalizer compared with both the DMI and CG algorithms.Item Efficient VLSI Architectures for Recursive Vandermonde QR Decomposition in Broadband OFDM Pre-distortion(2005-03-01) Guo, Yuanbin; Center for Multimedia Communications (http://cmc.rice.edu/)The Vandermonde system is used in OFDM predistortion to enhance the power efficiency dramatically. In this paper, we study efficient FPGA architectures of a recursive algorithm for the Cholesky and QR factorization of the Vandermonde system. We identify the key bottlenecks of the algorithm for the real-time constraints and resource consumption. Several architecture/resource tradeoffs are studied to find the commonalities in the architectures for a best partitioning. Hardware resources are reused according to the algorithmic parallelism and data dependency to achieve the best timing/area performance in hardware. The architectures are implemented in Xilinx FPGA and tested in Aptix real-time hardware platform with 11348 cycles at 25ns clock rate.Item Enhanced Power Efficiency of Mobile OFDM Radio using Pre-distortion and Post-compensation(2002-09-20) Guo, Yuanbin; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)High power efficiency is an important requirement in cellular transmitter radios. OFDM systems require very linear transmission because of high peak-to-average power-ratio (PAPR). However, linear power amplifiers (PA) have very low efficiency. To enhance the power efficiency, non-linear PAs are applied and even forced to work near the saturation point. This generates inevitably huge non-linear distortion and spectrum re-growth. In this paper we propose a novel scheme to enhance the power efficiency of the mobile OFDM radio using polynomial-based pre-distortion and post-compensation. We estimate both the non-linearity and inverse non-linearity of the RF transmitter modeled by a memory-less polynomial with Least Square Error (LSE) method. These parameters can then be used to either construct a closed-form pre-distorter or easily generate the post-compensation signal to cancel the non-linear distortion at the RF output.Item FFT-Accelerated Iterative MIMO Chip Equalizer Architecture For CDMA Downlink(2005-03-01) Guo, Yuanbin; McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we present a novel FFT-accelerated iterative Linear MMSE chip equalizer in the MIMO CDMA downlink receiver. The reversed form time-domain matrix multiplication in the Conjugate Gradient iteration is accelerated by an equivalent frequency-domain circular convolution with FFT-based "overlap-save" architecture. The iteration rapidly refines a crude initial approximation to the actual final equalizer taps. This avoids the Direct-Matrix-Inverse with O((NL)³) complexity, and reduces the standard CG complexity from O((NL)²) to O(NLlog2(NL)). Simulation demonstrates strong numerical stability and promising performance/complexity tradeoff, especially for very long channels.Item Hermitian Optimization and Scalable VLSI Architecture for Circulant Approximated MIMO Equalizer in CDMA Downlink(2005-09-01) Guo, Yuanbin; McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we propose a parallel and pipelined VLSI architecture for a circulant approximated equalizer for the MIMOCDMA systems. The FFT-based tap solver reduces the Direct-Matrix-Inverse of the size (NF x NF) to the inverse of O(N) sub-matrices of the size (N x N). Hermitian optimization and tree pruning is proposed to reduce the number and complexity of the FFTs. A divide-andconquer method partitions the 4£4 sub-matrices into 2x2 sub-matrices and simplifies the inverse of sub-matrices. Generic VLSI architecture is derived to eliminate the redundancies in the complex operations. Multiple level parallelism and pipelining is investigated with a Catapult C High-Level-Synthesis (HLS) methodology. This leads to efficient VLSI architectures with 3x further complexity reduction. The scalable VLSI architectures are prototyped with the Xilinx FPGAs and achieve area/time efficiency.Item High-Throughput Contention-Free Concurrent Interleaver Architecture for Multi-Standard Turbo Decoder(IEEE, 2011-09-01) Wang, Guohui; Sun, Yang; Cavallaro, Joseph R.; Guo, Yuanbin; Center for Multimedia CommunicationTo meet the higher data rate requirement of emerging wireless communication technology, numerous parallel turbo decoder architectures have been developed. However, the interleaver has become a major bottleneck that limits the achievable throughput in the parallel decoders due to the massive memory conflicts. In this paper, we propose a flexible Double-Buffer based Contention-Free (DBCF) interleaver architecture that can efficiently solve the memory conflict problem for parallel turbo decoders with very high parallelism. The proposed DBCF architecture enables high throughput concurrent interleaving for multi-standard turbo decoders that support UMTS/HSPA+, LTE and WiMAX, with small datapath delays and low hardware cost. We implemented the DBCF interleaver with a 65nm CMOS technology. The implementation of this highly efficient DBCF interleaver architecture shows significant improvement in terms of the maximum throughput and occupied chip area compared to the previous work.Item Highly Scalable On-the-Fly Interleaved Address Generation for UMTS/HSPA+ Parallel Turbo Decoder(24th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2013-06-01) Vosoughi, Aida; Wang, Guohui; Shen, Hao; Cavallaro, Joseph R.; Guo, Yuanbin; CMCHigh throughput parallel interleaver design is a major challenge in designing parallel turbo decoders that conform to high data rate requirements of advanced standards such as HSPA+. The hardware complexity of the HSPA+ interleaver makes it difficult to scale to high degrees of parallelism. We propose a novel algorithm and architecture for on-the-fly parallel interleaved address generation in UMTS/HSPA+ standard that is highly scalable. Our proposed algorithm generates an interleaved memory address from an original input address without building the complete interleaving pattern or storing it; the generated interleaved address can be used directly for interleaved writing to memory blocks. We use an extended Euclidean algorithm for modular multiplicative inversion as a step towards reversed intra-row permutations in UMTS/HSPA+ standard. As a result, we can determine interleaved addresses from original addresses. We also propose an efficient and scalable hardware architecture for our method. Our design generates 32 interleaved addresses in one cycle and satisfies the data rate requirement of 672 Mbps in HSPA+ while the silicon area and frequency is improved compared to recent related works.Item A low complexity and low power SoC design architecture for adaptive MAI suppression in CDMA systems(2005-05-01) Guo, Yuanbin; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we propose a reduced complexity and power efficient System-on-Chip (SoC) architecture for adaptive interference suppression in CDMA systems. The adaptive Parallel-Residue-Compensation architecture leads to significant performance gain over the conventional interference cancellation algorithms. The multi-code commonality is explored to avoid the direct Interference Cancellation (IC), which reduces the IC complexity from O(K^2N) to O(KN). The physical meaning of the complete versus weighted IC is applied to clip the weights above a certain threshold so as to reduce the VLSI circuit activity rate. Novel scalable SoC architectures based on simple combinational logic are proposed to eliminate dedicated multipliers with at least 10X saving in hardware resource. A Catapult C High Level Synthesis methodology is apply to explore the VLSI design space extensively and achieve at least 4£ speedup. Multi-stage Convergence-Masking-Vector combined with clock gating is proposed to reduce the VLSI dynamic power consumption by up to 90%.Item Low Complexity System-On-Chip Architectures Of Optimal Parallel-Residue-Compensation In CDMA Systems(2004-05-01) Guo, Yuanbin; McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we propose a novel multi-stage Parallel-Residue-Compensation (PRC) receiver architecture for enhanced suppression of the MAI in CDMA systems. We extract the commonality to avoid the direct Interference Cancellation and reduce the algorithm complexity from O(K²N) to O(KN). In the second part, scalable VLSI architectures are implemented in a FPGA prototyping system with an efficient Precision-C System-on-Chip (SOC) design methodology. Hardware efficiency is achieved by investigating multi-level parallelism and pipelines. The design of Sum-Sub-MUX Unit (SMU) combinational logic avoids the usage of dedicated multipliers with at least 10X saving in hardware resources. The most area/timing efficient design only uses area similar to the most area constraint architecture but gives at least 4X speedup over a conventional design.Item Low Power VLSI Architecture for Adaptive MAI Suppression in CDMA Using Multi-stage Convergence Masking Vector(2005-09-01) Guo, Yuanbin; McCain, Dennis; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)In this paper, we propose a novel low power and low complexity multi-stage Parallel-Residue-Compensation (PRC) architecture for enhanced MAI suppression in the CDMA systems. The accuracy of the interference cancellation is improved with a set of weights computed from an adaptive Normalized Least-Mean-Square (NLMS) algorithm. The physical meaning of the complete versus weighted interference cancellation is applied to clip the weights above a certain threshold. Multistage Convergence-Masking-Vector (CMV) is then proposed to combine with the clock gating as a dynamic power management scheme in the VLSI receiver architecture. This reduces the dynamic power consumption in the VLSI architecture by up to 90% with a negligible performance loss.Item Modeling of Welkin RF in a DSSS System(2000-09-20) Guo, Yuanbin; Center for Multimedia Communications (http://cmc.rice.edu/)In this report, the author studied the methodology of distortion-true software modeling for RF (radio frequency) in a DSSS W-CDMA system. A complete end-to-end testbed based on the Welkin RF system is built by SystemView with specifications from real world RF components. This testbed includes digital baseband transmitter, RF transmitter, wireless channel, RF receiver, and baseband receiver for WLAN applications. Several implementation issues will be discussed. Various time/frequency domain analysis methods help to understand the non-linearity of real RF components in each stage and provide a reference to the hardware design. To support the capability of multi-domain simulation, Matlab custom tokens were integrated. The simulation results show the effects of noise figure introduced by RF on the receiver's sensitivity.Item A Novel Adaptive Pre-Distorter Using LS Estimation of SSPA Non-Linearity in Mobile OFDM Systems(2002-05-01) Guo, Yuanbin; Cavallaro, Joseph R.; Center for Multimedia Communications (http://cmc.rice.edu/)Several LUT-based pre-distortion schemes using Saleh's model of traveling wave tube (TWT) HPA for compensation of RF non-linearity in various systems have been studied recently. In this paper we propose a novel parameter-based pre-distorter to enhance the power efficiency of mobile OFDM transmitters using solid-state power amplifiers (SSPA). Both the non-linearity and inverse of the non-linearity are estimated as memory-less polynomials using Least Squares (LS) criterion. A fully digital pre-distorter is then constructed with a closed-form inverse non-linearity to reduce the in-band non-linear distortion and out-of-band spectrum re-growth in the transmitter. Simulation results in 16-QAM OFDM show promising performance and simple implementation compared with the LUT method.Item Parallel Interleaver Architecture with New Scheduling Scheme for High Throughput Configurable Turbo Decoder(IEEE, 2013-05) Wang, Guohui; Vosoughi, Aida; Shen, Hao; Cavallaro, Joseph R.; Guo, YuanbinParallel architecture is required for high throughput turbo decoder to meet the data rate requirements of the emerging wireless communication systems. However, due to the severe memory conflict problem caused by parallel architectures, the interleaver design has become a major challenge that limits the achievable throughput. Moreover, the high complexity of the interleaver algorithm makes the parallel interleaving address generation hardware very difficult to implement. In this paper, we propose a parallel interleaver architecture that can generate multiple interleaving addresses on-the-fly. We devised a novel scheduling scheme with which we can use more efficient buffer structures to eliminate memory contention. The synthesis results show that the proposed architecture with the new scheduling scheme can significantly reduce memory usage and hardware complexity. The proposed architecture also shows great flexibility and scalability compared to prior work.