Browsing by Author "Wang, Guohui"
Now showing 1 - 17 of 17
Results Per Page
Sort Options
Item Accelerating Computer Vision Algorithms Using OpenCL Framework on Mobile Devices - A Case Study(IEEE, 2013-06) Wang, Guohui; Xiong, Y.; Yun, Jay; Cavallaro, Joseph R.Recently, general-purpose computing on graphics processing units (GPGPU) has been enabled on mobile devices thanks to the emerging heterogeneous programming models such as OpenCL. The capability of GPGPU on mobile devices opens a new era for mobile computing and can enable many computationally demanding computer vision algorithms on mobile devices. As a case study, this paper proposes to accelerate an exemplar-based inpainting algorithm for object removal on a mobile GPU using OpenCL. We discuss the methodology of exploring the parallelism in the algorithm as well as several optimization techniques. Experimental results demonstrate that our optimization strategies for mobile GPUs have significantly reduced the processing time and make computationally intensive computer vision algorithms feasible for a mobile device. To the best of the authors’ knowledge, this work is the first published implementation of general-purpose computing using OpenCL on mobile GPUs.Item Design Space Exploration of Parallel Algorithms and Architectures for Wireless Communication and Mobile Computing Systems(2014-10-30) Wang, Guohui; Cavallaro, Joseph R.; Sarkar, Vivek; Zhong, Lin; Juntti, MarkkuDuring past several years, there has been a trend that the modern mobile SoC (system-on-chip) chipsets start to incorporate in one single chip the functionality of several general purpose processors and application-specific accelerators to reduce the cost, the power consumption and the communication overhead. Given the ever-growing performance requirements and strict power constraints, the existence of different types of signal processing workloads have posed challenges to the mapping of the computationally-intensive algorithms to the heterogeneous architecture of the mobile SoCs. Many such signal processing workloads such as channel decoding for wireless communication modem and mobile computer vision applications have high computational complexity, which requires accelerators implemented with parallel algorithms and architectures to meet the performance requirements. Partitioning the workloads and deploying them with the appropriate components of mobile chipsets are crucial to fully utilize the mobile SoC's heterogeneous architecture. The goal of this thesis is to study parallel algorithms and architecture of high performance signal processing accelerators for several representative application workloads in wireless communication and mobile computing systems. We explore the design space of the parallel algorithms and architectures and highlight the workload partitioning and architecture-aware optimization schemes including algorithmic optimization, data structure optimization, and memory access optimization to improve the throughput performance and hardware (or energy) efficiency. As case studies, we will first propose contention-free interleaver architecture for parallel turbo decoding, which enables high throughput multi-standard turbo decoding ASIC (application-specific integrated circuit) with efficient hardware. Secondly, we propose massively parallel LDPC (low-density parity-check) decoding algorithm and implementation using GPU (graphics processor unit), which leads to high throughput and low latency LDPC decoding for practical SDR (software-defined radio) systems. Furthermore, we take advantage of the heterogeneous mobile CPU and GPU to accelerate representative mobile computer vision algorithms such as image editing and local feature extraction algorithms. Based on algorithm analysis and experimental results from the above case studies, we finally explore the design space and compare the performance of accelerator architectures for wireless communication and mobile vision use cases. We will show that the heterogeneous architecture of mobile systems is the key to efficiently accelerating parallel algorithms in order to meet the growing requirements of performance, efficiency, and flexibility.Item A Fast and Efficient Sift Detector Using The Mobile GPU(IEEE, 2013-06) Rister, Blaine; Wang, Guohui; Wu, Michael; Cavallaro, Joseph R.Emerging mobile applications, such as augmented reality, demand robust feature detection at high frame rates. We present an implementation of the popular Scale-Invariant Feature Transform (SIFT) feature detection algorithm that incorporates the powerful graphics processing unit (GPU) in mobile devices. Where the usual GPU methods are inefficient on mobile hardware, we propose a heterogeneous dataflow scheme. By methodically partitioning the computation, compressing the data for memory transfers, and taking into account the unique challenges that arise out of the mobile GPU, we are able to achieve a speedup of 4-7x over an optimized CPU version, and a 6.4x speedup over a published GPU implementation. Additionally, we reduce energy consumption by 87 percent per image. We achieve near-realtime detection without compromising the original algorithm.Item FPGA Prototyping of A High Data Rate LTE Uplink Baseband Receiver(IEEE, 2009-11-01) Wang, Guohui; Yin, Bei; Amiri, Kiarash; Sun, Yang; Wu, Michael; Cavallaro, Joseph R.; Center for Multimedia CommunicationThe Third Generation Partnership Project (3GPP) Long Term Evolution (LTE) standard is becoming the appropriate choice to pave the way for the next generation wireless and cellular standards. While the popular OFDM technique has been adopted and implemented in previous standards and also in the LTE downlink, it suffers from high peak-to-average-power ratio (PAPR). High PAPR requires more sophisticated power amplifiers (PAs) in the handsets and would result in lower efficiency PAs. In order to combat such effects, the LTE uplink choice of transmission is the novel Single Carrier Frequency Division Multiple Access (SC-FDMA) scheme which has lower PAPR due to its inherent signal structure. While reducing the PAPR, the SC-FDMA requires a more complicated detector structure in the base station for multi-antenna and multi-user scenarios. Since the multi-antenna and multi-user scenarios are critical parts of the LTE standard to deliver high performance and data rate, it is important to design novel architectures to ensure high reliability and data rate in the receiver. In this paper, we propose a flexible architecture of a high data rate LTE uplink receiver with multiple receive antennas and implemented a single FPGA prototype of this architecture. The architecture is verified on the WARPLab (a software defined radio platform based on Rice Wireless Open-access Research Platform) and tested in the real over-the-air indoor channel.Item GPU Accelerated Scalable Parallel Decoding of LDPC Codes(IEEE, 2011-11-01) Wang, Guohui; Wu, Michael; Sun, Yang; Center for Multimedia CommunicationThis paper proposes a flexible low-density parity-check (LDPC) decoder which leverages graphic processor units (GPU) to provide high decoding throughput. LDPC codes are widely adopted by the new emerging standards for wireless communication systems and storage applications due to their near-capacity error correcting performance. To achieve high decoding throughput on GPU, we leverage the parallelism embedded in the check-node computation and variable-node computation and propose a parallel strategy of partitioning the decoding jobs among multi-processors in GPU. In addition, we propose a scalable multi-codeword decoding scheme to fully utilize the computation resources of GPU. Furthermore, we developed a novel adaptive performance-tuning method to make our decoder implementation more flexible and scalable. The experimental results show that our LDPC decoder is scalable and flexible, and the adaptive performance-tuning method can deliver the peak performance based on the GPU architecture.Item High-Level Design Tools for Complex DSP Applications(Elsevier, Waltham, MA, 2012-07-12) Sun, Yang; Amiri, Kiarash; Wang, Guohui; Yin, Bei; Cavallaro, Joseph R.; Ly, Tai; Center for Multimedia CommunicationHigh-level synthesis design methodology - High level synthesis (HLS) [1], also known as behavioral synthesis and algorithmic synthesis, is a design process in which a high level, functional description of a design is automatically compiled into a RTL implementation that meets certain user specified design constraints. The HLS design description is ‘high level’ compared to RTL in two aspects: design abstraction, and specification language.Item High-Throughput Contention-Free Concurrent Interleaver Architecture for Multi-Standard Turbo Decoder(IEEE, 2011-09-01) Wang, Guohui; Sun, Yang; Cavallaro, Joseph R.; Guo, Yuanbin; Center for Multimedia CommunicationTo meet the higher data rate requirement of emerging wireless communication technology, numerous parallel turbo decoder architectures have been developed. However, the interleaver has become a major bottleneck that limits the achievable throughput in the parallel decoders due to the massive memory conflicts. In this paper, we propose a flexible Double-Buffer based Contention-Free (DBCF) interleaver architecture that can efficiently solve the memory conflict problem for parallel turbo decoders with very high parallelism. The proposed DBCF architecture enables high throughput concurrent interleaving for multi-standard turbo decoders that support UMTS/HSPA+, LTE and WiMAX, with small datapath delays and low hardware cost. We implemented the DBCF interleaver with a 65nm CMOS technology. The implementation of this highly efficient DBCF interleaver architecture shows significant improvement in terms of the maximum throughput and occupied chip area compared to the previous work.Item Highly Scalable On-the-Fly Interleaved Address Generation for UMTS/HSPA+ Parallel Turbo Decoder(24th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2013-06-01) Vosoughi, Aida; Wang, Guohui; Shen, Hao; Cavallaro, Joseph R.; Guo, Yuanbin; CMCHigh throughput parallel interleaver design is a major challenge in designing parallel turbo decoders that conform to high data rate requirements of advanced standards such as HSPA+. The hardware complexity of the HSPA+ interleaver makes it difficult to scale to high degrees of parallelism. We propose a novel algorithm and architecture for on-the-fly parallel interleaved address generation in UMTS/HSPA+ standard that is highly scalable. Our proposed algorithm generates an interleaved memory address from an original input address without building the complete interleaving pattern or storing it; the generated interleaved address can be used directly for interleaved writing to memory blocks. We use an extended Euclidean algorithm for modular multiplicative inversion as a step towards reversed intra-row permutations in UMTS/HSPA+ standard. As a result, we can determine interleaved addresses from original addresses. We also propose an efficient and scalable hardware architecture for our method. Our design generates 32 interleaved addresses in one cycle and satisfies the data rate requirement of 672 Mbps in HSPA+ while the silicon area and frequency is improved compared to recent related works.Item Implementation of a High Throughput 3GPP Turbo Decoder on GPU(Springer, 2011-11-01) Wu, Michael; Sun, Yang; Wang, Guohui; Cavallaro, Joseph R.; Center for Multimedia CommunicationTurbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput.Item Low Complexity Opportunistic Decoder for Network Coding(IEEE, 2012-12-01) Yin, Bei; Wu, Michael; Wang, Guohui; Cavallaro, Joseph R.; CMCIn this paper, we propose a novel opportunistic decoding scheme for network coding decoder which significantly reduces the decoder complexity and increases the throughput. Network coding was proposed to improve the network throughput and reliability, especially for multicast transmissions. Although network coding increases the network performance, the complexity of the network coding decoder algorithm is still high, especially for higher dimensional finite fields or larger network codes. Different software and hardware approaches were proposed to accelerate the decoding algorithm, but the decoder remains to be the bottleneck for high speed data transmission. We propose a novel decoding scheme which exploits the structure of the network coding matrix to reduce the network decoder complexity and improve throughput. We also implemented the proposed scheme on Virtex 7 FPGA and compared our implementation to the widely used Gaussian elimination.Item A Massively Parallel Implementation of QC-LDPC Decoder on GPU(IEEE, 2011-06-01) Wang, Guohui; Wu, Michael; Sun, Yang; Cavallaro, Joseph R.; Center for Multimedia CommunicationThe graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU’s computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.Item Measurement-Based Analysis, Modeling, and Synthesis of the Internet Delay Space for Large Scale Simulation(2006-10-04) Zhang, Bo; Ng, T. S. Eugene; Nandi, Animesh; Riedi, Rudolf H.; Druschel, Peter; Wang, GuohuiThe characteristics of packet delays among edge networks in the Internet can have a significant impact on the performance and scalability of global-scale distributed systems. Designers rely on simulation to study design alternatives for such systems at scale, which requires an appropriate model of the Internet delay space. The model must preserve the geometry and density distribution of the delay space, which are known, for instance, to influence the effectiveness of selforganization algorithms used in overlay networks. In this paper, we characterize measured delays between Internet edge networks with respect to a set of relevant metrics. We show that existing Internet models differ dramatically from measured delays relative to these metrics. Then, based on measured data, we derive a model of the Internet delay space. The model preserves the relevant metrics, allows for a compact representation, and can be used to synthesize delay data for large-scale simulations. Moreover, specific metrics of the delay space can be adjusted in a principled manner, thus allowing systems designers to study the robustness of their designs to such variations.Item Multi-Layer Parallel Decoding Algorithm and VLSI Architecture for Quasi-Cyclic LDPC Codes(IEEE, 2011-05-01) Sun, Yang; Wang, Guohui; Cavallaro, Joseph R.; Center for Multimedia CommunicationWe propose a multi-layer parallel decoding algorithm and VLSI architecture for decoding of structured quasi-cyclic low-density parity-check codes. In the conventional layered decoding algorithm, the block-rows of the parity check matrix are processed sequentially, or layer after layer. The maximum number of rows that can be simultaneously processed by the conventional layered decoder is limited to the sub-matrix size. To remove this limitation and support layer-level parallelism, we extend the conventional layered decoding algorithm and architecture to enable simultaneously processing of multiple (K) layers of a parity check matrix, which will lead to a roughly K-fold throughput increase. As a case study, we have designed a double-layer parallel LDPC decoder for the IEEE 802.11n standard. The decoder was synthesized for a TSMC 45-nm CMOS technology. With a synthesis area of 0.81 mm2 and a maximum clock frequency of 815 MHz, the decoder achieves a maximum throughput of 3.0 Gbps at 15 iterations.Item On the design principles of network coordinates systems(2008) Wang, Guohui; Ng, T.S. EugeneSince its inception, the concept of network coordinates has been successfully applied to solve a wide variety of problems such as overlay optimization, network routing, network localization, and network modeling. Despite these successes, several practical problems limit the benefits of network coordinates today. First, the Triangle Inequality Violation(TIV) among the Internet delays degrades the application performance of network coordinates, how to reduce the impact of TIV on network coordinates systems? Second, how can network coordinates be stabilized without losing accuracy in a distributed fashion so that they can be cached by applications? Third, how can network coordinates be secured such that legitimate nodes' coordinates are not impacted by misbehaving nodes? Although these problems have been discussed extensively, the solutions are still unclear. This thesis presents analytical studies for understanding these problems and reveals several new findings: (1) the analysis results from existing Internet delay measurements demonstrate the irregular behaviors of TIVs among the Internet delays, which implies the difficulty of modeling TIVs; (2) a new TIV alert mechanism can identify the edges causing severe TIVs and reduce the impact of TIVs on network coordinates; (3) a new model of coordinates stabilization based on error elimination can achieve stability without hurting accuracy; a novel algorithm based on this model is presented; (4) recently proposed statistical detection mechanisms cannot achieve an acceptable level of security against aggressive attacks. (5) an accountability protocol can completely protect coordinates computation and a TIV alert detection mechanism can effectively protect network coordinates against delay attacks. These findings offer guidelines on the design and applications of network coordinates systems.Item Optics and virtualization as data center network infrastructure(2012) Wang, Guohui; Ng, T. S. EugeneThe emerging cloud services have motivated a fresh look at the design of data center network infrastructure in multiple layers. To transfer the huge amount of data generated by many data intensive applications, data center network has to be fast, scalable and power efficient. To support flexible and efficient sharing in cloud services, service providers deploy a virtualization layer as part of the data center infrastructure. This thesis explores the design and performance analysis of data center network infrastructure in both physical network and virtualization layer. On the physical network design front, we present a hybrid packet/circuit switched network architecture which uses circuit switched optics to augment traditional packet-switched Ethernet in modern data centers. We show that this technique has substantial potential to improve bisection bandwidth and application performance in a cost-effective manner. To push the adoption of optical circuits in real cloud data centers, we further explore and address the circuit control issues in shared data center environments. On the virtualization layer, we present an analytical study on the network performance of virtualized data centers. Using Amazon EC2 as an experiment platform, we quantify the impact of virtualization on network performance in commercial cloud. Our findings provide valuable insights to both cloud users in moving legacy application into cloud and service providers in improving the virtualization infrastructure to support better cloud services.Item Parallel Interleaver Architecture with New Scheduling Scheme for High Throughput Configurable Turbo Decoder(IEEE, 2013-05) Wang, Guohui; Vosoughi, Aida; Shen, Hao; Cavallaro, Joseph R.; Guo, YuanbinParallel architecture is required for high throughput turbo decoder to meet the data rate requirements of the emerging wireless communication systems. However, due to the severe memory conflict problem caused by parallel architectures, the interleaver design has become a major challenge that limits the achievable throughput. Moreover, the high complexity of the interleaver algorithm makes the parallel interleaving address generation hardware very difficult to implement. In this paper, we propose a parallel interleaver architecture that can generate multiple interleaving addresses on-the-fly. We devised a novel scheduling scheme with which we can use more efficient buffer structures to eliminate memory contention. The synthesis results show that the proposed architecture with the new scheduling scheme can significantly reduce memory usage and hardware complexity. The proposed architecture also shows great flexibility and scalability compared to prior work.Item Parallel Nonbinary LDPC Decoding on GPU(IEEE, 2012-12-01) Wang, Guohui; Shen, Hao; Yin, Bei; Wu, Michael; Sun, Yang; Cavallaro, Joseph R.Nonbinary Low-Density Parity-Check (LDPC) codes are a class of error-correcting codes constructed over the Galois field GF(q) for q > 2. As extensions of binary LDPC codes, nonbinary LDPC codes can provide better error-correcting performance when the code length is short or moderate, but at a cost of higher decoding complexity. This paper proposes a massively parallel implementation of a nonbinary LDPC decoding accelerator based on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The implementation maps the Min-Max decoding algorithm to GPU’s massively parallel architecture. We highlight the methodology to partition the decoding task to a heterogeneous platform consisting of the CPU and GPU. The experimental results show that our GPUbased implementation can achieve high throughput while still providing great flexibility and scalability.