R-3 Repository :: Browsing by Author "Cavallaro, Joseph"

Browsing by Author "Cavallaro, Joseph"

Now showing 1 - 8 of 8

Decentralized Baseband Processing for Massive MU-MIMO Systems
(2019-04-19) Li, Kaipeng; Cavallaro, Joseph
Achieving high spectral efficiency in realistic massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems requires computationally-complex algorithms for data detection in the uplink (users transmit to base-station) and beamforming in the downlink (base-station transmits to users). Most existing algorithms are designed to be executed on centralized computing hardware at the base-station (BS), which results in prohibitive complexity for systems with hundreds or thousands of antennas and generates raw baseband data rates that exceed the limits of current interconnect technology and chip I/O interfaces. This thesis proposes novel decentralized baseband processing architectures that alleviate these bottlenecks by partitioning the BS antenna array into clusters, each associated with independent radio-frequency chains, analog and digital modulation circuitry, and computing hardware. For those decentralized architectures, we develop novel decentralized data detection and beamforming algorithms that only access local channel-state information and require low communication bandwidth among the clusters. We first propose a decentralized consensus-sharing architecture. With this architecture, each cluster performs local baseband processing in parallel and then shares their local results with little amount of data transfer to compute a global consensus at a centralized processing element; the consensus is then broadcast to each cluster for another round of local update. After a few rounds of local update and consensus sharing, a converged global consensus result is obtained. Under this architecture, we solve uplink data detection and downlink beamforming problems using alternating direction method of multipliers (ADMM) and conjugate gradient methods in a decentralized manner, and show superb error-rate performance that has minimum loss compared to centralized solutions. To reduce the data transfer latency across clusters, we further propose a decentralized feedforward architecture that only requires one-shot message passing among clusters to arrive at global detection or beamforming results. With this architecture, we develop multiple variations of detection and beamforming algorithms with non-linear or linear local solvers, and with partially or fully decentralization schemes, that realize trade-offs between error-rate performance, computational complexity, and interconnect bandwidth. To evaluate the hardware efficiency of our proposed methods, we implement above decentralized detection and beamforming algorithms on multi-GPU systems using parallel and distributed programming techniques to optimize the data rate performance. Our implementations achieve less than 1ms latency and over 1Gbps data throughput on a high-end multi-GPU platform, and demonstrate high scalability to support hundreds to thousands of antennas for massive MU-MIMO systems.
GPU Accelerated Reconfigurable Detector and Precoder for Massive MIMO SDR Systems
(2015-12-02) Li, Kaipeng; Cavallaro, Joseph; Aazhang, Behnaam; Zhong, Lin
We present a reconfigurable GPU-based unified detector and precoder for massive MIMO software-defined radio systems. To enable high throughput, we implement the linear minimum mean square error detector/precoder and further reduce the algorithm complexity by numerical approximation without sacrificing the error-rate performance. For efficient GPU implementation, we explore the algorithm's inherent parallelism and take advantage of the GPU's numerous computing cores and hierarchical memories for the optimization of kernel computations. We furthermore perform multi-stream scheduling and multi-GPU workload deployment to pipeline multiple detection or precoding tasks on GPU streams for the reduction of host-device memory copy overhead. The flexible design supports both detection and precoding and can switch between Cholesky based mode and conjugate gradient based mode for accuracy and complexity tradeoff. The GPU implementation exceeds 250 Mb/s detection and precoding throughput for a 128x16 antenna system.
High-speed Track and Hold Amplifiers in CMOS for Enabling Pulse-based Direct Modulation, Secure Communication and Precision Localization
(2015-08-06) Aggrawal, Himanshu; Babakhani, Aydin; Cavallaro, Joseph; Mittleman, Daniel
Last few decades have seen a puissant desire for fast communication links that has shaped the evolution of high-speed circuits and silicon- based technology. This desire accompanied with a large consumer market has fueled the development of ever-shrinking, faster technology nodes. These advanced nodes open doors for designers to develop new ways of transferring data with unprecedented speed and accuracy. There are a number of challenges in building high-speed, secure communication links, one being the lack of availability of fast Analog to Digital Converters (ADCs), which form the front end of a receiver. Even in advanced technology nodes, the leakage in the transmission gate due to parasitic source-drain capacitance provides an alternate path for signals to pass, thus lowering the performance of the ADCs at high frequencies. Second, the current communication schemes use beam-forming or Direct Antenna Modulation (DAM) to narrow the information beam and point it in the direction of communication. Such techniques still have a wide information beam compared pulse-based directional modulation, as discussed in this thesis. In this dissertation, we address the issue of parasitic leakages in the transmission gate of a fast sampler by introducing active cancellation. A track-and-hold amplifier with active cancellation is designed and fabricated in 45nm CMOS SOI technology, which can operate at 40GSample/second real-time. In addition to this, we also study a pulse-based directional modulation scheme which can be used for secure communication, imaging and localization. Two coherent pulse generators with pulse width less than 200ps were used to attain an information beamwidth of less than 1 degree and localize objects with millimeter accuracy.
Linkify: A Web-Based Collaborative Content Tagging System for Machine Learning Algorithms
(2014-12-03) Soares, Dante Mattos de Salles; Baraniuk, Richard; Cavallaro, Joseph; Burrus, C. Sidney
Automated tutoring systems that use machine learning algorithms are a relatively new development which promises to revolutionize education by providing students on a large scale with an experience that closely resembles one-on-one tutoring. Machine learning algorithms are essential for these systems, as they are able to perform, with fairly good results, certain data processing tasks that have usually been considered difficult for artificial intelligence. However, the high performance of several machine learning algorithms relies on the existence of information about what is being processed in the form of tags, which have to be manually added to the content. Therefore, there is a strong need today for tagged educational resources. Unfortunately, tagging can be a very time-consuming task. Proven strategies for the mass tagging of content already exist: collaborative tagging systems, such as Delicious, StumbleUpon and CiteULike, have been growing in popularity in recent years. These websites allow users to tag content and browse previously tagged content that is relevant to the user’s interests. However, attempting to apply this particular strategy towards educational resource tagging presents several problems. Tags for educational resources to be used in tutoring systems need to be highly accurate, as mistakes in recommending or assigning material to students can be very detrimental to their learning, so ideally subject-matter experts would perform the resource tagging. The issue with hiring experts is that they can sometimes be not only scarce but also expensive, therefore limiting the number of resources that could potentially be tagged. Even if non-experts are used, another issue arises from the fact that a large user base would be required to tag large amounts of resources, and acquiring large numbers of users can be a challenge in itself. To solve these problems, we present Linkify, a system that allows the more accurate tagging of large amounts of educational resources by combining the efforts of users with certain existing machine learning algorithms that are also capable of tagging resources. This thesis will discuss Linkify in detail, presenting its database structure and components, and discussing the design choices made during its development. We will also discuss a novel model for tagging errors based on a binary asymmetric channel. From this model, we derive an EM algorithm which can be used to combine tags entered into the Linkify system by multiple users and machine learning algorithms, producing the most likely set of relevant tags for each given educational resource. Our goal is to enable automated tutoring systems to use this tagging information in the future in order to improve their capability of assessing student knowledge and predicting student performance. At the same time, Linkify’s standardized structure for data input and output will facilitate the development and testing of new machine learning algorithms.
RT-RCG: Neural Network and Accelerator Search Towards Effective and Real-time ECG Reconstruction from Intracardiac Electrograms
(ACM, 2022) Zhang, Yongan; Banta, Anton; Fu, Yonggan; John, Mathews M.; Post, Allison; Razavi, Mehdi; Cavallaro, Joseph; Aazhang, Behnaam; Lin, Yingyan
There exists a gap in terms of the signals provided by pacemakers (i.e., intracardiac electrogram (EGM)) and the signals doctors use (i.e., 12-lead electrocardiogram (ECG)) to diagnose abnormal rhythms. Therefore, the former, even if remotely transmitted, are not sufficient for doctors to provide a precise diagnosis, let alone make a timely intervention. To close this gap and make a heuristic step towards real-time critical intervention in instant response to irregular and infrequent ventricular rhythms, we propose a new framework dubbed RT-RCG to automatically search for (1) efficient Deep Neural Network (DNN) structures and then (2) corresponding accelerators, to enable Real-Time and high-quality Reconstruction of ECG signals from EGM signals. Specifically, RT-RCG proposes a new DNN search space tailored for ECG reconstruction from EGM signals and incorporates a differentiable acceleration search (DAS) engine to efficiently navigate over the large and discrete accelerator design space to generate optimized accelerators. Extensive experiments and ablation studies under various settings consistently validate the effectiveness of our RT-RCG. To the best of our knowledge, RT-RCG is the first to leverage neural architecture search (NAS) to simultaneously tackle both reconstruction efficacy and efficiency.
ShuFFLE: Automated Framework for HArdware Accelerated Iterative Big Data Analysis
(2014-10-22) Mohammadgholi Songhori, Ebrahim; Koushanfar, Farinaz; Baraniuk, Richard; Cavallaro, Joseph
This thesis introduces ShuFFLE, a set of novel methodologies and tools for automated analysis and hardware acceleration of large and dense (non-sparse) Gram matrices. Such matrices arise in most contemporary data mining; they are hard to handle because of the complexity of known matrix transformation algorithms and the inseparability of non-sparse correlations. ShuFFLE learns the properties of the Gram matrices and their rank for each particular application domain. It then utilizes the underlying properties for reconfiguring accelerators that scalably operate on the data in that domain. The learning is based on new factorizations that work at the limit of the matrix rank to optimize the hardware implementation by minimizing the costly off-chip memory as well as I/O interactions. ShuFFLE also provides users with a new Application Programming Interface (API) to implement a customized iterative least squares solver for analyzing big and dense matrices in a scalable way. This API is readily integrated within the Xilinx Vivado High Level Synthesis tool to translate user's code to Hardware Description Language (HDL). As a case study, we implement Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) as an l1 regularized least squares solver. Experimental results show that during FISTA computation using Field-Programmable Gate Array (FPGA) platform, ShuFFLE attains 1800x iteration speed improvement compared to the conventional solver and about 24x improvement compared to our factorized solver on a general purpose processor with SSE4 architecture for a Gram matrix with 4.6 billion non-zero elements.
Sub-Band Digital Predistortion for Noncontiguous Carriers: Implementation and Testing
(2016-04-22) Tarver, Chance A; Cavallaro, Joseph
To facilitate increasing data-rate demands and spectrum scarcity, non-contiguous transmission schemes are becoming more popular. However the non-contiguous carriers of such schemes intermodulate due to the nonlinear nature of power amplifiers (PAs). This may cause emissions which interfere with nearby channels or with one's own receiver in a frequency division duplexing transceiver. We implement a low-complexity, sub-band, block-adaptive, digital predistortion (DPD) solution that corrects the distortion in a real PA. Using WARPLab we correct up to ninth-order nonlinearities, and using a real-time FPGA design we correct up to third-order nonlinearities. This is done by targeting the most problematic spurious distortion components at the PA output and performing least mean squares training to adapt an inverse spur to inject. This sub-band method allows for reduced processing complexity over other full-band predistortion solutions. Using these techniques, we are able to suppress spurious emissions in WARP by over 20 dB.
TinyGarble: Efficient, Scalable, and Versatile Privacy-Preserving Computation Through Sequential Garbled Circuit
(2017-04-20) Mohammadgholi Songhori, Ebrahim; Koushanfar, Farinaz; Cavallaro, Joseph
Privacy-preserving computation is a standing challenge central to several modern-world applications which require computing on sensitive data. Secure Function Evaluation (SFE) refers to provably secure techniques aiming to address this problem by enabling multiple parties to compute an arbitrary function jointly on their private inputs. The most promising two-party SFE method is called the Garbled Circuit (GC) protocol introduced by Andrew Yao. The protocol relays on representing the function as a Boolean circuit and encrypting/communicating at the logic gate level. Despite several significant improvements in GC, efficiency, scalability and ease-of-use of the available methods are limited by the naive circuit representation as a directed acyclic graph, ad-hoc logic optimizations, and custom compilers. In this thesis, we proposed a holistic solution to enhance the efficiency, scalability, and simplicity of the GC protocol. Our approach has three main pillars to address these key challenges: GC synthesis, sequential GC, and garbled processor. The GC synthesis is a novel automated methodology based on logic synthesis techniques for generating optimized Boolean circuits for the GC protocol. Using sequential GC, we achieve an unprecedented level of compactness and scalability using sequential circuit descriptions. We combine GC synthesis and sequential GC in an open-source framework called TinyGarble. The preliminary implementation of benchmark functions using TinyGarble demonstrates a high degree of memory-footprint compactness as well as improvement in overall efficiency compared to results of existing tools. Our sequential description also enables us, for the first time, to design and realize a garbled processor to reduce the problem of private function evaluation to a conventional SFE problem. In addition, the garbled processor allows users to develop SFE applications in high-level languages (e.g., C) and eliminates the need for Boolean circuit generation. We present ARM2GC, a garbled processor framework based on TinyGarble and the ARM processor. It allows users to develop GC applications using high-level programming languages with comparable efficiency to the best previous results. The primary enabler to make this construction practical and efficient is the introduction of SkipGate, a new algorithm that omits the communication cost of a Boolean gate when its output is independent of the private data. Benchmark evaluations demonstrate efficiency and usability of ARM2GC compared with the prior art in high-level GC compilation.

Browsing by Author "Cavallaro, Joseph"

Results Per Page

Sort Options