Browsing by Author "Lin, Yingyan"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Algorithm-Hardware Co-Design Towards Efficient and Robust Edge Vision Applications(2022-08-12) Fu, Yonggan; Lin, YingyanThe recent breakthroughs of deep neural networks (DNNs) and the advent of billions of Internet of Things (IoT) devices have excited an explosive demand for intelligent IoT devices equipped with domain-specific DNN accelerators. However, the deployment of DNN accelerator enabled intelligent functionality into real-world IoT devices still remains particularly challenging. First, powerful DNNs often come at prohibitive complexities, whereas IoT devices often suffer from stringent resource constraints. Second, while DNNs are vulnerable to adversarial attacks especially on IoT devices exposed to complex real-world environments, many IoT applications require strict security. Existing DNN accelerators mostly tackle only one of the two aforementioned challenges (i.e., efficiency or adversarial robustness) while neglecting or even sacrificing the other. To this end, we propose a 2-in-1 Accelerator, an integrated algorithm-hardware co-design framework aiming at winning both the adversarial robustness and efficiency of DNN accelerators. Specifically, we first propose a Random Precision Switch (RPS) algorithm that can effectively defend DNNs against adversarial attacks by enabling random DNN quantization as an in-situ model switch during training and inference. Furthermore, we propose a new precision-scalable accelerator featuring (1) a new precision-scalable MAC unit architecture which spatially tiles the temporal MAC units to boost both the achievable efficiency and flexibility and (2) a systematically optimized dataflow that is searched by our generic accelerator optimizer. Extensive experiments and ablation studies validate that our 2-in-1 Accelerator can not only aggressively boost both the adversarial robustness and efficiency of DNN accelerators under various attacks, but also naturally support instantaneous robustness-efficiency trade-offs adapting to varied resources without the necessity of DNN retraining. We believe our 2-in-1 Accelerator has opened up an exciting perspective for robust and efficient accelerator design.Item Automated Deep Learning Algorithm and Accelerator Co-search for Both Boosted Hardware Efficiency and Task Accuracy(2023-04-24) Zhang, Yongan; Lin, YingyanPowerful yet complex deep neural networks (DNNs) have fueled a booming demand for efficient DNN solutions to bring DNN-powered intelligence into numerous applications. Jointly optimizing the networks and their accelerators are promising in providing optimal performance. However, the great potential of such solutions have yet to be unleashed due to the challenge of simultaneously exploring the vast and entangled, yet different design spaces of the networks and their accelerators. To this end, we propose DIAN, a DIfferentiable Accelerator-Network co-search framework for automatically searching for matched networks and accelerators to maximize both the task accuracy and acceleration efficiency. Specifically, DIAN integrates two enablers: (1) a generic design space for DNN accelerators that is applicable to both FPGA- and ASIC-based DNN accelerators and compatible with DNN frameworks such as PyTorch to enable algorithmic exploration for more efficient DNNs and their accelerators; and (2) a joint DNN network and accelerator co-search algorithm that enables the simultaneous search for optimal DNN structures and their accelerators’ micro-architectures and mapping methods to maximize both the task accuracy and acceleration efficiency. Experiments and ablation studies based on FPGA measurements and ASIC synthesis show that the matched networks and accelerators generated by DIAN consistently outperform state-of-the-art (SOTA) DNNs and DNN accelerators (e.g., 3.04× better FPS with a 5.46% higher accuracy on ImageNet), while requiring notably reduced search time (up to 1234.3×) over SOTA co-exploration methods, when evaluated over ten SOTA baselines on three datasets.Item Boosting the Efficiency of Graph Convolutional Networks via Algorithm and Accelerator Co-Design(2022-08-11) You, Haoran; Lin, YingyanGraph Convolutional Networks (GCNs) have emerged as the state-of-the-art graph learning model. However, it can be notoriously challenging to inference GCNs over large graph datasets, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN graphs. This is because real-world graphs can be extremely large and sparse. Furthermore, the node degree of GCNs tends to follow the power-law distribution and therefore have highly irregular adjacency matrices, resulting in prohibitive inefficiencies in both data processing and movement and thus substantially limiting the achievable GCN acceleration efficiency. To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs’ inference efficiency. Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the model accuracy, resulting in graph adjacency matrices that (mostly) have merely two levels of workload and enjoys largely enhanced regularity and thus ease of acceleration. \underline{On the hardware level}, we further develop a dedicated two-pronged accelerator with a separated engine to process each of the aforementioned denser and sparser workloads, further boosting the overall utilization and acceleration efficiency. Extensive experiments and ablation studies validate that our GCoD consistently reduces the number of off-chip accesses, leading to speedups 15286x, 294x, 7.8x, and 2.5x as compared to CPUs, GPUs, and prior-art GCN accelerators including HyGCN and AWB-GCN, respectively, while maintaining or even improving the task accuracy. Additionally, we visualize GCoD trained graph adjacency matrices for a better understanding of its advantages.Item Marrying Application-Level Opportunities with Algorithm-Hardware Co-Design towards Ubiquitous Edge Intelligence(2023-04-21) Zhao, Yang; Lin, YingyanArtificial Intelligence (AI) algorithms, especially Deep Neural Networks (DNNs), recently achieved record-breaking performance (i.e., task accuracy) in a wide range of applications. This has motivated a growing demand for bringing powerful AI-powered functionalities into edge devices, such as Virtual Reality/Augmented Reality (VR/AR) and medical devices, towards ubiquitous edge intelligence. On the other hand, the powerful performance of AI algorithms comes with much increased computational complexity and memory storage requirements, which stand at odd with the limited compute/storage resources on edge devices. To close the aforementioned gap for enabling more extensive AI-powered edge intelligence, we advocate harmonizing AI algorithms and dedicated accelerators via algorithm-accelerator co-design and leveraging application-level opportunities to minimize redundant computations in the processing pipeline. First, to tackle the efficiency bottleneck caused by the required massive random-access memory (DRAM) accesses when accelerating DNNs, we propose an algorithm-accelerator co-design technique called SmartExchange to trade higher-cost memory storage/accesses for lower-cost computations, for boosting the acceleration efficiency of both DNN inference and training. In particular, on the algorithm level, we enforce a hardware-friendly DNN weight structure, where only a small basis matrix and a sparse and readily-quantized coefficient matrix are needed to be stored for each layer and the remaining majority of weights can be recovered from lower-cost computations. On the hardware level, we further design a dedicated accelerator to leverage the SmartExchange-enforced algorithm structure for improving both the energy efficiency and processing latency of acceleration. Second, motivated by the promising results achieved by the above algorithm-accelerator co-design technique SmartExchange, we explore and develop dedicated algorithm-accelerator co-design techniques for two real-world applications, which are further empowered with application-level opportunities to maximize the achievable efficiency. In particular, we consider two representative and increasingly demanded AI-powered intelligent applications, one is eye tracking on VR/AR devices which is to estimate the gaze directions of human eyes and the other is cardiac detection on medical implants which is to perform intracardiac electrogram (EGM) to electrocardiogram (ECG) conversion (i.e., EGM-to-ECG conversion) on pacemakers. Among these applications, we find that there consistently exist application-level opportunities to be leveraged for largely reducing the computation/data movement redundancy within the processing pipeline. Therefore, we develop a tailored processing pipeline for each application and then pair it with dedicated algorithm-accelerator co-design techniques to further boost the overall system efficiency while maintaining the task performance, as elaborated below: For the eye tracking application, we propose a predict-then-focus pipeline that first extracts region-of-interests (ROIs), which is only 24% (average) of the original eye images for gaze estimation, to reduce computational redundancy. Additionally, the temporal correlation of eyes across frames is leveraged so that only 5% of the frames require ROIs adjustment over time. On top of those, we develop a dedicated accelerator and integrate both the algorithm and accelerator into a real hardware prototype system, dubbed i-FlatCam, consisting of a lensless camera and a chip prototype fabricated in a 28nm CMOS technology for validation. Real-hardware measurements show that the i-FlatCam system is the first to simultaneously meets all three requirements of eye tracking required by next-generation AR/VR devices. After that, we take another big leap towards accelerating an eye segmentation-involved pipeline for eye tracking towards more general eye tracking in AR/VR, where the segmentation result can enable more downstream tasks in addition to eye tracking. The resulting system is called EyeCoD and is validated with a multi-chip hardware prototype setting. For the EGM-to-ECG conversion application, we propose an application-aware processing pipeline where a precise and more complex conversion algorithm is only incurred in instant response to the detected anomaly (i.e., arrhythmia) and a coarse conversion is activated otherwise to avoid unnecessary computations. Furthermore, we develop a dedicated accelerator called e-G2C which is tailored for the above processing pipeline to further boost energy efficiency. For evaluation, the e-G2C processor is fabricated in a 28nm CMOS technology and achieves 0.14-8.31 μJ/inference energy efficiency outperforming prior arts under similar complexity, enabling real-time detection/conversion, and promising possibly life-critical interventions.Item RT-RCG: Neural Network and Accelerator Search Towards Effective and Real-time ECG Reconstruction from Intracardiac Electrograms(ACM, 2022) Zhang, Yongan; Banta, Anton; Fu, Yonggan; John, Mathews M.; Post, Allison; Razavi, Mehdi; Cavallaro, Joseph; Aazhang, Behnaam; Lin, YingyanThere exists a gap in terms of the signals provided by pacemakers (i.e., intracardiac electrogram (EGM)) and the signals doctors use (i.e., 12-lead electrocardiogram (ECG)) to diagnose abnormal rhythms. Therefore, the former, even if remotely transmitted, are not sufficient for doctors to provide a precise diagnosis, let alone make a timely intervention. To close this gap and make a heuristic step towards real-time critical intervention in instant response to irregular and infrequent ventricular rhythms, we propose a new framework dubbed RT-RCG to automatically search for (1) efficient Deep Neural Network (DNN) structures and then (2) corresponding accelerators, to enable Real-Time and high-quality Reconstruction of ECG signals from EGM signals. Specifically, RT-RCG proposes a new DNN search space tailored for ECG reconstruction from EGM signals and incorporates a differentiable acceleration search (DAS) engine to efficiently navigate over the large and discrete accelerator design space to generate optimized accelerators. Extensive experiments and ablation studies under various settings consistently validate the effectiveness of our RT-RCG. To the best of our knowledge, RT-RCG is the first to leverage neural architecture search (NAS) to simultaneously tackle both reconstruction efficacy and efficiency.Item SACoD: Sensor Algorithm Co-Design Towards Efficient CNN-powered Intelligent PhlatCam(2021-04-30) Wang, Yue; Lin, YingyanThere has been a growing demand for integrating Convolutional Neural Networks (CNNs) powered functionalities into Internet-of-Thing (IoT) devices to enable ubiquitous intelligent "IoT cameras". However, there are two challenges limiting the application of Internet-of-Thing (IoT) devices powered by convolutional neural networks (CNNs) in real-world. First, some applications, especially medicine- and biology-related ones, impose strict requirements on camera size. Second, powerful CNNs often require a large number of parameters that correspond to considerable computing, storage, and memory bandwidth, whereas IoT devices only have limited resources. PhlatCam, due to its potentially orders-of-magnitude reduced form-factor, has provided a promising solution to the first aforementioned challenge, while the second one remains a bottleneck. To tackle this problem, existing compression techniques, focusing merely on the CNN algorithm itself, show some promise yet still limited. To this end, this work proposes SACoD, a Sensor Algorithm Co-Design framework to enable energy-efficient CNN-powered PhlatCam. In particular, the mask coded in the PhlatCam sensor and the CNN model in the algorithm is jointly optimized in terms of both parameters and architectures based on differential neural architecture search. Extensive experiments including both simulation and actual physical measurement on manufactured masks show that the proposed SACoD framework achieves aggressive model compression and energy savings while maintaining or even boosting the task accuracy, when benchmarking over two state-of-the-art (SOTA) designs with six datasets on two different tasks.Item TACoS: Transformer and Accelerator Co-Search Towards Ubiquitious Vision Transformer Acceleration(2023-12-06) Puckett, Daniel; Lin, YingyanRecent works have combined pruned Vision Transformer (ViT) models and specialized accelerators to achieve strong accuracy/latency tradeoffs in many computer vision tasks. However, it takes a significant amount of expert labor to adapt these systems to real-world scenarios with specific accuracy, latency, power, and/or area constraints. Automating the design and exploration of these systems is a promising solution but is hampered by two unsolved problems: 1) Existing methods of pruning the attention maps of a ViT model involve fully training the model, pruning its attention maps, then fine-tuning the model. This is infeasible when exploring a design space containing millions of model architectures. 2) The design space is complicated and the system’s area efficiency, scalability, and data movement are hurt because we lack a unified accelerator template that efficiently computes each operation in sparse ViT models. To solve these problems, I propose TACoS: Transformer and Accelerator Co-Search, the first automated method to co-design pruned ViT model and accelerator pairs. TACoS answers the above challenges using 1) a novel ViT search algorithm that simultaneously prunes and fine-tunes many models at many different sparsity ratios, and 2) the first unified ViT accelerator template, which efficiently accelerates each operation in sparse ViT models using adaptable PEs and reconfigurable PE lanes. With these innovations, the TACoS framework quickly and automatically designs state-of-the-art systems for real-world applications and achieves accuracy/latency tradeoffs superior to hand-crafted ViT models and accelerators.