Lin, Yingyan2023-06-152023-052023-04-21May 2023Zhao, Yang. "Marrying Application-Level Opportunities with Algorithm-Hardware Co-Design towards Ubiquitous Edge Intelligence." (2023) Diss., Rice University. <a href="https://hdl.handle.net/1911/114913">https://hdl.handle.net/1911/114913</a>.https://hdl.handle.net/1911/114913Artificial Intelligence (AI) algorithms, especially Deep Neural Networks (DNNs), recently achieved record-breaking performance (i.e., task accuracy) in a wide range of applications. This has motivated a growing demand for bringing powerful AI-powered functionalities into edge devices, such as Virtual Reality/Augmented Reality (VR/AR) and medical devices, towards ubiquitous edge intelligence. On the other hand, the powerful performance of AI algorithms comes with much increased computational complexity and memory storage requirements, which stand at odd with the limited compute/storage resources on edge devices. To close the aforementioned gap for enabling more extensive AI-powered edge intelligence, we advocate harmonizing AI algorithms and dedicated accelerators via algorithm-accelerator co-design and leveraging application-level opportunities to minimize redundant computations in the processing pipeline. First, to tackle the efficiency bottleneck caused by the required massive random-access memory (DRAM) accesses when accelerating DNNs, we propose an algorithm-accelerator co-design technique called SmartExchange to trade higher-cost memory storage/accesses for lower-cost computations, for boosting the acceleration efficiency of both DNN inference and training. In particular, on the algorithm level, we enforce a hardware-friendly DNN weight structure, where only a small basis matrix and a sparse and readily-quantized coefficient matrix are needed to be stored for each layer and the remaining majority of weights can be recovered from lower-cost computations. On the hardware level, we further design a dedicated accelerator to leverage the SmartExchange-enforced algorithm structure for improving both the energy efficiency and processing latency of acceleration. Second, motivated by the promising results achieved by the above algorithm-accelerator co-design technique SmartExchange, we explore and develop dedicated algorithm-accelerator co-design techniques for two real-world applications, which are further empowered with application-level opportunities to maximize the achievable efficiency. In particular, we consider two representative and increasingly demanded AI-powered intelligent applications, one is eye tracking on VR/AR devices which is to estimate the gaze directions of human eyes and the other is cardiac detection on medical implants which is to perform intracardiac electrogram (EGM) to electrocardiogram (ECG) conversion (i.e., EGM-to-ECG conversion) on pacemakers. Among these applications, we find that there consistently exist application-level opportunities to be leveraged for largely reducing the computation/data movement redundancy within the processing pipeline. Therefore, we develop a tailored processing pipeline for each application and then pair it with dedicated algorithm-accelerator co-design techniques to further boost the overall system efficiency while maintaining the task performance, as elaborated below: For the eye tracking application, we propose a predict-then-focus pipeline that first extracts region-of-interests (ROIs), which is only 24% (average) of the original eye images for gaze estimation, to reduce computational redundancy. Additionally, the temporal correlation of eyes across frames is leveraged so that only 5% of the frames require ROIs adjustment over time. On top of those, we develop a dedicated accelerator and integrate both the algorithm and accelerator into a real hardware prototype system, dubbed i-FlatCam, consisting of a lensless camera and a chip prototype fabricated in a 28nm CMOS technology for validation. Real-hardware measurements show that the i-FlatCam system is the first to simultaneously meets all three requirements of eye tracking required by next-generation AR/VR devices. After that, we take another big leap towards accelerating an eye segmentation-involved pipeline for eye tracking towards more general eye tracking in AR/VR, where the segmentation result can enable more downstream tasks in addition to eye tracking. The resulting system is called EyeCoD and is validated with a multi-chip hardware prototype setting. For the EGM-to-ECG conversion application, we propose an application-aware processing pipeline where a precise and more complex conversion algorithm is only incurred in instant response to the detected anomaly (i.e., arrhythmia) and a coarse conversion is activated otherwise to avoid unnecessary computations. Furthermore, we develop a dedicated accelerator called e-G2C which is tailored for the above processing pipeline to further boost energy efficiency. For evaluation, the e-G2C processor is fabricated in a 28nm CMOS technology and achieves 0.14-8.31 μJ/inference energy efficiency outperforming prior arts under similar complexity, enabling real-time detection/conversion, and promising possibly life-critical interventions.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Artificial IntelligenceDeep Neural NetworkHardware accelerationAlgorithm-hardware co-designIntegrated CircuitMarrying Application-Level Opportunities with Algorithm-Hardware Co-Design towards Ubiquitous Edge IntelligenceThesis2023-06-15