Elastic Parameter Memory for Efficient Machine Learning

Shrivastava, Anshumali2024-08-302024-08-302024-082024-08-08August 202Desai, Aditya. Elastic Parameter Memory for Efficient Machine Learning. (2024). PhD diss., Rice University. https://hdl.handle.net/1911/117787https://hdl.handle.net/1911/117787The standard machine learning (ML) models are known to have redundancies. We have repeatedly seen that ML models can be sparsified or quantized, and low-rank components can replace parts of the model, often without affecting the quality of the model. Extracting these redundancies is not just a curiosity but a necessity in the era of ever-increasing model sizes and exorbitant costs required to train and deploy ML models. Sparsity, Quantization, and Low-Rank methods have been the key themes at the core of many approaches proposed for efficiency in the past five years. However, as we will see in this thesis, these methods are limited in the amount of efficiency they can bring. In this work, we introduce a novel approach, Elastic Parameter Memory (EM), which repurposes traditionally data-consuming probabilistic algorithms and data structures to the setting of learning where we learn compact representations of ML models. EM is an example of the confluence of probabilistic algorithms and data structures and ML, which opens up new research areas and unlocks the potential to push the efficiency frontiers in ML. The core idea in EM is hashing-based weight retrieval and enabling parameter space multiplexing. The majority of this thesis is about developing EM by solving critical issues motivated by practical application of EM to real systems. Our contributions are multifold: (1) Memory bandwidth efficient hash functions: The randomized hash functions that provide probabilistic algorithms and data structures with accuracy guarantees also cause severe cache performance deterioration in EM. In the process of developing cache-efficient hash functions, we stumble across a new class of hash functions that is not only cache-efficient but also strictly better than standard hash functions for the projection (2) Parameter multiplexing: To optimize for the parameter efficiency; we devise a memory multiplexing approach where all the modules of the model share parameter space. (3) Stability of training: We show that naively using EM can lead to unstable convergence. We devise a gradient scaling mechanism to remove this instability provably. (3) Optimal parameter usage in EM: we devise hash functions to optimally use the parameters in EM without compromising the quality or cache efficiency of the EM. We also theoretically and empirically contrast and combine EM with popular efficiency approaches. We show that in terms of parameter memory efficiency, EM is strictly better than the popular sparsity approach theoretically. While the analysis is restricted to linear models, the results follow into deep learning confirmed via rigorous empirical evaluation. Quantization is a sharper compression technique. While it maintains most of the accuracy at lower regimes of compression, the quality deteriorates faster as higher compressions, not to mention that we cannot obtain more than $16\times$ compression (for precision 16). However, it can be combined with EM to obtain efficiency not demonstrated by either of the methods alone. We show this theoretically in the dimensionality reduction setup. Furthermore, we find that with particular choices of hash functions, EM can even reduce the computational workload of machine learning. We additionally explore how EM can provide a single backbone for heterogeneous model training where different-sized models are deployed on different systems and show applications in federated learning. We demonstrate the practical implications of EM, showing that it can significantly reduce popular machine learning workloads' memory utilization, bandwidth utilization, computing, and, thus, latency. To highlight some impactful results, we show that EM can reduce the parameter memory usage of Deep learning recommendation models (DRLM) by $ 10000\times$ without compromising the model's accuracy, leading to $ 3.1\times$ improvement in latency and orders of magnitude improvement in carbon footprint and cost of training and deploying DLRM. We also show that EM can improve the throughput of Large Language Models (LLM) by $ 1.31\times$ without compromising the model quality. Moreover, we demonstrate that EM can be combined with Quantization and Sparsity to improve memory and throughput further.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.efficiencymachine learninghashingprojectionsparameter memoryElastic Parameter Memory for Efficient Machine LearningThesis2024-08-30