Parameter and Data Sparsity for Efficient Training of Large Neural Networks

Date
2023-12-01
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

The size of deep neural networks has tremendously increased due to the recent advancements in generative AI and developments of Large Language Models (LLMs). Simultaneously, the exponential growth in data volumes has compounded the challenges associated with training these expansive models. This thesis utilizes randomized algorithms to leverage the notion of dynamic sparsity, where only a small subset of parameters is crucial for each input, and mitigate some of the bottlenecks in the training of large deep neural networks.

In the first chapter, we utilize parameter sparsity to address the training bottleneck in neural networks with a large output layer, which occurs in many applications such as recommendation systems and information retrieval. The calculation of full softmax is costly from the computational and energy perspective. There have been various sampling approaches to overcome this challenge, popularly known as negative sampling (NS). However, the existing NS approaches trade either efficiency or adaptivity. We propose two classes of distributions where the sampling scheme is truly adaptive and provably generates negative samples in near-constant time. Our proposal is based on Locality Sensitive Hashing (LSH) data structure and significantly outperforms the baselines in terms of accuracy and wall-clock time.

In the second chapter, we propose an efficient data sampling scheme based on LSH data structure. Data sampling is an effective method to improve the training speed of large deep neural networks, and it originates from the fact that not all data points have an equal amount of importance. We propose a novel dynamic sampling distribution based on nonparametric kernel regression. To make this approach computationally feasible, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator and provide the exponential convergence guarantees. We demonstrate that our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy across multiple datasets.

Finally, in the third chapter, we study the system design and implementation of networks with dynamic parameter sparsity on CPUs. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based feed-forward and back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this section, we argue that SLIDE’s current implementation is sub-optimal and does not exploit several features available in modern CPUs. In particular, we show that by utilizing opportunities such as memory optimization, quantization, and vectorization, we obtain up to 7x speedup in the computations on the same hardware.

Description
Advisor
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Deep Learning, Locality Sensitive Hashing, Negative Sampling, Data Selection, Efficient Training of Neural Networks
Citation

Daghaghi, Shabnam. "Parameter and Data Sparsity for Efficient Training of Large Neural Networks." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115391

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page