Parameter and Data Sparsity for Efficient Training of Large Neural Networks

Daghaghi, Shabnam

Parameter and Data Sparsity for Efficient Training of Large Neural Networks

dc.contributor.committeeMember	Shrivastava, Anshumali
dc.contributor.committeeMember	Baraniuk, Richard
dc.contributor.committeeMember	Hu, Xia (Ben)
dc.creator	Daghaghi, Shabnam
dc.date.accessioned	2024-01-24T21:40:33Z
dc.date.available	2024-01-24T21:40:33Z
dc.date.created	2023-12
dc.date.issued	2023-12-01
dc.date.submitted	December 2023
dc.date.updated	2024-01-24T21:40:33Z
dc.description.abstract	The size of deep neural networks has tremendously increased due to the recent advancements in generative AI and developments of Large Language Models (LLMs). Simultaneously, the exponential growth in data volumes has compounded the challenges associated with training these expansive models. This thesis utilizes randomized algorithms to leverage the notion of dynamic sparsity, where only a small subset of parameters is crucial for each input, and mitigate some of the bottlenecks in the training of large deep neural networks. In the first chapter, we utilize parameter sparsity to address the training bottleneck in neural networks with a large output layer, which occurs in many applications such as recommendation systems and information retrieval. The calculation of full softmax is costly from the computational and energy perspective. There have been various sampling approaches to overcome this challenge, popularly known as negative sampling (NS). However, the existing NS approaches trade either efficiency or adaptivity. We propose two classes of distributions where the sampling scheme is truly adaptive and provably generates negative samples in near-constant time. Our proposal is based on Locality Sensitive Hashing (LSH) data structure and significantly outperforms the baselines in terms of accuracy and wall-clock time. In the second chapter, we propose an efficient data sampling scheme based on LSH data structure. Data sampling is an effective method to improve the training speed of large deep neural networks, and it originates from the fact that not all data points have an equal amount of importance. We propose a novel dynamic sampling distribution based on nonparametric kernel regression. To make this approach computationally feasible, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator and provide the exponential convergence guarantees. We demonstrate that our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy across multiple datasets. Finally, in the third chapter, we study the system design and implementation of networks with dynamic parameter sparsity on CPUs. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based feed-forward and back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this section, we argue that SLIDE’s current implementation is sub-optimal and does not exploit several features available in modern CPUs. In particular, we show that by utilizing opportunities such as memory optimization, quantization, and vectorization, we obtain up to 7x speedup in the computations on the same hardware.
dc.format.mimetype	application/pdf
dc.identifier.citation	Daghaghi, Shabnam. "Parameter and Data Sparsity for Efficient Training of Large Neural Networks." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115391
dc.identifier.uri	https://hdl.handle.net/1911/115391
dc.language.iso	eng
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subject	Deep Learning
dc.subject	Locality Sensitive Hashing
dc.subject	Negative Sampling
dc.subject	Data Selection
dc.subject	Efficient Training of Neural Networks
dc.title	Parameter and Data Sparsity for Efficient Training of Large Neural Networks
dc.type	Thesis
dc.type.material	Text
thesis.degree.department	Electrical and Computer Engineering
thesis.degree.discipline	Engineering
thesis.degree.grantor	Rice University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: DAGHAGHI-DOCUMENT-2023.pdf
Size:: 2.99 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.85 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.98 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Electronic Theses and Dissertations