Parameter and Data Sparsity for Efficient Training of Large Neural Networks

dc.contributor.committeeMemberShrivastava, Anshumali
dc.contributor.committeeMemberBaraniuk, Richard
dc.contributor.committeeMemberHu, Xia (Ben)
dc.creatorDaghaghi, Shabnam
dc.date.accessioned2024-01-24T21:40:33Z
dc.date.available2024-01-24T21:40:33Z
dc.date.created2023-12
dc.date.issued2023-12-01
dc.date.submittedDecember 2023
dc.date.updated2024-01-24T21:40:33Z
dc.description.abstractThe size of deep neural networks has tremendously increased due to the recent advancements in generative AI and developments of Large Language Models (LLMs). Simultaneously, the exponential growth in data volumes has compounded the challenges associated with training these expansive models. This thesis utilizes randomized algorithms to leverage the notion of dynamic sparsity, where only a small subset of parameters is crucial for each input, and mitigate some of the bottlenecks in the training of large deep neural networks. In the first chapter, we utilize parameter sparsity to address the training bottleneck in neural networks with a large output layer, which occurs in many applications such as recommendation systems and information retrieval. The calculation of full softmax is costly from the computational and energy perspective. There have been various sampling approaches to overcome this challenge, popularly known as negative sampling (NS). However, the existing NS approaches trade either efficiency or adaptivity. We propose two classes of distributions where the sampling scheme is truly adaptive and provably generates negative samples in near-constant time. Our proposal is based on Locality Sensitive Hashing (LSH) data structure and significantly outperforms the baselines in terms of accuracy and wall-clock time. In the second chapter, we propose an efficient data sampling scheme based on LSH data structure. Data sampling is an effective method to improve the training speed of large deep neural networks, and it originates from the fact that not all data points have an equal amount of importance. We propose a novel dynamic sampling distribution based on nonparametric kernel regression. To make this approach computationally feasible, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator and provide the exponential convergence guarantees. We demonstrate that our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy across multiple datasets. Finally, in the third chapter, we study the system design and implementation of networks with dynamic parameter sparsity on CPUs. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based feed-forward and back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this section, we argue that SLIDE’s current implementation is sub-optimal and does not exploit several features available in modern CPUs. In particular, we show that by utilizing opportunities such as memory optimization, quantization, and vectorization, we obtain up to 7x speedup in the computations on the same hardware.
dc.format.mimetypeapplication/pdf
dc.identifier.citationDaghaghi, Shabnam. "Parameter and Data Sparsity for Efficient Training of Large Neural Networks." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115391
dc.identifier.urihttps://hdl.handle.net/1911/115391
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectDeep Learning
dc.subjectLocality Sensitive Hashing
dc.subjectNegative Sampling
dc.subjectData Selection
dc.subjectEfficient Training of Neural Networks
dc.titleParameter and Data Sparsity for Efficient Training of Large Neural Networks
dc.typeThesis
dc.type.materialText
thesis.degree.departmentElectrical and Computer Engineering
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DAGHAGHI-DOCUMENT-2023.pdf
Size:
2.99 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.85 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: