Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

Wang, Zhuang

Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

dc.contributor.advisor	Ng, T. S. Eugene
dc.creator	Wang, Zhuang
dc.date.accessioned	2024-01-22T22:21:08Z
dc.date.available	2024-01-22T22:21:08Z
dc.date.created	2023-12
dc.date.issued	2023-11-29
dc.date.submitted	December 2023
dc.date.updated	2024-01-22T22:21:08Z
dc.description	EMBARGO NOTE: This item is embargoed until 2024-06-01
dc.description.abstract	The recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane. The first half of the thesis focuses on data-plane communications that support gradient synchronization in DDL. We introduce Zen, a system that fully leverages the sparsity in gradient tensors to scale deep learning with a provable optimal scheme for sparse tensor synchronization. We then propose Espresso and Cupcake to optimize the scalability of compression-enabled DDL that applies gradient compression algorithms to reduce traffic volume in communications for DNNs with low sparsity. The second half of the thesis focuses on management-plane communications that provide fault tolerance to DDL. We introduce Gemini, a scalable distributed training system that checkpoints the model states at the optimal frequency to minimize failure recovery overhead in DDL, especially for large DNNs towards trillions of parameters.
dc.embargo.lift	2024-06-01
dc.embargo.terms	2024-06-01
dc.format.mimetype	application/pdf
dc.identifier.citation	Wang, Zhuang. "Scaling Deep Learning through Optimizing Data- and Management-Plane Communications." (2023) PhD diss., Rice University. https://hdl.handle.net/1911/115360
dc.identifier.uri	https://hdl.handle.net/1911/115360
dc.language.iso	eng
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subject	Distributed deep learning
dc.subject	gradient synchronization
dc.subject	gradient compression
dc.subject	fault tolerance
dc.title	Scaling Deep Learning through Optimizing Data- and Management-Plane Communications
dc.type	Thesis
dc.type.material	Text
thesis.degree.department	Computer Science
thesis.degree.discipline	Engineering
thesis.degree.grantor	Rice University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: WANG-DOCUMENT-2023.pdf
Size:: 13.11 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.98 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Electronic Theses and Dissertations