Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

dc.contributor.advisorNg, T. S. Eugene
dc.creatorWang, Zhuang
dc.date.accessioned2024-01-22T22:21:08Z
dc.date.available2024-01-22T22:21:08Z
dc.date.created2023-12
dc.date.issued2023-11-29
dc.date.submittedDecember 2023
dc.date.updated2024-01-22T22:21:08Z
dc.descriptionEMBARGO NOTE: This item is embargoed until 2024-06-01
dc.description.abstractThe recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane. The first half of the thesis focuses on data-plane communications that support gradient synchronization in DDL. We introduce Zen, a system that fully leverages the sparsity in gradient tensors to scale deep learning with a provable optimal scheme for sparse tensor synchronization. We then propose Espresso and Cupcake to optimize the scalability of compression-enabled DDL that applies gradient compression algorithms to reduce traffic volume in communications for DNNs with low sparsity. The second half of the thesis focuses on management-plane communications that provide fault tolerance to DDL. We introduce Gemini, a scalable distributed training system that checkpoints the model states at the optimal frequency to minimize failure recovery overhead in DDL, especially for large DNNs towards trillions of parameters.
dc.embargo.lift2024-06-01
dc.embargo.terms2024-06-01
dc.format.mimetypeapplication/pdf
dc.identifier.citationWang, Zhuang. "Scaling Deep Learning through Optimizing Data- and Management-Plane Communications." (2023) PhD diss., Rice University. https://hdl.handle.net/1911/115360
dc.identifier.urihttps://hdl.handle.net/1911/115360
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectDistributed deep learning
dc.subjectgradient synchronization
dc.subjectgradient compression
dc.subjectfault tolerance
dc.titleScaling Deep Learning through Optimizing Data- and Management-Plane Communications
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
WANG-DOCUMENT-2023.pdf
Size:
13.11 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: