Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

Date
2023-11-29
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

The recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane.

The first half of the thesis focuses on data-plane communications that support gradient synchronization in DDL. We introduce Zen, a system that fully leverages the sparsity in gradient tensors to scale deep learning with a provable optimal scheme for sparse tensor synchronization. We then propose Espresso and Cupcake to optimize the scalability of compression-enabled DDL that applies gradient compression algorithms to reduce traffic volume in communications for DNNs with low sparsity.

The second half of the thesis focuses on management-plane communications that provide fault tolerance to DDL. We introduce Gemini, a scalable distributed training system that checkpoints the model states at the optimal frequency to minimize failure recovery overhead in DDL, especially for large DNNs towards trillions of parameters.

Description
EMBARGO NOTE: This item is embargoed until 2024-06-01
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Distributed deep learning, gradient synchronization, gradient compression, fault tolerance
Citation

Wang, Zhuang. "Scaling Deep Learning through Optimizing Data- and Management-Plane Communications." (2023) PhD diss., Rice University. https://hdl.handle.net/1911/115360

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page