Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

Wang, Zhuang

Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

Files

WANG-DOCUMENT-2023.pdf (13.11 MB)

Date

2023-11-29

Authors

Wang, Zhuang

Abstract

The recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane.

The first half of the thesis focuses on data-plane communications that support gradient synchronization in DDL. We introduce Zen, a system that fully leverages the sparsity in gradient tensors to scale deep learning with a provable optimal scheme for sparse tensor synchronization. We then propose Espresso and Cupcake to optimize the scalability of compression-enabled DDL that applies gradient compression algorithms to reduce traffic volume in communications for DNNs with low sparsity.

The second half of the thesis focuses on management-plane communications that provide fault tolerance to DDL. We introduce Gemini, a scalable distributed training system that checkpoints the model states at the optimal frequency to minimize failure recovery overhead in DDL, especially for large DNNs towards trillions of parameters.

Description

EMBARGO NOTE: This item is embargoed until 2024-06-01

Advisor

Ng, T. S. Eugene

Degree

Doctor of Philosophy

Type

Thesis

Keywords

Distributed deep learning, gradient synchronization, gradient compression, fault tolerance

Citation

Wang, Zhuang. "Scaling Deep Learning through Optimizing Data- and Management-Plane Communications." (2023) PhD diss., Rice University. https://hdl.handle.net/1911/115360

Rights

Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.

Citable link to this page

https://hdl.handle.net/1911/115360

Collections

Rice University Electronic Theses and Dissertations

Full item page