Scaling Deep Learning through Optimizing Data- and Management-Plane Communications

Ng, T. S. Eugene2024-01-222024-01-222023-122023-11-29December 2Wang, Zhuang. "Scaling Deep Learning through Optimizing Data- and Management-Plane Communications." (2023) PhD diss., Rice University. https://hdl.handle.net/1911/115360https://hdl.handle.net/1911/115360EMBARGO NOTE: This item is embargoed until 2024-06-01The recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane. The first half of the thesis focuses on data-plane communications that support gradient synchronization in DDL. We introduce Zen, a system that fully leverages the sparsity in gradient tensors to scale deep learning with a provable optimal scheme for sparse tensor synchronization. We then propose Espresso and Cupcake to optimize the scalability of compression-enabled DDL that applies gradient compression algorithms to reduce traffic volume in communications for DNNs with low sparsity. The second half of the thesis focuses on management-plane communications that provide fault tolerance to DDL. We introduce Gemini, a scalable distributed training system that checkpoints the model states at the optimal frequency to minimize failure recovery overhead in DDL, especially for large DNNs towards trillions of parameters.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Distributed deep learninggradient synchronizationgradient compressionfault toleranceScaling Deep Learning through Optimizing Data- and Management-Plane CommunicationsThesis2024-01-22