A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

Dun, Chen

A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

Files

DUN-DOCUMENT-2023.pdf (3.94 MB)

Date

2023-11-30

Authors

Dun, Chen

Abstract

In the past decades of development of machine learning systems, there is an eternal conflict: model performance versus model scale versus computation resources. The never ended desire to improve model performance significantly increases the size of machine learning model, the size of training dataset and the training time, while the available computation resources are generally limited due to limited memory size and computation power of computation devices and limited data usage (due to data storage or user privacy). In general, there are two main research attempts to solve such eternal conflict. The first attempt focuses on decreasing the needed computation resources. Accordingly, synchronous distributed training systems (such as data parallelism and model parallelism) and asynchronous distributed training system have been widely studied. Further, federated learning system has been researched to address the additional restriction of data usage due to data privacy or data storage. The second attempt to solve the eternal conflict instead focuses on improving model performance with limited model scale with Mixture of Expert (MoE) system. As we find there is hidden shared essence between these two directions, we aim to create a general methodology that can solve the problems met in both directions mentioned above. We propose a novel methodology that partitions, randomly or by a controlled method, the large neural network model into smaller subnetworks, each of which is distributed to local workers, trained independently and synchronized periodically. For the first direction, we demonstrate, with theoretical guarantee and empirical experiments, that such methodology can be applied in both synchronous and asynchronous systems, in different model architectures, and in both distributed training and federated learning, in most cases significantly reducing communication, memory and computation cost. For the second direction, we demonstrate that such methodology can significantly improve the model performance in MoE system without increasing model scale, by guiding the training of specialized experts. We also demonstrate our methodology can be applied to MoE systems on both traditional deep learning model and recent Large Language Model (LLM).

Advisor

Kyrillidis, Anastasios

Degree

Doctor of Philosophy

Type

Thesis

Keywords

Distributed Machine Learning, Federated Learning

Citation

Dun, Chen. "A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios." (2023). PhD thesis, Rice University. https://hdl.handle.net/1911/115431

Rights

Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.

Citable link to this page

https://hdl.handle.net/1911/115431

Collections

Rice University Theses and Dissertations

Full item page