A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

Dun, Chen

A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

dc.contributor.advisor	Kyrillidis, Anastasios	en_US
dc.creator	Dun, Chen	en_US
dc.date.accessioned	2024-01-25T15:36:12Z	en_US
dc.date.available	2024-01-25T15:36:12Z	en_US
dc.date.created	2023-12	en_US
dc.date.issued	2023-11-30	en_US
dc.date.submitted	December 2023	en_US
dc.date.updated	2024-01-25T15:36:12Z	en_US
dc.description.abstract	In the past decades of development of machine learning systems, there is an eternal conflict: model performance versus model scale versus computation resources. The never ended desire to improve model performance significantly increases the size of machine learning model, the size of training dataset and the training time, while the available computation resources are generally limited due to limited memory size and computation power of computation devices and limited data usage (due to data storage or user privacy). In general, there are two main research attempts to solve such eternal conflict. The first attempt focuses on decreasing the needed computation resources. Accordingly, synchronous distributed training systems (such as data parallelism and model parallelism) and asynchronous distributed training system have been widely studied. Further, federated learning system has been researched to address the additional restriction of data usage due to data privacy or data storage. The second attempt to solve the eternal conflict instead focuses on improving model performance with limited model scale with Mixture of Expert (MoE) system. As we find there is hidden shared essence between these two directions, we aim to create a general methodology that can solve the problems met in both directions mentioned above. We propose a novel methodology that partitions, randomly or by a controlled method, the large neural network model into smaller subnetworks, each of which is distributed to local workers, trained independently and synchronized periodically. For the first direction, we demonstrate, with theoretical guarantee and empirical experiments, that such methodology can be applied in both synchronous and asynchronous systems, in different model architectures, and in both distributed training and federated learning, in most cases significantly reducing communication, memory and computation cost. For the second direction, we demonstrate that such methodology can significantly improve the model performance in MoE system without increasing model scale, by guiding the training of specialized experts. We also demonstrate our methodology can be applied to MoE systems on both traditional deep learning model and recent Large Language Model (LLM).	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Dun, Chen. "A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios." (2023). PhD thesis, Rice University. https://hdl.handle.net/1911/115431	en_US
dc.identifier.uri	https://hdl.handle.net/1911/115431	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Distributed Machine Learning	en_US
dc.subject	Federated Learning	en_US
dc.title	A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: DUN-DOCUMENT-2023.pdf
Size:: 3.94 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.97 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations