A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

dc.contributor.advisorKyrillidis, Anastasios
dc.creatorDun, Chen
dc.date.accessioned2024-01-25T15:36:12Z
dc.date.available2024-01-25T15:36:12Z
dc.date.created2023-12
dc.date.issued2023-11-30
dc.date.submittedDecember 2023
dc.date.updated2024-01-25T15:36:12Z
dc.description.abstractIn the past decades of development of machine learning systems, there is an eternal conflict: model performance versus model scale versus computation resources. The never ended desire to improve model performance significantly increases the size of machine learning model, the size of training dataset and the training time, while the available computation resources are generally limited due to limited memory size and computation power of computation devices and limited data usage (due to data storage or user privacy). In general, there are two main research attempts to solve such eternal conflict. The first attempt focuses on decreasing the needed computation resources. Accordingly, synchronous distributed training systems (such as data parallelism and model parallelism) and asynchronous distributed training system have been widely studied. Further, federated learning system has been researched to address the additional restriction of data usage due to data privacy or data storage. The second attempt to solve the eternal conflict instead focuses on improving model performance with limited model scale with Mixture of Expert (MoE) system. As we find there is hidden shared essence between these two directions, we aim to create a general methodology that can solve the problems met in both directions mentioned above. We propose a novel methodology that partitions, randomly or by a controlled method, the large neural network model into smaller subnetworks, each of which is distributed to local workers, trained independently and synchronized periodically. For the first direction, we demonstrate, with theoretical guarantee and empirical experiments, that such methodology can be applied in both synchronous and asynchronous systems, in different model architectures, and in both distributed training and federated learning, in most cases significantly reducing communication, memory and computation cost. For the second direction, we demonstrate that such methodology can significantly improve the model performance in MoE system without increasing model scale, by guiding the training of specialized experts. We also demonstrate our methodology can be applied to MoE systems on both traditional deep learning model and recent Large Language Model (LLM).
dc.format.mimetypeapplication/pdf
dc.identifier.citationDun, Chen. "A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios." (2023). PhD thesis, Rice University. https://hdl.handle.net/1911/115431
dc.identifier.urihttps://hdl.handle.net/1911/115431
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectDistributed Machine Learning
dc.subjectFederated Learning
dc.titleA General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DUN-DOCUMENT-2023.pdf
Size:
3.94 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.97 KB
Format:
Plain Text
Description: