A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios

dc.contributor.advisorKyrillidis, Anastasiosen_US
dc.creatorDun, Chenen_US
dc.date.accessioned2024-01-25T15:36:12Zen_US
dc.date.available2024-01-25T15:36:12Zen_US
dc.date.created2023-12en_US
dc.date.issued2023-11-30en_US
dc.date.submittedDecember 2023en_US
dc.date.updated2024-01-25T15:36:12Zen_US
dc.description.abstractIn the past decades of development of machine learning systems, there is an eternal conflict: model performance versus model scale versus computation resources. The never ended desire to improve model performance significantly increases the size of machine learning model, the size of training dataset and the training time, while the available computation resources are generally limited due to limited memory size and computation power of computation devices and limited data usage (due to data storage or user privacy). In general, there are two main research attempts to solve such eternal conflict. The first attempt focuses on decreasing the needed computation resources. Accordingly, synchronous distributed training systems (such as data parallelism and model parallelism) and asynchronous distributed training system have been widely studied. Further, federated learning system has been researched to address the additional restriction of data usage due to data privacy or data storage. The second attempt to solve the eternal conflict instead focuses on improving model performance with limited model scale with Mixture of Expert (MoE) system. As we find there is hidden shared essence between these two directions, we aim to create a general methodology that can solve the problems met in both directions mentioned above. We propose a novel methodology that partitions, randomly or by a controlled method, the large neural network model into smaller subnetworks, each of which is distributed to local workers, trained independently and synchronized periodically. For the first direction, we demonstrate, with theoretical guarantee and empirical experiments, that such methodology can be applied in both synchronous and asynchronous systems, in different model architectures, and in both distributed training and federated learning, in most cases significantly reducing communication, memory and computation cost. For the second direction, we demonstrate that such methodology can significantly improve the model performance in MoE system without increasing model scale, by guiding the training of specialized experts. We also demonstrate our methodology can be applied to MoE systems on both traditional deep learning model and recent Large Language Model (LLM).en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationDun, Chen. "A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios." (2023). PhD thesis, Rice University. https://hdl.handle.net/1911/115431en_US
dc.identifier.urihttps://hdl.handle.net/1911/115431en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectDistributed Machine Learningen_US
dc.subjectFederated Learningen_US
dc.titleA General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenariosen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DUN-DOCUMENT-2023.pdf
Size:
3.94 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.97 KB
Format:
Plain Text
Description: