Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Yan, Minghao

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

dc.contributor.advisor	Shrivastava, Anshumali	en_US
dc.creator	Yan, Minghao	en_US
dc.date.accessioned	2022-09-23T20:48:38Z	en_US
dc.date.available	2022-09-23T20:48:38Z	en_US
dc.date.created	2022-05	en_US
dc.date.issued	2022-04-21	en_US
dc.date.submitted	May 2022	en_US
dc.date.updated	2022-09-23T20:48:38Z	en_US
dc.description.abstract	More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Yan, Minghao. "Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity." (2022) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/113304">https://hdl.handle.net/1911/113304</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/113304	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Deep Learning	en_US
dc.subject	Distributed Training	en_US
dc.title	Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Masters	en_US
thesis.degree.name	Master of Science	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: YAN-DOCUMENT-2022.pdf
Size:: 1.4 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.6 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations