Computer Systems for Distributed Machine Learning

dc.contributor.advisorJermaine, Chrisen_US
dc.creatorGao, Zekaien_US
dc.date.accessioned2019-05-17T13:50:38Zen_US
dc.date.available2019-05-17T13:50:38Zen_US
dc.date.created2018-05en_US
dc.date.issued2018-01-31en_US
dc.date.submittedMay 2018en_US
dc.date.updated2019-05-17T13:50:39Zen_US
dc.description.abstractMy thesis considers the design and development of state-of-the-art distributed data analytics systems supporting the implementation and execution of machine learning algorithms. Specifically, I consider how to support iterative, large-scale machine algorithms problems on a prototype distributed relational database system called SimSQL. SimSQL allows a programmer to leverage the power of declarative programming and data independence to specify what a computation does, and not how to implement it. This increases programmer productivity and means that the same implementation can be used for different data sets of different sizes and complexities, and different hardwares. The thesis considers three specific problems in the context of adapting SimSQL for the implementation and execution of large-scale machine learning algorithms. First, during learning, when a user-defined function is parameterized with a data object and a statistical model used to process that object, the fully parameterized model can be huge. How do we deal with the potential massive blowup in size during distributed learning? Second, although the idea of data independence—a fundamental design principle upon which relational database systems are built—supports the notion of “one implementation, any model/data size and compute hardware”, such systems lack sufficient support for recursive computations in deep learning and other applications. How should such a system be modified to support these applications? Third, some key features of distributed platforms aim at more general applications in data processing and are not always the best fit for large-scale machine learning and distributed linear algebra. Can we achieve higher efficiency on these platforms by avoiding some widely existing pitfalls? My thesis addresses the issues above by first describing and studying the ubiquitous join-and-co-group pattern for user-defined function parameterization, and carefully describing the alternatives for implementing this pattern on top of both SimSQL and Apache Spark. Second, I enhance SimSQL to support declarative recursion via multidimensional tables, then modify the query optimization framework so that it can handle the massive query plans that result from complicated recursive computations. I benchmark the resulting system, comparing it with TensorFlow and Spark. Third, I examine various performance bottlenecks associated with SimSQL in running large-scale machine learning applications, and consider three enhancements in large vector-type or matrix-type data partitioning, choice of physical plans for complicated operations as well as runtime compilation respectively.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGao, Zekai. "Computer Systems for Distributed Machine Learning." (2018) Diss., Rice University. <a href="https://hdl.handle.net/1911/105638">https://hdl.handle.net/1911/105638</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/105638en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectDistributed systemsen_US
dc.subjectmachine learningen_US
dc.subjectrelational database systemsen_US
dc.titleComputer Systems for Distributed Machine Learningen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GAO-DOCUMENT-2018.pdf
Size:
2.46 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: