Computer Systems for Distributed Machine Learning

dc.contributor.advisorJermaine, Chris
dc.creatorGao, Zekai
dc.date.accessioned2019-05-17T13:50:38Z
dc.date.available2019-05-17T13:50:38Z
dc.date.created2018-05
dc.date.issued2018-01-31
dc.date.submittedMay 2018
dc.date.updated2019-05-17T13:50:39Z
dc.description.abstractMy thesis considers the design and development of state-of-the-art distributed data analytics systems supporting the implementation and execution of machine learning algorithms. Specifically, I consider how to support iterative, large-scale machine algorithms problems on a prototype distributed relational database system called SimSQL. SimSQL allows a programmer to leverage the power of declarative programming and data independence to specify what a computation does, and not how to implement it. This increases programmer productivity and means that the same implementation can be used for different data sets of different sizes and complexities, and different hardwares. The thesis considers three specific problems in the context of adapting SimSQL for the implementation and execution of large-scale machine learning algorithms. First, during learning, when a user-defined function is parameterized with a data object and a statistical model used to process that object, the fully parameterized model can be huge. How do we deal with the potential massive blowup in size during distributed learning? Second, although the idea of data independence—a fundamental design principle upon which relational database systems are built—supports the notion of “one implementation, any model/data size and compute hardware”, such systems lack sufficient support for recursive computations in deep learning and other applications. How should such a system be modified to support these applications? Third, some key features of distributed platforms aim at more general applications in data processing and are not always the best fit for large-scale machine learning and distributed linear algebra. Can we achieve higher efficiency on these platforms by avoiding some widely existing pitfalls? My thesis addresses the issues above by first describing and studying the ubiquitous join-and-co-group pattern for user-defined function parameterization, and carefully describing the alternatives for implementing this pattern on top of both SimSQL and Apache Spark. Second, I enhance SimSQL to support declarative recursion via multidimensional tables, then modify the query optimization framework so that it can handle the massive query plans that result from complicated recursive computations. I benchmark the resulting system, comparing it with TensorFlow and Spark. Third, I examine various performance bottlenecks associated with SimSQL in running large-scale machine learning applications, and consider three enhancements in large vector-type or matrix-type data partitioning, choice of physical plans for complicated operations as well as runtime compilation respectively.
dc.format.mimetypeapplication/pdf
dc.identifier.citationGao, Zekai. "Computer Systems for Distributed Machine Learning." (2018) Diss., Rice University. <a href="https://hdl.handle.net/1911/105638">https://hdl.handle.net/1911/105638</a>.
dc.identifier.urihttps://hdl.handle.net/1911/105638
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectDistributed systems
dc.subjectmachine learning
dc.subjectrelational database systems
dc.titleComputer Systems for Distributed Machine Learning
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GAO-DOCUMENT-2018.pdf
Size:
2.46 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: