Computer Systems for Distributed Machine Learning
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
My thesis considers the design and development of state-of-the-art distributed data analytics systems supporting the implementation and execution of machine learning algorithms. Specifically, I consider how to support iterative, large-scale machine algorithms problems on a prototype distributed relational database system called SimSQL. SimSQL allows a programmer to leverage the power of declarative programming and data independence to specify what a computation does, and not how to implement it. This increases programmer productivity and means that the same implementation can be used for different data sets of different sizes and complexities, and different hardwares. The thesis considers three specific problems in the context of adapting SimSQL for the implementation and execution of large-scale machine learning algorithms. First, during learning, when a user-defined function is parameterized with a data object and a statistical model used to process that object, the fully parameterized model can be huge. How do we deal with the potential massive blowup in size during distributed learning? Second, although the idea of data independence—a fundamental design principle upon which relational database systems are built—supports the notion of “one implementation, any model/data size and compute hardware”, such systems lack sufficient support for recursive computations in deep learning and other applications. How should such a system be modified to support these applications? Third, some key features of distributed platforms aim at more general applications in data processing and are not always the best fit for large-scale machine learning and distributed linear algebra. Can we achieve higher efficiency on these platforms by avoiding some widely existing pitfalls? My thesis addresses the issues above by first describing and studying the ubiquitous join-and-co-group pattern for user-defined function parameterization, and carefully describing the alternatives for implementing this pattern on top of both SimSQL and Apache Spark. Second, I enhance SimSQL to support declarative recursion via multidimensional tables, then modify the query optimization framework so that it can handle the massive query plans that result from complicated recursive computations. I benchmark the resulting system, comparing it with TensorFlow and Spark. Third, I examine various performance bottlenecks associated with SimSQL in running large-scale machine learning applications, and consider three enhancements in large vector-type or matrix-type data partitioning, choice of physical plans for complicated operations as well as runtime compilation respectively.
Description
Advisor
Degree
Type
Keywords
Citation
Gao, Zekai. "Computer Systems for Distributed Machine Learning." (2018) Diss., Rice University. https://hdl.handle.net/1911/105638.