Very Large Scale Bayesian Machine Learning

Cai, Zhuhua

Very Large Scale Bayesian Machine Learning

dc.contributor.advisor	Jermaine, Christopher	en_US
dc.contributor.committeeMember	Nakhleh, Luay	en_US
dc.contributor.committeeMember	Zhong, Lin	en_US
dc.creator	Cai, Zhuhua	en_US
dc.date.accessioned	2016-01-06T20:43:11Z	en_US
dc.date.available	2016-01-06T20:43:11Z	en_US
dc.date.created	2014-12	en_US
dc.date.issued	2014-07-30	en_US
dc.date.submitted	December 2014	en_US
dc.date.updated	2016-01-06T20:43:11Z	en_US
dc.description.abstract	This thesis aims to scale Bayesian machine learning (ML) to very large datasets. First, I propose a pairwise Gaussian random field model (PGRF) for high dimensional data imputation. The PGRF is a graphical, factor-based model. Besides its high accuracy, the PGRF is more efficient and scalable than the Gaussian Markov random field model (GMRF). Experiments show that the PGRF followed by the linear regression (LR) or support vector machine (SVM) reduces the RMSE by 10% to 45% compared with the mean imputation followed by the LR or SVM. Furthermore, the PGRF scales the imputation to very large datasets distributed in a 100-machine cluster that could not be handled by the GMRF or Gaussian methods at all. Unfortunately, the PGRF model is hard to implement -- approximately 18000 lines of Hadoop code and 4 months of work in distributed debugging and running. To reduce the huge amount of human effort, I designed a database system called SimSQL. SimSQL supports rich analytical methods such as Bayesian ML, and scales such methods to terabytes of data distributed over 100 machines. SimSQL enlarges the analysis power of relational database systems, and at the same time keeps merits such as declarative language, transparent optimization and automatic parallelization. SimSQL builds upon the MCDB uncertainty database, and allows the definition of recursive stochastic tables. SimSQL is an ideal platform for Markov chain simulations or iterative algorithms such as PageRank. To show SimSQL's performance, I introduce an objective benchmark that compares SimSQL with Giraph, GraphLab and Spark on five Bayesian ML problems.. The results show that SimSQL provides the best programmability and competitive performance. To run a general Bayesian ML model, SimSQL takes 1X less code than Spark, 6X less than GraphLab, and 12X less code than Giraph, while its time cost is within 5X slowdown in the worst case compared with Giraph and GraphLab. In brief, I consider both modeling and inference for large scale Bayesian ML. The goals for both sides are the same: scaling Bayesian ML to very large datasets, achieving better performance and reducing time cost in design, implementation and execution of ML algorithms.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Cai, Zhuhua. "Very Large Scale Bayesian Machine Learning." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/87724">https://hdl.handle.net/1911/87724</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/87724	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Bayesian inference	en_US
dc.subject	large scale machine learning	en_US
dc.subject	MapReduce	en_US
dc.title	Very Large Scale Bayesian Machine Learning	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: CAI-DOCUMENT-2014.pdf
Size:: 1.32 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.83 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.6 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations