Very Large Scale Bayesian Machine Learning

dc.contributor.advisorJermaine, Christopheren_US
dc.contributor.committeeMemberNakhleh, Luayen_US
dc.contributor.committeeMemberZhong, Linen_US
dc.creatorCai, Zhuhuaen_US
dc.date.accessioned2016-01-06T20:43:11Zen_US
dc.date.available2016-01-06T20:43:11Zen_US
dc.date.created2014-12en_US
dc.date.issued2014-07-30en_US
dc.date.submittedDecember 2014en_US
dc.date.updated2016-01-06T20:43:11Zen_US
dc.description.abstractThis thesis aims to scale Bayesian machine learning (ML) to very large datasets. First, I propose a pairwise Gaussian random field model (PGRF) for high dimensional data imputation. The PGRF is a graphical, factor-based model. Besides its high accuracy, the PGRF is more efficient and scalable than the Gaussian Markov random field model (GMRF). Experiments show that the PGRF followed by the linear regression (LR) or support vector machine (SVM) reduces the RMSE by 10% to 45% compared with the mean imputation followed by the LR or SVM. Furthermore, the PGRF scales the imputation to very large datasets distributed in a 100-machine cluster that could not be handled by the GMRF or Gaussian methods at all. Unfortunately, the PGRF model is hard to implement -- approximately 18000 lines of Hadoop code and 4 months of work in distributed debugging and running. To reduce the huge amount of human effort, I designed a database system called SimSQL. SimSQL supports rich analytical methods such as Bayesian ML, and scales such methods to terabytes of data distributed over 100 machines. SimSQL enlarges the analysis power of relational database systems, and at the same time keeps merits such as declarative language, transparent optimization and automatic parallelization. SimSQL builds upon the MCDB uncertainty database, and allows the definition of recursive stochastic tables. SimSQL is an ideal platform for Markov chain simulations or iterative algorithms such as PageRank. To show SimSQL's performance, I introduce an objective benchmark that compares SimSQL with Giraph, GraphLab and Spark on five Bayesian ML problems.. The results show that SimSQL provides the best programmability and competitive performance. To run a general Bayesian ML model, SimSQL takes 1X less code than Spark, 6X less than GraphLab, and 12X less code than Giraph, while its time cost is within 5X slowdown in the worst case compared with Giraph and GraphLab. In brief, I consider both modeling and inference for large scale Bayesian ML. The goals for both sides are the same: scaling Bayesian ML to very large datasets, achieving better performance and reducing time cost in design, implementation and execution of ML algorithms.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationCai, Zhuhua. "Very Large Scale Bayesian Machine Learning." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/87724">https://hdl.handle.net/1911/87724</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/87724en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectBayesian inferenceen_US
dc.subjectlarge scale machine learningen_US
dc.subjectMapReduceen_US
dc.titleVery Large Scale Bayesian Machine Learningen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
CAI-DOCUMENT-2014.pdf
Size:
1.32 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.83 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: