Large Scale Online Aggregation Via Distributed Systems

Pansare, Niketan

Large Scale Online Aggregation Via Distributed Systems

dc.contributor.advisor	Jermaine, Chris	en_US
dc.contributor.committeeMember	Nakhleh, Luay	en_US
dc.contributor.committeeMember	Merényi, Erzsébet	en_US
dc.creator	Pansare, Niketan	en_US
dc.date.accessioned	2016-02-05T22:13:49Z	en_US
dc.date.available	2016-02-05T22:13:49Z	en_US
dc.date.created	2014-12	en_US
dc.date.issued	2014-12-04	en_US
dc.date.submitted	December 2014	en_US
dc.date.updated	2016-02-05T22:13:49Z	en_US
dc.description.abstract	From movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic. In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project similar to Hadoop, the most popular implementation of MapReduce. In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of a function L(w) by starting with an initial point $w_0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework developed in the first part of my thesis can be used to speed up this algorithm in a MapReduce framework. The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new statistical model for addressing this question Finally, I design, implement, and evaluate a particular machine learning algorithm. An extremely popular feature selection methodology is topic modeling . A topic is defined as a probability distribution over sets of words or phrases and each document in the corpus is drawn from mixture of these topics. A topic model for a corpus specifies the set of topics, as well as the proportion in which they are present in any given document. The recent interest in topic models has been driven by the explosion of electronic, text based data that are available for analysis. From web pages to emails to microblogs, text-based data are everywhere. However, not all electronically-available natural language corpora are text-based. In my thesis, I consider the problem of learning topic models over spoken language. My work is motivated by our involvement with the Spoken Web (also called the World Wide Telecomm Web), which allows users in rural India to post farming-related questions and responses to an audio forum using mobile phones. I propose a new topic model that leverages the statistical algorithms used in most modern speech-to-text software. I develop alternative version of the popular LDA topic model called the spoken topic model, or STM for short. This model uses a Bayesian interpretation of the output of a speech-to-text software that takes into account the software's explicit uncertainty description (the phrases and weights) in a principled fashion.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Pansare, Niketan. "Large Scale Online Aggregation Via Distributed Systems." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/88434">https://hdl.handle.net/1911/88434</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/88434	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Online aggregation	en_US
dc.subject	Gradient Descent	en_US
dc.subject	MapReduce	en_US
dc.subject	Hadoop	en_US
dc.subject	Topic Models	en_US
dc.title	Large Scale Online Aggregation Via Distributed Systems	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: PANSARE-DOCUMENT-2014.pdf
Size:: 2.9 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.61 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations