Large Scale Online Aggregation Via Distributed Systems

Pansare, Niketan

Large Scale Online Aggregation Via Distributed Systems

Files

PANSARE-DOCUMENT-2014.pdf (2.9 MB)

Date

2014-12-04

Authors

Pansare, Niketan

Abstract

From movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge
amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic.

In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary
aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical
sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be
biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule
each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project similar to Hadoop, the most popular implementation of MapReduce.

In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of a function L(w) by starting with an initial point $w0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can
take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework
developed in the first part of my thesis can be used to speed up this algorithm in a
MapReduce framework. The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new
statistical model for addressing this question

Finally, I design, implement, and evaluate a particular machine
learning algorithm. An extremely popular feature selection methodology is topic modeling . A topic is defined as a probability distribution over sets of words or phrases and each document in the corpus is drawn from mixture of
these topics. A topic model for a corpus specifies the set of topics, as well as the
proportion in which they are present in any given document. The recent interest in topic
models has been driven by the explosion of electronic, text based data that are available for analysis. From web pages to emails to
microblogs, text-based data are everywhere. However, not all electronically-available natural
language corpora are text-based. In my thesis, I consider the problem of learning
topic models over spoken language. My work is motivated by our involvement with the
Spoken Web (also called the World Wide Telecomm Web), which allows users in rural India to post
farming-related questions and responses to an audio forum using mobile phones. I propose a new
topic model that leverages the statistical algorithms used in most modern speech-to-text software. I develop alternative
version of the popular LDA topic model called the spoken topic model, or STM for short. This model uses a Bayesian
interpretation of the output of a speech-to-text software that takes
into account the software's explicit uncertainty description (the
phrases and weights) in a principled fashion.

Advisor

Jermaine, Chris

Degree

Doctor of Philosophy

Type

Thesis

Keywords

Online aggregation, Gradient Descent, MapReduce, Hadoop, Topic Models

Citation

Pansare, Niketan. "Large Scale Online Aggregation Via Distributed Systems." (2014) Diss., Rice University. https://hdl.handle.net/1911/88434.

Rights

Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.

Citable link to this page

https://hdl.handle.net/1911/88434

Collections

Rice University Theses and Dissertations

Full item page