Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference

Drummond, Anna

Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference

dc.contributor.advisor	Jermaine, Christopher M.	en_US
dc.contributor.committeeMember	Nakhleh, Luay K.	en_US
dc.contributor.committeeMember	Chaudhuri, Swarat	en_US
dc.contributor.committeeMember	Allen, Genevera	en_US
dc.creator	Drummond, Anna	en_US
dc.date.accessioned	2014-08-26T18:45:29Z	en_US
dc.date.available	2014-08-26T18:45:29Z	en_US
dc.date.created	2014-05	en_US
dc.date.issued	2014-04-25	en_US
dc.date.submitted	May 2014	en_US
dc.date.updated	2014-08-26T18:45:31Z	en_US
dc.description.abstract	This work concentrates on mining textual data. In particular, I apply Statistical Machine Learning to document clustering, predictive modeling, and document classification tasks undertaken in three different application domains. I have designed novel statistical Bayesian models for each application domain, as well as derived Markov Chain Monte Carlo (MCMC) algorithms for the model inference. First, I investigate the usefulness of using topic models, such as the popular Latent Dirichlet Allocation (LDA) and its extensions, as a pre-processing feature selection step for unsupervised document clustering. Documents are clustered using the pro- portion of the various topics that are present in each document; the topic proportion vectors are then used as an input to an unsupervised clustering algorithm. I analyze two approaches to topic model design utilized in the pre-processing step: (1) A traditional topic model, such as LDA (2) A novel topic model integrating a discrete mixture to simultaneously learn the clustering structure and the topic model that is conducive to the learned structure. I propose two variants of the second approach, one of which is experimentally found to be the best option. Given that clustering is one of the most common data mining tasks, it seems like an obvious application for topic modeling. Second, I focus on automatically evaluating the quality of programming assignments produced by students in a Massive Open Online Course (MOOC), specifically an interactive game programming course, where automated test-based grading is not applicable due the the character of the assignments (i.e., interactive computer games). Automatically evaluating interactive computer games is not easy because such pro- grams lack any sort of well-defined logical specification, so it is difficult to devise a testing platform that can play a student-coded game to determine whether it is correct. I propose a stochastic model that given a set of user-defined metrics and graded example programs, can learn, without running the programs and without a grading rubric, to assign scores that are predictive of what a human (i.e., peer-grader) would give to ungraded assignments. The main goal of the third problem I consider is email/document classification. I concentrate on incorporating the information about senders/receivers/authors of a document to solve a supervised classification problem. I propose a novel vectorized representation for people associated with a document. People are placed in the latent space of a chosen dimensionality and have a set of weights specific to the roles they can play (e.g., in the email case, the categories would be TO, FROM, CC, and BCC). The latent space positions together with the weights are used to map a set of people to a vector by taking a weighted average. In particular, a multi-labeled email classification problem is considered, where an email can be relevant to all/some/none of the desired categories. I develop three stochastic models that can be used to learn to predict multiple labels, taking into account correlations.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Drummond, Anna. "Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/76711">https://hdl.handle.net/1911/76711</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/76711	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Bayesian modeling	en_US
dc.subject	Text mining	en_US
dc.subject	Machine learning	en_US
dc.subject	MCMC	en_US
dc.title	Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: DRUMMOND-DOCUMENT-2014.pdf
Size:: 1.57 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: LICENSE.txt
Size:: 2.61 KB
Format:: Plain Text
Description:

Download

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations