Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference

dc.contributor.advisorJermaine, Christopher M.
dc.contributor.committeeMemberNakhleh, Luay K.
dc.contributor.committeeMemberChaudhuri, Swarat
dc.contributor.committeeMemberAllen, Genevera
dc.creatorDrummond, Anna
dc.date.accessioned2014-08-26T18:45:29Z
dc.date.available2014-08-26T18:45:29Z
dc.date.created2014-05
dc.date.issued2014-04-25
dc.date.submittedMay 2014
dc.date.updated2014-08-26T18:45:31Z
dc.description.abstractThis work concentrates on mining textual data. In particular, I apply Statistical Machine Learning to document clustering, predictive modeling, and document classification tasks undertaken in three different application domains. I have designed novel statistical Bayesian models for each application domain, as well as derived Markov Chain Monte Carlo (MCMC) algorithms for the model inference. First, I investigate the usefulness of using topic models, such as the popular Latent Dirichlet Allocation (LDA) and its extensions, as a pre-processing feature selection step for unsupervised document clustering. Documents are clustered using the pro- portion of the various topics that are present in each document; the topic proportion vectors are then used as an input to an unsupervised clustering algorithm. I analyze two approaches to topic model design utilized in the pre-processing step: (1) A traditional topic model, such as LDA (2) A novel topic model integrating a discrete mixture to simultaneously learn the clustering structure and the topic model that is conducive to the learned structure. I propose two variants of the second approach, one of which is experimentally found to be the best option. Given that clustering is one of the most common data mining tasks, it seems like an obvious application for topic modeling. Second, I focus on automatically evaluating the quality of programming assignments produced by students in a Massive Open Online Course (MOOC), specifically an interactive game programming course, where automated test-based grading is not applicable due the the character of the assignments (i.e., interactive computer games). Automatically evaluating interactive computer games is not easy because such pro- grams lack any sort of well-defined logical specification, so it is difficult to devise a testing platform that can play a student-coded game to determine whether it is correct. I propose a stochastic model that given a set of user-defined metrics and graded example programs, can learn, without running the programs and without a grading rubric, to assign scores that are predictive of what a human (i.e., peer-grader) would give to ungraded assignments. The main goal of the third problem I consider is email/document classification. I concentrate on incorporating the information about senders/receivers/authors of a document to solve a supervised classification problem. I propose a novel vectorized representation for people associated with a document. People are placed in the latent space of a chosen dimensionality and have a set of weights specific to the roles they can play (e.g., in the email case, the categories would be TO, FROM, CC, and BCC). The latent space positions together with the weights are used to map a set of people to a vector by taking a weighted average. In particular, a multi-labeled email classification problem is considered, where an email can be relevant to all/some/none of the desired categories. I develop three stochastic models that can be used to learn to predict multiple labels, taking into account correlations.
dc.format.mimetypeapplication/pdf
dc.identifier.citationDrummond, Anna. "Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/76711">https://hdl.handle.net/1911/76711</a>.
dc.identifier.urihttps://hdl.handle.net/1911/76711
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectBayesian modeling
dc.subjectText mining
dc.subjectMachine learning
dc.subjectMCMC
dc.titleStatistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DRUMMOND-DOCUMENT-2014.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.61 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description: