Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference

dc.contributor.advisorJermaine, Christopher M.en_US
dc.contributor.committeeMemberNakhleh, Luay K.en_US
dc.contributor.committeeMemberChaudhuri, Swaraten_US
dc.contributor.committeeMemberAllen, Geneveraen_US
dc.creatorDrummond, Annaen_US
dc.date.accessioned2014-08-26T18:45:29Zen_US
dc.date.available2014-08-26T18:45:29Zen_US
dc.date.created2014-05en_US
dc.date.issued2014-04-25en_US
dc.date.submittedMay 2014en_US
dc.date.updated2014-08-26T18:45:31Zen_US
dc.description.abstractThis work concentrates on mining textual data. In particular, I apply Statistical Machine Learning to document clustering, predictive modeling, and document classification tasks undertaken in three different application domains. I have designed novel statistical Bayesian models for each application domain, as well as derived Markov Chain Monte Carlo (MCMC) algorithms for the model inference. First, I investigate the usefulness of using topic models, such as the popular Latent Dirichlet Allocation (LDA) and its extensions, as a pre-processing feature selection step for unsupervised document clustering. Documents are clustered using the pro- portion of the various topics that are present in each document; the topic proportion vectors are then used as an input to an unsupervised clustering algorithm. I analyze two approaches to topic model design utilized in the pre-processing step: (1) A traditional topic model, such as LDA (2) A novel topic model integrating a discrete mixture to simultaneously learn the clustering structure and the topic model that is conducive to the learned structure. I propose two variants of the second approach, one of which is experimentally found to be the best option. Given that clustering is one of the most common data mining tasks, it seems like an obvious application for topic modeling. Second, I focus on automatically evaluating the quality of programming assignments produced by students in a Massive Open Online Course (MOOC), specifically an interactive game programming course, where automated test-based grading is not applicable due the the character of the assignments (i.e., interactive computer games). Automatically evaluating interactive computer games is not easy because such pro- grams lack any sort of well-defined logical specification, so it is difficult to devise a testing platform that can play a student-coded game to determine whether it is correct. I propose a stochastic model that given a set of user-defined metrics and graded example programs, can learn, without running the programs and without a grading rubric, to assign scores that are predictive of what a human (i.e., peer-grader) would give to ungraded assignments. The main goal of the third problem I consider is email/document classification. I concentrate on incorporating the information about senders/receivers/authors of a document to solve a supervised classification problem. I propose a novel vectorized representation for people associated with a document. People are placed in the latent space of a chosen dimensionality and have a set of weights specific to the roles they can play (e.g., in the email case, the categories would be TO, FROM, CC, and BCC). The latent space positions together with the weights are used to map a set of people to a vector by taking a weighted average. In particular, a multi-labeled email classification problem is considered, where an email can be relevant to all/some/none of the desired categories. I develop three stochastic models that can be used to learn to predict multiple labels, taking into account correlations.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationDrummond, Anna. "Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/76711">https://hdl.handle.net/1911/76711</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/76711en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectBayesian modelingen_US
dc.subjectText miningen_US
dc.subjectMachine learningen_US
dc.subjectMCMCen_US
dc.titleStatistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inferenceen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DRUMMOND-DOCUMENT-2014.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.61 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description: