Statistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model Selection

dc.contributor.advisorAllen, Genevera I
dc.creatorYao, Tianyi
dc.date.accessioned2022-10-05T20:56:41Z
dc.date.available2022-10-05T20:56:41Z
dc.date.created2022-05
dc.date.issued2022-04-07
dc.date.submittedMay 2022
dc.date.updated2022-10-05T20:56:41Z
dc.description.abstractWith the rapidly increasing richness and volume of modern data sets, finding important structure, whether informative features, relationships between entities, or group patterns, is crucial for making data-driven discoveries in many domains such as genetics and neuroscience. In this thesis, I develop three methodologies for tackling these problems. The first project considers feature selection. While many feature selection techniques have been proposed, there are typically two key challenges in practice: computational intractability in huge-data settings and deteriorating statistical accuracy of selected features in high-dimensional, high-correlation scenarios. I tackle these issues by developing Stable Minipatch Selection (STAMPS) and AdaSTAMPS. These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, random or adaptively-chosen subsets of both the observations and features of the data, termed minipatches. Through extensive empirical experiments, I demonstrate that my approaches, especially AdaSTAMPS, achieve superior performance in terms of feature selection accuracy and computational time in challenging high-dimensional, high-correlation settings. The second project considers estimating the structure of Gaussian graphical models, which are powerful statistical approaches for studying conditional dependence relationships between nodes. Despite recent advancements, conducting graphical model selection on data with a huge number of nodes still poses great computational and statistical challenges in practice. I develop a highly scalable computational approach to Gaussian graphical model selection named Minipatch Graph (MPGraph) that ensembles thresholded graph estimators trained on many tiny, random minipatches. I demonstrate the efficacy of MPGraph through extensive empirical studies, showing that it not only yields more accurate graph estimation, but also achieves extensive speed improvement over existing techniques for huge data. The third project considers the problem of uncovering the functional groupings of large neuronal populations from neuronal activity data, which can lead to a better understanding of structures of interconnected neural circuits and thus the operating mechanisms of the brain. The Clustered Gaussian Graphical Model with a novel symmetric convex clustering penalty is developed for finding functionally coherent groups in a data-driven manner. All three methodologies can aid in discoveries of useful structure from large data sets in many applications.
dc.format.mimetypeapplication/pdf
dc.identifier.citationYao, Tianyi. "Statistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model Selection." (2022) Diss., Rice University. <a href="https://hdl.handle.net/1911/113512">https://hdl.handle.net/1911/113512</a>.
dc.identifier.urihttps://hdl.handle.net/1911/113512
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectStatistical machine learning
dc.subjectfeature selection
dc.subjectgraphical models
dc.subjectclustering
dc.titleStatistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model Selection
dc.typeThesis
dc.type.materialText
thesis.degree.departmentStatistics
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
YAO-DOCUMENT-2022.pdf
Size:
9.9 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: