Allen, Genevera I2022-10-052022-10-052022-052022-04-07May 2022Yao, Tianyi. "Statistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model Selection." (2022) Diss., Rice University. <a href="https://hdl.handle.net/1911/113512">https://hdl.handle.net/1911/113512</a>.https://hdl.handle.net/1911/113512With the rapidly increasing richness and volume of modern data sets, finding important structure, whether informative features, relationships between entities, or group patterns, is crucial for making data-driven discoveries in many domains such as genetics and neuroscience. In this thesis, I develop three methodologies for tackling these problems. The first project considers feature selection. While many feature selection techniques have been proposed, there are typically two key challenges in practice: computational intractability in huge-data settings and deteriorating statistical accuracy of selected features in high-dimensional, high-correlation scenarios. I tackle these issues by developing Stable Minipatch Selection (STAMPS) and AdaSTAMPS. These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, random or adaptively-chosen subsets of both the observations and features of the data, termed minipatches. Through extensive empirical experiments, I demonstrate that my approaches, especially AdaSTAMPS, achieve superior performance in terms of feature selection accuracy and computational time in challenging high-dimensional, high-correlation settings. The second project considers estimating the structure of Gaussian graphical models, which are powerful statistical approaches for studying conditional dependence relationships between nodes. Despite recent advancements, conducting graphical model selection on data with a huge number of nodes still poses great computational and statistical challenges in practice. I develop a highly scalable computational approach to Gaussian graphical model selection named Minipatch Graph (MPGraph) that ensembles thresholded graph estimators trained on many tiny, random minipatches. I demonstrate the efficacy of MPGraph through extensive empirical studies, showing that it not only yields more accurate graph estimation, but also achieves extensive speed improvement over existing techniques for huge data. The third project considers the problem of uncovering the functional groupings of large neuronal populations from neuronal activity data, which can lead to a better understanding of structures of interconnected neural circuits and thus the operating mechanisms of the brain. The Clustered Gaussian Graphical Model with a novel symmetric convex clustering penalty is developed for finding functionally coherent groups in a data-driven manner. All three methodologies can aid in discoveries of useful structure from large data sets in many applications.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Statistical machine learningfeature selectiongraphical modelsclusteringStatistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model SelectionThesis2022-10-05