Statistical Approaches for Interpretable Machine Learning

dc.contributor.advisorAllen, Genevera Ien_US
dc.creatorGan, Luqinen_US
dc.date.accessioned2023-08-09T18:52:27Zen_US
dc.date.created2023-05en_US
dc.date.issued2023-04-17en_US
dc.date.submittedMay 2023en_US
dc.date.updated2023-08-09T18:52:27Zen_US
dc.description.abstractNew technologies have led to vast troves of large and complex datasets across many scientific domains and industries. People routinely use machine learning techniques to process, visualize, and analyze this big data in a wide range of high-stakes applications. Interpretations obtained from the machine learning systems provide an understanding of the data, the model itself, or the fitted outcome. And, having human interpretable insights is critically important to not only build trust and transparency in the ML system but also to generate new knowledge or make data-driven discoveries. In this thesis, I develop interpretable machine learning (IML) methodologies, inference for IML methods, and conduct a large-scale empirical study on the reliability of existing IML methods. The first project considers feature importance in clustering methods for high dimensional and large-scale data sets, such as single-cell RNA-seq data. I develop IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering, which ensembles cluster co-occurrences from tiny subsets of both observations and features, termed minipatches. My approach leverages adaptive sampling schemes of minipatches to address the challenge of computational inefficiency of standard consensus clustering and at the same time to yield interpretable solutions by quickly learning the most relevant features that differentiate clusters. Going beyond clustering, interpretable machine learning has been applied to many other tasks but there has not yet been a systematic evaluation of the reliability of machine learning interpretations. Hence, my second project aims to study the reliability of the interpretations of popular machine learning models for tabular data. I run an extensive reliability study for three major machine learning interpretation tasks with a variety of IML techniques, benchmark data sets, and robust consistency metrics. I also build a user-interactive dashboard for users to explore and visualize the full results. My results show that interpretations are not necessarily reliable when there are small data perturbations, and the accuracy of the predictive model is not correlated with the consistency of the interpretation. These surprising results motivate my third project which seeks to quantify the uncertainty of machine learning interpretations, focusing on feature importance. To this end, I propose a mostly model-agnostic, distributional-free, and assumption-light inference framework for feature importance interpretations. I demonstrate that the approach is applicable to both regression and classification tasks and is computationally efficient and statistically powerful through comprehensive and thorough empirical studies. Collectively, my work has major implications for understanding how and when interpretations of machine learning systems are reliable and can be trusted. My developed IML methodologies are widely applicable to a number of societally and scientifically critical areas, potentially leading to increased utility and trust in machine learning systems and reliable knowledge discoveries.en_US
dc.embargo.lift2024-05-01en_US
dc.embargo.terms2024-05-01en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGan, Luqin. "Statistical Approaches for Interpretable Machine Learning." (2023) Diss., Rice University. <a href="https://hdl.handle.net/1911/115154">https://hdl.handle.net/1911/115154</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/115154en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectInterpretable machine learningen_US
dc.subjectconsensus clusteringen_US
dc.subjectmodel-agnostic feature importanceen_US
dc.subjectleave-one-covariate-out inferenceen_US
dc.subjectminipatch ensemblesen_US
dc.subjectpredictive inferenceen_US
dc.subjectconformal inferenceen_US
dc.titleStatistical Approaches for Interpretable Machine Learningen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentStatisticsen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GAN-DOCUMENT-2023.pdf
Size:
41.41 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: