Statistical Approaches for Interpretable Machine Learning

Gan, Luqin

Statistical Approaches for Interpretable Machine Learning

dc.contributor.advisor	Allen, Genevera I	en_US
dc.creator	Gan, Luqin	en_US
dc.date.accessioned	2023-08-09T18:52:27Z	en_US
dc.date.created	2023-05	en_US
dc.date.issued	2023-04-17	en_US
dc.date.submitted	May 2023	en_US
dc.date.updated	2023-08-09T18:52:27Z	en_US
dc.description.abstract	New technologies have led to vast troves of large and complex datasets across many scientific domains and industries. People routinely use machine learning techniques to process, visualize, and analyze this big data in a wide range of high-stakes applications. Interpretations obtained from the machine learning systems provide an understanding of the data, the model itself, or the fitted outcome. And, having human interpretable insights is critically important to not only build trust and transparency in the ML system but also to generate new knowledge or make data-driven discoveries. In this thesis, I develop interpretable machine learning (IML) methodologies, inference for IML methods, and conduct a large-scale empirical study on the reliability of existing IML methods. The first project considers feature importance in clustering methods for high dimensional and large-scale data sets, such as single-cell RNA-seq data. I develop IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering, which ensembles cluster co-occurrences from tiny subsets of both observations and features, termed minipatches. My approach leverages adaptive sampling schemes of minipatches to address the challenge of computational inefficiency of standard consensus clustering and at the same time to yield interpretable solutions by quickly learning the most relevant features that differentiate clusters. Going beyond clustering, interpretable machine learning has been applied to many other tasks but there has not yet been a systematic evaluation of the reliability of machine learning interpretations. Hence, my second project aims to study the reliability of the interpretations of popular machine learning models for tabular data. I run an extensive reliability study for three major machine learning interpretation tasks with a variety of IML techniques, benchmark data sets, and robust consistency metrics. I also build a user-interactive dashboard for users to explore and visualize the full results. My results show that interpretations are not necessarily reliable when there are small data perturbations, and the accuracy of the predictive model is not correlated with the consistency of the interpretation. These surprising results motivate my third project which seeks to quantify the uncertainty of machine learning interpretations, focusing on feature importance. To this end, I propose a mostly model-agnostic, distributional-free, and assumption-light inference framework for feature importance interpretations. I demonstrate that the approach is applicable to both regression and classification tasks and is computationally efficient and statistically powerful through comprehensive and thorough empirical studies. Collectively, my work has major implications for understanding how and when interpretations of machine learning systems are reliable and can be trusted. My developed IML methodologies are widely applicable to a number of societally and scientifically critical areas, potentially leading to increased utility and trust in machine learning systems and reliable knowledge discoveries.	en_US
dc.embargo.lift	2024-05-01	en_US
dc.embargo.terms	2024-05-01	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Gan, Luqin. "Statistical Approaches for Interpretable Machine Learning." (2023) Diss., Rice University. <a href="https://hdl.handle.net/1911/115154">https://hdl.handle.net/1911/115154</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/115154	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Interpretable machine learning	en_US
dc.subject	consensus clustering	en_US
dc.subject	model-agnostic feature importance	en_US
dc.subject	leave-one-covariate-out inference	en_US
dc.subject	minipatch ensembles	en_US
dc.subject	predictive inference	en_US
dc.subject	conformal inference	en_US
dc.title	Statistical Approaches for Interpretable Machine Learning	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Statistics	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: GAN-DOCUMENT-2023.pdf
Size:: 41.41 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.6 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations