Statistical Machine Learning Approaches for Data Integration and Graphical Models

Wang, Minjie

Statistical Machine Learning Approaches for Data Integration and Graphical Models

Files

WANG-DOCUMENT-2021.pdf (5.85 MB)

Date

2021-04-26

Authors

Wang, Minjie

Abstract

Unsupervised learning aims to identify underlying patterns in unlabeled data. In this thesis, we develop methodologies involving two popular unsupervised learning problems: clustering with application to data integration and graphical models. As the volume and variety of data grows, data integration, which analyzes multiple sources of data simultaneously, has gained increasing popularity. We study mixed multi-view data, where multiple sets of diverse features are measured on the same set of samples. In the first project, by integrating all available data sources, we seek to uncover common group structure among the samples from unlabeled mixed multi-view data that may be hidden in individualistic cluster analyses of a single data view. To achieve this, we propose and develop a convex formalization that inherits the strong mathematical and empirical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data. In the second project, we seek to come up with more meaningful interpretations of clustering, which has often been challenging due to its unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy “supervising auxiliary variables”, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. We propose and develop a new statistical pattern discovery method named Supervised Convex Clustering (SCC) that borrows strength from both unlabeled data and the so-called supervising auxiliary variable in order to find more interpretable patterns with a joint convex fusion penalty. Graphical models, statistical machine learning models defined on graphs, have been widely studied to understand conditional dependencies among a collection of random variables. In the third project, we consider graph selection in the presence of latent variables, a quite challenging problem in neuroscience where existing technologies can only record from a small subset of neurons. We propose an incredibly simple solution: apply a hard thresholding operator to existing graph selection methods, and demonstrate that thresholding the graphical Lasso, neighborhood selection, or CLIME estimators have superior theoretical properties in terms of graph selection consistency as well as stronger empirical results than existing approaches for the latent variable graphical model problem. We also demonstrate the applicability of our approach through a neuroscience case study on calcium-imaging data to estimate functional neural connections.

Advisor

Allen, Genevera I.

Degree

Doctor of Philosophy

Type

Thesis

Keywords

Statistical machine learning, data integration, clustering, graphical models, convex optimization

Citation

Wang, Minjie. "Statistical Machine Learning Approaches for Data Integration and Graphical Models." (2021) Diss., Rice University. https://hdl.handle.net/1911/110427.

Rights

Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.

Citable link to this page

https://hdl.handle.net/1911/110427

Collections

Rice University Theses and Dissertations

Full item page