Browsing by Author "Allen, Genevera I"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Computation, Visualization, and Applications of Convex Clustering(2018-10-04) Nagorski, John; Allen, Genevera IClustering is a ubiquitous tool for exploratory data analysis across the sciences, with the general aim of identifying groups of similar objects. Recent work has recast the clustering problem within the framework of convex optimization, addressing many shortcomings of traditional methods such as interpretability, stability, and parameter selection. The method of Convex Clustering has proven to be a canonical example of such an approach, and its extensions and applications will be the focus of this work. We begin by considering the application of Convex Clustering in the novel setting of region detection for high-throughput genomic data. We illustrate the versatility of Convex Clustering by developing a novel extension, Spatial Convex Clustering (SpaCC), specifically catered to multivariate spatially correlated genomics data. We demonstrate SpaCC to achieve state-of-the-art performance on the well-studied prob- lem of Copy Number Segmentation, and show it to be similarly successful in the novel setting of DNA Methylation region detection. Next, we address several shortcomings of Convex Clustering including slow computation and lack of familiar visualizations relative to its traditional counterparts. To do so, we introduce algorithms for the fast approximation of the Convex Clustering solution path and provide both theoretical guarantees of error control as well as empirical investigations. Next, we provide a suite of visualization techniques to aid in the interpretation of the clustering solutioniii path, exploring their insights via several real data examples. Finally we introduce the R-package, clustRviz, which gives practitioners direct access to the fast computation and dynamic visualizations introduced throughout.Item Graphical Models for Functional Neuronal Connectivity(2022-09-29) Chang, Andersen; Allen, Genevera IWith modern calcium imaging technology, activities of thousands of neurons can be recorded in vivo. These experiments can potentially provide new insights into intrinsic functional neuronal connectivity, defined as contemporaneous correlations between neuronal activities. As a common tool for estimating conditional dependencies in high-dimensional settings, graphical models are a natural choice for estimating functional connectivity networks. However, raw neuronal activity data presents several statistical challenges when applying graphical models. In this project, we develop new methods to estimate scientifically meaningful functional neuronal connectivity networks using the graphical model framework. One unique facet of calcium imaging data is that the important information lies in rare extreme value observations that indicate neuronal firing, rather than in the observations near the mean. Thus, a graphical modeling technique which finds conditional dependencies between the extreme values of features is required in order to estimate scientifically meaningful functional connectivity networks from calcium imaging data. To address this, we develop a novel class of graphical models, called the Subbotin graphical model, which can be used to find sparse conditional dependency structures for extreme values. We first derive the form of the Subbotin graphical model and show the conditions under which it is normalizable. We then study the empirical performance of the Subbotin graphical model on simulations as well as real-world data. Additionally, in many modern calcium imaging data sets, the complete data set is often comprised of multiple individual recording sessions of partially overlapping subsets of neurons. Thus, in order to estimate a graph on the full data, conditional dependencies in the missing portion of the covariance must be inferred; this is known as the graph quilting problem. We introduce several graph quilting methods that can be applied to for calcium imaging data, which utilize a low-rankness assumption to impute the full covariance matrix. Through several empirical studies, we investigate the efficacy of these methods for estimating graphical models for functional connectivity in the presence of missing joint observations. We also develop new methods for covariate and dynamic latent variable adjustment for functional neuronal data, which can arise from the presence of stimuli, unobserved neurons, and physical activity. We first introduce two models to infer functional connectivity from neuronal activity data after adjusting for dynamic latent brain states, and we use simulation studies to compare their performance to traditional, unconditional graphical models. We then propose a new method for sparse high-dimensional linear regression for extreme values, called the Extreme Lasso. We prove consistency and variable selection consistency for our regression method, and we analyze the theoretical impact of extreme value observations on the model parameter estimates using the concept of influence functions. We then study the empirical performance of the Extreme Lasso for selecting features associated with extreme values in high-dimensional regression. In our work, we demonstrate the applicability of each of our developed methods to finding functional connectivity networks through studies on several real-world calcium imaging data sets. In particular, we compare these network estimates to those from existing methods from both the graphical model and neuroscience literature, and we show that our methods can provide more scientifically sensible functional connectivity estimates.Item Methods and Applications for Mixed Graphical Models(2017-10-06) Baker, Yulia; Allen, Genevera I``Multi-view Data'' is a term used to describe heterogeneous data measured on the same set of observations but collected from different sources and of potentially different types (continuous, discrete, count). This type of data is prevalent in various fields, such as imaging genetics, national security, social networking, Internet advertising, and our particular motivation - high-throughput integrative genomics. There have been limited efforts directed at statistically modeling such mixed data jointly. In this thesis, we address this by introducing a novel class of Mixed Markov Random Field (MRFs) and Mixed Chain Markov Random Field distributions, or graphical models. Mixed MRFs assume that each node-conditional distribution arises from a different exponential family model. And Mixed Chain MRFs incorporate directed and undirected edges, in addition to different exponential family models, to produce more flexible models with less restrictive normalizibility constraints. Mixed MRFs and Mixed Chain MRFS, both yield joint densities, which can directly parameterize dependencies over mixed variables. Fitting these models to perform mixed graph selection entails estimating penalized generalized linear models with mixed covariates. Model selection with mixed covariates in a high dimensional setting, however, poses many challenges due to differences in the scale and potential signal interference between variables. In this thesis, we introduce this novel class of Mixed MRFs and Mixed Chain MRFs, study model estimation challenges theoretically and empirically, and propose a new iterative block estimation strategy. Our methods are applied to infer a gene regulatory network in three ovarian cancer studies that integrate methylation, micro-RNA expression, mutation, and gene expression data to fully understand regulatory relationships in ovarian cancer.Item Statistical and Algorithmic Methods for High-Dimensional and Highly-Correlated Data(2016-05-20) Hu, Yue; Allen, Genevera ITechnological advances have led to a proliferation of high-dimensional and highly correlated data. This sort of data poses enormous challenges for statistical analysis, pushing the limits of distributed optimization, predictive modeling, and statistical inference. We propose new methods, motivated by biomedical applications, for predictive modeling and variable selection in this challenging setting. First, we build predictive models for multi-subject neuroimaging data. This is an ultra-high-dimensional problem that consists of a highly spatially and temporally correlated matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy, Local Aggregate Modeling, to apply generalized linear models (GLMs) to this massive tensor data that not only has better prediction accuracy and interpretability, but can also be fit in a distributed manner. Second, we propose a novel method, Algorithmic Regularization Paths, for variable selection with high-dimensional and highly correlated data. Existing penalized regression methods such as the Lasso solve a relaxation of the best subsets problem that runs in polynomial time; however, the Lasso can only correctly recover the true sparsity pattern if the design matrix satisfies the so-called Irrepresentability Condition or related conditions, which are easily violated when the data is highly correlated. Our method achieves better variable selection performance and faster computation in ultra-high-dimensional and high-correlation settings where the Lasso and many other standard methods fail.Item Statistical Approaches for Interpretable Machine Learning(2023-04-17) Gan, Luqin; Allen, Genevera INew technologies have led to vast troves of large and complex datasets across many scientific domains and industries. People routinely use machine learning techniques to process, visualize, and analyze this big data in a wide range of high-stakes applications. Interpretations obtained from the machine learning systems provide an understanding of the data, the model itself, or the fitted outcome. And, having human interpretable insights is critically important to not only build trust and transparency in the ML system but also to generate new knowledge or make data-driven discoveries. In this thesis, I develop interpretable machine learning (IML) methodologies, inference for IML methods, and conduct a large-scale empirical study on the reliability of existing IML methods. The first project considers feature importance in clustering methods for high dimensional and large-scale data sets, such as single-cell RNA-seq data. I develop IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering, which ensembles cluster co-occurrences from tiny subsets of both observations and features, termed minipatches. My approach leverages adaptive sampling schemes of minipatches to address the challenge of computational inefficiency of standard consensus clustering and at the same time to yield interpretable solutions by quickly learning the most relevant features that differentiate clusters. Going beyond clustering, interpretable machine learning has been applied to many other tasks but there has not yet been a systematic evaluation of the reliability of machine learning interpretations. Hence, my second project aims to study the reliability of the interpretations of popular machine learning models for tabular data. I run an extensive reliability study for three major machine learning interpretation tasks with a variety of IML techniques, benchmark data sets, and robust consistency metrics. I also build a user-interactive dashboard for users to explore and visualize the full results. My results show that interpretations are not necessarily reliable when there are small data perturbations, and the accuracy of the predictive model is not correlated with the consistency of the interpretation. These surprising results motivate my third project which seeks to quantify the uncertainty of machine learning interpretations, focusing on feature importance. To this end, I propose a mostly model-agnostic, distributional-free, and assumption-light inference framework for feature importance interpretations. I demonstrate that the approach is applicable to both regression and classification tasks and is computationally efficient and statistically powerful through comprehensive and thorough empirical studies. Collectively, my work has major implications for understanding how and when interpretations of machine learning systems are reliable and can be trusted. My developed IML methodologies are widely applicable to a number of societally and scientifically critical areas, potentially leading to increased utility and trust in machine learning systems and reliable knowledge discoveries.Item Statistical Machine Learning Methodology and Inference for Structured Variable Selection(2018-02-01) Campbell, Frederick; Allen, Genevera IStructured variable selection is a powerful tool for modeling a wide range of real world phenomena. In this work we develop methodology based on structured variable selection for three different problems. In the first, we develop methodology for problems with pre-defined group structure. Our goal is to select at least one variable from each group in the context of predictive regression modeling. This problem is NP-hard, but we propose the tightest convex relaxation: a composite penalty that is a combination of the l1 and l2 norms. Our so-called Exclusive Lasso method performs structured variable selection by ensuring that at least one variable is selected from each group. In the next problem, we investigate the neurological response to speech by developing a method for the brain decoding problem with electrocorticography (ECoG) data. Electrocorticography measures brain activity at a range of frequencies over time at multiple locations in the brain resulting in highly structured spatial-temporal data. Effective brain decoding relies on effectively identifying relevant features in the data motivating us to propose a new method for brain decoding based on partial least squares called Regularized Higher-Order Partial least squares (RHOP). Our method RHOP (pronounced ``Rope") organizes the data into a tensor and reduces the dimensionality of the data by factoring it into a sparse node factor that identifies important regions in the brain, a smooth time factor that identifies important time points and a smooth frequency factor that identifies frequencies carrying information about the patient stimuli. Lastly, we develop statistical tests for clustering that help determine whether the clustering assignment is due to random sampling variation or due to actual structure in the population. Clustering is widely applied but there are currently few methods for inference on clustered data. As a result, we develop new tests and statistics for inference after clustering with Convex Clustering. Our tests are based on the geometric interpretation of Hotelling's T squared test and allow us to evaluate the quality of our clustering assignment.Item Statistical Machine Learning Methodology for Feature Selection, Structured Data, and Graphical Model Selection(2022-04-07) Yao, Tianyi; Allen, Genevera IWith the rapidly increasing richness and volume of modern data sets, finding important structure, whether informative features, relationships between entities, or group patterns, is crucial for making data-driven discoveries in many domains such as genetics and neuroscience. In this thesis, I develop three methodologies for tackling these problems. The first project considers feature selection. While many feature selection techniques have been proposed, there are typically two key challenges in practice: computational intractability in huge-data settings and deteriorating statistical accuracy of selected features in high-dimensional, high-correlation scenarios. I tackle these issues by developing Stable Minipatch Selection (STAMPS) and AdaSTAMPS. These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, random or adaptively-chosen subsets of both the observations and features of the data, termed minipatches. Through extensive empirical experiments, I demonstrate that my approaches, especially AdaSTAMPS, achieve superior performance in terms of feature selection accuracy and computational time in challenging high-dimensional, high-correlation settings. The second project considers estimating the structure of Gaussian graphical models, which are powerful statistical approaches for studying conditional dependence relationships between nodes. Despite recent advancements, conducting graphical model selection on data with a huge number of nodes still poses great computational and statistical challenges in practice. I develop a highly scalable computational approach to Gaussian graphical model selection named Minipatch Graph (MPGraph) that ensembles thresholded graph estimators trained on many tiny, random minipatches. I demonstrate the efficacy of MPGraph through extensive empirical studies, showing that it not only yields more accurate graph estimation, but also achieves extensive speed improvement over existing techniques for huge data. The third project considers the problem of uncovering the functional groupings of large neuronal populations from neuronal activity data, which can lead to a better understanding of structures of interconnected neural circuits and thus the operating mechanisms of the brain. The Clustered Gaussian Graphical Model with a novel symmetric convex clustering penalty is developed for finding functionally coherent groups in a data-driven manner. All three methodologies can aid in discoveries of useful structure from large data sets in many applications.Item The International Conference on Intelligent Biology and Medicine (ICIBM) 2016: from big data to big analytical tools(BioMed Central, 2017-10-03) Liu, Zhandong; Zheng, W. Jim; Allen, Genevera I; Liu, Yin; Ruan, Jianhua; Zhao, ZhongmingAbstract The 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) was held on December 8–10, 2016 in Houston, Texas, USA. ICIBM included eight scientific sessions, four tutorials, one poster session, four highlighted talks and four keynotes that covered topics on 3D genomics structural analysis, next generation sequencing (NGS) analysis, computational drug discovery, medical informatics, cancer genomics, and systems biology. Here, we present a summary of the nine research articles selected from ICIBM 2016 program for publishing in BMC Bioinformatics.Item The International Conference on Intelligent Biology and Medicine (ICIBM) 2016: summary and innovation in genomics(BioMed Central, 10/3/2017) Zhao, Zhongming; Liu, Zhandong; Chen, Ken; Guo, Yan; Allen, Genevera I; Zhang, Jiajie; Jim Zheng, W.; Ruan, JianhuaAbstract In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8–10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.Item XMRF: an R package to fit Markov Networks to high-throughput genetics data(BioMed Central, 2016) Wan, Ying-Wooi; Allen, Genevera I; Baker, Yulia; Yang, Eunho; Ravikumar, Pradeep; Anderson, Matthew; Liu, ZhandongAbstract Background Technological advances in medicine have led to a rapid proliferation of high-throughput “omics” data. Tools to mine this data and discover disrupted disease networks are needed as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epi-genetic markers. Results We developed an R software package, XMRF, that can be used to fit Markov Networks to various types of high-throughput genomics data. Encoding the models and estimation techniques of the recently proposed exponential family Markov Random Fields (Yang et al., 2012), our software can be used to learn genetic networks from RNA-sequencing data (counts via Poisson graphical models), mutation and copy number variation data (categorical via Ising models), and methylation data (continuous via Gaussian graphical models). Conclusions XMRF is the only tool that allows network structure learning using the native distribution of the data instead of the standard Gaussian. Moreover, the parallelization feature of the implemented algorithms computes the large-scale biological networks efficiently. XMRF is available from CRAN and Github ( https://github.com/zhandong/XMRF ).