Browsing by Author "Allen, Genevera I."
Now showing 1 - 17 of 17
Results Per Page
Sort Options
Item Crowdsourced estimation of cognitive decline and resilience in Alzheimer's disease(Elsevier, 2016) Allen, Genevera I.; Amoroso, Nicola; Anghel, Catalina; Balagurusamy, Venkat; Bare, Christopher J.; Beaton, Derek; Bellotti, Roberto; Bennett, David A.; Boehme, Kevin L.; Boutros, Paul C.; Caberlotto, Laura; Caloian, Cristian; Campbell, Frederick; Neto, Elias Chaibub; Chang, Yu-Chuan; Chen, Beibei; Chen, Chien-Yu; Chien, Ting-Ying; Clark, Tim; Das, Sudeshna; Davatzikos, Christos; Deng, Jieyao; Dillenberger, Donna; Dobson, Richard J.B.; Dong, Qilin; Doshi, Jimit; Duma, Denise; Errico, Rosangela; Erus, Guray; Everett, Evan; Fardo, David W.; Friend, Stephen H.; Frӧhlich, Holger; Gan, Jessica; St George-Hyslop, Peter; Ghosh, Satrajit S.; Glaab, Enrico; Green, Robert C.; Guan, Yuanfang; Hong, Ming-Yi; Huang, Chao; Hwang, Jinseub; Ibrahim, Joseph; Inglese, Paolo; Iyappan, Anandhi; Jiang, Qijia; Katsumata, Yuriko; Kauwe, John S.K.; Klein, Arno; Kong, Dehan; Krause, Roland; Lalonde, Emilie; Lauria, Mario; Lee, Eunjee; Lin, Xihui; Liu, Zhandong; Livingstone, Julie; Logsdon, Benjamin A.; Lovestone, Simon; Ma, Tsung-wei; Malhotra, Ashutosh; Mangravite, Lara M.; Maxwell, Taylor J.; Merrill, Emily; Nagorski, John; Namasivayam, Aishwarya; Narayan, Manjari; Naz, Mufassra; Newhouse, Stephen J.; Norman, Thea C.; Nurtdinov, Ramil N.; Oyang, Yen-Jen; Pawitan, Yudi; Peng, Shengwen; Peters, Mette A.; Piccolo, Stephen R.; Praveen, Paurush; Priami, Corrado; Sabelnykova, Veronica Y.; Senger, Philipp; Shen, Xia; Simmons, Andrew; Sotiras, Aristeidis; Stolovitzky, Gustavo; Tangaro, Sabina; Tateo, Andrea; Tung, Yi-An; Tustison, Nicholas J.; Varol, Erdem; Vradenburg, George; Weiner, Michael W.; Xiao, Guanghua; Xie, Lei; Xie, Yang; Xu, Jia; Yang, Hojin; Zhan, Xiaowei; Zhou, Yunyun; Zhu, Fan; Zhu, Hongtu; Zhu, Shanfeng; Alzheimer’s Disease Neuroimaging InitiativeIdentifying accurate biomarkers of cognitive decline is essential for advancing early diagnosis and prevention therapies in Alzheimer's disease. The Alzheimer's disease DREAM Challenge was designed as a computational crowdsourced project to benchmark the current state-of-the-art in predicting cognitive outcomes in Alzheimer's disease based on high dimensional, publicly available genetic and structural imaging data. This meta-analysis failed to identify a meaningful predictor developed from either data modality, suggesting that alternate approaches should be considered for prediction of cognitive performance.Item Functional screening of lysosomal storage disorder genes identifies modifiers of alpha-synuclein neurotoxicity(Public Library of Science, 2023) Yu, Meigen; Ye, Hui; De-Paula, Ruth B.; Mangleburg, Carl Grant; Wu, Timothy; Lee, Tom V.; Li, Yarong; Duong, Duc; Phillips, Bridget; Cruchaga, Carlos; Allen, Genevera I.; Seyfried, Nicholas T.; Al-Ramahi, Ismael; Botas, Juan; Shulman, Joshua M.Heterozygous variants in the glucocerebrosidase (GBA) gene are common and potent risk factors for Parkinson’s disease (PD). GBA also causes the autosomal recessive lysosomal storage disorder (LSD), Gaucher disease, and emerging evidence from human genetics implicates many other LSD genes in PD susceptibility. We have systemically tested 86 conserved fly homologs of 37 human LSD genes for requirements in the aging adult Drosophila brain and for potential genetic interactions with neurodegeneration caused by α-synuclein (αSyn), which forms Lewy body pathology in PD. Our screen identifies 15 genetic enhancers of αSyn-induced progressive locomotor dysfunction, including knockdown of fly homologs of GBA and other LSD genes with independent support as PD susceptibility factors from human genetics (SCARB2, SMPD1, CTSD, GNPTAB, SLC17A5). For several genes, results from multiple alleles suggest dose-sensitivity and context-dependent pleiotropy in the presence or absence of αSyn. Homologs of two genes causing cholesterol storage disorders, Npc1a / NPC1 and Lip4 / LIPA, were independently confirmed as loss-of-function enhancers of αSyn-induced retinal degeneration. The enzymes encoded by several modifier genes are upregulated in αSyn transgenic flies, based on unbiased proteomics, revealing a possible, albeit ineffective, compensatory response. Overall, our results reinforce the important role of lysosomal genes in brain health and PD pathogenesis, and implicate several metabolic pathways, including cholesterol homeostasis, in αSyn-mediated neurotoxicity.Item Genomic region detection via Spatial Convex Clustering(Public Library of Science, 2018) Nagorski, John; Allen, Genevera I.Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC’s advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.Item Graphical Models via Univariate Exponential Family Distributions(JMLR, 2015) Yang, Eunho; Ravikumar, Pradeep; Allen, Genevera I.; Liu, ZhandongUndirected graphical models, or Markov networks, are a popular class of statistical models, used in a wide variety of applications. Popular instances of this class include Gaussian graphical models and Ising models. In many settings, however, it might not be clear which subclass of graphical models to use, particularly for non-Gaussian and non-categorical data. In this paper, we consider a general sub-class of graphical models where the node-wise conditional distributions arise from exponential families. This allows us to derive multivariate graphical model distributions from univariate exponential family distributions, such as the Poisson, negative binomial, and exponential distributions. Our key contributions include a class of M-estimators to fit these graphical model distributions; and rigorous statistical analysis showing that these M-estimators recover the true graphical model structure exactly, with high probability. We provide examples of genomic and proteomic networks learned via instances of our class of graphical models derived from Poisson and exponential distributions.Item Imaging genetics via sparse canonical correlation analysis(IEEE, 2013) Chi, Eric C.; Allen, Genevera I.; Zhou, Hua; Kohannim, Omid; Lange, Kenneth; Thompson, Paul M.The collection of brain images from populations of subjects who have been genotyped with genome-wide scans makes it feasible to search for genetic effects on the brain. Even so, multivariate methods are sorely needed that can search both images and the genome for relationships, making use of the correlation structure of both datasets. Here we investigate the use of sparse canonical correlation analysis (CCA) to home in on sets of genetic variants that explain variance in a set of images. We extend recent work on penalized matrix decomposition to account for the correlations in both datasets. Such methods show promise in imaging genetics as they exploit the natural covariance in the datasets. They also avoid an astronomically heavy statistical correction for searching the whole genome and the entire image for promising associations.Item Inferential Methods to Find Differences in Population of Graphical Models with Applications to Functional Connectomics(2016-03-18) Narayan, Manjari; Baraniuk, Richard G.; Allen, Genevera I.In many neuroimaging modalities, scientists observe neural activity at distinct units of brain function but seek to study and manipulate functional connectivity or unobserved latent relationships between these units. Functional connectivity is commonly described using networks where nodes correspond to brain locations or regions, electrodes, circuits or neurons while edges correspond to some notion of statistical dependence. Such net- work models are increasingly used in clinical neuroimaging where scientists seek to find robust network biomarkers to detect specific brain based disorders, explain underlying disease mechanisms and guide personalized treatment regimes. However, functional con- nectivity networks are never observed but estimated from complex and noisy data, and as a result, estimated networks are prone to statistical errors. This dissertation shows that failure to account for such statistical errors compromises subsequent inferential analyses to find differences in functional connectivity and proposes a new statistical framework that ameliorates these problems, thus improving the reproducibility of functional connectivity studies. Formally, this dissertation identifies a new statistical problem, Population Post-Selection Inference or popPSI, that arises in functional neuroimaging when scientists ask inferential questions such as — How do network metrics differ between a population of unhealthy subjects and healthy controls How do individual networks vary with symptom severity To investigate popPSI issues in such questions, we use two level models to study network differences, specifically employing Gaussian graphical models (GGMs) for functional connectivity. Whereas standard test statistics do not adequately control type I and type II errors for such models, R^3, our novel methodological approach, based on resampling, random penalization with random effects test statistics addresses the deficiencies of current test statistics employed in neuroimaging. Our framework is general and can be used to test general linear hypotheses of the network at the edge, node or global level. Using extensive simulation studies for a wide variety of sample sizes and network structures, we show that R3 offers improvements in statistical power and error for various network met- rics. Real data case studies reveal that our methods find meaningful and clinically relevant network differences in synesthesia, neurofibromatosis-1 and autism spectrum disorders.Item Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data(JMLR, 2021) Wang, Minjie; Allen, Genevera I.In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.Item Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data(JMLR, 2021) Wang, Minjie; Allen, Genevera I.In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.Item Mixed Effects Models for Resampled Network Statistics Improves Statistical Power to Find Differences in Multi-Subject Functional Connectivity(Frontiers, 2016) Narayan, Manjari; Allen, Genevera I.Many complex brain disorders, such as autism spectrum disorders, exhibit a wide range of symptoms and disability. To understand how brain communication is impaired in such conditions, functional connectivity studies seek to understand individual differences in brain network structure in terms of covariates that measure symptom severity. In practice, however, functional connectivity is not observed but estimated from complex and noisy neural activity measurements. Imperfect subject network estimates can compromise subsequent efforts to detect covariate effects on network structure. We address this problem in the case of Gaussian graphical models of functional connectivity, by proposing novel two-level models that treat both subject level networks and population level covariate effects as unknown parameters. To account for imperfectly estimated subject level networks when fitting these models, we propose two related approaches-R (2) based on resampling and random effects test statistics, and R (3) that additionally employs random adaptive penalization. Simulation studies using realistic graph structures reveal that R (2) and R (3) have superior statistical power to detect covariate effects compared to existing approaches, particularly when the number of within subject observations is comparable to the size of subject networks. Using our novel models and methods to study parts of the ABIDE dataset, we find evidence of hypoconnectivity associated with symptom severity in autism spectrum disorders, in frontoparietal and limbic systems as well as in anterior and posterior cingulate cortices.Item Molecular pathway identification using biological network-regularized logistic models(BioMed Central, 2013) Zhang, Wen; Wan, Ying-wooi; Allen, Genevera I.; Pang, Kaifang; Anderson, Matthew L.; Liu, ZhandongBackground: Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature. Results: We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/Logit-Lapnet. Conclusion: Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epi-genomic, and other types of genome wide association studies.Item Neural Networks of Colored Sequence Synesthesia(Society for Neuroscience, 2013) Tomson, Steffie N.; Narayan, Manjari; Allen, Genevera I.; Eagleman, David M.Synesthesia is a condition in which normal stimuli can trigger anomalous associations. In this study,weexploit synesthesia to understand how the synesthetic experience can be explained by subtle changes in network properties. Of the many forms of synesthesia, we focus on colored sequence synesthesia, a form in which colors are associated with overlearned sequences, such as numbers and letters (graphemes). Previous studies have characterized synesthesia using resting-state connectivity or stimulus-driven analyses, but it remains unclear how network properties change as synesthetes move from one condition to another. To address this gap, we used functional MRI in humans to identify grapheme-specific brain regions, thereby constructing a functional “synesthetic” network. We then explored functional connectivity of color and grapheme regions during a synesthesia-inducing fMRI paradigm involving rest, auditory grapheme stimulation, and audiovisual grapheme stimulation. Using Markov networks to represent direct relationships between regions, we found that synesthetes had more connections during rest and auditory conditions. We then expanded the network space to include 90 anatomical regions, revealing that synesthetes tightly cluster in visual regions, whereas controls cluster in parietal and frontal regions. Together, these results suggest that synesthetes have increased connectivity between grapheme and color regions, and that synesthetes use visual regions to a greater extent than controls when presented with dynamic grapheme stimulation. These data suggest that synesthesia is better characterized by studying global network dynamics than by individual properties of a single brain region.Item On the Reproducibility of TCGA Ovarian Cancer MicroRNA Profiles(Public Library of Science, 2014) Wan, Ying-Wooi; Mach, Claire M.; Allen, Genevera I.; Anderson, Matthew L.; Liu, ZhandongDysregulated microRNA (miRNA) expression is a well-established feature of human cancer. However, the role of specific miRNAs in determining cancer outcomes remains unclear. Using Level 3 expression data from the Cancer Genome Atlas (TCGA), we identified 61 miRNAs that are associated with overall survival in 469 ovarian cancers profiled by microarray (p<0.01). We also identified 12 miRNAs that are associated with survival when miRNAs were profiled in the same specimens using Next Generation Sequencing (miRNA-Seq) (p<0.01). Surprisingly, only 1 miRNA transcript is associated with ovarian cancer survival in both datasets. Our analyses indicate that this discrepancy is due to the fact that miRNA levels reported by the two platforms correlate poorly, even after correcting for potential issues inherent to signal detection algorithms. Corrections for false discovery and microRNA abundance had minimal impact on this discrepancy. Further investigation is warranted.Item Quantifying cognitive resilience in Alzheimer’s Disease: The Alzheimer’s Disease Cognitive Resilience Score(Public Library of Science, 2020) Yao, Tianyi; Sweeney, Elizabeth; Nagorski, John; Shulman, Joshua M.; Allen, Genevera I.Even though there is a clear link between Alzheimer’s Disease (AD) related neuropathology and cognitive decline, numerous studies have observed that healthy cognition can exist in the presence of extensive AD pathology, a phenomenon sometimes called Cognitive Resilience (CR). To better understand and study CR, we develop the Alzheimer’s Disease Cognitive Resilience Score (AD-CR Score), which we define as the difference between the observed and expected cognition given the observed level of AD pathology. Unlike other definitions of CR, our AD-CR Score is a fully non-parametric, stand-alone, individual-level quantification of CR that is derived independently of other factors or proxy variables. Using data from two ongoing, longitudinal cohort studies of aging, the Religious Orders Study (ROS) and the Rush Memory and Aging Project (MAP), we validate our AD-CR Score by showing strong associations with known factors related to CR such as baseline and longitudinal cognition, non AD-related pathology, education, personality, APOE, parkinsonism, depression, and life activities. Even though the proposed AD-CR Score cannot be directly calculated during an individual’s lifetime because it uses postmortem pathology, we also develop a machine learning framework that achieves promising results in terms of predicting whether an individual will have an extremely high or low AD-CR Score using only measures available during the lifetime. Given this, our AD-CR Score can be used for further investigations into mechanisms of CR, and potentially for subject stratification prior to clinical trials of personalized therapies.Item Regularized partial least squares with an application to NMR spectroscopy(John Wiley & Sons, Inc., 2013) Allen, Genevera I.; Peterson, Christine; Vannucci, Marina; Maletic-Savatic, MirjanaHigh-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for non-negative PLS and generalized PLS, an adoption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data.Item Resting state functional MRI reveals abnormal network connectivity in neurofibromatosis 1(Wiley, 2015) Tomson, Steffie N.; Schreiner, Matthew J.; Narayan, Manjari; Rosser, Tena; Enrique, Nicole; Silva, Alcino J.; Allen, Genevera I.; Bookheimer, Susan Y.; Bearden, Carrie E.Neurofibromatosis type I (NF1) is a genetic disorder caused by mutations in the neurofibromin 1 gene at locus 17q11.2. Individuals with NF1 have an increased incidence of learning disabilities, attention deficits, and autism spectrum disorders. As a single-gene disorder, NF1 represents a valuable model for understanding gene–brain–behavior relationships. While mouse models have elucidated molecular and cellular mechanisms underlying learning deficits associated with this mutation, little is known about functional brain architecture in human subjects with NF1. To address this question, we used resting state functional connectivity magnetic resonance imaging (rs-fcMRI) to elucidate the intrinsic network structure of 30 NF1 participants compared with 30 healthy demographically matched controls during an eyes-open rs-fcMRI scan. Novel statistical methods were employed to quantify differences in local connectivity (edge strength) and modularity structure, in combination with traditional global graph theory applications. Our findings suggest that individuals with NF1 have reduced anterior–posterior connectivity, weaker bilateral edges, and altered modularity clustering relative to healthy controls. Further, edge strength and modular clustering indices were correlated with IQ and internalizing symptoms. These findings suggest that Ras signaling disruption may lead to abnormal functional brain connectivity; further investigation into the functional consequences of these alterations in both humans and in animal models is warranted.Item Statistical Machine Learning Approaches for Data Integration and Graphical Models(2021-04-26) Wang, Minjie; Allen, Genevera I.Unsupervised learning aims to identify underlying patterns in unlabeled data. In this thesis, we develop methodologies involving two popular unsupervised learning problems: clustering with application to data integration and graphical models. As the volume and variety of data grows, data integration, which analyzes multiple sources of data simultaneously, has gained increasing popularity. We study mixed multi-view data, where multiple sets of diverse features are measured on the same set of samples. In the first project, by integrating all available data sources, we seek to uncover common group structure among the samples from unlabeled mixed multi-view data that may be hidden in individualistic cluster analyses of a single data view. To achieve this, we propose and develop a convex formalization that inherits the strong mathematical and empirical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data. In the second project, we seek to come up with more meaningful interpretations of clustering, which has often been challenging due to its unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy “supervising auxiliary variables”, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. We propose and develop a new statistical pattern discovery method named Supervised Convex Clustering (SCC) that borrows strength from both unlabeled data and the so-called supervising auxiliary variable in order to find more interpretable patterns with a joint convex fusion penalty. Graphical models, statistical machine learning models defined on graphs, have been widely studied to understand conditional dependencies among a collection of random variables. In the third project, we consider graph selection in the presence of latent variables, a quite challenging problem in neuroscience where existing technologies can only record from a small subset of neurons. We propose an incredibly simple solution: apply a hard thresholding operator to existing graph selection methods, and demonstrate that thresholding the graphical Lasso, neighborhood selection, or CLIME estimators have superior theoretical properties in terms of graph selection consistency as well as stronger empirical results than existing approaches for the latent variable graphical model problem. We also demonstrate the applicability of our approach through a neuroscience case study on calcium-imaging data to estimate functional neural connections.Item Supervised convex clustering(Wiley, 2023) Wang, Minjie; Yao, Tianyi; Allen, Genevera I.Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.