Browsing by Author "Li, Meng"
Now showing 1 - 18 of 18
Results Per Page
Sort Options
Item Advanced Bayesian Models for Dependent Data(2023-04-18) Zeng, Zijian; Li, Meng; Vannucci, MarinaOver the past few years, there has been a noticeable increase in the amount of available data with complex dependent structure. Bayesian statistics is an approach to inference based on the Bayes’ theorem, which is interpretable and provides uncertainty quantification. These advantages have made Bayesian methods widely used across various applied fields, including social sciences, ecology, genetics, medicine and more. In this thesis, we advance the application of Bayesian methods for three different types of dependent data. For the first project, we develop a Bayesian median autoregressive model for time series forecasting. This model utilizes time-varying quantile regression at the median, which inherits the robustness of median regression in contrast to the widely used mean-based methods. We use Bayesian model averaging to account for model uncertainty including the uncertainty in the autoregressive order, in addition to a Bayesian model selection approach. The second project addresses image-on-scaler regression. We consider a Bayesian hierarchical Gaussian process model for image smoothing, that uses a flexible Inverse-Wishart process prior to handle within-image dependency. We propose a general global-local spatial selection prior that achieves simultaneous global (i.e., at the covariate-level) and local (i.e., at the pixel/voxel-level) selection. We introduce participation rate parameters that measure the probability for individual covariates to affect the observed images. This along with a hard-thresholding strategy leads to dependency between selections at the two levels, introduces extra sparsity at the local level, and allows the global selection to be informed by the local selection, all in a model-based manner. The last project is on Gaussian graphical regression models with covariates. We use a tensor representation of the regression coefficients to describe the multi-level selection action achieved by the proposed prior: covariate-level, edge-level and local-level. Simultaneous multi-level selection is done by nesting a global-local spike-and-slab prior in a sparse group selection prior. This nested prior first achieves a global-level selection, excluding a covariate, by measuring the probability of the covariate being influential, and then, conditional on the outcome, performs edge-level selection in the manner of conventional Gaussian graphical regression models. In a fully Bayes approach, we design Markov Chain Monte Carlo (MCMC) samplers for all three models and show in simulations and real data applications (with U.S. macroeconomic data, Autism brain imaging data, and human gene expression data respectively) that the proposed Bayesian methods are competitive with respect to existing models. Furthermore, the proposed Bayesian methods are also highly interpretable and able to provide joint uncertainty quantification via posterior samples for prediction and/or inference.Item Advanced Nonparametric Methods for Function Derivatives and Related Localized Features(2024-04-15) Liu, Zejian; Li, MengIn this thesis, we propose nonparametric statistical methods to study the derivatives and localized features of functions, areas of paramount importance across diverse scientific disciplines such as cosmology, environmental science, and neuroscience. We first develop a plug-in kernel ridge regression estimator for derivatives of arbitrary order. We provide both non-asymptotic and asymptotic error bounds for this estimator and show its substantial improvements over existing techniques, particularly for high-order derivatives and multi-dimensional settings. We then transition to exploring the application of Gaussian processes (GPs) in modeling derivative functionals, addressing a long-standing skepticism surrounding derivative estimation strategies using GPs. We show that the GP prior exhibits a remarkable plug-in property with minimax optimality---this, to the best of our knowledge, provides the first positive result for using GPs in estimating function derivatives. A data-driven empirical Bayes approach is studied for hyperparameter tuning, which achieves theoretical optimality and computational efficiency. Lastly, we tackle the challenging problem of identifying and analyzing localized features of functions, such as local extrema, in the presence of noise. To this end, we introduce a semiparametric Bayesian method built upon derivative-constrained GP priors. We establish large sample frequentist properties for the proposed method, from which point and interval estimates are derived, offering the advantage of fast implementation without requiring complex sampling procedures. Overall, this thesis blends theoretical and methodological innovation with practical application, aiming to equip researchers and practitioners with more powerful and versatile tools for the analysis of nonparametric functions. The advancements presented herein not only contribute to the statistical community but also have broad implications across various fields where derivative estimation and localized feature analysis are essential. Our methods are illustrated via a climate change application for analyzing the global sea-level rise, and an application to cognitive science for analyzing event-related potentials.Item Advanced Statistical Learning Methods in Image Processing(2020-11-30) Liu, Rongjie; Li, MengWith the rapid development of modern technology, massive imaging datasets have been routinely collected in a wide range of applications, e.g., ImageNet in natural image processing and Human Connectome Project in computational neuroscience. These big and complex imaging datasets have facilitated a surge of interest in image processing at the interplay of statistics, computer vision, and medical science domains. While new approaches are constantly being developed, scalability, interpretability and uncertainty quantification are still considered as main challenges. This thesis encompasses three projects, in which we develop advanced statistical methods to address daunting challenges in three key imaging processing problems. First, in the image compression problem, we develop a scalable and model-based method called Compression through Adaptive Recursive Partitioning (CARP) to compress m-dimensional images in a unified fashion. CARP uses an optimal permutation of the image pixels inferred from a Bayesian probabilistic model on recursive partitions of the image to reduce its effective dimensionality, achieving a parsimonious representation that preserves information. Extensive experiments show that CARP dominates the state-of-the-art image/video compression approaches including JPEG, JPEG2000, BPG, MPEG-4, HEVC and a neural network-based method for all of different image/video types and often on nearly all of the individual images. Second, in the image reconstruction problem, we establish posterior concentration rates for wavelets with adaptive recursive partitioning for multi-dimensional imaging data, which are minimax optimal up to a logarithmic factor under the supremum norm. This provides strong theoretical support for a fast reconstruction method that scales linearly to the number of pixels and provides uncertainty quantification. Third, in the brain image parcellation problem, we propose a new principal parcellation analysis for learning the relevant ROIs automatically based on white matter fiber bundles. We develop a novel framework that conducts the clustering analysis on the fibers' ending points to redefine parcellation and eventually predict human traits, which dramatically improves parsimony and prediction, compared to anatomical parcellation based connectomes. It eliminates the need to choose ROIs manually, reducing subjectivity and leading to a substantially different representation of the connectome.Item Analysis of sex-based differences in clinical and molecular responses to ischemia reperfusion after lung transplantation(Springer Nature, 2021) Chacon-Alberty, Lourdes; Ye, Shengbin; Daoud, Daoud; Frankel, William C.; Virk, Hassan; Mase, Jonathan; Hochman-Mendez, Camila; Li, Meng; Sampaio, Luiz C.; Taylor, Doris A.; Loor, GabrielBackground: Sex and hormones influence immune responses to ischemia reperfusion (IR) and could, therefore, cause sex-related differences in lung transplantation (LTx) outcomes. We compared men’s and women’s clinical and molecular responses to post-LTx IR. Methods: In 203 LTx patients, we used the 2016 International Society for Heart and Lung Transplantation guidelines to score primary graft dysfunction (PGD). In a subgroup of 40 patients with blood samples collected before LTx (T0) and 6, 24, 48 (T48), and 72 h (T72) after lung reperfusion, molecular response to IR was examined through serial analysis of circulating cytokine expression. Results: After adjustment, women had less grade 3 PGD than men at T48, but not at T72. PGD grade decreased from T0 to T72 more often in women than men. The evolution of PGD (the difference in mean PGD between T72 and T0) was greater in men. However, the evolution of IL-2, IL-7, IL-17a, and basic fibroblast growth factor levels was more often sustained throughout the 72 h in women. In the full cohort, we noted no sex differences in secondary clinical outcomes, but women had significantly lower peak lactate levels than men across the 72 h. Conclusions: Men and women differ in the evolution of PGD and cytokine secretion after LTx: Women have a more sustained proinflammatory response than men despite a greater reduction in PGD over time. This interaction between cytokine and PGD responses warrants investigation. Additionally, there may be important sex-related differences that could be used to tailor treatment during or after transplantation.Item BayesBD: An R Package for Bayesian Inference on Image Boundaries(The R Foundation, 2017) Syring, Nicholas; Li, MengWe present the BayesBD package providing Bayesian inference for boundaries of noisy images. The BayesBD package implements flexible Gaussian process priors indexed by the circle to recover the boundary in a binary or Gaussian noised image. The boundary recovered by BayesBD has the practical advantages of guaranteed geometric restrictions and convenient joint inferences under certain assumptions, in addition to its desirable theoretical property of achieving (nearly) minimax optimal rate in a way that is adaptive to the unknown smoothness. The core sampling tasks for our model have linear complexity, and are implemented in C++ for computational efficiency using packages Rcpp and RcppArmadillo. Users can access the full functionality of the package in both the command line and the corresponding shiny application. Additionally, the package includes numerous utility functions to aid users in data preparation and analysis of results. We compare BayesBD with selected existing packages using both simulations and real data applications, demonstrating the excellent performance and flexibility of BayesBD even when the observation contains complicated structural information that may violate its assumptions.Item Bayesian Image-on-Scalar Regression with a Spatial Global-Local Spike-and-Slab Prior(Project Euclid, 2024) Zeng, Zijian; Li, Meng; Vannucci, MarinaIn this article, we propose a novel spatial global-local spike-and-slab selection prior for image-on-scalar regression. We consider a Bayesian hierarchical Gaussian process model for image smoothing, that uses a flexible Inverse-Wishart process prior to handle within-image dependency, and propose a general global-local spatial selection prior that broadly relates to a rich class of well-studied selection priors. Unlike existing constructions, we achieve simultaneous global (i.e., at covariate-level) and local (i.e., at pixel/voxel-level) selection by introducing participation rate parameters that measure the probability for the individual covariates to affect the observed images. This along with a hard-thresholding strategy leads to dependency between selections at the two levels, introduces extra sparsity at the local level, and allows the global selection to be informed by the local selection, all in a model-based manner. We design an efficient Gibbs sampler that allows inference for large image data. We show on simulated data that parameters are interpretable and lead to efficient selection. Finally, we demonstrate performance of the proposed model by using data from the Autism Brain Imaging Data Exchange (ABIDE) study (Di Martino et al., 2014).Item Fake News Detection with Headlines(Rice University, 2023-12-12) Ramirez, Gared; Li, MengFake news has become an increasing problem due to the rising use of the Internet and social media. It is important to be able to distinguish sources of fake and misleading news articles to ensure that misinformation does not sow discord, erode trust in credible sources, and negatively impact our personal and societal well-being. Moreover, in an age where many people only skim headlines without delving into the full articles, the ability to discern fake news from headlines alone becomes even more crucial. To detect and classify fake news, we implement and compare five machine learning models–naive Bayes, logistic regression, decision tree, random forest, and support vector machine–on two different datasets: a benchmark dataset and a dataset with full articles and headlines. We utilize measures such as term frequency-inverse document frequency and sentiment scores, as predictors in our models. We find that naive Bayes consistently performs best on both datasets with accuracies of 64.40% and 92.56%, respectively.Item Feature Learning and Bayesian Functional Regression for High-Dimensional Complex Data(2021-12-02) Zohner, Ye Emma M; Li, Meng; Morris, Jeffrey S.In recent years, technological innovations have facilitated the collection of complex, high-dimensional data that pose substantial modeling challenges. Most of the time, these complex objects are strongly characterized by internal structure that makes sparse representations possible. If we can learn a sparse set of features that accurately captures the salient features of a given object, then we can model these features using standard statistical tools including clustering, regression and classification. The key question is how well this sparse set of features captures the salient information in the objects. In this thesis, we develop methodology for evaluating latent feature representations for functional data and for using these latent features within functional regression frameworks to build flexible models. In the first project, we introduce a graphical latent feature representation tool (GLaRe) to learn features and assess how well a given feature learning approach captures the salient information in a data object. In the second project, we build on this feature learning methodology to propose a basis strategy for fitting functional regression models when the domain is a closed manifold. This methodology is applied to MRI data to characterize patterns of infant cortical thickness development in the first two years of life. In the third project, we adapt our feature learning and Bayesian functional regression methodology to high-frequency data streams. We model high-frequency intraocular pressure data streams using custom bases for quantile representations of the underlying distribution, and provide insights into the etiology of glaucoma.Item Improved data quality and statistical power of trial-level event-related potentials with Bayesian random-shift Gaussian processes(Springer Nature, 2024) Pluta, Dustin; Hadj-Amar, Beniamino; Li, Meng; Zhao, Yongxiang; Versace, Francesco; Vannucci, MarinaStudies of cognitive processes via electroencephalogram (EEG) recordings often analyze group-level event-related potentials (ERPs) averaged over multiple subjects and trials. This averaging procedure can obscure scientifically relevant variability across subjects and trials, but has been necessary due to the difficulties posed by inference of trial-level ERPs. We introduce the Bayesian Random Phase-Amplitude Gaussian Process (RPAGP) model, for inference of trial-level amplitude, latency, and ERP waveforms. We apply RPAGP to data from a study of ERP responses to emotionally arousing images. The model estimates of trial-specific signals are shown to greatly improve statistical power in detecting significant differences in experimental conditions compared to existing methods. Our results suggest that replacing the observed data with the de-noised RPAGP predictions can potentially improve the sensitivity and accuracy of many of the existing ERP analysis pipelines.Item Network Modeling Approaches Leveraging Higher Order Dependence for Complex Biomedical Data(2021-12-21) Desai, Neel Mehul; Morris, Jeffrey S; Baladandayuthapani, Veera; Li, MengIn this work I propose three network modeling approaches that estimate and leverage dependence in novel ways across and within networks to enhance the inference of complex modern biomedical data. Specifically, our methods are motivated by the personalized drug discovery problem in precision medicine and by the identification of factors explaining inter-subject variability in functional connectivity brain networks. I first present NetCellMatch, a network-based multiscale matching algorithm de- signed for mapping patient tumors to in-vitro cancer cell lines for personalized drug discovery. Our algorithm first constructs a global network across all patient-cell line samples using their genomic similarity. Then, a multi-scale community detection algorithm integrates information across topologically meaningful clustering scales to obtain Network-Based Matching Scores (NBMS). NBMS are measures of cluster robustness which map patient tumors to cell lines. I apply NetCellMatch to reverse-phase protein array data obtained from the Cancer Genome Atlas for patients and from the MD Anderson Cell Lines Project for cell lines. Along with ”avatar” cell line identification for subgroups of patients, I evaluate connectivity patterns for breast, lung, and colon cancer and explore the proteomic profiles of avatars and their corresponding top matching patients. Our results demonstrate our framework’s ability to identify both patient-cell line matches and potential proteomic drivers of similarity. Our algorithm is general and can be easily adapted to integrate multi-omic datasets. Next, I present general methodology to regress subject-specific networks on a set of covariates that produces both multiplicity-adjusted hypothesis tests for which covariates affect networks and statistical measures indicating which network edges are driving these differences. Our strategy projects subject-specific empirical correlation matrices into an alternative space using a matrix logarithm transform, which ensures positive-semidefiniteness and justifies Gaussian modeling. Using a Gaussian multivariate regression framework in this space with cutting-edge sparsity priors, I regress the networks on predictors while discovering and accounting for second-order dependence across network edges which I show leads to greater efficiency and power for statistical inference. I validate our framework via extensive simulation and apply our approach to analyze functional connectivity networks of 1003 healthy young patients taken from the Human Connectome Project (HCP), demonstrating concordance be- tween results in the transformed and original space. Our second project is limited to the consideration of strictly linear associations between covariates and network edges. Seeking to extend the framework of the previous project, I conclude by developing methodology to solve multivariate sparse generalized additive models. I jointly select between null, linear, and non-linear effects in an efficient, theoretically justified parallelizable manner while accounting for estimated sparse residual structure. I provide validation for our method via simulation, demonstrating the benefits of accounting for residual structure in the selection and estimation of linear and non-linear associations in a manner analogous to the principles of seemingly unrelated regression. Our method is applied to the aforementioned dataset from the HCP to explore potential non-linear effects between covariates and network edges.Item Prediction of WilderHill Clean Energy Index Directional Movement(Rice University, 2023-05-08) Du, Yolanda; Lu, Lu; Ding, Hongkai; McGuffey, Elizabeth; Li, MengThe popularity of clean energy has risen recently due to concerns about climate change and the exhaustion of traditional energy sources. The stock price of clean energy companies reflects the public’s attention to the industry’s growth potential, and clean energy stocks are among the riskiest stocks to invest in. Thus, it is important to apply quantitative methods to analyze the financial risks and returns of renewable energy stocks. Prior works on the topic are mainly focused on inference rather than predictions of renewable energy stock prices. In this investigation, the directional movement of the WilderHill Clean Energy Index is predicted using machine learning methods including logistic regression, random forest, and neural networks. Using data including technical indicators and macroeconomic variables, the aim is to predict the movement of the WilderHill Clean Energy Index with high accuracy. The results suggest that for the classification models with two directions, random forest and neural networks outperform full logistic regression and stepwise logistic regression. For the classification models with a three-category target variable, random forest and neural networks models outperform full logistic regression and stepwise logistic regression in overall accuracy; however, the methods give varying results for different outcome classes, in regards to sensitivity and specificity. In addition, the relationship between renewable energy stock directional movement and independent variables is investigated. The results suggest that two important macroeconomic variables are West Texas Intermediate crude oil prices and NYSE Arca Tech 100 Index.Item Pregnant and Peripartum Women with COVID-19 Have High Survival with Extracorporeal Membrane Oxygenation: An Extracorporeal Life Support Organization Registry Analysis(American Thoracic Society, 2022) O’Neil, Erika R.; Lin, Huiming; Shamshirsaz, Amir A.; Naoum, Emily E.; Rycus, Peter R.; Alexander, Peta M. A.; Ortoleva, Jamel P.; Li, Meng; Anders, Marc M.Item Shape-constrained Regression and False Discovery Rate Control for Functional Data with Applications to Human Intracranial Electroencephalography Study(2021-06-22) Wang, Zhengjia; Li, MengAbstract: Intracranial electroencephalography (iEEG) is a neuroscience technique that allows for recordings of human brain activity with high spatial and temporal resolution. Since iEEG data consists of continuous recordings of voltage signals from electrodes at different brain locations, functional data analysis (FDA) could be a useful tool for transforming iEEG data into meaningful discoveries about brain function. Shape constraints that arise from domain knowledge are crucial for a flexible nonparametric model to be interpretable. This thesis advances methods, theory, computation, and visualization for functional data of complex dependent structure with shape constraints by addressing statistical questions raised by the application of FDA to iEEG data. The first project focuses on locally sparse regression functions and develops a weighted group bridge approach for simultaneous function estimation and support recovery in function-on-scalar mixed effect models. We use locally supported B-splines to transform nonparametric functions to vectors of diverging dimension with group sparsity, and propose a fast non-convex optimization algorithm using nested alternative direction method of multipliers for estimation. We show that the estimated coefficient functions achieve the minimax optimal rate and resemble a phase transition phenomenon. For support estimation, we establish selection consistency under approximate sparsity and provide a simple sufficient regularity condition for strict selection consistency. The second project controls the false discovery rate (FDR) for a continuum of hypotheses related to functional data. A key contribution of this project is a set-theoretic framework to embed topological constraints such as the connectedness of significant regions into a two-stage multiple testing procedure. In the first stage, we discover significant clusters by controlling a newly proposed extended false cluster rate criterion that allows for overlapped clusters. In the second stage, we control the FDR for individual hypotheses by post-selection inference based on derived conditional p-values given the rejected clusters in the first stage. We propose a testing procedure that reduces the number of unique tests from uncountably infinite to possibly linear in the number of discrete observations sampled from functional data, substantially facilitating the computation. We show that the FDR is controlled asymptotically at both the cluster and individual levels. In simulations to assess finite sample performance, the proposed methods compare favorably to several recently proposed methods. Methods developed in the first two projects are applied to iEEG data to study multisensory integration in the human brain. The third project develops the software package RAVE ("R" Analysis and Visualization of iEEG data). RAVE performs all of the steps necessary to analyze iEEG data, producing publication-ready graphics with an easy-to-use graphical user interface, using statistically valid analyses.Item Standardization of multivariate Gaussian mixture models and background adjustment of PET images in brain oncology(The Institute of Mathematical Statistics, 2018) Li, Meng; Schwartzman, ArminIn brain oncology, it is routine to evaluate the progress or remission of the disease based on the differences between a pre-treatment and a post-treatment Positron Emission Tomography (PET) scan. Background adjustment is necessary to reduce confounding by tissue-dependent changes not related to the disease. When modeling the voxel intensities for the two scans as a bivariate Gaussian mixture, background adjustment translates into standardizing the mixture at each voxel, while tumor lesions present themselves as outliers to be detected. In this paper, we address the question of how to standardize the mixture to a standard multivariate normal distribution, so that the outliers (i.e., tumor lesions) can be detected using a statistical test. We show theoretically and numerically that the tail distribution of the standardized scores is favorably close to standard normal in a wide range of scenarios while being conservative at the tails, validating voxelwise hypothesis testing based on standardized scores. To address standardization in spatially heterogeneous image data, we propose a spatial and robust multivariate expectation-maximization (EM) algorithm, where prior class membership probabilities are provided by transformation of spatial probability template maps and the estimation of the class mean and covariances are robust to outliers. Simulations in both univariate and bivariate cases suggest that standardized scores with soft assignment have tail probabilities that are either very close to or more conservative than standard normal. The proposed methods are applied to a real data set from a PET phantom experiment, yet they are generic and can be used in other contexts.Item Statistical Approaches for Large-Scale and Complex Omics Data(2019-12-05) Liu, Yusha; Li, Meng; Morris, Jeffrey S.In this thesis, we propose several novel statistical approaches to analyzing large-scale and complex omics data. This thesis consists of three projects. In the first project, with the goal of characterizing gene-level relationships between DNA methylation and gene expression, we introduce a sequential penalized regression approach to identify methylation-expression quantitative trait loci (methyl-eQTLs), a term that we have coined to represent, for each gene and tissue type, a sparse set of CpG loci best explaining gene expression and accompanying weights indicating direction and strength of association, which can be used to construct gene-level methylation summaries that are maximally correlated with gene expression for use in integrative models. Using TCGA and MD Anderson colorectal cohorts to build and validate our models, we demonstrate our strategy explains expression variability much better than commonly used integrative methods. In the second project, we propose a unified Bayesian framework to perform quantile regression on functional responses (FQR). Our approach represents functional coefficients with basis functions to borrow strength from nearby locations, and places a global-local shrinkage prior on the basis coefficients to achieve adaptive regularization. We develop a scalable Gibbs sampler to implement the approach. Simulation studies show that our method has superior performance against competing methods. We apply our method to a mass spectrometry dataset and identify proteomic biomarkers of pancreatic cancer that were entirely missed by mean-regression based approaches. The third project is a theoretical investigation of the FQR problem, extending the previous project. We propose an interpolation-based estimator that can be strongly approximated by a sequence of Gaussian processes, based upon which we can derive the convergence rate of the estimator and construct simultaneous confidence bands for the functional coefficient. The strong approximation results also build a theoretical foundation for the development of alternative approaches that are shown to have better finite-sample performance in simulation studies.Item Statistical Methods for Multivariate Outcomes with Applications to Biomedical Data(2021-11-29) Ding, Maomao; Li, Meng; Ning, Jing; Li, RuoshaThe analysis of multivariate outcomes is common practice in biomedical studies when independence between outcomes cannot be assumed. Various mechanisms can create dependence in the multivariate outcomes. In this thesis, we consider statistical tools for three settings: competing risks data, multivariate outcomes for comprehensively capturing multidimensional symptoms of a disease, and gene co-expression networks. First, competing risks data arise naturally in biomedical studies where subjects are at risk of more than one failure causes that are mutually exclusive. For example, in a study of monoclonal gammopathy of undetermined significance (MGUS), the competing risks outcomes involved time until progression to a plasma cell malignancy (PCM) and time to death without PCM. Second, when no single outcome is sufficient to quantify the multi-dimensional deterioration of a disease, investigators often rely on multiple outcomes for a comprehensive assessment of the global disease status. This is exemplified by a study on Parkinson's disease, where investigators chose five outcomes to jointly capture the global disease progression. Last, in gene co-expression network analysis, it is of interest to detect pairs of genes that exhibit significant co-expression relationship. Identification of the gene co-expression network will help understand the underlying biological processes in a systematic way. In the first project, we propose an estimator of the Polytomous Discrimination Index applicable to competing risks data, which can quantify a prognostic models ability to discriminate among subjects from different outcome groups. The proposed estimator allows the prediction model to be subject to model misspecification and enjoys desirable asymptotic properties. We also develop an efficient computation algorithm that features a computational complexity of O(n log n). A perturbation resampling scheme is developed to achieve consistent variance estimation. Numerical results suggest that the estimator performs well under realistic sample sizes. We apply the proposed method to a study of monoclonal gammopathy of undetermined significance and the evaluated the performance Fine-Gray model on this dataset. In the second project, we develop a sensible semiparametric regression strategy for single/multiple outcomes in longitudinal studies. Our method requires minimal assumptions and can accommodate missing data by the inverse probability weighting technique. We estimate the model parameter by maximizing a rank correlation type objective function. Under mild regularity conditions, the proposed estimators asymptotically follow a normal distribution, and the asymptotic variance can be estimated by the perturbation-resampling method. We further smooth the original discontinuous objective function by the kernel smoothing, and the resulting estimators will have the same asymptotic distribution as the original estimators. We propose a computationally stable and efficient procedure for the optimization, which addresses the challenge due to the non-convexity of the objective function. Numerical studies show that our method performs well under realistic settings. We apply the proposed method to a Parkinson Disease (PD) clinical trial data to examining risk factors associated with the global disease burden and/or the progression of PD. In the third project, we propose a class conditional independence test to evaluate whether Y1 and Y2 are independent conditioning on covariates X. Our method relies on the modeling of the density functions of Y1|X and Y2|X, but the resulting test will be valid as long as one of the density functions is correctly specified. Under mild regularity conditions, our test statistic will be asymptotically normal, and the asymptotic variance can be consistently estimated. Compared with existing methods, our method is computationally efficient, as no bootstrap or hyper-parameter tuning procedure is required. We extend our method to infer the conditional independence graph, and propose a multiple testing procedure to control the false discovery rate. Numerical results suggest that our method performs well under a variety of settings and is robust to density function misspecifications. We apply the proposed method to a gastric cancer gene expression data to understand the associations between genes belonging to the transforming growth factor β signalling pathway.Item Uncertainty quantification in high-dimensional models and post-selection procedures(2022-11-28) Lin, Huiming; Li, MengIn the era of big data, the role of uncertainty quantification has become increasingly recognized in wide-ranging areas for transparent, trustworthy, and reproducible data science. This ubiquitous task of quantifying uncertainty can be approached in a multifaceted, context-specific manner. For example, in the Bayesian paradigm, the goal of uncertainty quantification is to incorporate domain knowledge via the prior specification and base the inference on the posterior distribution; in large-scale hypothesis testing, we may be interested in controlling false positives of selected variables; in frequentist inference settings it is of fundamental importance to construct confidence intervals with intended coverage. However, coupling principled uncertainty quantification with interpretability in modern data science faces daunting challenges from modeling and computation to theoretical understanding. In response to these challenges, this thesis includes three projects addressing uncertainty quantification in high-dimensional structured ensembles, high-dimensional false discovery control, and valid confidence intervals in post-selection inference. In the first project, we introduce the concept of structured high-dimensional probability simplexes motivated by the “forecast combination puzzle” in economics, in which most components are zero or near zero and the remaining ones are close to each other. We propose a novel class of double spike Dirichlet priors to encode this structure, leading to a Bayesian method for structured weighting that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. Posterior contraction rates are established to study large sample behaviors of the posterior distribution. We demonstrate the wide applicability and competitive performance of the proposed methods through extensive simulations, and two real data applications using the European Central Bank Survey of Professional Forecasters data set and a data set from the UC Irvine Machine Learning Repository. In the second project, we focus on the false discovery control problem in variable selection for high-dimensional linear models and develop scalable Bayesian estimators that achieve simultaneous false discovery rate and exceedance control. The proposed methods select variables within a sequence of posterior contours centering at a Bayes estimator via constrained optimization, leading to a fast procedure with large sample property guarantees and finite sample correction. Extensive numerical studies evidence the improved false discovery control and robustness over popular alternatives under a wide range of data generation settings. The proposed methods are illustrated by analyzing a Human Immunodeficiency Virus (HIV) data set to detect mutations associated with drug resistance. In the third project, we focus on post-selection inference under best-subset selection criteria, including the commonly used Akaike information criterion (AIC) and Bayesian Information Criterion (BIC). We characterize the model selection event, which consists of a series of pairwise model comparisons, and derive the conditional distribution of linear estimators for any sample size. Our results elucidate the invalid coverage of conventional confidence intervals, and provide a non-asymptotic formulation based on which we construct post-selection confidence intervals with guaranteed frequentist coverage. Simulation studies confirm the coverage of proposed confidence intervals in finite-sample settings. We use a US consumption data set to show how post-selection inference can arrive at different conclusions compared with conventional inference methods.Item Using statistical learning to predict interactions between single metal atoms and modified MgO(100) supports(Springer Nature, 2020) Liu, Chun-Yen; Zhang, Shijia; Martinez, Daniel; Li, Meng; Senftle, Thomas P.Metal/oxide interactions mediated by charge transfer influence reactivity and stability in numerous heterogeneous catalysts. In this work, we use density functional theory (DFT) and statistical learning (SL) to derive models for predicting how the adsorption strength of metal atoms on MgO(100) surfaces can be enhanced by modifications of the support. MgO(100) in its pristine form is relatively unreactive, and thus is ideal for examining ways in which its electronic interactions with metals can be enhanced, tuned, and controlled. We find that the charge transfer characteristics of MgO are readily modified either by adsorbates on the surface (e.g., H, OH, F, and NO2) or dopants in the oxide lattice (e.g., Li, Na, B, and Al). We use SL methods (i.e., LASSO, Horseshoe prior, and Dirichlet–Laplace prior) that are trained against DFT data to identify physical descriptors for predicting how the adsorption energy of metal atoms will change in response to support modification. These SL-derived feature selection tools are used to screen through more than one million candidate descriptors that are generated from simple chemical properties of the adsorbed metals, MgO, dopants, and adsorbates. Among the tested SL tools, we demonstrate that Dirichlet–Laplace prior predicts metal adsorption energies on MgO most accurately, while also identifying descriptors that are most transferable to chemically similar oxides, such as CaO, BaO, and ZnO.