Li, Meng2023-01-042023-01-042022-122022-11-28December 2Lin, Huiming. "Uncertainty quantification in high-dimensional models and post-selection procedures." (2022) Diss., Rice University. <a href="https://hdl.handle.net/1911/114201">https://hdl.handle.net/1911/114201</a>.https://hdl.handle.net/1911/114201In the era of big data, the role of uncertainty quantification has become increasingly recognized in wide-ranging areas for transparent, trustworthy, and reproducible data science. This ubiquitous task of quantifying uncertainty can be approached in a multifaceted, context-specific manner. For example, in the Bayesian paradigm, the goal of uncertainty quantification is to incorporate domain knowledge via the prior specification and base the inference on the posterior distribution; in large-scale hypothesis testing, we may be interested in controlling false positives of selected variables; in frequentist inference settings it is of fundamental importance to construct confidence intervals with intended coverage. However, coupling principled uncertainty quantification with interpretability in modern data science faces daunting challenges from modeling and computation to theoretical understanding. In response to these challenges, this thesis includes three projects addressing uncertainty quantification in high-dimensional structured ensembles, high-dimensional false discovery control, and valid confidence intervals in post-selection inference. In the first project, we introduce the concept of structured high-dimensional probability simplexes motivated by the “forecast combination puzzle” in economics, in which most components are zero or near zero and the remaining ones are close to each other. We propose a novel class of double spike Dirichlet priors to encode this structure, leading to a Bayesian method for structured weighting that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. Posterior contraction rates are established to study large sample behaviors of the posterior distribution. We demonstrate the wide applicability and competitive performance of the proposed methods through extensive simulations, and two real data applications using the European Central Bank Survey of Professional Forecasters data set and a data set from the UC Irvine Machine Learning Repository. In the second project, we focus on the false discovery control problem in variable selection for high-dimensional linear models and develop scalable Bayesian estimators that achieve simultaneous false discovery rate and exceedance control. The proposed methods select variables within a sequence of posterior contours centering at a Bayes estimator via constrained optimization, leading to a fast procedure with large sample property guarantees and finite sample correction. Extensive numerical studies evidence the improved false discovery control and robustness over popular alternatives under a wide range of data generation settings. The proposed methods are illustrated by analyzing a Human Immunodeficiency Virus (HIV) data set to detect mutations associated with drug resistance. In the third project, we focus on post-selection inference under best-subset selection criteria, including the commonly used Akaike information criterion (AIC) and Bayesian Information Criterion (BIC). We characterize the model selection event, which consists of a series of pairwise model comparisons, and derive the conditional distribution of linear estimators for any sample size. Our results elucidate the invalid coverage of conventional confidence intervals, and provide a non-asymptotic formulation based on which we construct post-selection confidence intervals with guaranteed frequentist coverage. Simulation studies confirm the coverage of proposed confidence intervals in finite-sample settings. We use a US consumption data set to show how post-selection inference can arrive at different conclusions compared with conventional inference methods.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.False discovery controlForecast combinationHigh-dimensional linear modelsPost-selection inferenceUncertainty quantification in high-dimensional models and post-selection proceduresThesis2023-01-04