Recent Advances in Bayesian Copula Models for Mixed Data and Quantile Regression

Date
2023-04-13
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

This thesis advances novel Bayesian approaches towards joint modeling of mixed data types and quantile regression. In the first part of this work, we advance methodological and theoretical properties of the Bayesian Gaussian copula, and deploy these models in a variety of applications. Copula models link arbitrary univariate marginal distributions under a multivariate dependence structure to define a valid joint distribution for a random vector. By estimating the joint distribution of a multivariate random vector, we are granted access to a myriad of information, from marginal properties and conditional relationships, to multivariate dependence structures. The final portion of this thesis introduces a novel technique for quantile regression that is broadly compatible with any Bayesian predictive model, including copulas. We utilize posterior summarization to estimate coherent and interpretable quantile functions with the added benefit of quantile-specific variable selection.

In the first chapter, we deploy the Gaussian copula towards the generation of privacy-preserving fully synthetic data. Often, the dissemination of data sets containing information on real individuals poses harmful privacy risks. However, the lack of rich, publicly available data hinders policy and decision making, as well as statistics education. Synthetic data are a promising alternative for data sharing: they are simulated from a model estimated on the confidential data, which destroys any one-to-one correspondences between synthetic and real individuals. If the synthetic data are shown to be sufficiently useful and private, they may be disseminated and studied with minimal adverse privacy implications.

In this chapter, we synthesize a data set comprised of dozens of sensitive health and academic achievement measurements on nearly 20,000 children from North Carolina which precludes its public release. In addition, the data set is comprised mixed continuous, count, ordinal and nominal data types which poses substantial modeling challenges. We develop a novel Bayesian Gaussian copula model for synthesis of the North Carolina data based on the Extended Rank-Probit Likelihood (RPL), which modifies existing copula models to additionally handle nominal variables. We demonstrate state-of-the-art utility of synthetic data synthesized under the RPL copula model, and study the post-hoc privacy implications of synthetic data releases.

In the second chapter, we apply copula models towards imputation of missing values, which are commonplace in modern data analysis. With abundant missing values, it is problematic to conduct a complete case analysis, which proceeds using only observations for which all variables are observed. Thus, imputation is necessary, but limited by the ability of the model to jointly predict missing values of mixed data types. Recognizing the broad compatibility of RPL copula models with mixed data types, we develop a novel Bayesian mixture copula for flexible imputation. Most uniquely, we introduce a technique for marginal distribution estimation, the margin adjustment, which enables automated and consistent estimation of marginal distribution functions in the presence missing data. Our Bayesian mixture copula demonstrates exceptional performance in simulation, and we apply the model on a subset of variables from the National Health and Nutrition Examination Survey subject to abundant missing data. Our results demonstrate the risks of a complete case analysis, and how a suitable model for imputation can correct these shortcomings.

We conclude with new perspectives on Bayesian quantile regression, which provides a more robust view into how covariates affect the distribution of a response variable. Given any Bayesian predictive model, we view the quantile function as a posterior functional, which enables point estimation through decision theory. Our technique unifies estimation of quantile-specific functions under a singular, coherent model, which alleviates issues of quantile crossing. Furthermore, through careful justification of the loss function in our posited decision analysis, we develop quantile-specific variable selection techniques. Thus, this work connects the extensive literature on valid quantile function estimation (i.e. techniques to prevent quantile crossing) with variable selection in the mean regression setting. Extensive simulation highlights the vast improvements of the proposed approach over existing Bayesian and frequentist methods in terms of prediction, inference, and variable selection.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Synthetic Data, Missing Data, Quantile Regression, Copula
Citation

Feldman, Joseph. "Recent Advances in Bayesian Copula Models for Mixed Data and Quantile Regression." (2023) Diss., Rice University. https://hdl.handle.net/1911/115126.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page