R-3 Repository :: Browsing by Author "Feldman, Joseph"

Browsing by Author "Feldman, Joseph"

Now showing 1 - 3 of 3

Bayesian data synthesis and the utility-risk trade-off for mixed epidemiological data
(Project Euclid, 2022) Feldman, Joseph; Kowal, Daniel R.
Much of the microdata used for epidemiological studies contain sensitive measurements on real individuals. As a result, such microdata cannot be published out of privacy concerns, and without public access to these data, any statistical analyses originally published on them are nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic high-dimensional microdatasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.
Recent Advances in Bayesian Copula Models for Mixed Data and Quantile Regression
(2023-04-13) Feldman, Joseph; Kowal, Daniel; Balakrishnan, Guha
This thesis advances novel Bayesian approaches towards joint modeling of mixed data types and quantile regression. In the first part of this work, we advance methodological and theoretical properties of the Bayesian Gaussian copula, and deploy these models in a variety of applications. Copula models link arbitrary univariate marginal distributions under a multivariate dependence structure to define a valid joint distribution for a random vector. By estimating the joint distribution of a multivariate random vector, we are granted access to a myriad of information, from marginal properties and conditional relationships, to multivariate dependence structures. The final portion of this thesis introduces a novel technique for quantile regression that is broadly compatible with any Bayesian predictive model, including copulas. We utilize posterior summarization to estimate coherent and interpretable quantile functions with the added benefit of quantile-specific variable selection. In the first chapter, we deploy the Gaussian copula towards the generation of privacy-preserving fully synthetic data. Often, the dissemination of data sets containing information on real individuals poses harmful privacy risks. However, the lack of rich, publicly available data hinders policy and decision making, as well as statistics education. Synthetic data are a promising alternative for data sharing: they are simulated from a model estimated on the confidential data, which destroys any one-to-one correspondences between synthetic and real individuals. If the synthetic data are shown to be sufficiently useful and private, they may be disseminated and studied with minimal adverse privacy implications. In this chapter, we synthesize a data set comprised of dozens of sensitive health and academic achievement measurements on nearly 20,000 children from North Carolina which precludes its public release. In addition, the data set is comprised mixed continuous, count, ordinal and nominal data types which poses substantial modeling challenges. We develop a novel Bayesian Gaussian copula model for synthesis of the North Carolina data based on the Extended Rank-Probit Likelihood (RPL), which modifies existing copula models to additionally handle nominal variables. We demonstrate state-of-the-art utility of synthetic data synthesized under the RPL copula model, and study the post-hoc privacy implications of synthetic data releases. In the second chapter, we apply copula models towards imputation of missing values, which are commonplace in modern data analysis. With abundant missing values, it is problematic to conduct a complete case analysis, which proceeds using only observations for which all variables are observed. Thus, imputation is necessary, but limited by the ability of the model to jointly predict missing values of mixed data types. Recognizing the broad compatibility of RPL copula models with mixed data types, we develop a novel Bayesian mixture copula for flexible imputation. Most uniquely, we introduce a technique for marginal distribution estimation, the margin adjustment, which enables automated and consistent estimation of marginal distribution functions in the presence missing data. Our Bayesian mixture copula demonstrates exceptional performance in simulation, and we apply the model on a subset of variables from the National Health and Nutrition Examination Survey subject to abundant missing data. Our results demonstrate the risks of a complete case analysis, and how a suitable model for imputation can correct these shortcomings. We conclude with new perspectives on Bayesian quantile regression, which provides a more robust view into how covariates affect the distribution of a response variable. Given any Bayesian predictive model, we view the quantile function as a posterior functional, which enables point estimation through decision theory. Our technique unifies estimation of quantile-specific functions under a singular, coherent model, which alleviates issues of quantile crossing. Furthermore, through careful justification of the loss function in our posited decision analysis, we develop quantile-specific variable selection techniques. Thus, this work connects the extensive literature on valid quantile function estimation (i.e. techniques to prevent quantile crossing) with variable selection in the mean regression setting. Extensive simulation highlights the vast improvements of the proposed approach over existing Bayesian and frequentist methods in terms of prediction, inference, and variable selection.
Spatial Variability in Relationships between Early Childhood Lead Exposure and Standardized Test Scores in Fourth Grade North Carolina Public School Students (2013–2016)
(National Institute of Environmental Health Sciences, National Institutes of Health, 2024) Bravo, Mercedes A.; Kowal, Daniel R.; Zephyr, Dominique; Feldman, Joseph; Ensor, Katherine; Miranda, Marie Lynn
Background:Exposure to lead during childhood is detrimental to children’s health. The extent to which the association between lead exposure and elementary school academic outcomes varies across geography is not known.Objective:Estimate associations between blood lead levels (BLLs) and fourth grade standardized test scores in reading and mathematics in North Carolina using models that allow associations between BLL and test scores to vary spatially across communities.Methods:We link geocoded, individual-level, standardized test score data for North Carolina public school students in fourth grade (2013–2016) with detailed birth records and blood lead testing data retrieved from the North Carolina childhood blood lead state registry on samples typically collected at 1–6 y of age. BLLs were categorized as: 1μ⁢g/dL (reference), 2μ⁢g/dL, 3–4μ⁢g/dL and ≥5μ⁢g/dL. We then fit spatially varying coefficient models that incorporate information sharing (smoothness), across neighboring communities via a Gaussian Markov random field to provide a global estimate of the association between BLL and test scores, as well as census tract–specific estimates (i.e., spatial coefficients). Models adjusted for maternal- and child-level covariates and were fit separately for reading and math.Results:The average BLL across the 91,706 individuals in the analysis dataset was 2.84μ⁢g/dL. Individuals were distributed across 2,002 (out of 2,195) census tracts in North Carolina. In models adjusting for child sex, birth weight percentile for gestational age, and Medicaid participation as well as maternal race/ethnicity, educational attainment, marital status, and tobacco use, BLLs of 2μ⁢g/dL, 3–4μ⁢g/dL and ≥5μ⁢g/dL were associated with overall lower reading test scores of −0.28 [95% confidence interval (CI): −0.43, −0.12], −0.53 (−0.69, −0.38), and −0.79 (−0.99, −0.604), respectively. For BLLs of 1μ⁢g/dL, 2μ⁢g/dL, 3–4μ⁢g/dL and ≥5μ⁢g/dL, spatial coefficients—that is, tract-specific adjustments in reading test score relative to the “global” coefficient—ranged from −9.70 to 2.52, −3.19 to 3.90, −11.14 to 7.85, and −4.73 to 4.33, respectively. Results for mathematics were similar to those for reading.Conclusion:The association between lead exposure and reading and mathematics test scores exhibits considerable heterogeneity across North Carolina communities. These results emphasize the need for prevention and mitigation efforts with respect to lead exposures everywhere, with special attention to locations where the cognitive impact is elevated. https://doi.org/10.1289/EHP13898

Browsing by Author "Feldman, Joseph"

Results Per Page

Sort Options