Browsing by Author "Kowal, Daniel"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Bayesian Adaptive and Interpretable Functional Regression Models(2023-11-28) Gao, Yunan; Kowal, DanielScalar-on-function regression (SOFR) is a widely used tool in the medical and behavioral sciences, which elucidates the association between a scalar response and data collected repeatedly across a continuous domain. However, estimating and interpreting SOFR models pose significant challenges due to the high autocorrelation and dimensionality of functional predictors. This work presents novel estimation and inference tools for Bayesian SOFR models. Firstly, we propose a locally adaptive and highly scalable Bayesian SOFR model. By combining a B-spline basis expansion with dynamic shrinkage priors on the regression coefficient function, our model achieves more accurate point estimates and precise uncertainty quantification, particularly when capturing both smooth and rapidly-changing features. Secondly, we provide decision analysis tools for Bayesian SOFR models that extract locally constant summaries based on the posterior predictive distribution. These summaries help identify critical windows—regions in the domain of the functional covariates that predict the scalar response. Leveraging the proposed Bayesian SOFR model and decision analysis tools, we investigate the relationship between prenatal daily PM2.5 exposure and standardized 4th-grade reading test scores in a large cohort of North Carolina students. Our findings indicate that prenatal PM2.5 exposure during early and late pregnancy has the most adverse impact on the testing scores. Lastly, we extend the proposed Bayesian SOFR model and decision analysis strategy to handle multiple functional covariates, nonlinear relationships, and binary indicator response variables. Using this generalized framework, we explore the effects of prenatal temperature and PM2.5 exposure on birth weight in Michigan. Our analysis reveals that higher temperature and PM2.5 exposure are associated with lower birth weights, with higher temperature exhibiting a stronger effect than PM2.5. We provide an R package (BaiSOFR), that implements the proposed model and decision analysis strategy, along with a vignette illustrating its application on simulated data.Item Economic Forecasting with News Headlines and Natural Language Processing(Rice University, 2023) Fuad, Gazi; Kowal, DanielConsumer sentiment, which measures how confident individuals feel in the strength of the economy, is a crucial indicator of the overall health of the economy. However, due to the time and costs associated with collecting the survey responses associated with the Index of Consumer Sentiment (ICS), along with the delayed nature of releasing this information, there is motivation to find alternative data sources to the ICS. In this project, we investigated utilizing news headlines as an alternative signal to gauge consumer sentiment in the United States. More specifically, we utilized natural language processing techniques such as latent Dirichlet allocation (LDA) and sentiment analysis to extract quantifiable topics and sentiments from news headlines on the front page of top publications' websites. We subsequently used that information as predictors for the monthly personal saving and labor force participation rates. The topics and sentiments served as exogenous inputs in a Seasonal Autoregressive Integrated Moving Average with eXogenous regressors (SARIMAX) model to predict the actual rates, and as covariates in classification models to predict the direction of rate movement. Our findings showed that topic-sentiment combinations from news headlines have considerable predictive power in modeling future economic conditions even when comparing to the predictive power of the ICS.Item Racial residential segregation shapes the relationship between early childhood lead exposure and fourth-grade standardized test scores(National Academy of Sciences, 2022) Bravo, Mercedes A.; Zephyr, Dominique; Kowal, Daniel; Ensor, Katherine; Miranda, Marie LynnRacial/ethnic disparities in academic performance may result from a confluence of adverse exposures that arise from structural racism and accrue to specific subpopulations. This study investigates childhood lead exposure, racial residential segregation, and early educational outcomes. Geocoded North Carolina birth data is linked to blood lead surveillance data and fourth-grade standardized test scores (n = 25,699). We constructed a census tract-level measure of racial isolation (RI) of the non-Hispanic Black (NHB) population. We fit generalized additive models of reading and mathematics test scores regressed on individual-level blood lead level (BLL) and neighborhood RI of NHB (RINHB). Models included an interaction term between BLL and RINHB. BLL and RINHB were associated with lower reading scores; among NHB children, an interaction was observed between BLL and RINHB. Reading scores for NHB children with BLLs of 1 to 3 µg/dL were similar across the range of RINHB values. For NHB children with BLLs of 4 µg/dL, reading scores were similar to those of NHB children with BLLs of 1 to 3 µg/dL at lower RINHB values (less racial isolation/segregation). At higher RINHB levels (greater racial isolation/segregation), children with BLLs of 4 µg/dL had lower reading scores than children with BLLs of 1 to 3 µg/dL. This pattern becomes more marked at higher BLLs. Higher BLL was associated with lower mathematics test scores among NHB and non-Hispanic White (NHW) children, but there was no evidence of an interaction. In conclusion, NHB children with high BLLs residing in high RINHB neighborhoods had worse reading scores.Item Recent Advances in Bayesian Copula Models for Mixed Data and Quantile Regression(2023-04-13) Feldman, Joseph; Kowal, Daniel; Balakrishnan, GuhaThis thesis advances novel Bayesian approaches towards joint modeling of mixed data types and quantile regression. In the first part of this work, we advance methodological and theoretical properties of the Bayesian Gaussian copula, and deploy these models in a variety of applications. Copula models link arbitrary univariate marginal distributions under a multivariate dependence structure to define a valid joint distribution for a random vector. By estimating the joint distribution of a multivariate random vector, we are granted access to a myriad of information, from marginal properties and conditional relationships, to multivariate dependence structures. The final portion of this thesis introduces a novel technique for quantile regression that is broadly compatible with any Bayesian predictive model, including copulas. We utilize posterior summarization to estimate coherent and interpretable quantile functions with the added benefit of quantile-specific variable selection. In the first chapter, we deploy the Gaussian copula towards the generation of privacy-preserving fully synthetic data. Often, the dissemination of data sets containing information on real individuals poses harmful privacy risks. However, the lack of rich, publicly available data hinders policy and decision making, as well as statistics education. Synthetic data are a promising alternative for data sharing: they are simulated from a model estimated on the confidential data, which destroys any one-to-one correspondences between synthetic and real individuals. If the synthetic data are shown to be sufficiently useful and private, they may be disseminated and studied with minimal adverse privacy implications. In this chapter, we synthesize a data set comprised of dozens of sensitive health and academic achievement measurements on nearly 20,000 children from North Carolina which precludes its public release. In addition, the data set is comprised mixed continuous, count, ordinal and nominal data types which poses substantial modeling challenges. We develop a novel Bayesian Gaussian copula model for synthesis of the North Carolina data based on the Extended Rank-Probit Likelihood (RPL), which modifies existing copula models to additionally handle nominal variables. We demonstrate state-of-the-art utility of synthetic data synthesized under the RPL copula model, and study the post-hoc privacy implications of synthetic data releases. In the second chapter, we apply copula models towards imputation of missing values, which are commonplace in modern data analysis. With abundant missing values, it is problematic to conduct a complete case analysis, which proceeds using only observations for which all variables are observed. Thus, imputation is necessary, but limited by the ability of the model to jointly predict missing values of mixed data types. Recognizing the broad compatibility of RPL copula models with mixed data types, we develop a novel Bayesian mixture copula for flexible imputation. Most uniquely, we introduce a technique for marginal distribution estimation, the margin adjustment, which enables automated and consistent estimation of marginal distribution functions in the presence missing data. Our Bayesian mixture copula demonstrates exceptional performance in simulation, and we apply the model on a subset of variables from the National Health and Nutrition Examination Survey subject to abundant missing data. Our results demonstrate the risks of a complete case analysis, and how a suitable model for imputation can correct these shortcomings. We conclude with new perspectives on Bayesian quantile regression, which provides a more robust view into how covariates affect the distribution of a response variable. Given any Bayesian predictive model, we view the quantile function as a posterior functional, which enables point estimation through decision theory. Our technique unifies estimation of quantile-specific functions under a singular, coherent model, which alleviates issues of quantile crossing. Furthermore, through careful justification of the loss function in our posited decision analysis, we develop quantile-specific variable selection techniques. Thus, this work connects the extensive literature on valid quantile function estimation (i.e. techniques to prevent quantile crossing) with variable selection in the mean regression setting. Extensive simulation highlights the vast improvements of the proposed approach over existing Bayesian and frequentist methods in terms of prediction, inference, and variable selection.