Browsing by Author "Kowal, Daniel"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Bayesian Adaptive and Interpretable Functional Regression Models(2023-11-28) Gao, Yunan; Kowal, DanielScalar-on-function regression (SOFR) is a widely used tool in the medical and behavioral sciences, which elucidates the association between a scalar response and data collected repeatedly across a continuous domain. However, estimating and interpreting SOFR models pose significant challenges due to the high autocorrelation and dimensionality of functional predictors. This work presents novel estimation and inference tools for Bayesian SOFR models. Firstly, we propose a locally adaptive and highly scalable Bayesian SOFR model. By combining a B-spline basis expansion with dynamic shrinkage priors on the regression coefficient function, our model achieves more accurate point estimates and precise uncertainty quantification, particularly when capturing both smooth and rapidly-changing features. Secondly, we provide decision analysis tools for Bayesian SOFR models that extract locally constant summaries based on the posterior predictive distribution. These summaries help identify critical windows—regions in the domain of the functional covariates that predict the scalar response. Leveraging the proposed Bayesian SOFR model and decision analysis tools, we investigate the relationship between prenatal daily PM2.5 exposure and standardized 4th-grade reading test scores in a large cohort of North Carolina students. Our findings indicate that prenatal PM2.5 exposure during early and late pregnancy has the most adverse impact on the testing scores. Lastly, we extend the proposed Bayesian SOFR model and decision analysis strategy to handle multiple functional covariates, nonlinear relationships, and binary indicator response variables. Using this generalized framework, we explore the effects of prenatal temperature and PM2.5 exposure on birth weight in Michigan. Our analysis reveals that higher temperature and PM2.5 exposure are associated with lower birth weights, with higher temperature exhibiting a stronger effect than PM2.5. We provide an R package (BaiSOFR), that implements the proposed model and decision analysis strategy, along with a vignette illustrating its application on simulated data.Item Economic Forecasting with News Headlines and Natural Language Processing(Rice University, 2023) Fuad, Gazi; Kowal, DanielConsumer sentiment, which measures how confident individuals feel in the strength of the economy, is a crucial indicator of the overall health of the economy. However, due to the time and costs associated with collecting the survey responses associated with the Index of Consumer Sentiment (ICS), along with the delayed nature of releasing this information, there is motivation to find alternative data sources to the ICS. In this project, we investigated utilizing news headlines as an alternative signal to gauge consumer sentiment in the United States. More specifically, we utilized natural language processing techniques such as latent Dirichlet allocation (LDA) and sentiment analysis to extract quantifiable topics and sentiments from news headlines on the front page of top publications' websites. We subsequently used that information as predictors for the monthly personal saving and labor force participation rates. The topics and sentiments served as exogenous inputs in a Seasonal Autoregressive Integrated Moving Average with eXogenous regressors (SARIMAX) model to predict the actual rates, and as covariates in classification models to predict the direction of rate movement. Our findings showed that topic-sentiment combinations from news headlines have considerable predictive power in modeling future economic conditions even when comparing to the predictive power of the ICS.Item Racial residential segregation shapes the relationship between early childhood lead exposure and fourth-grade standardized test scores(National Academy of Sciences, 2022) Bravo, Mercedes A.; Zephyr, Dominique; Kowal, Daniel; Ensor, Katherine; Miranda, Marie LynnRacial/ethnic disparities in academic performance may result from a confluence of adverse exposures that arise from structural racism and accrue to specific subpopulations. This study investigates childhood lead exposure, racial residential segregation, and early educational outcomes. Geocoded North Carolina birth data is linked to blood lead surveillance data and fourth-grade standardized test scores (n = 25,699). We constructed a census tract-level measure of racial isolation (RI) of the non-Hispanic Black (NHB) population. We fit generalized additive models of reading and mathematics test scores regressed on individual-level blood lead level (BLL) and neighborhood RI of NHB (RINHB). Models included an interaction term between BLL and RINHB. BLL and RINHB were associated with lower reading scores; among NHB children, an interaction was observed between BLL and RINHB. Reading scores for NHB children with BLLs of 1 to 3 µg/dL were similar across the range of RINHB values. For NHB children with BLLs of 4 µg/dL, reading scores were similar to those of NHB children with BLLs of 1 to 3 µg/dL at lower RINHB values (less racial isolation/segregation). At higher RINHB levels (greater racial isolation/segregation), children with BLLs of 4 µg/dL had lower reading scores than children with BLLs of 1 to 3 µg/dL. This pattern becomes more marked at higher BLLs. Higher BLL was associated with lower mathematics test scores among NHB and non-Hispanic White (NHW) children, but there was no evidence of an interaction. In conclusion, NHB children with high BLLs residing in high RINHB neighborhoods had worse reading scores.Item Recent Advances in Bayesian Copula Models for Mixed Data and Quantile Regression(2023-04-13) Feldman, Joseph; Kowal, Daniel; Balakrishnan, GuhaThis thesis advances novel Bayesian approaches towards joint modeling of mixed data types and quantile regression. In the first part of this work, we advance methodological and theoretical properties of the Bayesian Gaussian copula, and deploy these models in a variety of applications. Copula models link arbitrary univariate marginal distributions under a multivariate dependence structure to define a valid joint distribution for a random vector. By estimating the joint distribution of a multivariate random vector, we are granted access to a myriad of information, from marginal properties and conditional relationships, to multivariate dependence structures. The final portion of this thesis introduces a novel technique for quantile regression that is broadly compatible with any Bayesian predictive model, including copulas. We utilize posterior summarization to estimate coherent and interpretable quantile functions with the added benefit of quantile-specific variable selection. In the first chapter, we deploy the Gaussian copula towards the generation of privacy-preserving fully synthetic data. Often, the dissemination of data sets containing information on real individuals poses harmful privacy risks. However, the lack of rich, publicly available data hinders policy and decision making, as well as statistics education. Synthetic data are a promising alternative for data sharing: they are simulated from a model estimated on the confidential data, which destroys any one-to-one correspondences between synthetic and real individuals. If the synthetic data are shown to be sufficiently useful and private, they may be disseminated and studied with minimal adverse privacy implications. In this chapter, we synthesize a data set comprised of dozens of sensitive health and academic achievement measurements on nearly 20,000 children from North Carolina which precludes its public release. In addition, the data set is comprised mixed continuous, count, ordinal and nominal data types which poses substantial modeling challenges. We develop a novel Bayesian Gaussian copula model for synthesis of the North Carolina data based on the Extended Rank-Probit Likelihood (RPL), which modifies existing copula models to additionally handle nominal variables. We demonstrate state-of-the-art utility of synthetic data synthesized under the RPL copula model, and study the post-hoc privacy implications of synthetic data releases. In the second chapter, we apply copula models towards imputation of missing values, which are commonplace in modern data analysis. With abundant missing values, it is problematic to conduct a complete case analysis, which proceeds using only observations for which all variables are observed. Thus, imputation is necessary, but limited by the ability of the model to jointly predict missing values of mixed data types. Recognizing the broad compatibility of RPL copula models with mixed data types, we develop a novel Bayesian mixture copula for flexible imputation. Most uniquely, we introduce a technique for marginal distribution estimation, the margin adjustment, which enables automated and consistent estimation of marginal distribution functions in the presence missing data. Our Bayesian mixture copula demonstrates exceptional performance in simulation, and we apply the model on a subset of variables from the National Health and Nutrition Examination Survey subject to abundant missing data. Our results demonstrate the risks of a complete case analysis, and how a suitable model for imputation can correct these shortcomings. We conclude with new perspectives on Bayesian quantile regression, which provides a more robust view into how covariates affect the distribution of a response variable. Given any Bayesian predictive model, we view the quantile function as a posterior functional, which enables point estimation through decision theory. Our technique unifies estimation of quantile-specific functions under a singular, coherent model, which alleviates issues of quantile crossing. Furthermore, through careful justification of the loss function in our posited decision analysis, we develop quantile-specific variable selection techniques. Thus, this work connects the extensive literature on valid quantile function estimation (i.e. techniques to prevent quantile crossing) with variable selection in the mean regression setting. Extensive simulation highlights the vast improvements of the proposed approach over existing Bayesian and frequentist methods in terms of prediction, inference, and variable selection.Item Embargo State space models for Bayesian analysis of non-Gaussian time series(2024-07-29) Zito, John; Kowal, DanielIn this work we develop new state space models and Bayesian inference techniques for generating accurate probabilistic predictions of time series that possess various non-Gaussian properties. In the first chapter, we consider multivariate time series data that include a heterogeneous mix of non-Gaussian distributional features (asymmetry, multimodality, heavy tails, etc) and data types (continuous and discrete variables). Traditional multivariate time series methods based on convenient parametric families of probability distributions are typically ill-equipped to model this heterogeneity. Copula models provide an appealing alternative, but it is challenging to estimate them in a fully Bayesian way that incorporates uncertainty from all model unobservables, which is crucial for probabilistic time series forecasting. To meet this challenge, we propose a novel method for posterior approximation in copula time series models, and we apply it to a Gaussian copula built from a dynamic factor model. This framework provides flexible, scalable, and computationally tractable Bayesian inference for both the dependence structure and the heterogeneous marginal behavior of a multivariate time series. We validate our posterior approximation by providing model-trusting posterior consistency theory, and we provide simulation evidence that consistency is still achieved under model misspecification. In a diverse array of forecast comparisons on real and simulated data, we show that our proposed approach provides accurate point, interval, and density forecasts compared to a basket of popular alternatives. Taken together, these results demonstrate that our proposed method is a versatile, general-purpose utility for multivariate time series forecasting that works well across of range of applications with minimal user-intensive tuning. In the second chapter, we consider time series on the unit $n$-sphere, which arise in directional statistics, compositional data analysis, and many scientific fields. There are few models for such data, and the ones that exist suffer from several limitations: they are often computationally challenging to fit, many of them apply only to the circular case of $n=2$, and they are usually based on families of distributions that are not flexible enough to capture the complexities observed in real data. Furthermore, there is little work on Bayesian methods for spherical time series. To address these shortcomings, we propose a state space model based on the projected normal distribution that can be applied to spherical time series of arbitrary dimension. We describe how to perform fully Bayesian offline inference for this model using a simple and efficient Gibbs sampling algorithm, and we develop a Rao-Blackwellized particle filter to perform online inference for streaming data. In analyses of wind direction and energy market time series, we show that the proposed model outperforms competitors in terms of point, set, and density forecasting. In the last chapter, we develop new methods for data on the Stiefel manifold. Such data arise again in directional statistics, where we measure the orientation of an object in space and record this information in the form of an orthonormal matrix. There is a rich literature on distributional theory and inference for these data from a classical point of view, but there has been very little Bayesian work until recently, and almost no discussion of time series. To fill these gaps, we describe how to use data augmentation to perform Bayesian inference for a class of models on the Stiefel manifold that are based on the matrix normal distribution. The chapter culminates in the first fully Bayesian method for time series on the Stiefel manifold. We propose a new state space model for this purpose, and show how to access its full posterior using a Markov chain Monte Carlo sampler.