Browsing by Author "Guerra, Rudy"
Now showing 1 - 12 of 12
Results Per Page
Sort Options
Item A Bayesian hierarchical model for detecting associations between haplotypes and disease using unphased SNPs(2008) Fox, Garrett Reed; Guerra, RudyThis thesis addresses using haplotypes to detect disease predisposing chromosomal regions based on a Bayesian hierarchical model for case-control data. By utilizing the Stochastic Search Variable Selection (SSVS) procedure of George and McCulloch (1997), the number of parameters is riot constrained by the sample size, as are the frequentist methods. Haplotype information is used in the form of estimated haplotype frequencies, and using these values in the model as if they were the true population frequencies. A Bayesian hierarchical probit model was developed by estimating the distribution of haplotype pairs for an individual based on these estimated populaltion frequencies and using SSVS to make decisions about model selection. To date, Bayesian models for haplotype based case-control data assume either the haplotypes are known, or that haplotypes can be clustered such that every haplotype within a cluster has the same effect on disease status. A simulation was performed analyzing the testing properties of this Bayesian model and comparing it to a popular frequentist method (Schaid, 2002). Both real genotype data from the Dallas Heart Study (DHS) and simulated data were used to study the operating characteristics of the new model The Bayesian method is shown to have higher power than Schaid's frequentist method when there are a limited number of common haplotypes in a region, a situation that appears to be common (Gabriel, 2002). An approach based on the maximum of Chi-squared statistics at each marker locus performed suprisingly well against both haplotype methods in various cases. These simulations contribute to the ongoing debate on the efficacy of haplotype methods. The most suprising result was the ability of the genotype methods to outperform the haplotype methods in various instances where there were cis-acting interactions. The Bayesian haplotype method performed better in comparison when dealing with low penetrance in highly conserved blocks. Additionally, a set of simulations were based on a number of genes from the DHS data set with multiple haplotype block regions. This demonstrated the similarities of the haplotype methods and the added flexibility when analyzing posterior distributions. We also demonstrate that interactions between loci in separate blocks can be detected without having interaction terms in the regression model. Future work should focus on more efficient methods of detecting these and other complex interactions.Item Clustering time-course gene-expression array data(2008) Gershman, Jason Andrew; Guerra, RudyThis thesis examines methods used to cluster time-course gene expression array data. In the past decade, various model-based methods have been published and advocated for clustering this type of data in place of classic non-parametric techniques like K-means and hierarchical clustering. On simulated data, where the variance between clusters is large, I show that the model-based MCLUST outperforms model-based SSClust and non-model-based K-means clustering. I also show that the number of genes or the number of clusters has no significant effect on the performance of these model-based clustering techniques. On two real data sets, where the variance between clusters is smaller, I show that model-based SSClust outperforms both MCLUST and K-means clustering. Since the "truth" is often not known for real data sets, I use the clustered data as "truth" and then perturb the data by adding pointwise noise to cluster this noisy data. Throughout my analysis of real and simulated expression data, I use the misclassification rate and the overall success rate as measures of success of the clustering algorithm. Overall, the model-based methods appear to cluster the data better than the non-model-based methods. Later, I examine the role of gene ontology (GO) and using gene ontology data to cluster gene expression data. I find that clustering expression data, using a synthesis of gene expression and gene ontology not only provides clustering that has a biologic meaning but also clusters the data well. I also introduce an algorithm for clustering expression profiles on both gene expression and gene ontology data when some of the genes are missing the ontology data. Instead of some other methods which ignore the missing data or lump it all into a miscellaneous cluster, I use classification and inferential techniques to cluster using all of the available data and this method shows promising results. I also examine which ontology, among molecular function, biological process, and cellular component, is best in clustering expression data. This analysis shows that biological process is the preferred ontology for clustering expression data.Item Comparative Genomics of Cephalochordates(2015-04-23) Yue, Jiaxing; Kohn, Michael H.; Nakhleh, Luay K.; Shamoo, Yousif; Guerra, Rudy; Putnam, Nicholas H.Cephalochordates, commonly known as lancelets or amphioxus, represent an ancient chordate lineage falling at the boundary between invertebrates and vertebrates. They are considered the best living proxy for the common ancestor of all chordate animals and hold the key for understanding chordate evolution. Despite such great importance, current studies on cephalochordates are generally limited to the Branchiostoma genus, leaving the other two genera, Asymmetron and Epigonichthys largely unexplored. In this dissertation, I set out to fill this gap by developing an array of genomic resources for the Bahama cephalochordate, Asymmetron lucayanum, by both RNA-Seq and whole-genome shotgun (WGS) sequencing. The transcriptome and genome of this representative cephalochordate species were assembled and characterized via the state-of-arts comparative genomics approach. By comparing its transcriptome and genome sequences with those of a distant related cephalochordate species, Branchiostoma floridae, as well as with several representative vertebrate species, many aspects of their genome biology were illuminated, which includes lineage-specific molecular evolution rate, fast-evolving genes, evolution time frame, conserved non-coding elements, and germline-related genes. The raw genomic resources, technical pipelines and biological results and insights generated by this dissertation work will benefit the whole cephalochordate research community by providing a powerful guide for formulating new hypotheses and designing new experiments towards a better understanding about the biology and evolution of cephalochordates, as well as the evolutionary transition from invertebrates to vertebrates.Item Computing diversity in undergraduate admissions decisions(2009) Chatman, Jamie; Guerra, RudyThe Supreme Court decision in the University of Michigan case in 2003 ruled the university's admissions procedures unconstitutional, giving minorities an unfair advantage of acceptance. The ruling stated race may still be used in admissions decisions to achieve diversity, but that race could not be used to give applicants preferential treatment in the admissions process. Motivated by this case, a researcher, Juan Gilbert, developed a computer based clustering method to aid admissions committees in choosing diverse entering classes. This method was evaluated using undergraduate admissions data sets from two public universities. Gilbert's method suggested diverse entering classes but did not select well based on merit. A method of improvement is introduced that maintains the academic characteristics of the university through classification, while suggesting diverse entering classes more academically similar to those actually accepted.Item Haplotype block and genetic association(2006) Yu, Zhaoxia; Guerra, RudyThe recently identified (Daly et al. 2001 and Patil et al. 2001) block-like structure in the human genome has attracted much attention since each haplotype block contains limited sequence variation, which can reduce the complexity in genetic mapping studies. This dissertation focuses on estimating haplotype block structures and their application to genetic mapping using single nucleotide polymorphisms (SNPs) from unrelated individuals. Among other issues, the traditional single marker association study leads to the problem of multiple testing, which is still not well understood in the context of genomewide association studies. The haplotype-based approach is one way to lessen certain problems caused by multiple testing. There is also evidence that haplotype based tests have higher statistical power. We first propose a novel approach to estimate haplotype blocks based on pairwise linkage disequilibrium (LD). The application to simulated data shows that our new approach has higher power than several existing methods in identifying haplotype blocks. We also examine the impact of marker density and different tagging strategies on the estimation of haplotype blocks. We introduce a new statistic to measure the difference between two different block partitions. Applying the new statistic to real and simulated data we show that a higher marker density is needed than previously expected in order to recover the true block structure over a given region. Finally, we analyzed a real SNP data set. A comparison of the haplotype-SNP based method to the more traditional single-SNP based method shows that the two methods tend to agree more when halplotype block sizes are small. On the other hand, the haplotype-SNP based approach does not always have higher power than the single-SNP based study as is supported by theoretical considerations. Indeed, long haplotype blocks where the LD structure might be very complex can lead to inferior power compared to single-SNP approaches. In practice, it is recommended that single-SNP analyses be run routinely, especially in the presence of moderate to long blocks.Item Impact of hypothermia on implementation of CPAP for neonatal respiratory distress syndrome in a low-resource setting(Public Library of Science, 2018) Carns, Jennifer; Kawaza, Kondwani; Quinn, M.K.; Miao, Yinsen; Guerra, Rudy; Molyneux, Elizabeth; Oden, Maria; Richards-Kortum, Rebecca; Bioengineering; StatisticsBackground: Neonatal hypothermia is widely associated with increased risks of morbidity and mortality, but remains a pervasive global problem. No studies have examined the impact of hypothermia on outcomes for preterm infants treated with CPAP for respiratory distress syndrome (RDS). Methods: This retrospective analysis assessed the impact of hypothermia on outcomes of 65 neonates diagnosed with RDS and treated with either nasal oxygen (N = 17) or CPAP (N = 48) in a low-resource setting. A classification tree approach was used to develop a model predicting survival for subjects diagnosed with RDS. Findings: Survival to discharge was accurately predicted based on three variables: mean temperature, treatment modality, and mean respiratory rate. None of the 23 neonates with a mean temperature during treatment below 35.8°C survived to discharge, regardless of treatment modality. Among neonates with a mean temperature exceeding 35.8°C, the survival rate was 100% for the 31 neonates treated with CPAP and 36.4% for the 11 neonates treated with nasal oxygen (p<0.001). For neonates treated with CPAP, outcomes were poor if more than 50% of measured temperatures indicated hypothermia (5.6% survival). In contrast, all 30 neonates treated with CPAP and with more than 50% of temperature measurements above 35.8°C survived to discharge, regardless of initial temperature. Conclusion: The results of our study suggest that successful implementation of CPAP to treat RDS in low-resource settings will require aggressive action to prevent persistent hypothermia. However, our results show that even babies who are initially cold can do well on CPAP with proper management of hypothermia.Item Incorporating annotation data in quantitative trait loci mapping with mRNA transcripts(2009) Christian, James Blair; Guerra, RudyMicroarrays allow measurements of the quantity of every mRNA transcript in a subject and of the particular versions of their genes. Understanding the relationship between a particular genetic location and its expression is fundamental to elucidating the relationships among genes, other genes' transcripts, and proteins translated from those transcripts. Currently, few statistical labs have developed models that use all available biological information. This research helps develop the knowledge base used by the 21st century's pioneering researchers in oncology, metabolic engineering and pharmacogenetics. To strengthen the available models, I introduced a biological distance based covariance matrix. Using simulated data, I examined the incorporation of biological distance in statistical genetics, specifically into expression quantitative trait loci mapping. I used receiver operator characteristic curves to compare these approaches, and generated recommendations for when it is advantageous to include annotation information into gene mapping. The greatest benefit arises in pleiotropic relationships where each transcript has low heritability, although using excessively noisy annotations is disadvantageous. These tools fill a small part of the gap in our understanding of the complex dynamical system that is molecular or systems biology.Item Proteomic analyses reveal distinct chromatin‐associated and soluble transcription factor complexes(EMBO, 2015) Li, Xu; Wang, Wenqi; Wang, Jiadong; Malovannaya, Anna; Xi, Yuanxin; Li, Wei; Guerra, Rudy; Hawke, David H.; Qin, Jun; Chen, JunjieThe current knowledge on how transcription factors (TFs), the ultimate targets and executors of cellular signalling pathways, are regulated by protein–protein interactions remains limited. Here, we performed proteomics analyses of soluble and chromatin‐associated complexes of 56 TFs, including the targets of many signalling pathways involved in development and cancer, and 37 members of the Forkhead box (FOX) TF family. Using tandem affinity purification followed by mass spectrometry (TAP/MS), we performed 214 purifications and identified 2,156 high‐confident protein–protein interactions. We found that most TFs form very distinct protein complexes on and off chromatin. Using this data set, we categorized the transcription‐related or unrelated regulators for general or specific TFs. Our study offers a valuable resource of protein–protein interaction networks for a large number of TFs and underscores the general principle that TFs form distinct location‐specific protein complexes that are associated with the different regulation and diverse functions of these Tfs.Item Robust CT ventilation from the integral formulation of the Jacobian(Wiley, 2019) Castillo, Edward; Castillo, Richard; Vinogradskiy, Yevgeniy; Dougherty, Michele; Solis, David; Myziuk, Nicholas; Thompson, Andrew; Guerra, Rudy; Nair, Girish; Guerrero, ThomasComputed tomography (CT) derived ventilation algorithms estimate the apparent voxel volume changes within an inhale/exhale CT image pair. Transformation-based methods compute these estimates solely from the spatial transformation acquired by applying a deformable image registration (DIR) algorithm to the image pair. However, approaches based on finite difference approximations of the transformation's Jacobian have been shown to be numerically unstable. As a result, transformation-based CT ventilation is poorly reproducible with respect to both DIR algorithm and CT acquisition method. PURPOSE: We introduce a novel Integrated Jacobian Formulation (IJF) method for estimating voxel volume changes under a DIR-recovered spatial transformation. The method is based on computing volume estimates of DIR mapped subregions using the hit-or-miss sampling algorithm for integral approximation. The novel approach allows for regional volume change estimates that (a) respect the resolution of the digital grid and (b) are based on approximations with quantitatively characterized and controllable levels of uncertainty. As such, the IJF method is designed to be robust to variations in DIR solutions and thus overall more reproducible. METHODS: Numerically, Jacobian estimates are recovered by solving a simple constrained linear least squares problem that guarantees the recovered global volume change is equal to the global volume change obtained from the inhale and exhale lung segmentation masks. Reproducibility of the IJF method with respect to DIR solution was assessed using the expert-determined landmark point pairs and inhale/exhale phases from 10 four-dimensional computed tomographies (4DCTs) available on www.dir-lab.com. Reproducibility with respect to CT acquisition was assessed on the 4DCT and 4D cone beam CT (4DCBCT) images acquired for five lung cancer patients prior to radiotherapy. RESULTS: The ten Dir-Lab 4DCT cases were registered twice with the same DIR algorithm, but with different smoothing parameter. Finite difference Jacobian (FDJ) and IFJ images were computed for both solutions. The average spatial errors (300 landmarks per case) for the two DIR solution methods were 0.98 (1.10) and 1.02 (1.11). The average Pearson correlation between the FDJ images computed from the two DIR solutions was 0.83 (0.03), while for the IJF images it was 1.00 (0.00). For intermodality assessment, the IJF and FDJ images were computed from the 4DCT and 4DCBCT of five patients. The average Pearson correlation of the spatially aligned FDJ images was 0.27 (0.11), while it was 0.77 (0.13) for the IFJ method. CONCLUSION: The mathematical theory underpinning the IJF method allows for the generation of ventilation images that are (a) computed with respect to DIR spatial accuracy on the digital voxel grid and (b) based on DIR-measured subregional volume change estimates acquired with quantifiable and controllable levels of uncertainty. Analyses of the experiments are consistent with the mathematical theory and indicate that IJF ventilation imaging has a higher reproducibility with respect to both DIR algorithm and CT acquisition method, in comparison to the standard finite difference approach.Item Statistical Approaches to Identifying Therapeutic Vulnerabilities from Cancer Genomics Data(2023-04-17) Kong, Elisabeth K; Guerra, Rudy; Korkut, AnilAn increased understanding of molecular mechanisms that regulate cellular processes has resulted in the growing prevalence of genomically targeted cancer therapies. However, durable responses to these therapies are rarely achieved in the majority of cancers due to cancer cells’ resistance to therapy. There is a significant challenge in identifying which patients will benefit from targeted therapies and matching the correct therapy to each patient. This demonstrates the need for a comprehensive statistical analysis of genomic alterations across patient cohorts to define driver events that can guide the selection of targeted therapies. Here, we address therapeutic vulnerability identification and precision therapy selection through two data collections. First, we present a dataset of cancers featuring EGFR and HER2 aberrations from The Cancer Genome Atlas. We assess the landscape of aberrations in these genes across disease types and observe differences in their pathway scores and most common variants. Second, we present a dataset from the ARTEMIS longitudinal trial composed of triple-negative breast cancer samples. We detail the batch effect correction process performed to enable integrated analysis from multiple sequencing batches. Through differential gene expression analysis, we observe increases in genes encoding for potentially actionable proteins that can be matched to targeted therapies for patients who were resistant to neoadjuvant therapy. Additionally, we observe a decrease in immune infiltration in patients who are resistant to neoadjuvant therapy compared to those who were sensitive to therapy. We utilize these two datasets to demonstrate two algorithms. First, the machine learning tool, REFLECT, maps the landscape of recurrent oncogenic co-alterations in cancers to propose targeted combination therapies. We evaluate REFLECT in the EGFR- and HER2-aberrant and ARTEMIS datasets to match specific co-alteration signatures to targeted cancer drugs. Then, we consider ImogiMap, a statistical analysis tool that identifies interactions between oncogenic processes and immune checkpoint receptors based on their impact on immune phenotypes. We evaluate the algorithm through EGFR-aberrant lung adenocarcinomas and HER2-aberrant breast cancers, as well as compare the cohorts that were sensitive and resistant to neoadjuvant therapy from the ARTEMIS trial. We expect this analysis will guide future precision medicine applications.Item Statistical Methods for Bioinformatics: Estimation of Copy N umber and Detection of Gene Interactions(2011) Guo, Beibei; Guerra, RudyIdentification of copy number aberrations in the human genome has been an important area in cancer research. In the first part of my thesis, I propose a new model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. The second part of the thesis describes a new method for the detection of gene-gene interactions using gene expression data extracted from micro array experiments. The method is based on a two-step Genetic Algorithm, with the first step detecting main effects and the second step looking for interacting gene pairs. The performances of both algorithms are examined on both simulated data and real cancer data and are compared with popular existing algorithms. Conclusions are given and possible extensions are discussed.Item Using Multiple Imputation, Survival Analysis, And Propensity Score Analysis In Cancer Data With Missingness(2015-12-01) Berliner, Nathan K; Hess, Kenneth; Vannucci, Marina; Scott, David; Guerra, Rudy; Shen, YuIn this thesis multiple imputation, survival analysis, and propensity score analysis are combined in order to answer questions about treatment efficacy in cancer data with missingness. While each of these fields have been studied individually, there has been little work and analysis on using all three together. Starting with an incomplete dataset, the goal is to impute the missing data, and then run survival and propensity score analysis on each of the imputed datasets to answer clinically relevant questions. Along the way, many theoretical and analytical decisions are made and justified. The methodology is then applied to an observational cancer survival dataset of patients who have brain metastases from breast cancer to determine the effectiveness of chemotherapeutic and HER2-directed therapies.