Computer Science Publications
Permanent URI for this collection
Browse
Browsing Computer Science Publications by Title
Now showing 1 - 20 of 186
Results Per Page
Sort Options
Item A Chromosome-length Assembly of the Black Petaltail (Tanypteryx hageni) Dragonfly(Oxford University Press, 2023) Tolman, Ethan R; Beatty, Christopher D; Bush, Jonas; Kohli, Manpreet; Moreno, Carlos M; Ware, Jessica L; Weber, K Scott; Khan, Ruqayya; Maheshwari, Chirag; Weisz, David; Dudchenko, Olga; Aiden, Erez Lieberman; Frandsen, Paul B; Center for Theoretical Biological PhysicsWe present a chromosome-length genome assembly and annotation of the Black Petaltail dragonfly (Tanypteryx hageni). This habitat specialist diverged from its sister species over 70 million years ago, and separated from the most closely related Odonata with a reference genome 150 million years ago. Using PacBio HiFi reads and Hi-C data for scaffolding we produce one of the most high-quality Odonata genomes to date. A scaffold N50 of 206.6 Mb and a single copy BUSCO score of 96.2% indicate high contiguity and completeness.Item A Chromosome-Length Reference Genome for the Endangered Pacific Pocket Mouse Reveals Recent Inbreeding in a Historically Large Population(Oxford University Press, 2022) Wilder, Aryn P; Dudchenko, Olga; Curry, Caitlin; Korody, Marisa; Turbek, Sheela P; Daly, Mark; Misuraca, Ann; Wang, Gaojianyong; Khan, Ruqayya; Weisz, David; Fronczek, Julie; Aiden, Erez Lieberman; Houck, Marlys L; Shier, Debra M; Ryder, Oliver A; Steiner, Cynthia C; Center for Theoretical Biological PhysicsHigh-quality reference genomes are fundamental tools for understanding population history, and can provide estimates of genetic and demographic parameters relevant to the conservation of biodiversity. The federally endangered Pacific pocket mouse (PPM), which persists in three small, isolated populations in southern California, is a promising model for studying how demographic history shapes genetic diversity, and how diversity in turn may influence extinction risk. To facilitate these studies in PPM, we combined PacBio HiFi long reads with Omni-C and Hi-C data to generate a de novo genome assembly, and annotated the genome using RNAseq. The assembly comprised 28 chromosome-length scaffolds (N50 = 72.6 MB) and the complete mitochondrial genome, and included a long heterochromatic region on chromosome 18 not represented in the previously available short-read assembly. Heterozygosity was highly variable across the genome of the reference individual, with 18% of windows falling in runs of homozygosity (ROH) >1 MB, and nearly 9% in tracts spanning >5 MB. Yet outside of ROH, heterozygosity was relatively high (0.0027), and historical Ne estimates were large. These patterns of genetic variation suggest recent inbreeding in a formerly large population. Currently the most contiguous assembly for a heteromyid rodent, this reference genome provides insight into the past and recent demographic history of the population, and will be a critical tool for management and future studies of outbreeding depression, inbreeding depression, and genetic load.Item A CRISPR toolbox for generating intersectional genetic mouse models for functional, molecular, and anatomical circuit mapping(Springer Nature, 2022) Lusk, Savannah J.; McKinney, Andrew; Hunt, Patrick J.; Fahey, Paul G.; Patel, Jay; Chang, Andersen; Sun, Jenny J.; Martinez, Vena K.; Zhu, Ping Jun; Egbert, Jeremy R.; Allen, Genevera; Jiang, Xiaolong; Arenkiel, Benjamin R.; Tolias, Andreas S.; Costa-Mattioli, Mauro; Ray, Russell S.The functional understanding of genetic interaction networks and cellular mechanisms governing health and disease requires the dissection, and multifaceted study, of discrete cell subtypes in developing and adult animal models. Recombinase-driven expression of transgenic effector alleles represents a significant and powerful approach to delineate cell populations for functional, molecular, and anatomical studies. In addition to single recombinase systems, the expression of two recombinases in distinct, but partially overlapping, populations allows for more defined target expression. Although the application of this method is becoming increasingly popular, its experimental implementation has been broadly restricted to manipulations of a limited set of common alleles that are often commercially produced at great expense, with costs and technical challenges associated with production of intersectional mouse lines hindering customized approaches to many researchers. Here, we present a simplified CRISPR toolkit for rapid, inexpensive, and facile intersectional allele production.Item A deep learning solution for crystallographic structure determination(International Union of Crystallography, 2023) Pan, T.; Jin, S.; Miller, M. D.; Kyrillidis, A.; Phillips, G. N.The general de novo solution of the crystallographic phase problem is difficult and only possible under certain conditions. This paper develops an initial pathway to a deep learning neural network approach for the phase problem in protein crystallography, based on a synthetic dataset of small fragments derived from a large well curated subset of solved structures in the Protein Data Bank (PDB). In particular, electron-density estimates of simple artificial systems are produced directly from corresponding Patterson maps using a convolutional neural network architecture as a proof of concept.Item A divide-and-conquer method for scalable phylogenetic network inference from multilocus data(Oxford University Press, 2019) Zhu, Jiafan; Liu, Xinhao; Ogilvie, Huw A.; Nakhleh, Luay K.Motivation: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. Results: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.Item A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study(JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, IlaBackground: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.Item A maximum pseudo-likelihood approach for phylogenetic networks(BioMed Central, 2015) Yu, Yun; Nakhleh, Luay K.Abstract Background Several phylogenomic analyses have recently demonstrated the need to account simultaneously for incomplete lineage sorting (ILS) and hybridization when inferring a species phylogeny. A maximum likelihood approach was introduced recently for inferring species phylogenies in the presence of both processes, and showed very good results. However, computing the likelihood of a model in this case is computationally infeasible except for very small data sets. Results Inspired by recent work on the pseudo-likelihood of species trees based on rooted triples, we introduce the pseudo-likelihood of a phylogenetic network, which, when combined with a search heuristic, provides a statistical method for phylogenetic network inference in the presence of ILS. Unlike trees, networks are not always uniquely encoded by a set of rooted triples. Therefore, even when given sufficient data, the method might converge to a network that is equivalent under rooted triples to the true one, but not the true one itself. The method is computationally efficient and has produced very good results on the data sets we analyzed. The method is implemented in PhyloNet, which is publicly available in open source. Conclusions Maximum pseudo-likelihood allows for inferring species phylogenies in the presence of hybridization and ILS, while scaling to much larger data sets than is currently feasible under full maximum likelihood. The nonuniqueness of phylogenetic networks encoded by a system of rooted triples notwithstanding, the proposed method infers the correct network under certain scenarios, and provides candidates for further exploration under other criteria and/or data in other scenarios.Item A Polynomial Blossom for the Askey–Wilson Operator(Springer, 2018) Simeonov, Plamen; Goldman, RonWe introduce a blossoming procedure for polynomials related to the Askey–Wilson operator. This new blossom is symmetric, multiaffine, and reduces to the complex representation of the polynomial on a certain diagonal. This Askey–Wilson blossom can be used to find the Askey–Wilson derivative of a polynomial of any order. We also introduce a corresponding Askey–Wilson Bernstein basis for which this new blossom provides the dual functionals. We derive a partition of unity property and a Marsden identity for this Askey–Wilson Bernstein basis, which turn out to be the terminating versions of Rogers’ 6ϕ5 summation formula and a very-well-poised 8ϕ7 summation formula. Recurrence and symmetry relations and differentiation and degree elevation formulas for the Askey–Wilson Bernstein bases, as well as degree elevation formulas for Askey–Wilson Bézier curves, are also given.Item A review of parameters and heuristics for guiding metabolic pathfinding(Springer International Publishing, 2017-09-15) Kim, Sarah M.; Peña, Matthew I.; Moll, Mark; Bennett, George N.; Kavraki, Lydia E.Abstract Recent developments in metabolic engineering have led to the successful biosynthesis of valuable products, such as the precursor of the antimalarial compound, artemisinin, and opioid precursor, thebaine. Synthesizing these traditionally plant-derived compounds in genetically modified yeast cells introduces the possibility of significantly reducing the total time and resources required for their production, and in turn, allows these valuable compounds to become cheaper and more readily available. Most biosynthesis pathways used in metabolic engineering applications have been discovered manually, requiring a tedious search of existing literature and metabolic databases. However, the recent rapid development of available metabolic information has enabled the development of automated approaches for identifying novel pathways. Computer-assisted pathfinding has the potential to save biochemists time in the initial discovery steps of metabolic engineering. In this paper, we review the parameters and heuristics used to guide the search in recent pathfinding algorithms. These parameters and heuristics capture information on the metabolic network structure, compound structures, reaction features, and organism-specificity of pathways. No one metabolic pathfinding algorithm or search parameter stands out as the best to use broadly for solving the pathfinding problem, as each method and parameter has its own strengths and shortcomings. As assisted pathfinding approaches continue to become more sophisticated, the development of better methods for visualizing pathway results and integrating these results into existing metabolic engineering practices is also important for encouraging wider use of these pathfinding methods.Item A scientific machine learning framework to understand flash graphene synthesis(Royal Society of Chemistry, 2023) Sattari, Kianoosh; Eddy, Lucas; Beckham, Jacob L.; Wyss, Kevin M.; Byfield, Richard; Qian, Long; Tour, James M.; Lin, Jian; NanoCarbon Center; Welch Institute for Advanced MaterialsFlash Joule heating (FJH) is a far-from-equilibrium (FFE) processing method for converting low-value carbon-based materials to flash graphene (FG). Despite its promises in scalability and performance, attempts to explore the reaction mechanism have been limited due to the complexities involved in the FFE process. Data-driven machine learning (ML) models effectively account for the complexities, but the model training requires a considerable amount of experimental data. To tackle this challenge, we constructed a scientific ML (SML) framework trained by using both direct processing variables and indirect, physics-informed variables to predict the FG yield. The indirect variables include current-derived features (final current, maximum current, and charge density) predicted from the proxy ML models and reaction temperatures simulated from multi-physics modeling. With the combined indirect features, the final ML model achieves an average R2 score of 0.81 ± 0.05 and an average RMSE of 12.1% ± 2.0% in predicting the FG yield, which is significantly higher than the model trained without them (R2 of 0.73 ± 0.05 and an RMSE of 14.3% ± 2.0%). Feature importance analysis validates the key roles of these indirect features in determining the reaction outcome. These results illustrate the promise of this SML to elucidate FFE material synthesis outcomes, thus paving a new avenue to processing other datasets from the materials systems involving the same or different FFE processes.Item Accelerating High-Order Stencils on GPUs(IEEE, 2020) Sai, Ryuichi; Mellor-Crummey, John; Meng, Xiaozhu; Araya-Polo, Mauricio; Meng, JieWhile implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of the techniques work well for high-order stencils, such as those used for seismic imaging. In this paper, we study practical seismic imaging computations on GPUs using high-order stencils on large domains with meaningful boundary conditions. We manually crafted a collection of implementations of a 25-point seismic modeling stencil in CUDA along with code to apply the boundary conditions. We evaluated our stencil code shapes, memory hierarchy usage, data-fetching patterns, and other performance attributes. We conducted an empirical evaluation of these stencils using several mature and emerging tools and discuss our quantitative findings. Some of our implementations achieved twice the performance of a proprietary code developed in C and mapped to GPUs using OpenACC. Additionally, several of our implementations have excellent performance portability.Item An automated respiratory data pipeline for waveform characteristic analysis(Wiley, 2023) Lusk, Savannah; Ward, Christopher S.; Chang, Andersen; Twitchell-Heyne, Avery; Fattig, Shaun; Allen, Genevera; Jankowsky, Joanna L.; Ray, Russell S.Comprehensive and accurate analysis of respiratory and metabolic data is crucial to modelling congenital, pathogenic and degenerative diseases converging on autonomic control failure. A lack of tools for high-throughput analysis of respiratory datasets remains a major challenge. We present Breathe Easy, a novel open-source pipeline for processing raw recordings and associated metadata into operative outcomes, publication-worthy graphs and robust statistical analyses including QQ and residual plots for assumption queries and data transformations. This pipeline uses a facile graphical user interface for uploading data files, setting waveform feature thresholds and defining experimental variables. Breathe Easy was validated against manual selection by experts, which represents the current standard in the field. We demonstrate Breathe Easy's utility by examining a 2-year longitudinal study of an Alzheimer's disease mouse model to assess contributions of forebrain pathology in disordered breathing. Whole body plethysmography has become an important experimental outcome measure for a variety of diseases with primary and secondary respiratory indications. Respiratory dysfunction, while not an initial symptom in many of these disorders, often drives disability or death in patient outcomes. Breathe Easy provides an open-source respiratory analysis tool for all respiratory datasets and represents a necessary improvement upon current analytical methods in the field. Key points Respiratory dysfunction is a common endpoint for disability and mortality in many disorders throughout life. Whole body plethysmography in rodents represents a high face-value method for measuring respiratory outcomes in rodent models of these diseases and disorders. Analysis of key respiratory variables remains hindered by manual annotation and analysis that leads to low throughput results that often exclude a majority of the recorded data. Here we present a software suite, Breathe Easy, that automates the process of data selection from raw recordings derived from plethysmography experiments and the analysis of these data into operative outcomes and publication-worthy graphs with statistics. We validate Breathe Easy with a terabyte-scale Alzheimer's dataset that examines the effects of forebrain pathology on respiratory function over 2 years of degeneration.Item An Automated System for Interactively Learning Software Testing(Association for Computing Machinery, 2017) Smith, Rebecca; Tang, Terry; Warren, Joe; Rixner, ScottTesting is an important, time-consuming, and often difficult part of the software development process. It is therefore critical to introduce testing early in the computer science curriculum, and to provide students with frequent opportunities for practice and feedback. This paper presents an automated system to help introductory students learn how to test software. Students submit test cases to the system, which uses a large corpus of buggy programs to evaluate these test cases. In addition to gauging the quality of the test cases, the system immediately presents students with feedback in the form of buggy programs that nonetheless pass their tests. This enables students to understand why their test cases are deficient and gives them a starting point for improvement. The system has proven effective in an introductory class: students that trained using the system were later able to write better test cases -- even without any feedback -- than those who were not. Further, students reported additional benefits such as improved ability to read code written by others and to understand multiple approaches to the same problem.Item An Evaluation of Methods for Inferring Boolean Networks from Time-Series Data(Public Library of Science, 2013) Berestovsky, Natalie; Nakhleh, LuayRegulatory networks play a central role in cellular behavior and decision making. Learning these regulatory networks is a major task in biology, and devising computational methods and mathematical models for this task is a major endeavor in bioinformatics. Boolean networks have been used extensively for modeling regulatory networks. In this model, the state of each gene can be either ‘on’ or ‘off’ and that next-state of a gene is updated, synchronously or asynchronously, according to a Boolean rule that is applied to the current-state of the entire system. Inferring a Boolean network from a set of experimental data entails two main steps: first, the experimental time-series data are discretized into Boolean trajectories, and then, a Boolean network is learned from these Boolean trajectories. In this paper, we consider three methods for data discretization, including a new one we propose, and three methods for learning Boolean networks, and study the performance of all possible nine combinations on four regulatory systems of varying dynamics complexities. We find that employing the right combination of methods for data discretization and network learning results in Boolean networks that capture the dynamics well and provide predictive power. Our findings are in contrast to a recent survey that placed Boolean networks on the low end of the ‘‘faithfulness to biological reality’’ and ‘‘ability to model dynamics’’ spectra. Further, contrary to the common argument in favor of Boolean networks, we find that a relatively large number of time points in the timeseries data is required to learn good Boolean networks for certain data sets. Last but not least, while methods have been proposed for inferring Boolean networks, as discussed above, missing still are publicly available implementations thereof. Here, we make our implementation of the methods available publicly in open source at http://bioinfo.cs.rice.edu/.Item An incremental constraint-based framework for task and motion planning(Sage, 2018) Dantam, Neil T.; Kingston, Zachary K.; Chaudhuri, Swarat; Kavraki, Lydia E.We present a new constraint-based framework for task and motion planning (TMP). Our approach is extensible, probabilistically complete, and offers improved performance and generality compared with a similar, state-of-the-art planner. The key idea is to leverage incremental constraint solving to efficiently incorporate geometric information at the task level. Using motion feasibility information to guide task planning improves scalability of the overall planner. Our key abstractions address the requirements of manipulation and object rearrangement. We validate our approach on a physical manipulator and evaluate scalability on scenarios with many objects and long plans, showing order-of-magnitude gains compared with the benchmark planner and improved scalability from additional geometric guidance. Finally, in addition to describing a new method for TMP and its implementation on a physical robot, we also put forward requirements and abstractions for the development of similar planners in the future.Item Analysis of bronchoalveolar lavage fluid metatranscriptomes among patients with COVID-19 disease(Springer Nature, 2022) Jochum, Michael; Lee, Michael D.; Curry, Kristen; Zaksas, Victoria; Vitalis, Elizabeth; Treangen, Todd; Aagaard, Kjersti; Ternus, Krista L.To better understand the potential relationship between COVID-19 disease and hologenome microbial community dynamics and functional profiles, we conducted a multivariate taxonomic and functional microbiome comparison of publicly available human bronchoalveolar lavage fluid (BALF) metatranscriptome samples amongst COVID-19 (n = 32), community acquired pneumonia (CAP) (n = 25), and uninfected samples (n = 29). We then performed a stratified analysis based on mortality amongst the COVID-19 cohort with known outcomes of deceased (n = 10) versus survived (n = 15). Our overarching hypothesis was that there are detectable and functionally significant relationships between BALF microbial metatranscriptomes and the severity of COVID-19 disease onset and progression. We observed 34 functionally discriminant gene ontology (GO) terms in COVID-19 disease compared to the CAP and uninfected cohorts, and 21 GO terms functionally discriminant to COVID-19 mortality (q < 0.05). GO terms enriched in the COVID-19 disease cohort included hydrolase activity, and significant GO terms under the parental terms of biological regulation, viral process, and interspecies interaction between organisms. Notable GO terms associated with COVID-19 mortality included nucleobase-containing compound biosynthetic process, organonitrogen compound catabolic process, pyrimidine-containing compound biosynthetic process, and DNA recombination, RNA binding, magnesium and zinc ion binding, oxidoreductase activity, and endopeptidase activity. A Dirichlet multinomial mixtures clustering analysis resulted in a best model fit using three distinct clusters that were significantly associated with COVID-19 disease and mortality. We additionally observed discriminant taxonomic differences associated with COVID-19 disease and mortality in the genus Sphingomonas, belonging to the Sphingomonadacae family, Variovorax, belonging to the Comamonadaceae family, and in the class Bacteroidia, belonging to the order Bacteroidales. To our knowledge, this is the first study to evaluate significant differences in taxonomic and functional signatures between BALF metatranscriptomes from COVID-19, CAP, and uninfected cohorts, as well as associating these taxa and microbial gene functions with COVID-19 mortality. Collectively, while this data does not speak to causality nor directionality of the association, it does demonstrate a significant relationship between the human microbiome and COVID-19. The results from this study have rendered testable hypotheses that warrant further investigation to better understand the causality and directionality of host–microbiome–pathogen interactions.Item Annotation-free delineation of prokaryotic homology groups(Public Library of Science, 2022) Yin, Yongze; Ogilvie, Huw A.; Nakhleh, LuayPhylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.Item APE-Gen: A Fast Method for Generating Ensembles of Bound Peptide-MHC Conformations(MDPI, 2019) Abella, Jayvee R.; Antunes, Dinler A.; Clementi, Cecilia; Kavraki, Lydia E.The Class I Major Histocompatibility Complex (MHC) is a central protein in immunology as it binds to intracellular peptides and displays them at the cell surface for recognition by T-cells. The structural analysis of bound peptide-MHC complexes (pMHCs) holds the promise of interpretable and general binding prediction (i.e., testing whether a given peptide binds to a given MHC). However, structural analysis is limited in part by the difficulty in modelling pMHCs given the size and flexibility of the peptides that can be presented by MHCs. This article describes APE-Gen (Anchored Peptide-MHC Ensemble Generator), a fast method for generating ensembles of bound pMHC conformations. APE-Gen generates an ensemble of bound conformations by iterated rounds of (i) anchoring the ends of a given peptide near known pockets in the binding site of the MHC, (ii) sampling peptide backbone conformations with loop modelling, and then (iii) performing energy minimization to fix steric clashes, accumulating conformations at each round. APE-Gen takes only minutes on a standard desktop to generate tens of bound conformations, and we show the ability of APE-Gen to sample conformations found in X-ray crystallography even when only sequence information is used as input. APE-Gen has the potential to be useful for its scalability (i.e., modelling thousands of pMHCs or even non-canonical longer peptides) and for its use as a flexible search tool. We demonstrate an example for studying cross-reactivity.Item Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data(Public Library of Science, 2020) Mallory, Xian F.; Edrisi, Mohammadamin; Navin, Nicholas; Nakhleh, LuaySingle-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.Item Association of education attainment, smoking status, and alcohol use disorder with dementia risk in older adults: a longitudinal observational study(Springer Nature, 2024) Tang, Huilin; Shaaban, C. Elizabeth; DeKosky, Steven T.; Smith, Glenn E.; Hu, Xia; Jaffee, Michael; Salloum, Ramzi G.; Bian, Jiang; Guo, JingchuanPrevious research on the risk of dementia associated with education attainment, smoking status, and alcohol use disorder (AUD) has yielded inconsistent results, indicating potential heterogeneous treatment effects (HTEs) of these factors on dementia risk. Thus, this study aimed to identify the important variables that may contribute to HTEs of these factors in older adults.