Computer Science Publications

Permanent URI for this collection


Recent Submissions

Now showing 1 - 20 of 166
  • Item
    A scientific machine learning framework to understand flash graphene synthesis
    (Royal Society of Chemistry, 2023) Sattari, Kianoosh; Eddy, Lucas; Beckham, Jacob L.; Wyss, Kevin M.; Byfield, Richard; Qian, Long; Tour, James M.; Lin, Jian; NanoCarbon Center; Welch Institute for Advanced Materials
    Flash Joule heating (FJH) is a far-from-equilibrium (FFE) processing method for converting low-value carbon-based materials to flash graphene (FG). Despite its promises in scalability and performance, attempts to explore the reaction mechanism have been limited due to the complexities involved in the FFE process. Data-driven machine learning (ML) models effectively account for the complexities, but the model training requires a considerable amount of experimental data. To tackle this challenge, we constructed a scientific ML (SML) framework trained by using both direct processing variables and indirect, physics-informed variables to predict the FG yield. The indirect variables include current-derived features (final current, maximum current, and charge density) predicted from the proxy ML models and reaction temperatures simulated from multi-physics modeling. With the combined indirect features, the final ML model achieves an average R2 score of 0.81 ± 0.05 and an average RMSE of 12.1% ± 2.0% in predicting the FG yield, which is significantly higher than the model trained without them (R2 of 0.73 ± 0.05 and an RMSE of 14.3% ± 2.0%). Feature importance analysis validates the key roles of these indirect features in determining the reaction outcome. These results illustrate the promise of this SML to elucidate FFE material synthesis outcomes, thus paving a new avenue to processing other datasets from the materials systems involving the same or different FFE processes.
  • Item
    Joint embedding of biological networks for cross-species functional alignment
    (Oxford University Press, 2023) Li, Lechuan; Dannenfelser, Ruth; Zhu, Yu; Hejduk, Nathaniel; Segarra, Santiago; Yao, Vicky
    Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein–protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.
  • Item
    Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
    (Oxford University Press, 2023) Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M
    The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.MashMap3 is available at
  • Item
    Supervised convex clustering
    (Wiley, 2023) Wang, Minjie; Yao, Tianyi; Allen, Genevera I.
    Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.
  • Item
    Real-time, deep-learning aided lensless microscope
    (Optica Publishing Group, 2023) Wu, Jimin; Boominathan, Vivek; Veeraraghavan, Ashok; Robinson, Jacob T.
    Traditional miniaturized fluorescence microscopes are critical tools for modern biology. Invariably, they struggle to simultaneously image with a high spatial resolution and a large field of view (FOV). Lensless microscopes offer a solution to this limitation. However, real-time visualization of samples is not possible with lensless imaging, as image reconstruction can take minutes to complete. This poses a challenge for usability, as real-time visualization is a crucial feature that assists users in identifying and locating the imaging target. The issue is particularly pronounced in lensless microscopes that operate at close imaging distances. Imaging at close distances requires shift-varying deconvolution to account for the variation of the point spread function (PSF) across the FOV. Here, we present a lensless microscope that achieves real-time image reconstruction by eliminating the use of an iterative reconstruction algorithm. The neural network-based reconstruction method we show here, achieves more than 10000 times increase in reconstruction speed compared to iterative reconstruction. The increased reconstruction speed allows us to visualize the results of our lensless microscope at more than 25 frames per second (fps), while achieving better than 7 µm resolution over a FOV of 10 mm2. This ability to reconstruct and visualize samples in real-time empowers a more user-friendly interaction with lensless microscopes. The users are able to use these microscopes much like they currently do with conventional microscopes.
  • Item
    An automated respiratory data pipeline for waveform characteristic analysis
    (Wiley, 2023) Lusk, Savannah; Ward, Christopher S.; Chang, Andersen; Twitchell-Heyne, Avery; Fattig, Shaun; Allen, Genevera; Jankowsky, Joanna L.; Ray, Russell S.
    Comprehensive and accurate analysis of respiratory and metabolic data is crucial to modelling congenital, pathogenic and degenerative diseases converging on autonomic control failure. A lack of tools for high-throughput analysis of respiratory datasets remains a major challenge. We present Breathe Easy, a novel open-source pipeline for processing raw recordings and associated metadata into operative outcomes, publication-worthy graphs and robust statistical analyses including QQ and residual plots for assumption queries and data transformations. This pipeline uses a facile graphical user interface for uploading data files, setting waveform feature thresholds and defining experimental variables. Breathe Easy was validated against manual selection by experts, which represents the current standard in the field. We demonstrate Breathe Easy's utility by examining a 2-year longitudinal study of an Alzheimer's disease mouse model to assess contributions of forebrain pathology in disordered breathing. Whole body plethysmography has become an important experimental outcome measure for a variety of diseases with primary and secondary respiratory indications. Respiratory dysfunction, while not an initial symptom in many of these disorders, often drives disability or death in patient outcomes. Breathe Easy provides an open-source respiratory analysis tool for all respiratory datasets and represents a necessary improvement upon current analytical methods in the field. Key points Respiratory dysfunction is a common endpoint for disability and mortality in many disorders throughout life. Whole body plethysmography in rodents represents a high face-value method for measuring respiratory outcomes in rodent models of these diseases and disorders. Analysis of key respiratory variables remains hindered by manual annotation and analysis that leads to low throughput results that often exclude a majority of the recorded data. Here we present a software suite, Breathe Easy, that automates the process of data selection from raw recordings derived from plethysmography experiments and the analysis of these data into operative outcomes and publication-worthy graphs with statistics. We validate Breathe Easy with a terabyte-scale Alzheimer's dataset that examines the effects of forebrain pathology on respiratory function over 2 years of degeneration.
  • Item
    Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes
    (Springer Nature, 2023) Chin, Chen-Shan; Behera, Sairam; Khalak, Asif; Sedlazeck, Fritz J.; Sudmant, Peter H.; Wagner, Justin; Zook, Justin M.
    Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
  • Item
    Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree
    (Springer Nature, 2023) Dylus, David; Altenhoff, Adrian; Majidian, Sina; Sedlazeck, Fritz J.; Dessimoz, Christophe
    Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.
  • Item
    Genomic variant benchmark: if you cannot measure it, you cannot improve it
    (Springer Nature, 2023) Majidian, Sina; Agustinho, Daniel Paiva; Chin, Chen-Shan; Sedlazeck, Fritz J.; Mahmoud, Medhat
    Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
  • Item
    Charge-based interactions through peptide position 4 drive diversity of antigen presentation by human leukocyte antigen class I molecules
    (Oxford University Press, 2022) Jackson, Kyle R; Antunes, Dinler A; Talukder, Amjad H; Maleki, Ariana R; Amagai, Kano; Salmon, Avery; Katailiha, Arjun S; Chiu, Yulun; Fasoulis, Romanos; Rigo, Maurício Menegatti; Abella, Jayvee R; Melendez, Brenda D; Li, Fenge; Sun, Yimo; Sonnemann, Heather M; Belousov, Vladislav; Frenkel, Felix; Justesen, Sune; Makaju, Aman; Liu, Yang; Horn, David; Lopez-Ferrer, Daniel; Huhmer, Andreas F; Hwu, Patrick; Roszik, Jason; Hawke, David; Kavraki, Lydia E; Lizée, Gregory
    Human leukocyte antigen class I (HLA-I) molecules bind and present peptides at the cell surface to facilitate the induction of appropriate CD8+ T cell-mediated immune responses to pathogen- and self-derived proteins. The HLA-I peptide-binding cleft contains dominant anchor sites in the B and F pockets that interact primarily with amino acids at peptide position 2 and the C-terminus, respectively. Nonpocket peptide–HLA interactions also contribute to peptide binding and stability, but these secondary interactions are thought to be unique to individual HLA allotypes or to specific peptide antigens. Here, we show that two positively charged residues located near the top of peptide-binding cleft facilitate interactions with negatively charged residues at position 4 of presented peptides, which occur at elevated frequencies across most HLA-I allotypes. Loss of these interactions was shown to impair HLA-I/peptide binding and complex stability, as demonstrated by both in vitro and in silico experiments. Furthermore, mutation of these Arginine-65 (R65) and/or Lysine-66 (K66) residues in HLA-A*02:01 and A*24:02 significantly reduced HLA-I cell surface expression while also reducing the diversity of the presented peptide repertoire by up to 5-fold. The impact of the R65 mutation demonstrates that nonpocket HLA-I/peptide interactions can constitute anchor motifs that exert an unexpectedly broad influence on HLA-I-mediated antigen presentation. These findings provide fundamental insights into peptide antigen binding that could broadly inform epitope discovery in the context of viral vaccine development and cancer immunotherapy.
  • Item
    Birational Quadratic Planar Maps with Generalized Complex Rational Representations
    (MDPI, 2023) Wang, Xuhui; Han, Yuhao; Ni, Qian; Li, Rui; Goldman, Ron
    Complex rational maps have been used to construct birational quadratic maps based on two special syzygies of degree one. Similar to complex rational curves, rational curves over generalized complex numbers have also been constructed by substituting the imaginary unit with a new independent quantity. We first establish the relationship between degree one, generalized, complex rational Bézier curves and quadratic rational Bézier curves. Then we provide conditions to determine when a quadratic rational planar map has a generalized complex rational representation. Thus, a rational quadratic planar map can be made birational by suitably choosing the middle Bézier control points and their corresponding weights. In contrast to the edges of complex rational maps of degree one, which are circular arcs, the edges of the planar maps can be generalized to hyperbolic and parabolic arcs by invoking the hyperbolic and parabolic numbers.
  • Item
    Stratification of Pediatric COVID-19 Cases Using Inflammatory Biomarker Profiling and Machine Learning
    (MDPI, 2023) Subramanian, Devika; Vittala, Aadith; Chen, Xinpu; Julien, Christopher; Acosta, Sebastian; Rusin, Craig; Allen, Carl; Rider, Nicholas; Starosolski, Zbigniew; Annapragada, Ananth; Devaraj, Sridevi
    While pediatric COVID-19 is rarely severe, a small fraction of children infected with SARS-CoV-2 go on to develop multisystem inflammatory syndrome (MIS-C), with substantial morbidity. An objective method with high specificity and high sensitivity to identify current or imminent MIS-C in children infected with SARS-CoV-2 is highly desirable. The aim was to learn about an interpretable novel cytokine/chemokine assay panel providing such an objective classification. This retrospective study was conducted on four groups of pediatric patients seen at multiple sites of Texas Children’s Hospital, Houston, TX who consented to provide blood samples to our COVID-19 Biorepository. Standard laboratory markers of inflammation and a novel cytokine/chemokine array were measured in blood samples of all patients. Group 1 consisted of 72 COVID-19, 70 MIS-C and 63 uninfected control patients seen between May 2020 and January 2021 and predominantly infected with pre-alpha variants. Group 2 consisted of 29 COVID-19 and 43 MIS-C patients seen between January and May 2021 infected predominantly with the alpha variant. Group 3 consisted of 30 COVID-19 and 32 MIS-C patients seen between August and October 2021 infected with alpha and/or delta variants. Group 4 consisted of 20 COVID-19 and 46 MIS-C patients seen between October 2021 andJanuary 2022 infected with delta and/or omicron variants. Group 1 was used to train an L1-regularized logistic regression model which was tested using five-fold cross validation, and then separately validated against the remaining naïve groups. The area under receiver operating curve (AUROC) and F1-score were used to quantify the performance of the cytokine/chemokine assay-based classifier. Standard laboratory markers predict MIS-C with a five-fold cross-validated AUROC of 0.86 ± 0.05 and an F1 score of 0.78 ± 0.07, while the cytokine/chemokine panel predicted MIS-C with a five-fold cross-validated AUROC of 0.95 ± 0.02 and an F1 score of 0.91 ± 0.04, with only sixteen of the forty-five cytokines/chemokines sufficient to achieve this performance. Tested on Group 2 the cytokine/chemokine panel yielded AUROC = 0.98 and F1 = 0.93, on Group 3 it yielded AUROC = 0.89 and F1 = 0.89, and on Group 4 AUROC = 0.99 and F1 = 0.97. Adding standard laboratory markers to the cytokine/chemokine panel did not improve performance. A top-10 subset of these 16 cytokines achieves equivalent performance on the validation data sets. Our findings demonstrate that a sixteen-cytokine/chemokine panel as well as the top ten subset provides a highly sensitive, and specific method to identify MIS-C in patients infected with SARS-CoV-2 of all the major variants identified to date.
  • Item
    A deep learning solution for crystallographic structure determination
    (International Union of Crystallography, 2023) Pan, T.; Jin, S.; Miller, M. D.; Kyrillidis, A.; Phillips, G. N.
    The general de novo solution of the crystallographic phase problem is difficult and only possible under certain conditions. This paper develops an initial pathway to a deep learning neural network approach for the phase problem in protein crystallography, based on a synthetic dataset of small fragments derived from a large well curated subset of solved structures in the Protein Data Bank (PDB). In particular, electron-density estimates of simple artificial systems are produced directly from corresponding Patterson maps using a convolutional neural network architecture as a proof of concept.
  • Item
    PME: pruning-based multi-size embedding for recommender systems
    (Frontiers Media S.A., 2023) Liu, Zirui; Song, Qingquan; Li, Li; Choi, Soo-Hyun; Chen, Rui; Hu, Xia
    Embedding is widely used in recommendation models to learn feature representations. However, the traditional embedding technique that assigns a fixed size to all categorical features may be suboptimal due to the following reasons. In recommendation domain, the majority of categorical features' embeddings can be trained with less capacity without impacting model performance, thereby storing embeddings with equal length may incur unnecessary memory usage. Existing work that tries to allocate customized sizes for each feature usually either simply scales the embedding size with feature's popularity or formulates this size allocation problem as an architecture selection problem. Unfortunately, most of these methods either have large performance drop or incur significant extra time cost for searching proper embedding sizes. In this article, instead of formulating the size allocation problem as an architecture selection problem, we approach the problem from a pruning perspective and propose Pruning-based Multi-size Embedding (PME) framework. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding with significant less search cost. Experimental results validate that PME can efficiently find proper sizes and hence achieve strong performance while significantly reducing the number of parameters in the embedding layer.
  • Item
    EnGens: a computational framework for generation and analysis of representative protein conformational ensembles
    (Oxford University Press, 2023) Conev, Anja; Rigo, Mauricio Menegatti; Devaurs, Didier; Fonseca, André Faustino; Kalavadwala, Hussain; de Freitas, Martiela Vaz; Clementi, Cecilia; Zanatta, Geancarlo; Antunes, Dinler Amaral; Kavraki, Lydia E
    Proteins are dynamic macromolecules that perform vital functions in cells. A protein structure determines its function, but this structure is not static, as proteins change their conformation to achieve various functions. Understanding the conformational landscapes of proteins is essential to understand their mechanism of action. Sets of carefully chosen conformations can summarize such complex landscapes and provide better insights into protein function than single conformations. We refer to these sets as representative conformational ensembles. Recent advances in computational methods have led to an increase in the number of available structural datasets spanning conformational landscapes. However, extracting representative conformational ensembles from such datasets is not an easy task and many methods have been developed to tackle it. Our new approach, EnGens (short for ensemble generation), collects these methods into a unified framework for generating and analyzing representative protein conformational ensembles. In this work, we: (1) provide an overview of existing methods and tools for representative protein structural ensemble generation and analysis; (2) unify existing approaches in an open-source Python package, and a portable Docker image, providing interactive visualizations within a Jupyter Notebook pipeline; (3) test our pipeline on a few canonical examples from the literature. Representative ensembles produced by EnGens can be used for many downstream tasks such as protein–ligand ensemble docking, Markov state modeling of protein dynamics and analysis of the effect of single-point mutations.
  • Item
    Enabling accurate and early detection of recently emerged SARS-CoV-2 variants of concern in wastewater
    (Springer Nature, 2023) Sapoval, Nicolae; Liu, Yunxi; Lou, Esther G.; Hopkins, Loren; Ensor, Katherine B.; Schneider, Rebecca; Stadler, Lauren B.; Treangen, Todd J.
    As clinical testing declines, wastewater monitoring can provide crucial surveillance on the emergence of SARS-CoV-2 variant of concerns (VoCs) in communities. In this paper we present QuaID, a novel bioinformatics tool for VoC detection based on quasi-unique mutations. The benefits of QuaID are three-fold: (i) provides up to 3-week earlier VoC detection, (ii) accurate VoC detection (>95% precision on simulated benchmarks), and (iii) leverages all mutational signatures (including insertions & deletions).
  • Item
    PepSim: T-cell cross-reactivity prediction via comparison of peptide sequence and peptide-HLA structure
    (Frontiers Media S.A., 2023) Hall-Swan, Sarah; Slone, Jared; Rigo, Mauricio M.; Antunes, Dinler A.; Lizée, Gregory; Kavraki, Lydia E.
    IntroductionPeptide-HLA class I (pHLA) complexes on the surface of tumor cells can be targeted by cytotoxic T-cells to eliminate tumors, and this is one of the bases for T-cell-based immunotherapies. However, there exist cases where therapeutic T-cells directed towards tumor pHLA complexes may also recognize pHLAs from healthy normal cells. The process where the same T-cell clone recognizes more than one pHLA is referred to as T-cell cross-reactivity and this process is driven mainly by features that make pHLAs similar to each other. T-cell cross-reactivity prediction is critical for designing T-cell-based cancer immunotherapies that are both effective and safe.MethodsHere we present PepSim, a novel score to predict T-cell cross-reactivity based on the structural and biochemical similarity of pHLAs.Results and discussionWe show our method can accurately separate cross-reactive from non-crossreactive pHLAs in a diverse set of datasets including cancer, viral, and self-peptides. PepSim can be generalized to work on any dataset of class I peptide-HLAs and is freely available as a web server at
  • Item
    Improved understanding of biorisk for research involving microbial modification using annotated sequences of concern
    (Frontiers Media S.A., 2023) Godbold, Gene D.; Hewitt, F. Curtis; Kappell, Anthony D.; Scholz, Matthew B.; Agar, Stacy L.; Treangen, Todd J.; Ternus, Krista L.; Sandbrink, Jonas B.; Koblentz, Gregory D.
    Regulation of research on microbes that cause disease in humans has historically been focused on taxonomic lists of ‘bad bugs’. However, given our increased knowledge of these pathogens through inexpensive genome sequencing, 5 decades of research in microbial pathogenesis, and the burgeoning capacity of synthetic biologists, the limitations of this approach are apparent. With heightened scientific and public attention focused on biosafety and biosecurity, and an ongoing review by US authorities of dual-use research oversight, this article proposes the incorporation of sequences of concern (SoCs) into the biorisk management regime governing genetic engineering of pathogens. SoCs enable pathogenesis in all microbes infecting hosts that are ‘of concern’ to human civilization. Here we review the functions of SoCs (FunSoCs) and discuss how they might bring clarity to potentially problematic research outcomes involving infectious agents. We believe that annotation of SoCs with FunSoCs has the potential to improve the likelihood that dual use research of concern is recognized by both scientists and regulators before it occurs.
  • Item
    Genome-Wide Analysis of Structural Variants in Parkinson Disease
    (Wiley, 2023) Billingsley, Kimberley J.; Ding, Jinhui; Jerez, Pilar Alvarez; Illarionova, Anastasia; Levine, Kristin; Grenn, Francis P.; Makarious, Mary B.; Moore, Anni; Vitale, Daniel; Reed, Xylena; Hernandez, Dena; Torkamani, Ali; Ryten, Mina; Hardy, John; Consortium (UKBEC), UK Brain Expression; Chia, Ruth; Scholz, Sonja W.; Traynor, Bryan J.; Dalgard, Clifton L.; Ehrlich, Debra J.; Tanaka, Toshiko; Ferrucci, Luigi; Beach, Thomas G.; Serrano, Geidy E.; Quinn, John P.; Bubb, Vivien J.; Collins, Ryan L; Zhao, Xuefang; Walker, Mark; Pierce-Hoffman, Emma; Brand, Harrison; Talkowski, Michael E.; Casey, Bradford; Cookson, Mark R; Markham, Androo; Nalls, Mike A.; Mahmoud, Medhat; Sedlazeck, Fritz J; Blauwendraat, Cornelis; Gibbs, J. Raphael; Singleton, Andrew B.
    Objective Identification of genetic risk factors for Parkinson disease (PD) has to date been primarily limited to the study of single nucleotide variants, which only represent a small fraction of the genetic variation in the human genome. Consequently, causal variants for most PD risk are not known. Here we focused on structural variants (SVs), which represent a major source of genetic variation in the human genome. We aimed to discover SVs associated with PD risk by performing the first large-scale characterization of SVs in PD. Methods We leveraged a recently developed computational pipeline to detect and genotype SVs from 7,772 Illumina short-read whole genome sequencing samples. Using this set of SV variants, we performed a genome-wide association study using 2,585 cases and 2,779 controls and identified SVs associated with PD risk. Furthermore, to validate the presence of these variants, we generated a subset of matched whole-genome long-read sequencing data. Results We genotyped and tested 3,154 common SVs, representing over 412 million nucleotides of previously uncatalogued genetic variation. Using long-read sequencing data, we validated the presence of three novel deletion SVs that are associated with risk of PD from our initial association analysis, including a 2 kb intronic deletion within the gene LRRN4. Interpretation We identified three SVs associated with genetic risk of PD. This study represents the most comprehensive assessment of the contribution of SVs to the genetic risk of PD to date. ANN NEUROL 2023;93:1012–1022
  • Item
    Intratumoral Heterogeneity and Clonal Evolution Induced by HPV Integration
    (AACR, 2023) Akagi, Keiko; Symer, David E.; Mahmoud, Medhat; Jiang, Bo; Goodwin, Sara; Wangsa, Darawalee; Li, Zhengke; Xiao, Weihong; Dan Dunn, Joe; Ried, Thomas; Coombes, Kevin R.; Sedlazeck, Fritz J.; Gillison, Maura L.
    The human papillomavirus (HPV) genome is integrated into host DNA in most HPV-positive cancers, but the consequences for chromosomal integrity are unknown. Continuous long-read sequencing of oropharyngeal cancers and cancer cell lines identified a previously undescribed form of structural variation, “heterocateny,” characterized by diverse, interrelated, and repetitive patterns of concatemerized virus and host DNA segments within a cancer. Unique breakpoints shared across structural variants facilitated stepwise reconstruction of their evolution from a common molecular ancestor. This analysis revealed that virus and virus–host concatemers are unstable and, upon insertion into and excision from chromosomes, facilitate capture, amplification, and recombination of host DNA and chromosomal rearrangements. Evidence of heterocateny was detected in extrachromosomal and intrachromosomal DNA. These findings indicate that heterocateny is driven by the dynamic, aberrant replication and recombination of an oncogenic DNA virus, thereby extending known consequences of HPV integration to include promotion of intratumoral heterogeneity and clonal evolution.Long-read sequencing of HPV-positive cancers revealed “heterocateny,” a previously unreported form of genomic structural variation characterized by heterogeneous, interrelated, and repetitive genomic rearrangements within a tumor. Heterocateny is driven by unstable concatemerized HPV genomes, which facilitate capture, rearrangement, and amplification of host DNA, and promotes intratumoral heterogeneity and clonal evolution.See related commentary by McBride and White, p. 814.This article is highlighted in the In This Issue feature, p. 799