Computer Science Publications

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 190
  • Item
    The GIAB genomic stratifications resource for human reference genomes
    (Springer Nature, 2024) Dwarshuis, Nathan; Kalra, Divya; McDaniel, Jennifer; Sanio, Philippe; Alvarez Jerez, Pilar; Jadhav, Bharati; Huang, Wenyu (Eddy); Mondal, Rajarshi; Busby, Ben; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Majidian, Sina; Zook, Justin M.
    Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.
  • Item
    Single-cell somatic copy number variants in brain using different amplification methods and reference genomes
    (Springer Nature, 2024) Kalef-Ezra, Ester; Turan, Zeliha Gozde; Perez-Rodriguez, Diego; Bomann, Ida; Behera, Sairam; Morley, Caoimhe; Scholz, Sonja W.; Jaunmuktane, Zane; Demeulemeester, Jonas; Sedlazeck, Fritz J.; Proukakis, Christos
    The presence of somatic mutations, including copy number variants (CNVs), in the brain is well recognized. Comprehensive study requires single-cell whole genome amplification, with several methods available, prior to sequencing. Here we compare PicoPLEX with two recent adaptations of multiple displacement amplification (MDA): primary template-directed amplification (PTA) and droplet MDA, across 93 human brain cortical nuclei. We demonstrate different properties for each, with PTA providing the broadest amplification, PicoPLEX the most even, and distinct chimeric profiles. Furthermore, we perform CNV calling on two brains with multiple system atrophy and one control brain using different reference genomes. We find that 20.6% of brain cells have at least one Mb-scale CNV, with some supported by bulk sequencing or single-cells from other brain regions. Our study highlights the importance of selecting whole genome amplification method and reference genome for CNV calling, while supporting the existence of somatic CNVs in healthy and diseased human brain.
  • Item
    StratoMod: predicting sequencing and variant calling errors with interpretable machine learning
    (Springer Nature, 2024) Dwarshuis, Nathan; Tonner, Peter; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Zook, Justin M.
    Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.
  • Item
    A best-match approach for gene set analyses in embedding spaces
    (Cold Spring Harbor Laboratory Press, 2024) Li, Lechuan; Dannenfelser, Ruth; Cruz, Charlie; Yao, Vicky
    Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein–protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
  • Item
    Association of education attainment, smoking status, and alcohol use disorder with dementia risk in older adults: a longitudinal observational study
    (Springer Nature, 2024) Tang, Huilin; Shaaban, C. Elizabeth; DeKosky, Steven T.; Smith, Glenn E.; Hu, Xia; Jaffee, Michael; Salloum, Ramzi G.; Bian, Jiang; Guo, Jingchuan
    Previous research on the risk of dementia associated with education attainment, smoking status, and alcohol use disorder (AUD) has yielded inconsistent results, indicating potential heterogeneous treatment effects (HTEs) of these factors on dementia risk. Thus, this study aimed to identify the important variables that may contribute to HTEs of these factors in older adults.
  • Item
    Sampling-Based Motion Planning: A Comparative Review
    (Annual Reviews, 2024) Orthey, Andreas; Chamzas, Constantinos; Kavraki, Lydia E.
    Sampling-based motion planning is one of the fundamental paradigms to generate robot motions, and a cornerstone of robotics research. This comparative review provides an up-to-date guide and reference manual for the use of sampling-based motion planning algorithms. It includes a history of motion planning, an overview of the most successful planners, and a discussion of their properties. It also shows how planners can handle special cases and how extensions of motion planning can be accommodated. To put sampling-based motion planning into a larger context, a discussion of alternative motion generation frameworks highlights their respective differences from sampling-based motion planning. Finally, a set of sampling-based motion planners are compared on 24 challenging planning problems in order to provide insights into which planners perform well in which situations and where future research would be required. This comparative review thereby provides not only a useful reference manual for researchers in the field but also a guide for practitioners to make informed algorithmic decisions.
  • Item
    Singly exponential translation of alternating weak Büchi automata to unambiguous Büchi automata
    (Elsevier, 2024) Li, Yong; Schewe, Sven; Vardi, Moshe Y.
    We introduce a method for translating an alternating weak Büchi automaton (AWA), which corresponds to a Linear Dynamic Logic (LDL) formula, to an unambiguous Büchi automaton (UBA). Our translations generalize constructions for Linear Temporal Logic (LTL), a less expressive specification language than LDL. In classical constructions, LTL formulas are first translated to alternating very weak Büchi automata (AVAs)—automata that have only singleton strongly connected components (SCCs); these AVAs are then handled by efficient disambiguation procedures. However, general AWAs can have larger SCCs, which complicates disambiguation. Currently, the only available disambiguation procedure has to go through an intermediate construction of nondeterministic Büchi automata (NBAs), which would incur an exponential blow-up of its own. We introduce a translation from general AWAs to UBAs with a singly exponential blow-up, which also immediately provides a singly exponential translation from LDL to UBAs. Interestingly, the complexity of our translation is smaller than the best known disambiguation algorithm for NBAs (broadly (0.53n)n vs. (0.76n)n), while the input of our construction can be exponentially more succinct.
  • Item
    Machine Learning to Enhance Electronic Detection of Diagnostic Errors
    (American Medical Association, 2024) Zimolzak, Andrew J.; Wei, Li; Mir, Usman; Gupta, Ashish; Vaghani, Viralkumar; Subramanian, Devika; Singh, Hardeep
  • Item
    Impact and characterization of serial structural variations across humans and great apes
    (Springer Nature, 2024) Höps, Wolfram; Rausch, Tobias; Jendrusch, Michael; Korbel, Jan O.; Sedlazeck, Fritz J.
    Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals (https://github.com/WHops/NAHRwhals), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.
  • Item
    Polyphest: fast polyploid phylogeny estimation
    (Oxford University Press, 2024) Yan, Zhi; Cao, Zhen; Nakhleh, Luay
    Despite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies.We introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes.Polyphest is available at https://github.com/NakhlehLab/Polyphest.
  • Item
    Reference-free structural variant detection in microbiomes via long-read co-assembly graphs
    (Oxford University Press, 2024) Curry, Kristen D; Yu, Feiqiao Brian; Vance, Summer E; Segarra, Santiago; Bhaya, Devaki; Chikhi, Rayan; Rocha, Eduardo P C; Treangen, Todd J
    Motivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining.Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux.Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.
  • Item
    CrysFormer: Protein structure determination via Patterson maps, deep learning, and partial structure attention
    (AIP Publishing LLC, 2024) Pan, Tom; Dun, Chen; Jin, Shikai; Miller, Mitchell D.; Kyrillidis, Anastasios; Phillips, George N., Jr.
    Determining the atomic-level structure of a protein has been a decades-long challenge. However, recent advances in transformers and related neural network architectures have enabled researchers to significantly improve solutions to this problem. These methods use large datasets of sequence information and corresponding known protein template structures, if available. Yet, such methods only focus on sequence information. Other available prior knowledge could also be utilized, such as constructs derived from x-ray crystallography experiments and the known structures of the most common conformations of amino acid residues, which we refer to as partial structures. To the best of our knowledge, we propose the first transformer-based model that directly utilizes experimental protein crystallographic data and partial structure information to calculate electron density maps of proteins. In particular, we use Patterson maps, which can be directly obtained from x-ray crystallography experimental data, thus bypassing the well-known crystallographic phase problem. We demonstrate that our method, CrysFormer, achieves precise predictions on two synthetic datasets of peptide fragments in crystalline forms, one with two residues per unit cell and the other with fifteen. These predictions can then be used to generate accurate atomic models using established crystallographic refinement programs.
  • Item
    Profiling complex repeat expansions in RFC1 in Parkinson’s disease
    (Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Miano-Burkhardt, Abigail; Iwaki, Hirotaka; Malik, Laksh; Cogan, Guillaume; Makarious, Mary B.; Sullivan, Roisin; Vandrovcova, Jana; Ding, Jinhui; Gibbs, J. Raphael; Markham, Androo; Nalls, Mike A.; Kesharwani, Rupesh K.; Sedlazeck, Fritz J.; Casey, Bradford; Hardy, John; Houlden, Henry; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.
    A biallelic (AAGGG) expansion in the poly(A) tail of an AluSx3 transposable element within the gene RFC1 is a frequent cause of cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS), and more recently, has been reported as a rare cause of Parkinson’s disease (PD) in the Finnish population. Here, we investigate the prevalence of RFC1 (AAGGG) expansions in PD patients of non-Finnish European ancestry in 1609 individuals from the Parkinson’s Progression Markers Initiative study. We identified four PD patients carrying the biallelic RFC1 (AAGGG) expansion and did not identify any carriers in controls.
  • Item
    A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study
    (JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, Ila
    Background: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.
  • Item
    MethPhaser: methylation-based long-read haplotype phasing of human genomes
    (Springer Nature, 2024) Fu, Yilei; Aganezov, Sergey; Mahmoud, Medhat; Beaulaurier, John; Juul, Sissel; Treangen, Todd J.; Sedlazeck, Fritz J.
    The assignment of variants across haplotypes, phasing, is crucial for predicting the consequences, interaction, and inheritance of mutations and is a key step in improving our understanding of phenotype and disease. However, phasing is limited by read length and stretches of homozygosity along the genome. To overcome this limitation, we designed MethPhaser, a method that utilizes methylation signals from Oxford Nanopore Technologies to extend Single Nucleotide Variation (SNV)-based phasing. We demonstrate that haplotype-specific methylations extensively exist in Human genomes and the advent of long-read technologies enabled direct report of methylation signals. For ONT R9 and R10 cell line data, we increase the phase length N50 by 78%-151% at a phasing accuracy of 83.4-98.7% To assess the impact of tissue purity and random methylation signals due to inactivation, we also applied MethPhaser on blood samples from 4 patients, still showing improvements over SNV-only phasing. MethPhaser further improves phasing across HLA and multiple other medically relevant genes, improving our understanding of how mutations interact across multiple phenotypes. The concept of MethPhaser can also be extended to non-human diploid genomes. MethPhaser is available at https://github.com/treangenlab/methphaser.
  • Item
    Olivar: towards automated variant aware primer design for multiplex tiled amplicon sequencing of pathogens
    (Springer Nature, 2024) Wang, Michael X.; Lou, Esther G.; Sapoval, Nicolae; Kim, Eddie; Kalvapalle, Prashant; Kille, Bryce; Elworth, R. A. Leo; Liu, Yunxi; Fu, Yilei; Stadler, Lauren B.; Treangen, Todd J.
    Tiled amplicon sequencing has served as an essential tool for tracking the spread and evolution of pathogens. Over 15 million complete SARS-CoV-2 genomes are now publicly available, most sequenced and assembled via tiled amplicon sequencing. While computational tools for tiled amplicon design exist, they require downstream manual optimization both computationally and experimentally, which is slow and costly. Here we present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided. In a direct comparison with PrimalScheme, we show that Olivar has fewer mismatches overlapping with primers and predicted PCR byproducts. We also compare Olivar head-to-head with ARTIC v4.1, the most widely used primer set for SARS-CoV-2 sequencing, and show Olivar yields similar read mapping rates (~90%) and better coverage to the manually designed ARTIC v4.1 amplicons. We also evaluate Olivar on real wastewater samples and found that Olivar has up to 3-fold higher mapping rates while retaining similar coverage. In summary, Olivar automates and accelerates the generation of tiled amplicons, even in situations of high mutation frequency and/or density. Olivar is available online as a web application at https://olivar.rice.edu and can be installed locally as a command line tool with Bioconda. Source code, installation guide, and usage are available at https://github.com/treangenlab/Olivar.
  • Item
    Detection of diffusely abnormal white matter in multiple sclerosis on multiparametric brain MRI using semi-supervised deep learning
    (Springer Nature, 2024) Musall, Benjamin C.; Gabr, Refaat E.; Yang, Yanyu; Kamali, Arash; Lincoln, John A.; Jacobs, Michael A.; Ly, Vi; Luo, Xi; Wolinsky, Jerry S.; Narayana, Ponnada A.; Hasan, Khader M.
    In addition to focal lesions, diffusely abnormal white matter (DAWM) is seen on brain MRI of multiple sclerosis (MS) patients and may represent early or distinct disease processes. The role of MRI-observed DAWM is understudied due to a lack of automated assessment methods. Supervised deep learning (DL) methods are highly capable in this domain, but require large sets of labeled data. To overcome this challenge, a DL-based network (DAWM-Net) was trained using semi-supervised learning on a limited set of labeled data for segmentation of DAWM, focal lesions, and normal-appearing brain tissues on multiparametric MRI. DAWM-Net segmentation performance was compared to a previous intensity thresholding-based method on an independent test set from expert consensus (N = 25). Segmentation overlap by Dice Similarity Coefficient (DSC) and Spearman correlation of DAWM volumes were assessed. DAWM-Net showed DSC > 0.93 for normal-appearing brain tissues and DSC > 0.81 for focal lesions. For DAWM-Net, the DAWM DSC was 0.49 ± 0.12 with a moderate volume correlation (ρ = 0.52, p < 0.01). The previous method showed lower DAWM DSC of 0.26 ± 0.08 and lacked a significant volume correlation (ρ = 0.23, p = 0.27). These results demonstrate the feasibility of DL-based DAWM auto-segmentation with semi-supervised learning. This tool may facilitate future investigation of the role of DAWM in MS.
  • Item
    Characterizing a complex CT-rich haplotype in intron 4 of SNCA using large-scale targeted amplicon long-read sequencing
    (Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Grenn, Francis P.; Malik, Laksh; Miano-Burkhardt, Abigail; Makarious, Mary B.; Ding, Jinhui; Gibbs, J. Raphael; Moore, Anni; Reed, Xylena; Nalls, Mike A.; Shah, Syed; Mahmoud, Medhat; Sedlazeck, Fritz J.; Dolzhenko, Egor; Park, Morgan; Iwaki, Hirotaka; Casey, Bradford; Ryten, Mina; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.
    Parkinson’s disease (PD) is a common neurodegenerative disorder with a significant risk proportion driven by genetics. While much progress has been made, most of the heritability remains unknown. This is in-part because previous genetic studies have focused on the contribution of single nucleotide variants. More complex forms of variation, such as structural variants and tandem repeats, are already associated with several synucleinopathies. However, because more sophisticated sequencing methods are usually required to detect these regions, little is understood regarding their contribution to PD. One example is a polymorphic CT-rich region in intron 4 of the SNCA gene. This haplotype has been suggested to be associated with risk of Lewy Body (LB) pathology in Alzheimer’s Disease and SNCA gene expression, but is yet to be investigated in PD. Here, we attempt to resolve this CT-rich haplotype and investigate its role in PD. We performed targeted PacBio HiFi sequencing of the region in 1375 PD cases and 959 controls. We replicate the previously reported associations and a novel association between two PD risk SNVs (rs356182 and rs5019538) and haplotype 4, the largest haplotype. Through quantitative trait locus analyzes we identify a significant haplotype 4 association with alternative CAGE transcriptional start site usage, not leading to significant differential SNCA gene expression in post-mortem frontal cortex brain tissue. Therefore, disease association in this locus might not be biologically driven by this CT-rich repeat region. Our data demonstrates the complexity of this SNCA region and highlights that further follow up functional studies are warranted.
  • Item
    Inverted triplications formed by iterative template switches generate structural variant diversity at genomic disorder loci
    (Elsevier, 2024) Grochowski, Christopher M.; Bengtsson, Jesse D.; Du, Haowei; Gandhi, Mira; Lun, Ming Yin; Mehaffey, Michele G.; Park, KyungHee; Höps, Wolfram; Benito, Eva; Hasenfeld, Patrick; Korbel, Jan O.; Mahmoud, Medhat; Paulin, Luis F.; Jhangiani, Shalini N.; Hwang, James Paul; Bhamidipati, Sravya V.; Muzny, Donna M.; Fatih, Jawid M.; Gibbs, Richard A.; Pendleton, Matthew; Harrington, Eoghan; Juul, Sissel; Lindstrand, Anna; Sedlazeck, Fritz J.; Pehlivan, Davut; Lupski, James R.; Carvalho, Claudia M. B.
    The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a complex genomic rearrangement (CGR). Although it has been identified as an important pathogenic DNA mutation signature in genomic disorders and cancer genomes, its architecture remains unresolved. Here, we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the DNA of 24 patients identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted structural variant (SV) haplotypes. Using a combination of short-read genome sequencing (GS), long-read GS, optical genome mapping, and single-cell DNA template strand sequencing (strand-seq), the haplotype structure was resolved in 18 samples. The point of template switching in 4 samples was shown to be a segment of ∼2.2–5.5 kb of 100% nucleotide similarity within inverted repeat pairs. These data provide experimental evidence that inverted low-copy repeats act as recombinant substrates. This type of CGR can result in multiple conformers generating diverse SV haplotypes in susceptible dosage-sensitive loci.
  • Item
    Crykey: Rapid identification of SARS-CoV-2 cryptic mutations in wastewater
    (Springer Nature, 2024) Liu, Yunxi; Sapoval, Nicolae; Gallego-García, Pilar; Tomás, Laura; Posada, David; Treangen, Todd J.; Stadler, Lauren B.
    Wastewater surveillance for SARS-CoV-2 provides early warnings of emerging variants of concerns and can be used to screen for novel cryptic linked-read mutations, which are co-occurring single nucleotide mutations that are rare, or entirely missing, in existing SARS-CoV-2 databases. While previous approaches have focused on specific regions of the SARS-CoV-2 genome, there is a need for computational tools capable of efficiently tracking cryptic mutations across the entire genome and investigating their potential origin. We present Crykey, a tool for rapidly identifying rare linked-read mutations across the genome of SARS-CoV-2. We evaluated the utility of Crykey on over 3,000 wastewater and over 22,000 clinical samples; our findings are three-fold: i) we identify hundreds of cryptic mutations that cover the entire SARS-CoV-2 genome, ii) we track the presence of these cryptic mutations across multiple wastewater treatment plants and over three years of sampling in Houston, and iii) we find a handful of cryptic mutations in wastewater mirror cryptic mutations in clinical samples and investigate their potential to represent real cryptic lineages. In summary, Crykey enables large-scale detection of cryptic mutations in wastewater that represent potential circulating cryptic lineages, serving as a new computational tool for wastewater surveillance of SARS-CoV-2.