Browsing by Author "Sedlazeck, Fritz J."
Now showing 1 - 17 of 17
Results Per Page
Sort Options
Item Characterizing a complex CT-rich haplotype in intron 4 of SNCA using large-scale targeted amplicon long-read sequencing(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Grenn, Francis P.; Malik, Laksh; Miano-Burkhardt, Abigail; Makarious, Mary B.; Ding, Jinhui; Gibbs, J. Raphael; Moore, Anni; Reed, Xylena; Nalls, Mike A.; Shah, Syed; Mahmoud, Medhat; Sedlazeck, Fritz J.; Dolzhenko, Egor; Park, Morgan; Iwaki, Hirotaka; Casey, Bradford; Ryten, Mina; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.Parkinson’s disease (PD) is a common neurodegenerative disorder with a significant risk proportion driven by genetics. While much progress has been made, most of the heritability remains unknown. This is in-part because previous genetic studies have focused on the contribution of single nucleotide variants. More complex forms of variation, such as structural variants and tandem repeats, are already associated with several synucleinopathies. However, because more sophisticated sequencing methods are usually required to detect these regions, little is understood regarding their contribution to PD. One example is a polymorphic CT-rich region in intron 4 of the SNCA gene. This haplotype has been suggested to be associated with risk of Lewy Body (LB) pathology in Alzheimer’s Disease and SNCA gene expression, but is yet to be investigated in PD. Here, we attempt to resolve this CT-rich haplotype and investigate its role in PD. We performed targeted PacBio HiFi sequencing of the region in 1375 PD cases and 959 controls. We replicate the previously reported associations and a novel association between two PD risk SNVs (rs356182 and rs5019538) and haplotype 4, the largest haplotype. Through quantitative trait locus analyzes we identify a significant haplotype 4 association with alternative CAGE transcriptional start site usage, not leading to significant differential SNCA gene expression in post-mortem frontal cortex brain tissue. Therefore, disease association in this locus might not be biologically driven by this CT-rich repeat region. Our data demonstrates the complexity of this SNCA region and highlights that further follow up functional studies are warranted.Item FixItFelix: improving genomic analysis by fixing reference errors(Springer Nature, 2023) Behera, Sairam; LeFaive, Jonathon; Orchard, Peter; Mahmoud, Medhat; Paulin, Luis F.; Farek, Jesse; Soto, Daniela C.; Parker, Stephen C. J.; Smith, Albert V.; Dennis, Megan Y.; Zook, Justin M.; Sedlazeck, Fritz J.The current version of the human reference genome, GRCh38, contains a number of errors including 1.2 Mbp of falsely duplicated and 8.04 Mbp of collapsed regions. These errors impact the variant calling of 33 protein-coding genes, including 12 with medical relevance. Here, we present FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates. We showcase these improvements over multi-ethnic control samples, demonstrating improvements for population variant calling as well as eQTL studies.Item Genomic variant benchmark: if you cannot measure it, you cannot improve it(Springer Nature, 2023) Majidian, Sina; Agustinho, Daniel Paiva; Chin, Chen-Shan; Sedlazeck, Fritz J.; Mahmoud, MedhatGenomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.Item Germline structural variation globally impacts the cancer transcriptome including disease-relevant genes(Elsevier, 2024) Chen, Fengju; Zhang, Yiqun; Sedlazeck, Fritz J.; Creighton, Chad J.Germline variation and somatic alterations contribute to the molecular profile of cancers. We combine RNA with whole genome sequencing across 1,218 cancer patients to determine the extent germline structural variants (SVs) impact expression of nearby genes. For hundreds of genes, recurrent and common germline SV breakpoints within 100 kb associate with increased or decreased expression in tumors spanning various tissues of origin. A significant fraction of germline SV expression associations involves duplication of intergenic enhancers or 3′ UTR disruption. Genes altered by both somatic and germline SVs include ATRX and CEBPA. Genes essential in cancer cell lines include BARD1 and IRS2. Genes with both expression and germline SV breakpoint patterns associated with patient survival include GCLM. Our results capture a class of phenotypic variation at work in the disease setting, including genes with cancer roles. Specific germline SVs represent potential cancer risk variants for genetic testing, including those involving genes with targeting implications.Item The GIAB genomic stratifications resource for human reference genomes(Springer Nature, 2024) Dwarshuis, Nathan; Kalra, Divya; McDaniel, Jennifer; Sanio, Philippe; Alvarez Jerez, Pilar; Jadhav, Bharati; Huang, Wenyu (Eddy); Mondal, Rajarshi; Busby, Ben; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Majidian, Sina; Zook, Justin M.Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.Item Impact and characterization of serial structural variations across humans and great apes(Springer Nature, 2024) Höps, Wolfram; Rausch, Tobias; Jendrusch, Michael; Korbel, Jan O.; Sedlazeck, Fritz J.Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals (https://github.com/WHops/NAHRwhals), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.Item Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree(Springer Nature, 2023) Dylus, David; Altenhoff, Adrian; Majidian, Sina; Sedlazeck, Fritz J.; Dessimoz, ChristopheCurrent methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.Item Intratumoral Heterogeneity and Clonal Evolution Induced by HPV Integration(AACR, 2023) Akagi, Keiko; Symer, David E.; Mahmoud, Medhat; Jiang, Bo; Goodwin, Sara; Wangsa, Darawalee; Li, Zhengke; Xiao, Weihong; Dan Dunn, Joe; Ried, Thomas; Coombes, Kevin R.; Sedlazeck, Fritz J.; Gillison, Maura L.The human papillomavirus (HPV) genome is integrated into host DNA in most HPV-positive cancers, but the consequences for chromosomal integrity are unknown. Continuous long-read sequencing of oropharyngeal cancers and cancer cell lines identified a previously undescribed form of structural variation, “heterocateny,” characterized by diverse, interrelated, and repetitive patterns of concatemerized virus and host DNA segments within a cancer. Unique breakpoints shared across structural variants facilitated stepwise reconstruction of their evolution from a common molecular ancestor. This analysis revealed that virus and virus–host concatemers are unstable and, upon insertion into and excision from chromosomes, facilitate capture, amplification, and recombination of host DNA and chromosomal rearrangements. Evidence of heterocateny was detected in extrachromosomal and intrachromosomal DNA. These findings indicate that heterocateny is driven by the dynamic, aberrant replication and recombination of an oncogenic DNA virus, thereby extending known consequences of HPV integration to include promotion of intratumoral heterogeneity and clonal evolution.Long-read sequencing of HPV-positive cancers revealed “heterocateny,” a previously unreported form of genomic structural variation characterized by heterogeneous, interrelated, and repetitive genomic rearrangements within a tumor. Heterocateny is driven by unstable concatemerized HPV genomes, which facilitate capture, rearrangement, and amplification of host DNA, and promotes intratumoral heterogeneity and clonal evolution.See related commentary by McBride and White, p. 814.This article is highlighted in the In This Issue feature, p. 799Item Inverted triplications formed by iterative template switches generate structural variant diversity at genomic disorder loci(Elsevier, 2024) Grochowski, Christopher M.; Bengtsson, Jesse D.; Du, Haowei; Gandhi, Mira; Lun, Ming Yin; Mehaffey, Michele G.; Park, KyungHee; Höps, Wolfram; Benito, Eva; Hasenfeld, Patrick; Korbel, Jan O.; Mahmoud, Medhat; Paulin, Luis F.; Jhangiani, Shalini N.; Hwang, James Paul; Bhamidipati, Sravya V.; Muzny, Donna M.; Fatih, Jawid M.; Gibbs, Richard A.; Pendleton, Matthew; Harrington, Eoghan; Juul, Sissel; Lindstrand, Anna; Sedlazeck, Fritz J.; Pehlivan, Davut; Lupski, James R.; Carvalho, Claudia M. B.The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a complex genomic rearrangement (CGR). Although it has been identified as an important pathogenic DNA mutation signature in genomic disorders and cancer genomes, its architecture remains unresolved. Here, we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the DNA of 24 patients identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted structural variant (SV) haplotypes. Using a combination of short-read genome sequencing (GS), long-read GS, optical genome mapping, and single-cell DNA template strand sequencing (strand-seq), the haplotype structure was resolved in 18 samples. The point of template switching in 4 samples was shown to be a segment of ∼2.2–5.5 kb of 100% nucleotide similarity within inverted repeat pairs. These data provide experimental evidence that inverted low-copy repeats act as recombinant substrates. This type of CGR can result in multiple conformers generating diverse SV haplotypes in susceptible dosage-sensitive loci.Item MethPhaser: methylation-based long-read haplotype phasing of human genomes(Springer Nature, 2024) Fu, Yilei; Aganezov, Sergey; Mahmoud, Medhat; Beaulaurier, John; Juul, Sissel; Treangen, Todd J.; Sedlazeck, Fritz J.The assignment of variants across haplotypes, phasing, is crucial for predicting the consequences, interaction, and inheritance of mutations and is a key step in improving our understanding of phenotype and disease. However, phasing is limited by read length and stretches of homozygosity along the genome. To overcome this limitation, we designed MethPhaser, a method that utilizes methylation signals from Oxford Nanopore Technologies to extend Single Nucleotide Variation (SNV)-based phasing. We demonstrate that haplotype-specific methylations extensively exist in Human genomes and the advent of long-read technologies enabled direct report of methylation signals. For ONT R9 and R10 cell line data, we increase the phase length N50 by 78%-151% at a phasing accuracy of 83.4-98.7% To assess the impact of tissue purity and random methylation signals due to inactivation, we also applied MethPhaser on blood samples from 4 patients, still showing improvements over SNV-only phasing. MethPhaser further improves phasing across HLA and multiple other medically relevant genes, improving our understanding of how mutations interact across multiple phenotypes. The concept of MethPhaser can also be extended to non-human diploid genomes. MethPhaser is available at https://github.com/treangenlab/methphaser.Item Multiple genome alignment in the telomere-to-telomere assembly era(Springer Nature, 2022) Kille, Bryce; Balaji, Advait; Sedlazeck, Fritz J.; Nute, Michael; Treangen, Todd J.With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.Item Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes(Springer Nature, 2023) Chin, Chen-Shan; Behera, Sairam; Khalak, Asif; Sedlazeck, Fritz J.; Sudmant, Peter H.; Wagner, Justin; Zook, Justin M.Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.Item Profiling complex repeat expansions in RFC1 in Parkinson’s disease(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Miano-Burkhardt, Abigail; Iwaki, Hirotaka; Malik, Laksh; Cogan, Guillaume; Makarious, Mary B.; Sullivan, Roisin; Vandrovcova, Jana; Ding, Jinhui; Gibbs, J. Raphael; Markham, Androo; Nalls, Mike A.; Kesharwani, Rupesh K.; Sedlazeck, Fritz J.; Casey, Bradford; Hardy, John; Houlden, Henry; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.A biallelic (AAGGG) expansion in the poly(A) tail of an AluSx3 transposable element within the gene RFC1 is a frequent cause of cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS), and more recently, has been reported as a rare cause of Parkinson’s disease (PD) in the Finnish population. Here, we investigate the prevalence of RFC1 (AAGGG) expansions in PD patients of non-Finnish European ancestry in 1609 individuals from the Parkinson’s Progression Markers Initiative study. We identified four PD patients carrying the biallelic RFC1 (AAGGG) expansion and did not identify any carriers in controls.Item Rescuing low frequency variants within intra-host viral populations directly from Oxford Nanopore sequencing data(Springer Nature, 2022) Liu, Yunxi; Kearney, Joshua; Mahmoud, Medhat; Kille, Bryce; Sedlazeck, Fritz J.; Treangen, Todd J.Infectious disease monitoring on Oxford Nanopore Technologies (ONT) platforms offers rapid turnaround times and low cost. Tracking low frequency intra-host variants provides important insights with respect to elucidating within-host viral population dynamics and transmission. However, given the higher error rate of ONT, accurate identification of intra-host variants with low allele frequencies remains an open challenge with no viable computational solutions available. In response to this need, we present Variabel, a novel approach and first method designed for rescuing low frequency intra-host variants from ONT data alone. We evaluate Variabel on both synthetic data (SARS-CoV-2) and patient derived datasets (Ebola virus, norovirus, SARS-CoV-2); our results show that Variabel can accurately identify low frequency variants below 0.5 allele frequency, outperforming existing state-of-the-art ONT variant callers for this task. Variabel is open-source and available for download at: www.gitlab.com/treangenlab/variabel.Item SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission(Cold Spring Harbor Laboratory Press, 2021) Sapoval, Nicolae; Mahmoud, Medhat; Jochum, Michael D.; Liu, Yunxi; Elworth, R. A. Leo; Wang, Qi; Albin, Dreycey; Ogilvie, Huw A.; Lee, Michael D.; Villapol, Sonia; Hernandez, Kyle M.; Berry, Irina Maljkovic; Foox, Jonathan; Beheshti, Afshin; Ternus, Krista; Aagaard, Kjersti M.; Posada, David; Mason, Christopher E.; Sedlazeck, Fritz J.; Treangen, Todd J.The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 lineages that have spread throughout the world. In this study, we investigated 129 RNA-seq data sets and 6928 consensus genomes to contrast the intra-host and inter-host diversity of SARS-CoV-2. Our analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights intra-host single nucleotide variant (iSNV) and SNP similarity, albeit with differences in C > U changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of insertions and deletions contribute to the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.Item Single-cell somatic copy number variants in brain using different amplification methods and reference genomes(Springer Nature, 2024) Kalef-Ezra, Ester; Turan, Zeliha Gozde; Perez-Rodriguez, Diego; Bomann, Ida; Behera, Sairam; Morley, Caoimhe; Scholz, Sonja W.; Jaunmuktane, Zane; Demeulemeester, Jonas; Sedlazeck, Fritz J.; Proukakis, ChristosThe presence of somatic mutations, including copy number variants (CNVs), in the brain is well recognized. Comprehensive study requires single-cell whole genome amplification, with several methods available, prior to sequencing. Here we compare PicoPLEX with two recent adaptations of multiple displacement amplification (MDA): primary template-directed amplification (PTA) and droplet MDA, across 93 human brain cortical nuclei. We demonstrate different properties for each, with PTA providing the broadest amplification, PicoPLEX the most even, and distinct chimeric profiles. Furthermore, we perform CNV calling on two brains with multiple system atrophy and one control brain using different reference genomes. We find that 20.6% of brain cells have at least one Mb-scale CNV, with some supported by bulk sequencing or single-cells from other brain regions. Our study highlights the importance of selecting whole genome amplification method and reference genome for CNV calling, while supporting the existence of somatic CNVs in healthy and diseased human brain.Item StratoMod: predicting sequencing and variant calling errors with interpretable machine learning(Springer Nature, 2024) Dwarshuis, Nathan; Tonner, Peter; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Zook, Justin M.Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.