Browsing by Author "Zook, Justin M."
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item FixItFelix: improving genomic analysis by fixing reference errors(Springer Nature, 2023) Behera, Sairam; LeFaive, Jonathon; Orchard, Peter; Mahmoud, Medhat; Paulin, Luis F.; Farek, Jesse; Soto, Daniela C.; Parker, Stephen C. J.; Smith, Albert V.; Dennis, Megan Y.; Zook, Justin M.; Sedlazeck, Fritz J.The current version of the human reference genome, GRCh38, contains a number of errors including 1.2 Mbp of falsely duplicated and 8.04 Mbp of collapsed regions. These errors impact the variant calling of 33 protein-coding genes, including 12 with medical relevance. Here, we present FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates. We showcase these improvements over multi-ethnic control samples, demonstrating improvements for population variant calling as well as eQTL studies.Item The GIAB genomic stratifications resource for human reference genomes(Springer Nature, 2024) Dwarshuis, Nathan; Kalra, Divya; McDaniel, Jennifer; Sanio, Philippe; Alvarez Jerez, Pilar; Jadhav, Bharati; Huang, Wenyu (Eddy); Mondal, Rajarshi; Busby, Ben; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Majidian, Sina; Zook, Justin M.Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.Item Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes(Springer Nature, 2023) Chin, Chen-Shan; Behera, Sairam; Khalak, Asif; Sedlazeck, Fritz J.; Sudmant, Peter H.; Wagner, Justin; Zook, Justin M.Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.Item StratoMod: predicting sequencing and variant calling errors with interpretable machine learning(Springer Nature, 2024) Dwarshuis, Nathan; Tonner, Peter; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Zook, Justin M.Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.