StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

dc.citation.articleNumber1316en_US
dc.citation.journalTitleCommunications Biologyen_US
dc.citation.volumeNumber7en_US
dc.contributor.authorDwarshuis, Nathanen_US
dc.contributor.authorTonner, Peteren_US
dc.contributor.authorOlson, Nathan D.en_US
dc.contributor.authorSedlazeck, Fritz J.en_US
dc.contributor.authorWagner, Justinen_US
dc.contributor.authorZook, Justin M.en_US
dc.date.accessioned2024-11-04T16:25:13Zen_US
dc.date.available2024-11-04T16:25:13Zen_US
dc.date.issued2024en_US
dc.description.abstractDespite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.en_US
dc.identifier.citationDwarshuis, N., Tonner, P., Olson, N. D., Sedlazeck, F. J., Wagner, J., & Zook, J. M. (2024). StratoMod: Predicting sequencing and variant calling errors with interpretable machine learning. Communications Biology, 7(1), 1–14. https://doi.org/10.1038/s42003-024-06981-1en_US
dc.identifier.digitals42003-024-06981-1en_US
dc.identifier.doihttps://doi.org/10.1038/s42003-024-06981-1en_US
dc.identifier.urihttps://hdl.handle.net/1911/118004en_US
dc.language.isoengen_US
dc.publisherSpringer Natureen_US
dc.rightsExcept where otherwise noted, this work is licensed under a Creative Commons Attribution (CC BY) license. Permission to reuse, publish, or reproduce the work beyond the terms of the license or beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.titleStratoMod: predicting sequencing and variant calling errors with interpretable machine learningen_US
dc.typeJournal articleen_US
dc.type.dcmiTexten_US
dc.type.publicationpublisher versionen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s42003-024-06981-1.pdf
Size:
2.33 MB
Format:
Adobe Portable Document Format