Browsing by Author "Treangen, Todd J"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Embargo Accurate and Efficient Computational Approaches for Long-read Alignment and Genome Phasing of Human Genomes(2023-12-01) Fu, Yilei; Treangen, Todd JThe arrival of long-read sequencing technologies has enabled analysis of human genomes at unprecedented resolution. Long-read technologies have facilitated telomere-to-telomere assembly of the human genome and shed light on difficult to resolve structural variations, single nucleotide variations and epigenetic modifications, which all play a critical role in disease etiology and individual genetic diversity. Despite the technological advancement, novel computational methods are still needed to fully leverage long reads. In this dissertation, I tackle three key computational questions by leveraging long-read sequences of human genomes: 1. I improve on the efficiency and precision of long-read alignment, 2. I develop a novel variant phasing techniques based on methylation signal, and 3. I provide a novel method for clinical analysis specific to cancer samples and tumor purity estimation. These accomplishments are represented by three software tools I have developed: Vulcan, MethPhaser and MethPhaser-Cancer, respectively. Vulcan is a read mapping pipeline that uses two distinct gap penalty modes, which is referred to as dual-mode alignment. Read aligners before Vulcan only use one type of scoring scheme during the pairwise alignment stage, which can struggle due to the variable diversity across the human genome. With Vulcan’s dual-mode alignment algorithm, the read-to-reference mapping quality and efficiency for Oxford Nanopore Technology (ONT) long-reads are improved for both simulated and real datasets. Notably, we also show Vulcan provides improvement in structural variation detection. Vulcan increased the SV detection F1 score of 30X human ONT reads from 82.66% (minimap2) to 84.94%. MethPhaser is the first method that utilizes methylation, an epigenetic marker, from Oxford Nanopore Technologies to extend SNV-based phasing. Long-read human genomic variant phasing is limited by read length and stretches of homozygosity along the genome. The key innovation of MethPhaser is the utilization of the haplotype-specific long-read methylation signals. In benchmarking against human samples, MethPhaser nearly triples the phase length N50 while incurring a minimal increase in switch error from 0.06% to 0.07% using ONT R10 reads at 60X coverage. As an extension method to existing long-read SNV-based phasing workflows, MethPhaser offers substantial enhancements with a negligible rise in switch error rates. Building upon MethPhaser, I have also innovated an algorithmic extension named MethPhaser-Cancer that uses methylation signals for the assessment of tumor purity and for categorizing reads. The tumor purity estimation is an important step in clinical treatment that is related to tailoring patient-specific therapeutic strategies and in the broader context of personalized medicine. MethPhaser-Cancer adeptly identifies hypomethylated areas within human tumor samples and utilizes the k-means algorithm to sort the reads into two distinct groups. This represents a pioneering approach in the long-read sequencing field to consider whole-genome methylation profiles in simulated clinical samples, capable of automatically estimating the tumor purity and distinguishing long-reads within specific regions between two samples. To conclude, this dissertation represents a set of novel and efficient approaches that enhances the long-read human genomic analysis. The real-life usage of Vulcan, MethPhaser and MethPhaser-Cancer includes long-read alignment, human genome variant phasing and tumor purity estimation.Item Embargo Finding Needles in the Haystack: Computational tools for Contaminant Detection and Error Correction in Genomic and Metagenomic Datasets(2024-04-17) Liu, Yunxi; Treangen, Todd JThe scale and complexity of genomic studies have been expanded alongside the volume of sequencing data thanks to the recent development in next-gen and third-gen sequencing technology. However, errors introduced during sample collection, sample preparation, sequencing, and data analysis through computational methods may distort results and contribute to erroneous interpretations. In this work we present a set of studies that explore anomaly detection and error corrections in genomic data, from different points of view. Broadly the topics of the thesis could be grouped into two categories: those related to metagenomic, whereas the projects focus on accurate profiling of a microbiome community in terms of contamination identification, and false positive detection for taxonomic classification; and those related to viral genomics, whereas the projects focus on variant calling with high error rate long read sequencing, and cryptic mutation detection in wastewater for SARS-CoV-2. These methodologies underscore the significance of rare occurrences in high-throughput sequencing procedures, paving the way for advancements in metagenomics and viral genomics.Item Interrogating Microbial Populations: from Large-scale Data to Algorithms to Field-deployed Software(2024-04-17) Sapoval, Nick; Treangen, Todd JIn this work we present a set of studies that explore genomic sequencing data and offer computational methods to process these data at scale. Broadly the topics of this theses can be grouped into two categories: those that bridge the gap of efficient large scale data analysis with applications in public health and those that explore algorithmic solutions to analyses of clinically relevant metagenomic data. Across this set of topics we make several contributions that include scientific data analysis, and algorithm and software development. In the realm of public health, we contribute an exploratory study of the genomic variation within SARS-CoV-2 and its impacts on our ability to track the virus and its spread. We also propose an efficient pipeline for characterization of wastewater derived SARS-CoV-2 samples which is employed for routine monitoring in Houston, USA. On the clinical metagenomics side we explore a scalable database-free approach for characterization of longitudinal changes in human gut microbiota. We also propose a laptop-friendly software for taxonomic profiling of long-read metagenomic samples. Together the contributions of this thesis span two major application areas and cover topics of data-driven algorithm and software design.Item Limited genomic reconstruction of SARS-CoV-2 transmission history within local epidemiological clusters(Oxford University Press, 2022) Gallego-García, Pilar; Varela, Nair; Estévez-Gómez, Nuria; De Chiara, Loretta; Fernández-Silva, Iria; Valverde, Diana; Sapoval, Nicolae; Treangen, Todd J; Regueiro, Benito; Cabrera-Alvargonzález, Jorge Julio; del Campo, Víctor; Pérez, Sonia; Posada, DavidA detailed understanding of how and when severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission occurs is crucial for designing effective prevention measures. Other than contact tracing, genome sequencing provides information to help infer who infected whom. However, the effectiveness of the genomic approach in this context depends on both (high enough) mutation and (low enough) transmission rates. Today, the level of resolution that we can obtain when describing SARS-CoV-2 outbreaks using just genomic information alone remains unclear. In order to answer this question, we sequenced forty-nine SARS-CoV-2 patient samples from ten local clusters in NW Spain for which partial epidemiological information was available and inferred transmission history using genomic variants. Importantly, we obtained high-quality genomic data, sequencing each sample twice and using unique barcodes to exclude cross-sample contamination. Phylogenetic and cluster analyses showed that consensus genomes were generally sufficient to discriminate among independent transmission clusters. However, levels of intrahost variation were low, which prevented in most cases the unambiguous identification of direct transmission events. After filtering out recurrent variants across clusters, the genomic data were generally compatible with the epidemiological information but did not support specific transmission events over possible alternatives. We estimated the effective transmission bottleneck size to be one to two viral particles for sample pairs whose donor–recipient relationship was likely. Our analyses suggest that intrahost genomic variation in SARS-CoV-2 might be generally limited and that homoplasy and recurrent errors complicate identifying shared intrahost variants. Reliable reconstruction of direct SARS-CoV-2 transmission based solely on genomic data seems hindered by a slow mutation rate, potential convergent events, and technical artifacts. Detailed contact tracing seems essential in most cases to study SARS-CoV-2 transmission at high resolution.Item Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation(Oxford University Press, 2023) Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam MThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.MashMap3 is available at https://github.com/marbl/MashMap.Item Parsnp 2.0: scalable core-genome alignment for massive microbial datasets(Oxford University Press, 2024) Kille, Bryce; Nute, Michael G; Huang, Victor; Kim, Eddie; Phillippy, Adam M; Treangen, Todd JSince 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014.To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes.Parsnp v2 is available at https://github.com/marbl/parsnp.Item Reference-free structural variant detection in microbiomes via long-read co-assembly graphs(Oxford University Press, 2024) Curry, Kristen D; Yu, Feiqiao Brian; Vance, Summer E; Segarra, Santiago; Bhaya, Devaki; Chikhi, Rayan; Rocha, Eduardo P C; Treangen, Todd JMotivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining.Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux.Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.Item RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification(BioMed Central, 10/30/2018) Nasko, Daniel J; Koren, Sergey; Phillippy, Adam M; Treangen, Todd JAbstract In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.Item Embargo The Microbiome in its Entirety: Community-Oriented Computational Tools for Deciphering Metagenomic Diversity(2024-04-19) Curry, Kristen; Treangen, Todd JMicrobiome. An ecosystem composed of microscopic organisms. Although unseen by the naked eye, these communities can have powerful impacts on their hosts and surrounding environments. Yet, we are just beginning to crack the surface as to who these tiny critters are, how they are surviving, and what their overarching purpose is in the tree of life. This thesis presents software methods developed to improve understanding of these communities by leveraging the advent of high-throughput sequencing and viewing each ecosystem holistically, motivated by the intention of improving upon methods for gut microbiome analysis in concussion recovery. We dive into three computational tools developed for improvement of understanding the diversity within microbial communities: Emu for taxonomic community profiling, Rhea for structural variant detection, and Kiwi for P4 phage satellite detection. Each of these algorithms was designed with the view of the microbiome as a single evolving entity, rather than a sum of unique individuals. Viewing microbiomes through this lens and incorporating computer science theories in expectation-maximization, graph motifs extraction, and sub-string minimizers allowed us to develop software for each of these concepts that showed improvement upon existing methods.Item Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment(Oxford University Press, 2021) Fu, Yilei; Mahmoud, Medhat; Muraliraman, Viginesh Vaibhav; Sedlazeck, Fritz J; Treangen, Todd JLong-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection.We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone.Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.