Treangen, Todd J.2023-06-132023-052023-02-27May 2023Balaji, Advait. "Journey into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.." (2023) Diss., Rice University. <a href="https://hdl.handle.net/1911/114907">https://hdl.handle.net/1911/114907</a>.https://hdl.handle.net/1911/114907The advent of efficient high-throughput sequencing technologies has led to petabytes-scale genomic datasets. A significant contributor to this genomic data deluge is the field of metagenomics, which comprises the analysis of microbial communities from biological samples. Metagenomics has contributed to novel insights with respect to infectious disease spread and public health, however, scalable and accurate tools to identify and characterize sequences of interest (e.g. Pathogens) from metagenomic samples remain limited. In this thesis, we present three computational tools that encompass contributions towards novel pathogen detection via taxonomy-oblivious functional characterization of DNA sequences harbored within metagenomes. In the first part, we introduce SeqScreen, a tool that utilizes ensemble learning for sensitive functional screening of pathogenic sequences. We show that our ensemble classifier consisting of Neural Networks and Support Vector Classifiers can assign pathogenic labels known as Functions of Sequences of Concern (FunSoCs) to short read sequences. Our classifier achieves 90% precision and 82% recall on an imbalanced multi-class, multi-label classification task across 32 FunSoC labels. We highlight the advantages of FunSoCs over state-of-the-art taxonomic classifiers in distinguishing near-neighbor pathogens. We also simulate a novel-pathogen use-case and show that, in contrast to other tools, SeqScreen can sensitively detect trace amounts of SARS-CoV2 virus from a metagenomic sample obtained from COVID-19 patients. Second, we discuss KOMB, a software for reference-free characterization of function-rich Copy Number Variations (CNVs) in metagenomes. KOMB presents one of the first applications of K-core graph decomposition to metagenomes, thereby offering an exact O(Edges + Vertices) linear-time solution to identifying repeats in graph metagenomes in contrast to state-of-the-art betweenness centrality based tools. On a mock metagenome, KOMB offers more accurate detection of repeats across different copy numbers, offering a sample-wide characterization of CNVs. Using longitudinal metagenome data, we show that KOMB can be used to analyze and visualize shifts caused by disruptions. We also show that KOMB can identify sequences with potentially unique functional profiles using a previous anomaly detection method used to analyze social networks. Finally, we present SeqScreen-Nano, a tool for pathogen detection and identification in metagenomes using long read data. Using simulated nanopore reads from isolate genomes, we first show that the mapping stage of SeqScreen-Nano is optimized to accurately predict Open Reading Frames (ORFs) along the length of the raw nanopore read and accurately assign functional labels in comparison to other mappers and functional characterization tools. We also propose a majority voting approach and a greedy weighted minimum-set cover algorithm to predict a single taxonomic label per read. Further, we develop a reference inference pipeline that assigns a probabilistic coverage score based on ORF assignments to accurately predict species in two mock metagenomic communities and has higher precision and recall compared to state-of-the-art taxonomic classifiers. In summary, this thesis presents efficient and accurate software for pathogen detection and de-novo characterization of copy number variation. Our work presents novel computational frameworks and algorithmic applications that have the potential to have broad impacts across the scientific community ranging from clinical metagenomics to microbial forensics.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Pathogen detectionmetagenomicsmachine learninggraph theorysoftwareJourney into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.Thesis2023-06-13