Journey into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.

dc.contributor.advisorTreangen, Todd J.
dc.creatorBalaji, Advait
dc.date.accessioned2023-06-13T15:54:38Z
dc.date.created2023-05
dc.date.issued2023-02-27
dc.date.submittedMay 2023
dc.date.updated2023-06-13T15:54:38Z
dc.description.abstractThe advent of efficient high-throughput sequencing technologies has led to petabytes-scale genomic datasets. A significant contributor to this genomic data deluge is the field of metagenomics, which comprises the analysis of microbial communities from biological samples. Metagenomics has contributed to novel insights with respect to infectious disease spread and public health, however, scalable and accurate tools to identify and characterize sequences of interest (e.g. Pathogens) from metagenomic samples remain limited. In this thesis, we present three computational tools that encompass contributions towards novel pathogen detection via taxonomy-oblivious functional characterization of DNA sequences harbored within metagenomes. In the first part, we introduce SeqScreen, a tool that utilizes ensemble learning for sensitive functional screening of pathogenic sequences. We show that our ensemble classifier consisting of Neural Networks and Support Vector Classifiers can assign pathogenic labels known as Functions of Sequences of Concern (FunSoCs) to short read sequences. Our classifier achieves 90% precision and 82% recall on an imbalanced multi-class, multi-label classification task across 32 FunSoC labels. We highlight the advantages of FunSoCs over state-of-the-art taxonomic classifiers in distinguishing near-neighbor pathogens. We also simulate a novel-pathogen use-case and show that, in contrast to other tools, SeqScreen can sensitively detect trace amounts of SARS-CoV2 virus from a metagenomic sample obtained from COVID-19 patients. Second, we discuss KOMB, a software for reference-free characterization of function-rich Copy Number Variations (CNVs) in metagenomes. KOMB presents one of the first applications of K-core graph decomposition to metagenomes, thereby offering an exact O(Edges + Vertices) linear-time solution to identifying repeats in graph metagenomes in contrast to state-of-the-art betweenness centrality based tools. On a mock metagenome, KOMB offers more accurate detection of repeats across different copy numbers, offering a sample-wide characterization of CNVs. Using longitudinal metagenome data, we show that KOMB can be used to analyze and visualize shifts caused by disruptions. We also show that KOMB can identify sequences with potentially unique functional profiles using a previous anomaly detection method used to analyze social networks. Finally, we present SeqScreen-Nano, a tool for pathogen detection and identification in metagenomes using long read data. Using simulated nanopore reads from isolate genomes, we first show that the mapping stage of SeqScreen-Nano is optimized to accurately predict Open Reading Frames (ORFs) along the length of the raw nanopore read and accurately assign functional labels in comparison to other mappers and functional characterization tools. We also propose a majority voting approach and a greedy weighted minimum-set cover algorithm to predict a single taxonomic label per read. Further, we develop a reference inference pipeline that assigns a probabilistic coverage score based on ORF assignments to accurately predict species in two mock metagenomic communities and has higher precision and recall compared to state-of-the-art taxonomic classifiers. In summary, this thesis presents efficient and accurate software for pathogen detection and de-novo characterization of copy number variation. Our work presents novel computational frameworks and algorithmic applications that have the potential to have broad impacts across the scientific community ranging from clinical metagenomics to microbial forensics.
dc.embargo.lift2023-11-01
dc.embargo.terms2023-11-01
dc.format.mimetypeapplication/pdf
dc.identifier.citationBalaji, Advait. "Journey into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.." (2023) Diss., Rice University. <a href="https://hdl.handle.net/1911/114907">https://hdl.handle.net/1911/114907</a>.
dc.identifier.urihttps://hdl.handle.net/1911/114907
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectPathogen detection
dc.subjectmetagenomics
dc.subjectmachine learning
dc.subjectgraph theory
dc.subjectsoftware
dc.titleJourney into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
BALAJI-DOCUMENT-2023.pdf
Size:
14.03 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.61 KB
Format:
Plain Text
Description: