GenomeDepot: Computational Methods for Decoding Biological Information Encoded in Engineered DNA and Microbial Genomes

Date
2021-12-03
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Although great successes have been made in DNA sequencing and genome engineering, fully elucidating the underlying biological information encoded in genomic data, and the ability to fully control biological systems, are still limited. My research has focused on deciphering signatures hidden in genomic data, specifically in engineered synthetic sequences, and metagenomes. Recent advances in genome engineering and editing have enabled researchers to create novel genetic parts and redesign biological systems. As genome engineering develops, there is a heightened awareness of potential misuse related to biosafety concerns. In parallel, we are now able to study microbial communities at unprecedented resolution thanks to metagenomics. Previous efforts in this area allow us to identify species composition and estimate their metabolic functions of given microbial communities. Despite this great progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully understand and control microbial communities. In the first part of my thesis, I developed PlasmidHawk, a linear time pan-genome alignment-based pipeline to predict the lab-of-origin of unknown sequences. Compared to the previous deep learning method, PlasmidHawk has higher prediction accuracy. PlasmidHawk can successfully predict unknown sequences’ depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. In the second part of my thesis, I developed Bakdrive, a novel method for identifying driver species within microbiomes. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. Through simulated and real dataset, we demonstrate detecting driver species from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile infection patients to a healthy state. In summary, Bakdrive provides a novel approach for teasing apart microbial interactions and facilitates future personalized probiotic design. In conclusion, GenomeDepot represents a collection of novel, computationally efficient software tools and algorithms suited for deciphering biological information encoded in engineered and microbial genomes. Real-world applications of GenomeDepot have included lab-of-origin prediction and detection of driver species in healthy and disease associated microbiomes, feeding back into biosecurity decisions and human health.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
metagenome, synthetic biology, lab-of-origin
Citation

Wang, Qi X. "GenomeDepot: Computational Methods for Decoding Biological Information Encoded in Engineered DNA and Microbial Genomes." (2021) Diss., Rice University. https://hdl.handle.net/1911/111741.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page