Browsing by Author "Balaji, Advait"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Current progress and open challenges for applying deep learning across the biosciences(Springer Nature, 2022) Sapoval, Nicolae; Aghazadeh, Amirali; Nute, Michael G.; Antunes, Dinler A.; Balaji, Advait; Baraniuk, Richard; Barberan, C.J.; Dannenfelser, Ruth; Dun, Chen; Edrisi, Mohammadamin; Elworth, R.A. Leo; Kille, Bryce; Kyrillidis, Anastasios; Nakhleh, Luay; Wolfe, Cameron R.; Yan, Zhi; Yao, Vicky; Treangen, Todd J.Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.Item Journey into the unknown: graph and machine learning based approaches for improved characterization of novel pathogens.(2023-02-27) Balaji, Advait; Treangen, Todd J.The advent of efficient high-throughput sequencing technologies has led to petabytes-scale genomic datasets. A significant contributor to this genomic data deluge is the field of metagenomics, which comprises the analysis of microbial communities from biological samples. Metagenomics has contributed to novel insights with respect to infectious disease spread and public health, however, scalable and accurate tools to identify and characterize sequences of interest (e.g. Pathogens) from metagenomic samples remain limited. In this thesis, we present three computational tools that encompass contributions towards novel pathogen detection via taxonomy-oblivious functional characterization of DNA sequences harbored within metagenomes. In the first part, we introduce SeqScreen, a tool that utilizes ensemble learning for sensitive functional screening of pathogenic sequences. We show that our ensemble classifier consisting of Neural Networks and Support Vector Classifiers can assign pathogenic labels known as Functions of Sequences of Concern (FunSoCs) to short read sequences. Our classifier achieves 90% precision and 82% recall on an imbalanced multi-class, multi-label classification task across 32 FunSoC labels. We highlight the advantages of FunSoCs over state-of-the-art taxonomic classifiers in distinguishing near-neighbor pathogens. We also simulate a novel-pathogen use-case and show that, in contrast to other tools, SeqScreen can sensitively detect trace amounts of SARS-CoV2 virus from a metagenomic sample obtained from COVID-19 patients. Second, we discuss KOMB, a software for reference-free characterization of function-rich Copy Number Variations (CNVs) in metagenomes. KOMB presents one of the first applications of K-core graph decomposition to metagenomes, thereby offering an exact O(Edges + Vertices) linear-time solution to identifying repeats in graph metagenomes in contrast to state-of-the-art betweenness centrality based tools. On a mock metagenome, KOMB offers more accurate detection of repeats across different copy numbers, offering a sample-wide characterization of CNVs. Using longitudinal metagenome data, we show that KOMB can be used to analyze and visualize shifts caused by disruptions. We also show that KOMB can identify sequences with potentially unique functional profiles using a previous anomaly detection method used to analyze social networks. Finally, we present SeqScreen-Nano, a tool for pathogen detection and identification in metagenomes using long read data. Using simulated nanopore reads from isolate genomes, we first show that the mapping stage of SeqScreen-Nano is optimized to accurately predict Open Reading Frames (ORFs) along the length of the raw nanopore read and accurately assign functional labels in comparison to other mappers and functional characterization tools. We also propose a majority voting approach and a greedy weighted minimum-set cover algorithm to predict a single taxonomic label per read. Further, we develop a reference inference pipeline that assigns a probabilistic coverage score based on ORF assignments to accurately predict species in two mock metagenomic communities and has higher precision and recall compared to state-of-the-art taxonomic classifiers. In summary, this thesis presents efficient and accurate software for pathogen detection and de-novo characterization of copy number variation. Our work presents novel computational frameworks and algorithmic applications that have the potential to have broad impacts across the scientific community ranging from clinical metagenomics to microbial forensics.Item Multiple genome alignment in the telomere-to-telomere assembly era(Springer Nature, 2022) Kille, Bryce; Balaji, Advait; Sedlazeck, Fritz J.; Nute, Michael; Treangen, Todd J.With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.