Accurate and Efficient Computational Approaches for Long-read Alignment and Genome Phasing of Human Genomes

Date
2023-12-01
Journal Title
Journal ISSN
Volume Title
Publisher
Embargo
Abstract

The arrival of long-read sequencing technologies has enabled analysis of human genomes at unprecedented resolution. Long-read technologies have facilitated telomere-to-telomere assembly of the human genome and shed light on difficult to resolve structural variations, single nucleotide variations and epigenetic modifications, which all play a critical role in disease etiology and individual genetic diversity. Despite the technological advancement, novel computational methods are still needed to fully leverage long reads. In this dissertation, I tackle three key computational questions by leveraging long-read sequences of human genomes: 1. I improve on the efficiency and precision of long-read alignment, 2. I develop a novel variant phasing techniques based on methylation signal, and 3. I provide a novel method for clinical analysis specific to cancer samples and tumor purity estimation. These accomplishments are represented by three software tools I have developed: Vulcan, MethPhaser and MethPhaser-Cancer, respectively. Vulcan is a read mapping pipeline that uses two distinct gap penalty modes, which is referred to as dual-mode alignment. Read aligners before Vulcan only use one type of scoring scheme during the pairwise alignment stage, which can struggle due to the variable diversity across the human genome. With Vulcan’s dual-mode alignment algorithm, the read-to-reference mapping quality and efficiency for Oxford Nanopore Technology (ONT) long-reads are improved for both simulated and real datasets. Notably, we also show Vulcan provides improvement in structural variation detection. Vulcan increased the SV detection F1 score of 30X human ONT reads from 82.66% (minimap2) to 84.94%. MethPhaser is the first method that utilizes methylation, an epigenetic marker, from Oxford Nanopore Technologies to extend SNV-based phasing. Long-read human genomic variant phasing is limited by read length and stretches of homozygosity along the genome. The key innovation of MethPhaser is the utilization of the haplotype-specific long-read methylation signals. In benchmarking against human samples, MethPhaser nearly triples the phase length N50 while incurring a minimal increase in switch error from 0.06% to 0.07% using ONT R10 reads at 60X coverage. As an extension method to existing long-read SNV-based phasing workflows, MethPhaser offers substantial enhancements with a negligible rise in switch error rates. Building upon MethPhaser, I have also innovated an algorithmic extension named MethPhaser-Cancer that uses methylation signals for the assessment of tumor purity and for categorizing reads. The tumor purity estimation is an important step in clinical treatment that is related to tailoring patient-specific therapeutic strategies and in the broader context of personalized medicine. MethPhaser-Cancer adeptly identifies hypomethylated areas within human tumor samples and utilizes the k-means algorithm to sort the reads into two distinct groups. This represents a pioneering approach in the long-read sequencing field to consider whole-genome methylation profiles in simulated clinical samples, capable of automatically estimating the tumor purity and distinguishing long-reads within specific regions between two samples. To conclude, this dissertation represents a set of novel and efficient approaches that enhances the long-read human genomic analysis. The real-life usage of Vulcan, MethPhaser and MethPhaser-Cancer includes long-read alignment, human genome variant phasing and tumor purity estimation.

Description
EMBARGO NOTE: This item is embargoed until 2024-12-01
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Human Genome, Long-read, Read Alignment, Genome Phasing, Cancer Genomics
Citation

Fu, Yilei. "Accurate and Efficient Computational Approaches for Long-read Alignment and Genome Phasing of Human Genomes." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115440

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page