GEM Incorporating Context into Genomic Distance Estimation

Date
2019-06-04
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

A pivotal question in computational biology is how similar two organisms are based on their genomic sequences. Unfortunately, classical sequence alignment-based methods for estimating genomic distances do not scale well to the massive number of organisms that have been sequenced to date. Recently, composition-based methods have gained interest due to their computational efficiencies for massive distance estimation problems. However, these methods reduce the computation time at the cost of distorting the genomic distances. The main problem with composition-based methods is their reliance on the occurrence of length-k subsequences of the genome, known as k-mers, which ignores their ordering, i.e., their context in the genome. In this thesis, we take inspiration from computational linguistics to develop a new genomic distance estimation approach that exploits not only the frequency of the k-mers but also their context. In our Genomic distance EstiMation (GEM) algorithm, we first learn a context-aware, low-dimensional embedding for k-mers by training on a large corpus of FASTA files comprising 159 million bases of whole genome sequence data from microbial organisms in the National Center of Biotechnology Information (NCBI) repository. We then define the distance between two organisms using a generalization of the Jaccard similarity that incorporates the context-aware embedding of the constituent k-mers. A range of experiments demonstrate that GEM estimates the distance between unseen organisms with up to 2 times less error compared to state-of-art algorithms while incurring a similar running time. As a bonus, the GEM context reveals a distinct structure in the ordering of k-mers in bacteria, viruses, and fungi, a finding that motivates follow-up evolutionary studies.

Description
Degree
Master of Science
Type
Thesis
Keywords
Machine learning, genomics, k-mers
Citation

Barberan, CJ. "GEM Incorporating Context into Genomic Distance Estimation." (2019) Master’s Thesis, Rice University. https://hdl.handle.net/1911/106144.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page