GEM Incorporating Context into Genomic Distance Estimation

dc.contributor.advisorBaraniuk, Richard G
dc.creatorBarberan, CJ
dc.date.accessioned2019-07-16T18:47:41Z
dc.date.available2019-07-16T18:47:41Z
dc.date.created2018-12
dc.date.issued2019-06-04
dc.date.submittedDecember 2018
dc.date.updated2019-07-16T18:47:41Z
dc.description.abstractA pivotal question in computational biology is how similar two organisms are based on their genomic sequences. Unfortunately, classical sequence alignment-based methods for estimating genomic distances do not scale well to the massive number of organisms that have been sequenced to date. Recently, composition-based methods have gained interest due to their computational efficiencies for massive distance estimation problems. However, these methods reduce the computation time at the cost of distorting the genomic distances. The main problem with composition-based methods is their reliance on the occurrence of length-k subsequences of the genome, known as k-mers, which ignores their ordering, i.e., their context in the genome. In this thesis, we take inspiration from computational linguistics to develop a new genomic distance estimation approach that exploits not only the frequency of the k-mers but also their context. In our Genomic distance EstiMation (GEM) algorithm, we first learn a context-aware, low-dimensional embedding for k-mers by training on a large corpus of FASTA files comprising 159 million bases of whole genome sequence data from microbial organisms in the National Center of Biotechnology Information (NCBI) repository. We then define the distance between two organisms using a generalization of the Jaccard similarity that incorporates the context-aware embedding of the constituent k-mers. A range of experiments demonstrate that GEM estimates the distance between unseen organisms with up to 2 times less error compared to state-of-art algorithms while incurring a similar running time. As a bonus, the GEM context reveals a distinct structure in the ordering of k-mers in bacteria, viruses, and fungi, a finding that motivates follow-up evolutionary studies.
dc.format.mimetypeapplication/pdf
dc.identifier.citationBarberan, CJ. "GEM Incorporating Context into Genomic Distance Estimation." (2019) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/106144">https://hdl.handle.net/1911/106144</a>.
dc.identifier.urihttps://hdl.handle.net/1911/106144
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectMachine learning, genomics, k-mers
dc.titleGEM Incorporating Context into Genomic Distance Estimation
dc.typeThesis
dc.type.materialText
thesis.degree.departmentElectrical and Computer Engineering
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelMasters
thesis.degree.nameMaster of Science
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
BARBERAN-DOCUMENT-2018.pdf
Size:
2.06 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: