GEM Incorporating Context into Genomic Distance Estimation

dc.contributor.advisorBaraniuk, Richard Gen_US
dc.creatorBarberan, CJen_US
dc.date.accessioned2019-07-16T18:47:41Zen_US
dc.date.available2019-07-16T18:47:41Zen_US
dc.date.created2018-12en_US
dc.date.issued2019-06-04en_US
dc.date.submittedDecember 2018en_US
dc.date.updated2019-07-16T18:47:41Zen_US
dc.description.abstractA pivotal question in computational biology is how similar two organisms are based on their genomic sequences. Unfortunately, classical sequence alignment-based methods for estimating genomic distances do not scale well to the massive number of organisms that have been sequenced to date. Recently, composition-based methods have gained interest due to their computational efficiencies for massive distance estimation problems. However, these methods reduce the computation time at the cost of distorting the genomic distances. The main problem with composition-based methods is their reliance on the occurrence of length-k subsequences of the genome, known as k-mers, which ignores their ordering, i.e., their context in the genome. In this thesis, we take inspiration from computational linguistics to develop a new genomic distance estimation approach that exploits not only the frequency of the k-mers but also their context. In our Genomic distance EstiMation (GEM) algorithm, we first learn a context-aware, low-dimensional embedding for k-mers by training on a large corpus of FASTA files comprising 159 million bases of whole genome sequence data from microbial organisms in the National Center of Biotechnology Information (NCBI) repository. We then define the distance between two organisms using a generalization of the Jaccard similarity that incorporates the context-aware embedding of the constituent k-mers. A range of experiments demonstrate that GEM estimates the distance between unseen organisms with up to 2 times less error compared to state-of-art algorithms while incurring a similar running time. As a bonus, the GEM context reveals a distinct structure in the ordering of k-mers in bacteria, viruses, and fungi, a finding that motivates follow-up evolutionary studies.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationBarberan, CJ. "GEM Incorporating Context into Genomic Distance Estimation." (2019) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/106144">https://hdl.handle.net/1911/106144</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/106144en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectMachine learningen_US
dc.subjectgenomicsen_US
dc.subjectk-mersen_US
dc.titleGEM Incorporating Context into Genomic Distance Estimationen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentElectrical and Computer Engineeringen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Scienceen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
BARBERAN-DOCUMENT-2018.pdf
Size:
2.06 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: