Baraniuk, Richard G2019-07-162019-07-162018-122019-06-04December 2Barberan, CJ. "GEM Incorporating Context into Genomic Distance Estimation." (2019) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/106144">https://hdl.handle.net/1911/106144</a>.https://hdl.handle.net/1911/106144A pivotal question in computational biology is how similar two organisms are based on their genomic sequences. Unfortunately, classical sequence alignment-based methods for estimating genomic distances do not scale well to the massive number of organisms that have been sequenced to date. Recently, composition-based methods have gained interest due to their computational efficiencies for massive distance estimation problems. However, these methods reduce the computation time at the cost of distorting the genomic distances. The main problem with composition-based methods is their reliance on the occurrence of length-k subsequences of the genome, known as k-mers, which ignores their ordering, i.e., their context in the genome. In this thesis, we take inspiration from computational linguistics to develop a new genomic distance estimation approach that exploits not only the frequency of the k-mers but also their context. In our Genomic distance EstiMation (GEM) algorithm, we first learn a context-aware, low-dimensional embedding for k-mers by training on a large corpus of FASTA files comprising 159 million bases of whole genome sequence data from microbial organisms in the National Center of Biotechnology Information (NCBI) repository. We then define the distance between two organisms using a generalization of the Jaccard similarity that incorporates the context-aware embedding of the constituent k-mers. A range of experiments demonstrate that GEM estimates the distance between unseen organisms with up to 2 times less error compared to state-of-art algorithms while incurring a similar running time. As a bonus, the GEM context reveals a distinct structure in the ordering of k-mers in bacteria, viruses, and fungi, a finding that motivates follow-up evolutionary studies.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Machine learning, genomics, k-mersGEM Incorporating Context into Genomic Distance EstimationThesis2019-07-16