Machine Learning in Large-scale Genomics: Sensing, Processing, and Analysis

dc.contributor.advisorBaraniuk, Richard Gen_US
dc.creatorAghazadeh Mohandesi, Amir Alien_US
dc.date.accessioned2017-08-01T18:39:19Zen_US
dc.date.available2018-05-01T05:01:09Zen_US
dc.date.created2017-05en_US
dc.date.issued2017-04-19en_US
dc.date.submittedMay 2017en_US
dc.date.updated2017-08-01T18:39:19Zen_US
dc.description.abstractAdvances in the field of genomics, a branch of biology concerning with the structure, function, and evolution of genomes, has led to dramatic reductions in the price of sequencing machines. As a result, torrents of genomic data is being produced every day which pose huge challenges and opportunities for engineers, scientists, and researchers in various fields. Here, we propose novel machine learning tools and algorithms to more efficiently sense, process, and analyze large-scale genomic data. To begin with, we develop a novel universal microbial diagnostics (UMD) platform to sense microbial organisms in an infectious sample, using a small number of random DNA probes that are agnostic to the target genomic DNA sequences. Our platform leverages the theory of sparse signal recovery (compressive sensing) to identify the composition of a microbial sample that potentially contains thousands of novel or mutant species. We next develop a new sensor selection algorithm that finds the subset of sensors that best recovers a sparse vector in sparse recovery problems. Our proposed algorithm, Insense, minimizes a coherence-based cost function that is adapted from classical results in sparse recovery theory and outperforms traditional selection algorithms in finding optimal DNA probes for microbial diagnostics problem. Inspired by recent progress in robust optimization, we then develop a novel hashing algorithm, dubbed RHash, that minimizes the worst-case distortion among pairs of points in a dataset using an \ell_infinity-norm minimization technique. We develop practical and efficient implementations of RHash based on the alternating direction method of multipliers (ADMM) framework and column generation that scale well to large datasets. Finally, we develop a novel machine learning algorithm using techniques in deep learning and natural language processing literature to embed DNA sequences of arbitrary length into a single low-dimensional space. Our so-called Kmer2Vec platform learns biological concepts such as drug-resistance by parsing raw DNA sequences of microbial organisms with no prior biology knowledge.en_US
dc.embargo.terms2018-05-01en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationAghazadeh Mohandesi, Amir Ali. "Machine Learning in Large-scale Genomics: Sensing, Processing, and Analysis." (2017) Diss., Rice University. <a href="https://hdl.handle.net/1911/96111">https://hdl.handle.net/1911/96111</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/96111en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectCompressive Sensingen_US
dc.subjectUniversal Microbial Diagnosticsen_US
dc.subjectHashingen_US
dc.subjectSensor Selectionen_US
dc.subjectDNA Embeddingen_US
dc.titleMachine Learning in Large-scale Genomics: Sensing, Processing, and Analysisen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentElectrical and Computer Engineeringen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AGHAZADEHMOHANDESI-DOCUMENT-2017.pdf
Size:
3.02 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.86 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.62 KB
Format:
Plain Text
Description: