Resource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)

dc.contributor.advisorShrivastava, Anshumalien_US
dc.creatorSpring, Ryan Danielen_US
dc.date.accessioned2020-04-27T19:24:37Zen_US
dc.date.available2020-04-27T19:24:37Zen_US
dc.date.created2020-05en_US
dc.date.issued2020-04-24en_US
dc.date.submittedMay 2020en_US
dc.date.updated2020-04-27T19:24:38Zen_US
dc.description.abstractMachine learning problems are increasing in complexity, so models are growing correspondingly larger to handle these datasets. (e.g., large-scale transformer networks for language modeling). The increase in the number of input features, model size, and output classification space is straining our limited computational resources. Given vast amounts of data and limited computational resources, how do we scale machine learning algorithms to gain meaningful insights? Randomized algorithms are an essential tool in our algorithmic toolbox for solving these challenges. These algorithms achieve significant improvements in terms of computational cost or memory usage by incurring some approximation error. They work because most large-scale datasets follow a power-law distribution where a small subset of the data contains the most information. Therefore, we can avoid wasting computational resources by focusing only on the most relevant items. In this thesis, we explore how to use locality-sensitive hashing (LSH) and the count-sketch data structure for addressing the computational and memory challenges in four distinct areas. (1) The LSH Sampling algorithm uses the LSH data structure as an adaptive sampler. We demonstrate this LSH Sampling approach by accurately estimating the partition function in large-output spaces. (2) MISSION is a large-scale, feature extraction algorithm that uses the count-sketch data structure to store a compressed representation of the entire feature space. (3) The Count-Sketch Optimizer is an algorithm for minimizing the memory footprint of popular first-order gradient optimizers (e.g., Adam, Adagrad, Momentum). (4) Finally, we show the usefulness of our compressed memory optimizer by efficiently training a synthetic question generator, which uses large-scale transformer networks to generate high-quality, human-readable question-answer pairs.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationSpring, Ryan Daniel. "Resource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)." (2020) Diss., Rice University. <a href="https://hdl.handle.net/1911/108402">https://hdl.handle.net/1911/108402</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/108402en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectDeep Learningen_US
dc.subjectMachine Learningen_US
dc.subjectLocality-Sensitive Hashingen_US
dc.subjectCount-Sketchen_US
dc.subjectStochastic Optimizationen_US
dc.subjectNatural Language Processingen_US
dc.subjectQuestion Answeringen_US
dc.subjectQuestion Generationen_US
dc.subjectMeta-genomicsen_US
dc.subjectFeature Selectionen_US
dc.subjectMutual Informationen_US
dc.subjectImportance Samplingen_US
dc.subjectPartitionen_US
dc.titleResource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)en_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SPRING-DOCUMENT-2020.pdf
Size:
5.7 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: