Resource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)

Date
2020-04-24
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Machine learning problems are increasing in complexity, so models are growing correspondingly larger to handle these datasets. (e.g., large-scale transformer networks for language modeling). The increase in the number of input features, model size, and output classification space is straining our limited computational resources.

Given vast amounts of data and limited computational resources, how do we scale machine learning algorithms to gain meaningful insights? Randomized algorithms are an essential tool in our algorithmic toolbox for solving these challenges. These algorithms achieve significant improvements in terms of computational cost or memory usage by incurring some approximation error. They work because most large-scale datasets follow a power-law distribution where a small subset of the data contains the most information. Therefore, we can avoid wasting computational resources by focusing only on the most relevant items.

In this thesis, we explore how to use locality-sensitive hashing (LSH) and the count-sketch data structure for addressing the computational and memory challenges in four distinct areas. (1) The LSH Sampling algorithm uses the LSH data structure as an adaptive sampler. We demonstrate this LSH Sampling approach by accurately estimating the partition function in large-output spaces. (2) MISSION is a large-scale, feature extraction algorithm that uses the count-sketch data structure to store a compressed representation of the entire feature space. (3) The Count-Sketch Optimizer is an algorithm for minimizing the memory footprint of popular first-order gradient optimizers (e.g., Adam, Adagrad, Momentum). (4) Finally, we show the usefulness of our compressed memory optimizer by efficiently training a synthetic question generator, which uses large-scale transformer networks to generate high-quality, human-readable question-answer pairs.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Deep Learning, Machine Learning, Locality-Sensitive Hashing, Count-Sketch, Stochastic Optimization, Natural Language Processing, Question Answering, Question Generation, Meta-genomics, Feature Selection, Mutual Information, Importance Sampling, Partition
Citation

Spring, Ryan Daniel. "Resource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)." (2020) Diss., Rice University. https://hdl.handle.net/1911/108402.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page