Randomized Algorithms for Training Deep Models with Large Outputs

2022-10-052022-11-012022-052022-03-01May 2022Medini, Tharun. "Randomized Algorithms for Training Deep Models with Large Outputs." (2022) Diss., Rice University. <a href="https://hdl.handle.net/1911/113541">https://hdl.handle.net/1911/113541</a>.https://hdl.handle.net/1911/113541In the last decade, it has been shown that many hard AI tasks, especially in NLP, can be naturally modeled as extreme classification problems leading to improved precision. However, such models are prohibitively expensive to train due to the memory blow-up in the last layer. As an example, we will delve into a real Amazon Search dataset, for which a simple fully connected neural network with a reasonable hidden layer can easily reach well beyond 100 billion parameters (> 400 GB memory). This memory requirement is too big to fit even on a very expensive NVIDIA DGX box equipped with 8 V100 GPUs, each with 32 GB RAM. To cater to problems of this scale, my work presents several principled solutions, building on a fundamental algorithm called Merged-Average Classifiers via Hashing (MACH). MACH is a generic K-classification algorithm where memory provably scales at O(log K) without any strong assumption on the classes. This thesis is divided into three main chapter. The first chapter is `Extreme Classification in Log Memory', in which we rethink the problem of Extreme Classification (or Extreme Multi-label Learning, XML) as a Sketching Problem. MACH is subtly a count-min sketch structure in disguise, which uses universal hashing to reduce classification with a large number of classes to few embarrassingly parallel and independent classification tasks with a small (constant) number of classes. MACH naturally provides a technique for zero communication model parallelism. When experimented with 6 datasets; some multiclass and some multilabel, MACH shows consistent improvement over respective state-of-the-art baselines. In particular, we train an end-to-end deep classifier on a private product search dataset sampled from Amazon Search Engine with 70 million queries and 49.46 million products. MACH outperforms, by a significant margin, the state-of-the-art extreme classification models deployed on commercial search engines: Parabel and DSSM (Deep Semantic Search Model). That largest model that we trained has 6.4 billion parameters and takes less than 35 hours on a single p3.16x machine. Our training times are 7-10 times faster, and our memory footprints are 2-4 times smaller than the best baselines. This training time is also significantly lower than the one reported by Google's mixture of experts (MoE) language model on a comparable model size and hardware. In the second chapter, we realize that MACH is effectively a variant of an embedding model, with a critical difference being that it trains high dimensional sparse embeddings (contrary to the usual low dimensional dense embedding models). Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this chapter, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we use MACH's partitioning approach algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This yields a novel asymmetric mixture of Sparse, Orthogonal, Learned And Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that SOLAR's way of one-sided learning is equivalent to learning both query and label embeddings. Thanks to these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public XML datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10 times faster training and 2 times faster inference. In the third and final chapter, we discuss the challenges that representation learning has brought into Information Retrieval (IR). Learning-to-Index (LTI) has emerged as a key technique to solve most IR problems. Since query time is the most critical aspect effecting the scalability of a search system, there is an inherent tension between accuracy, scalability and the ability to load-balance in distributed settings. We will discuss an algorithm called Iterative Repartitioning for Learning to Index (IRLI), in which, we will retain the best features of SOLAR (sparsity and load balance) while additionally having Locality Sensitive Hashing (LSH) property. IRLI iteratively refines partitions of items by learning the relevant buckets directly from the query-item relevance data. To ensure that the buckets are balanced, IRLI uses the power-of-k choices strategy. Due to its design, IRLI can be used for both extreme classification and near-neighbor retrieval. In practice IRLI surpasses the best baseline's precision for multi-label classification while being 5 times faster for inference.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Information RetrievalDeep LearningLarge Scale Machine LearningCount Min SketchLocality Sensitive HashingRandomized Algorithms for Training Deep Models with Large OutputsThesis2022-10-05