Browsing by Author "Shrivastava, Anshumali"
Now showing 1 - 20 of 28
Results Per Page
Sort Options
Item AI and algorithms for ultra large scale information retrieval systems(2024-02-20) Gupta, Gaurav; Shrivastava, AnshumaliInformation Retrieval (IR) entails the task of matching a specific query to a single item or a group of items within a database full of potential matches. This pivotal operation serves numerous applications, such as genome sequence and document search, near-neighbor search, and classification. The exponential increase in data generation - encompassing text, audio, images, DNA sequences, and time series - from sources like sensors and the web has resulted in data volumes exceeding standard hardware's storage capabilities. Additionally, modern AI applications rely on substantial data quantities, ranging from Gigabytes to Terabytes, and a key-label pair set spanning billions to trillions. Hashing-based algorithms have shown promise in enabling efficient large-scale retrieval due to their low latency and minimal hardware storage requirements. This research is set to examine two search modalities: exact-match search and similarity search. We propose RAMBO, a Bloom filter-based sub-linear search index for exact-match search. Particularly potent in genome sequence searches, this efficient algorithm indexes terabytes of DNA data for E-coli species, dramatically reducing indexing times from weeks to hours and search times from hours to seconds. This progress empowers any laboratory to scour through extensive genome archives using standard computers. We also extend this work by proposing a hash function IDL (Identity with Locality) that preserves the spatial locality and identity of keys for cache-efficient exact-match queries. Conversely, recommendation engines often lean on similarity matching when handling real-world data such as text, images, and audio. In this context, we introduce BLISS (Balanced Index for Scalable Search), a mechanism capable of learning and associating any two real-world data entities within an acceptable approximation. BLISS can index a billion items through an iterative learning algorithm, and its high accuracy, coupled with a small memory footprint, speeds up retrieval by a factor of 5 compared to current benchmarks. Moreover, BLISS offers the dual functionality of near-neighbor search and extreme classification. In practical retrieval engines, similarity match and exact match form the core components, with embedding and filters serving as the query. Current techniques involve a cascade of vector similarity match and exact match on filters, which often leads to slower and less precise results. To address this, we introduce CAPS (Constrained Approximate Partition Search), a unified, single-stage index designed for filter-based near-neighbor searches. Our findings demonstrate that CAPS not only streamlines the search process but also significantly improves the efficiency of Amazon’s search system.Item Cache-Efficient Graph Algorithms for Near Neighbor Search(2021-11-01) Coleman, Ben; Shrivastava, AnshumaliGraph search has recently become one of the most successful algorithmic trends for near neighbor search. Several of the most popular and empirically successful algorithms are, at their core, a simple walk along a pruned near neighbor graph. Such methods consistently outperform other approaches and are a central component of industrial- scale information retrieval and recommendation systems. However, graph algorithms often suffer from issues related to the memory access pattern of graph traversal. Our measurements show that near neighbor search is no exception to this rule: popular graph indices have poor cache performance and rely on complex heuristics with a large memory cost. To address this problem, we apply graph reordering methods to near neighbor graphs. Graph reordering is a memory layout optimization that groups commonly-accessed nodes together in memory. We present exhaustive experiments that apply several reordering algorithms to the hierarchical navigable small-world (HNSW) graph, and we analyze the algorithms under the ideal cache model. We find that reordering improves the query time by up to 40%. We also demonstrate that popular heuristics can be replaced by simpler alternatives with no performance loss, and we show that the time needed to reorder the graph is negligible compared to the time required to construct the index.Item Camera-based positioning system using learning(2021-05-04) Shrivastava, Anshumali; Luo, Chen; Palem, Krishna; Moon, Yongshik; Noh, Soonhyun; Park, Daedong; Hong, Seongsoo; Rice University; Seoul National University R&DB Foundation; United States Patent and Trademark OfficeA device, system, and methods are described to perform machine-learning camera-based indoor mobile positioning. The indoor mobile positioning may utilize inexact computing, wherein a small decrease in accuracy is used to obtain significant computational efficiency. Hence, the positioning may be performed using a smaller memory overhead at a faster rate and with lower energy cost than previous implementations. The positioning may not involve any communication (or data transfer) with any other device or the cloud, providing privacy and security to the device. A hashing-based image matching algorithm may be used which is cheaper, both in energy and computation cost, over existing state-of-the-art matching techniques. This significant reduction allows end-to-end computation to be performed locally on the mobile device. The ability to run the complete algorithm on the mobile device may eliminate the need for the cloud, resulting in a privacy-preserving localization algorithm by design since network communication with other devices may not be required.Item Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity(2022-04-21) Yan, Minghao; Shrivastava, AnshumaliMore than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.Item Dynamic Sparsity for Efficient Machine Learning(2024-04-15) Liu, Zichang; Shrivastava, AnshumaliOver the past decades, machine learning(ML) models have delivered remarkable accomplishments in various applications. For example, large language models usher in a new wave of excitement in artificial intelligence. Interestingly, these accomplishments also unveil the scaling law in machine learning: larger models, equipped with more parameters and trained on more extensive datasets, often significantly outperform their smaller counterparts. However, the trends of increasing model size inevitably introduce unprecedented computation resource requirements, creating substantial challenges in model training and deployments. This thesis aims to improve the efficiency of ML models through algorithmic advancements. Specifically, we exploit the dynamic sparsity pattern inside ML models to achieve efficiency goals. Dynamic sparsity refers to the subset of parameters or activations that are important for a certain data, and different data may have a different dynamic sparsity pattern. We advocate identifying the dynamic sparsity pattern for each data set and focusing computation and memory resources on it. The first part of this thesis centers around the inference stage. We verify the existence of dynamic sparsity in trained ML models, namely, within the classification layer, attention mechanism, and transformer layers of trained models. Further, we demonstrate that such dynamic sparsity can be cheaply predicted and leveraged for each data to improve the inference efficiency goals. The subsequent part of the dissertation will shift its focus to the training stage, where dynamic sparsity emerges as a tool to mitigate the problem of catastrophic forgetting or data heterogeneity in federated learning to improve training efficiency.Item Elastic Parameter Memory for Efficient Machine Learning(2024-08-08) Desai, Aditya; Shrivastava, AnshumaliThe standard machine learning (ML) models are known to have redundancies. We have repeatedly seen that ML models can be sparsified or quantized, and low-rank components can replace parts of the model, often without affecting the quality of the model. Extracting these redundancies is not just a curiosity but a necessity in the era of ever-increasing model sizes and exorbitant costs required to train and deploy ML models. Sparsity, Quantization, and Low-Rank methods have been the key themes at the core of many approaches proposed for efficiency in the past five years. However, as we will see in this thesis, these methods are limited in the amount of efficiency they can bring. In this work, we introduce a novel approach, Elastic Parameter Memory (EM), which repurposes traditionally data-consuming probabilistic algorithms and data structures to the setting of learning where we learn compact representations of ML models. EM is an example of the confluence of probabilistic algorithms and data structures and ML, which opens up new research areas and unlocks the potential to push the efficiency frontiers in ML. The core idea in EM is hashing-based weight retrieval and enabling parameter space multiplexing. The majority of this thesis is about developing EM by solving critical issues motivated by practical application of EM to real systems. Our contributions are multifold: (1) Memory bandwidth efficient hash functions: The randomized hash functions that provide probabilistic algorithms and data structures with accuracy guarantees also cause severe cache performance deterioration in EM. In the process of developing cache-efficient hash functions, we stumble across a new class of hash functions that is not only cache-efficient but also strictly better than standard hash functions for the projection (2) Parameter multiplexing: To optimize for the parameter efficiency; we devise a memory multiplexing approach where all the modules of the model share parameter space. (3) Stability of training: We show that naively using EM can lead to unstable convergence. We devise a gradient scaling mechanism to remove this instability provably. (3) Optimal parameter usage in EM: we devise hash functions to optimally use the parameters in EM without compromising the quality or cache efficiency of the EM. We also theoretically and empirically contrast and combine EM with popular efficiency approaches. We show that in terms of parameter memory efficiency, EM is strictly better than the popular sparsity approach theoretically. While the analysis is restricted to linear models, the results follow into deep learning confirmed via rigorous empirical evaluation. Quantization is a sharper compression technique. While it maintains most of the accuracy at lower regimes of compression, the quality deteriorates faster as higher compressions, not to mention that we cannot obtain more than $16\times$ compression (for precision 16). However, it can be combined with EM to obtain efficiency not demonstrated by either of the methods alone. We show this theoretically in the dimensionality reduction setup. Furthermore, we find that with particular choices of hash functions, EM can even reduce the computational workload of machine learning. We additionally explore how EM can provide a single backbone for heterogeneous model training where different-sized models are deployed on different systems and show applications in federated learning. We demonstrate the practical implications of EM, showing that it can significantly reduce popular machine learning workloads' memory utilization, bandwidth utilization, computing, and, thus, latency. To highlight some impactful results, we show that EM can reduce the parameter memory usage of Deep learning recommendation models (DRLM) by $ 10000\times$ without compromising the model's accuracy, leading to $ 3.1\times$ improvement in latency and orders of magnitude improvement in carbon footprint and cost of training and deploying DLRM. We also show that EM can improve the throughput of Large Language Models (LLM) by $ 1.31\times$ without compromising the model quality. Moreover, we demonstrate that EM can be combined with Quantization and Sparsity to improve memory and throughput further.Item Enhancing Exploration in Reinforcement Learning through Multi-Step Actions(2020-12-03) Medini, Tharun; Shrivastava, AnshumaliThe paradigm of Reinforcement Learning (RL) has been plagued by slow and uncertain training owing to the poor exploration in existing techniques. This can be mainly attributed to the lack of training data beforehand. Further, querying a neural network after every step is a wasteful process as some states are conducive to multi-step actions. Since we train with data generated on-the-fly, it is hard to pre-identify certain action sequences that consistently yield great rewards. Prior research in RL has been focused on designing algorithms that can train multiple agents in parallel and accumulate information from these agents to train faster. Concurrently, research has also been done to dynamically identify action sequences that are suited for a specific input state. In this work, we provide insights into the necessity and training methods for RL with multi-step action sequences in conjunction with the main actions of an RL environment. We broadly discuss two approaches. First of them is A4C - Anticipatory Asynchronous Advantage Actor-Critic, a method that squeezes twice the gradients from the same number of episodes and thereby achieves higher scores and converges faster. The second one is an alternative to Imitation Learning that mitigates the need for having state-action pairs of expert. With as few as 20 action trajectories of expert, we can identify the most frequent action pairs and append to the novice's action space. We show the power of our approaches by consistently and significantly outperforming the state-of-the-art GPU-enabled-A3C (GA3C) on popular ATARI games.Item Generalized Zero-Shot Learning through Similarity Distribution Matching(2022-02-18) Daghaghi, Shabnam; Shrivastava, AnshumaliRecent advances in supervised learning methods in vision, specifically deep learning frameworks, are primarily built on the abundance of labeled images. However, image labeling is a laborious task, therefore many visual categories are unlabeled or even unavailable. Zero-Shot Learning (ZSL) deals with classifying unseen visual categories that have no samples during training phase. In particular, ZSL is a classification task where some classes referred to as unseen classes have no training images. Instead, we only have side information about seen and unseen classes, often in the form of semantic or descriptive attributes. Lack of training images from a set of classes restricts the use of standard classification techniques and losses, including the widespread cross-entropy loss. We introduce a novel Similarity Distribution Matching Network (SDM-Net) which is a standard fully connected neural-network architecture with non-trainable penultimate layer consisting of class attributes. The output layer of SDM-Net consists of both seen and unseen classes. To enable zero-shot learning, during training, we regularize the model such that the predicted distribution of unseen class is close in KL divergence to the distribution of similarities between the correct seen class and all the unseen classes. We evaluate the proposed model on five benchmark datasets for zero-shot learning, AwA1, AwA2, aPY, SUN and CUB datasets. We show that, despite the simplicity, our approach achieves competitive performance with state-of-the-art methods in Generalized-ZSL setting for all of these datasets.Item Kernel Sum Sketches for Large Scale Learning(2023-01-03) Coleman, Ben Ray; Shrivastava, AnshumaliKernel methods play a central role in machine learning and statistics, but algorithms for such methods scale poorly to large, high-dimensional datasets. Kernel sum computations are often the bottleneck, as they must aggregate all pairwise interactions between a query and each element of the dataset. Prior research has resulted in fast methods to approximate this sum with coresets, kernel approximations and adaptive sampling. However, existing methods still have prohibitively high memory and computation costs, especially for emerging applications in web-scale learning, genomics and streaming data. In my work, I have developed a compressed summary of the dataset, or sketch, that supports fast approximate sum queries for a special class of kernels. The sketch requires memory that is sub-linear in the data size and dimension, can be constructed in a single pass and comes with strong theoretical guarantees on the approximation error. In this thesis, I argue that kernel sum sketches are a new, useful tool for large-scale analysis and learning. I use the sketch to improve the resource-accuracy tradeoff by an order of magnitude for i) differentially private density estimation, linear regression and classification, ii) fast inverse propensity sampling and iii) memory-efficient near-neighbor search.Item Leveraging Physics-based Models in Data-driven Computational Imaging(2019-04-19) Chen, George; Veeraraghavan, Ashok; Baraniuk, Richard; Shrivastava, AnshumaliDeep Learning (DL) has revolutionized various applications in computational imaging and computer vision. However, existing DL frameworks are mostly data-driven, which largely disregards decades of prior work that focused on signal processing theory and physics-based models. As a result, many DL based image reconstruction methods generate eye-pleasing results but faces strong drawbacks, including 1) output not being physically correct, 2) requiring large datasets with labor-intensive annotations. In the thesis, we propose several computational imaging frameworks that leverage both physics-based models and data-driven deep learning. By formulating the physical model as an integrated and differentiable layer of the larger learning networks, we are able to a) constraint the results to be closer to the physical reality, b) perform self-supervised network training using the physical constraints as loss functions, avoiding manually labeled data, and c) develop true end-to-end imaging systems with jointly optimized front-end sensors and back-end algorithms. In particular, we show that the proposed approach is suitable for a wide range of applications, including motion de-blurring, 3D imaging and super-resolution microscopy.Item Leveraging Physics-based Models in Data-driven Computational Imaging(2019-04-19) Chen, George; Veeraraghavan, Ashok; Baraniuk, Richard; Shrivastava, AnshumaliDeep Learning (DL) has revolutionized various applications in computational imaging and computer vision. However, existing DL frameworks are mostly data-driven, which largely disregards decades of prior work that focused on signal processing theory and physics-based models. As a result, many DL based image reconstruction methods generate eye-pleasing results but faces strong drawbacks, including 1) output not being physically correct, 2) requiring large datasets with labor-intensive annotations. In the thesis, we propose several computational imaging frameworks that leverage both physics-based models and data-driven deep learning. By formulating the physical model as an integrated and differentiable layer of the larger learning networks, we are able to a) constraint the results to be closer to the physical reality, b) perform self-supervised network training using the physical constraints as loss functions, avoiding manually labeled data, and c) develop true end-to-end imaging systems with jointly optimized front-end sensors and back-end algorithms. In particular, we show that the proposed approach is suitable for a wide range of applications, including motion de-blurring, 3D imaging and super-resolution microscopy.Item Locality Sensitive Sampling for Extreme-Scale Optimization and Deep Learning(2020-08-11) Chen, Beidi; Shrivastava, AnshumaliThe exponential growth of data poses a number of challenges for scaling learning algorithms in machine learning and deep learning problems. This thesis aims to explore and tackle the computational challenges with randomized hashing algorithms and shed new light on Locality Sensitive Hashing (LSH) as an adaptive sampler for large-scale estimations. We first introduce the chicken-and-the-egg loop problem in large-scale optimization algorithms. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself. We break this barrier by providing the first demonstration of an LSH sampled stochastic gradient descent (LGD) that leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling. Then we demonstrate the power of LSH Sampling in our SLIDE (SUb-LInear Deep learning Engine), which drastically reduces the computations of extreme-scale neural network training and outperforms an optimized implementation of Tensorflow (TF) on the best available GPU with only a CPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. In addition to exploring these new possibilities for LSH, we also develop several classical hashing algorithms in the literature.Item Memory efficient computation for large scale machine learning and data inference(2022-08-11) Dai, Zhenwei; Shrivastava, AnshumaliWith the fast growth of large scale and high-dimensional datasets, large-scale machine learning and statistical inference become more and more common in many daily applications. Although the development of modern computation hardware (like GPUs) has brought an exponential speed-up of computation efficiency, memory price is still expensive and has become a main bottleneck for these large scale learning or inference tasks. The thesis focus on developing scalable and memory-efficient learning and inference algorithms with probabilistic data structures. We first aim to solve the low memory and high-speed membership testing problem. Membership testing tries to answer whether a query $q$ is in a set $S$. Membership testing has a lot of applications in web services, such as malicious URL testing and search query caching. However, due to the limited memory budget and constrained response time, the membership testing has to be fast and memory efficient. We propose two learned Bloom filter algorithms, which smartly combine the machine learning classifier with Bloom filters, to achieve low memory usage, high inference speed, and the state-of-art inference FPR. Secondly, we show a novel use of the probabilistic data structure (Count Sketch) to solve the high-dimensional covariance matrix estimation problem. High-dimensional covariance matrix estimation plays a critical role in many machine learning and statistical inference problems. However, the memory cost of storing a covariance matrix increases quadratically with the dimension. Hence, when the dimension increases to the scale of millions, storing the whole covariance matrix in the memory is almost impossible. However, the sparsity nature of most high-dimensional covariance matrices give us hope to only recover the large covariance entries. We incorporate active sampling into the Count Sketch algorithm to project the covariances into a compressed data structure. It only costs sub-linear memory while is able to locate the large covariance entries with high accuracy. Finally, we explore the memory and communication efficient algorithms for extreme classification tasks under the federated learning setup. Federated learning enables many local devices to train a deep learning model jointly without sharing the local data. Currently, most federated training schemes learn a global model by averaging the parameters of local models. However, it suffers from high communication costs resulting from transmitting full local model parameters. Especially for the federated learning tasks involving extreme classification, 1) communication becomes the main bottleneck since the model size increases proportionally to the number of output classes; 2) extreme classification (such as user recommendation) normally has extremely imbalanced classes and heterogeneous data on different devices. We propose to reduce the model size by compressing the output classes with Count Sketch. It can significantly reduce the memory usage while still being able to maintain the information of the major classes.Item Mining Massive-Scale Time Series Data using Hashing(2017-05-09) Luo, Chen; Shrivastava, AnshumaliSimilarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Empirical results on two large-scale benchmark time series data show that our proposed method prunes around 95% time series candidates and can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.Item Objective Sociability Measures from Multi-modal Smartphone data and Unconstrained Day-long Audio Streams(2019-04-16) Cao, Jian; Sabharwal, Ashutosh; Shrivastava, AnshumaliSociability is defined as a tendency to affiliate with others and to prefer being with others instead of staying alone. Sociability is well-known to influence many aspects of an individual, e.g., their quality of life, their health and well-being, their workplace performance and even learning outcomes. Despite its importance, existing sociability measures mainly rely on subjective self-reports, hence they only provide a sparse sampling of an individual's social experiences, and put extra burdens on both the researchers and participants. In this thesis, I propose and develop objective measurements to capture social interactions from multi-modal smartphone data and unconstrained day-long audio streams. The goal is to automatically capture two forms of social interactions (i.e., remote social interactions including phone calls and text messages; in-person social interactions in the form of verbal conversations), and to investigate the correlations between objectively assessed social interactions and wellbeing and mental health outcomes. Towards that goal, I develop a smartphone app that integrates multi-modal sensor and usage data for remote social interaction measure. I also propose and develop the SocialSense framework, which automatically captures in-person verbal interactions from unconstrained audio recordings by wearable devices. SocialSense consists of an utterance segmentation frontend, an unsupervised speaker indexing stage using Siamese Convolutional Neural Network (Siamese-CNN), and a SpeakerRank algorithm to track the most frequent speakers. I evaluate the performance of SocialSense on both public datasets and a private Rice Speech Corpus for different ambient backgrounds, voice clip lengths and the number of speakers, and the results indicate that SocialSense performs reasonably well on unconstrained audio data. Using the smartphone app and SocialSense, I conduct three trials with the clinical population to validate that objectively assessed social interactions are correlated with mental health outcomes. Specifically, the SOLVD-Adult and SOLVD-Teen trials focus on investigating the correlations between remote social interactions and depressive symptoms assessed by clinical instruments. The SocioNet study aims at studying the consistency between objectively captured in-person social interactions, and self-reported sociability level and qualitative clinical observations for depression and psychosis patients. The results indicate that the smartphone app and SocialSense provide an objective, continuous and unobstructive approach to capture both remote and in-person social interactions. The sensor-based sociability measures correlate well with both self-reports and clinical instruments. Besides, SocialSense is able to capture transient behavioral markers that are of significant clinical importance, but are hard to detect with previous measures. Hence, the objective sociability measures have great potential for applications in mental health, team science, and other behavioral research.Item Parameter and Data Sparsity for Efficient Training of Large Neural Networks(2023-12-01) Daghaghi, Shabnam; Shrivastava, Anshumali; Baraniuk, Richard; Hu, Xia (Ben)The size of deep neural networks has tremendously increased due to the recent advancements in generative AI and developments of Large Language Models (LLMs). Simultaneously, the exponential growth in data volumes has compounded the challenges associated with training these expansive models. This thesis utilizes randomized algorithms to leverage the notion of dynamic sparsity, where only a small subset of parameters is crucial for each input, and mitigate some of the bottlenecks in the training of large deep neural networks. In the first chapter, we utilize parameter sparsity to address the training bottleneck in neural networks with a large output layer, which occurs in many applications such as recommendation systems and information retrieval. The calculation of full softmax is costly from the computational and energy perspective. There have been various sampling approaches to overcome this challenge, popularly known as negative sampling (NS). However, the existing NS approaches trade either efficiency or adaptivity. We propose two classes of distributions where the sampling scheme is truly adaptive and provably generates negative samples in near-constant time. Our proposal is based on Locality Sensitive Hashing (LSH) data structure and significantly outperforms the baselines in terms of accuracy and wall-clock time. In the second chapter, we propose an efficient data sampling scheme based on LSH data structure. Data sampling is an effective method to improve the training speed of large deep neural networks, and it originates from the fact that not all data points have an equal amount of importance. We propose a novel dynamic sampling distribution based on nonparametric kernel regression. To make this approach computationally feasible, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator and provide the exponential convergence guarantees. We demonstrate that our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy across multiple datasets. Finally, in the third chapter, we study the system design and implementation of networks with dynamic parameter sparsity on CPUs. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based feed-forward and back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this section, we argue that SLIDE’s current implementation is sub-optimal and does not exploit several features available in modern CPUs. In particular, we show that by utilizing opportunities such as memory optimization, quantization, and vectorization, we obtain up to 7x speedup in the computations on the same hardware.Item Randomized Algorithms for Mega-AI Models(2023-08-08) Xu, Zhaozhuo; Shrivastava, Anshumali; Baraniuk, Richard; Hu, XiaOver the past few years, we have witnessed remarkable accomplishments in machine learning (ML) models due to increases in their sizes. However, the growth in model size has outpaced upgrades to hardware and network bandwidth, resulting in difficulties in training these Mega-AI models within current system infrastructures. Additionally, the shift towards training ML models on user devices, in light of global data privacy protection trends, has constrained hardware resources, exacerbating the tension between effectiveness and efficiency. Moreover, there exists an accuracy-efficiency trade-off in current ML algorithms and systems, where reducing computation and memory usage results in accuracy losses during both training and inference. This thesis aims to demonstrate algorithmic advancements in improving this trade-off in training Mega-AI models. Rather than relying on big data, we propose a focus on good data and sparse models, which refer to models with many parameters but only activate a subset during training for efficiency. We also frame the pursuit of good data and activated parameters as an information retrieval problem and develop hashing algorithms and data structures to maintain training accuracy while improving efficiency. This thesis begins with work on data sparsity and presents a hash-based sampling algorithm for Mega-AI models that adaptively selects data samples during training. We also demonstrate how this approach improves the machine teaching algorithm with 425.12x speedups and 99.76\% energy savings on edge devices. We then discuss our recent success in model sparsity and present a provably efficient hashing algorithm that adaptively selects and updates a subset of parameters during training. We also introduce methods to bridge the accuracy decline of sparse Mega-AI models in the post-training process. Finally, we present DRAGONN, a system that utilizes hash algorithms to achieve near-optimal communication for sparse and distributed ML. To demonstrate the utility of these scalable and sustainable ML algorithms, we apply them to personalized education, seismic imaging, and bioinformatics. Specifically, we show how modifying the ML algorithm can reduce seismic processing time from 10 months to 10 minutes.Item Randomized Algorithms for Training Deep Models with Large Outputs(2022-03-01) Medini, Tharun; Shrivastava, Anshumali; Baraniuk, Richard; Kyrillidis, AnastasiosIn the last decade, it has been shown that many hard AI tasks, especially in NLP, can be naturally modeled as extreme classification problems leading to improved precision. However, such models are prohibitively expensive to train due to the memory blow-up in the last layer. As an example, we will delve into a real Amazon Search dataset, for which a simple fully connected neural network with a reasonable hidden layer can easily reach well beyond 100 billion parameters (> 400 GB memory). This memory requirement is too big to fit even on a very expensive NVIDIA DGX box equipped with 8 V100 GPUs, each with 32 GB RAM. To cater to problems of this scale, my work presents several principled solutions, building on a fundamental algorithm called Merged-Average Classifiers via Hashing (MACH). MACH is a generic K-classification algorithm where memory provably scales at O(log K) without any strong assumption on the classes. This thesis is divided into three main chapter. The first chapter is `Extreme Classification in Log Memory', in which we rethink the problem of Extreme Classification (or Extreme Multi-label Learning, XML) as a Sketching Problem. MACH is subtly a count-min sketch structure in disguise, which uses universal hashing to reduce classification with a large number of classes to few embarrassingly parallel and independent classification tasks with a small (constant) number of classes. MACH naturally provides a technique for zero communication model parallelism. When experimented with 6 datasets; some multiclass and some multilabel, MACH shows consistent improvement over respective state-of-the-art baselines. In particular, we train an end-to-end deep classifier on a private product search dataset sampled from Amazon Search Engine with 70 million queries and 49.46 million products. MACH outperforms, by a significant margin, the state-of-the-art extreme classification models deployed on commercial search engines: Parabel and DSSM (Deep Semantic Search Model). That largest model that we trained has 6.4 billion parameters and takes less than 35 hours on a single p3.16x machine. Our training times are 7-10 times faster, and our memory footprints are 2-4 times smaller than the best baselines. This training time is also significantly lower than the one reported by Google's mixture of experts (MoE) language model on a comparable model size and hardware. In the second chapter, we realize that MACH is effectively a variant of an embedding model, with a critical difference being that it trains high dimensional sparse embeddings (contrary to the usual low dimensional dense embedding models). Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this chapter, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we use MACH's partitioning approach algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This yields a novel asymmetric mixture of Sparse, Orthogonal, Learned And Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that SOLAR's way of one-sided learning is equivalent to learning both query and label embeddings. Thanks to these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public XML datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10 times faster training and 2 times faster inference. In the third and final chapter, we discuss the challenges that representation learning has brought into Information Retrieval (IR). Learning-to-Index (LTI) has emerged as a key technique to solve most IR problems. Since query time is the most critical aspect effecting the scalability of a search system, there is an inherent tension between accuracy, scalability and the ability to load-balance in distributed settings. We will discuss an algorithm called Iterative Repartitioning for Learning to Index (IRLI), in which, we will retain the best features of SOLAR (sparsity and load balance) while additionally having Locality Sensitive Hashing (LSH) property. IRLI iteratively refines partitions of items by learning the relevant buckets directly from the query-item relevance data. To ensure that the buckets are balanced, IRLI uses the power-of-k choices strategy. Due to its design, IRLI can be used for both extreme classification and near-neighbor retrieval. In practice IRLI surpasses the best baseline's precision for multi-label classification while being 5 times faster for inference.Item Resource-Efficient Machine Learning via Count-Sketches and Locality-Sensitive Hashing (LSH)(2020-04-24) Spring, Ryan Daniel; Shrivastava, AnshumaliMachine learning problems are increasing in complexity, so models are growing correspondingly larger to handle these datasets. (e.g., large-scale transformer networks for language modeling). The increase in the number of input features, model size, and output classification space is straining our limited computational resources. Given vast amounts of data and limited computational resources, how do we scale machine learning algorithms to gain meaningful insights? Randomized algorithms are an essential tool in our algorithmic toolbox for solving these challenges. These algorithms achieve significant improvements in terms of computational cost or memory usage by incurring some approximation error. They work because most large-scale datasets follow a power-law distribution where a small subset of the data contains the most information. Therefore, we can avoid wasting computational resources by focusing only on the most relevant items. In this thesis, we explore how to use locality-sensitive hashing (LSH) and the count-sketch data structure for addressing the computational and memory challenges in four distinct areas. (1) The LSH Sampling algorithm uses the LSH data structure as an adaptive sampler. We demonstrate this LSH Sampling approach by accurately estimating the partition function in large-output spaces. (2) MISSION is a large-scale, feature extraction algorithm that uses the count-sketch data structure to store a compressed representation of the entire feature space. (3) The Count-Sketch Optimizer is an algorithm for minimizing the memory footprint of popular first-order gradient optimizers (e.g., Adam, Adagrad, Momentum). (4) Finally, we show the usefulness of our compressed memory optimizer by efficiently training a synthetic question generator, which uses large-scale transformer networks to generate high-quality, human-readable question-answer pairs.Item Rethinking Image Compression for the Object Detection Task(2015-12-03) Barua, Souptik; Veeraraghavan, Ashok; Baraniuk, Richard; Shrivastava, AnshumaliTraditionally, image compression algorithms, such as JPEG, have been designed for human viewers' satisfaction. Increasingly however, more and more images are being viewed by computers, for performing computer vision tasks such as object detection. Image compression and object detection have largely been independent areas of research so far. However, several applications such as surveillance and medical imaging impose severe bandwidth and power restrictions. These constraints make the quality and/or size of the compressed image a critical factor in object detection performance. My works presents three compressed image representations that enable fast and accurate object detection. The first representation is a saliency guided wavelet representation which modifies traditional wavelet compression using the knowledge of saliency to improve both compression and detection performance compared to JPEG images. The second representation, called event stream representation, comes directly from the new DVS sensor which has ultra-low bandwidth and power requirements. We show, for the first time, high speed video reconstruction, and direct detection, on the event data. We achieve detection performance comparable to that on conventional JPEG images. Finally, we explore an abstract compressed representation called patch-wise binary representation, which represents an image (patch-wise) as a collection of short binary strings. We demonstrate two ways of generating these binary strings, called hashing and feature binarization, which enable 10x faster detection. We show promising detection and reconstruction results for both these approaches.