Browsing by Author "Gupta, Gaurav"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item AI and algorithms for ultra large scale information retrieval systems(2024-02-20) Gupta, Gaurav; Shrivastava, AnshumaliInformation Retrieval (IR) entails the task of matching a specific query to a single item or a group of items within a database full of potential matches. This pivotal operation serves numerous applications, such as genome sequence and document search, near-neighbor search, and classification. The exponential increase in data generation - encompassing text, audio, images, DNA sequences, and time series - from sources like sensors and the web has resulted in data volumes exceeding standard hardware's storage capabilities. Additionally, modern AI applications rely on substantial data quantities, ranging from Gigabytes to Terabytes, and a key-label pair set spanning billions to trillions. Hashing-based algorithms have shown promise in enabling efficient large-scale retrieval due to their low latency and minimal hardware storage requirements. This research is set to examine two search modalities: exact-match search and similarity search. We propose RAMBO, a Bloom filter-based sub-linear search index for exact-match search. Particularly potent in genome sequence searches, this efficient algorithm indexes terabytes of DNA data for E-coli species, dramatically reducing indexing times from weeks to hours and search times from hours to seconds. This progress empowers any laboratory to scour through extensive genome archives using standard computers. We also extend this work by proposing a hash function IDL (Identity with Locality) that preserves the spatial locality and identity of keys for cache-efficient exact-match queries. Conversely, recommendation engines often lean on similarity matching when handling real-world data such as text, images, and audio. In this context, we introduce BLISS (Balanced Index for Scalable Search), a mechanism capable of learning and associating any two real-world data entities within an acceptable approximation. BLISS can index a billion items through an iterative learning algorithm, and its high accuracy, coupled with a small memory footprint, speeds up retrieval by a factor of 5 compared to current benchmarks. Moreover, BLISS offers the dual functionality of near-neighbor search and extreme classification. In practical retrieval engines, similarity match and exact match form the core components, with embedding and filters serving as the query. Current techniques involve a cascade of vector similarity match and exact match on filters, which often leads to slower and less precise results. To address this, we introduce CAPS (Constrained Approximate Partition Search), a unified, single-stage index designed for filter-based near-neighbor searches. Our findings demonstrate that CAPS not only streamlines the search process but also significantly improves the efficiency of Amazon’s search system.Item SKETCH TOWARD ONLINE RISK MINIMIZATION(2022-01-27) Gupta, Gaurav; Shrivastava, AnshumaliEmpirical risk minimization (ERM) is perhaps the most in uential idea in statistical learning, with applications to nearly all scienti c and technical domains in the form of regression and classi cation models. The growing concerns about the high energy cost of training and the increased prevalence of massive streaming datasets have led many ML practitioners to look for approximate ERM models that can achieve low cost on memory and latency for training. To this end, we propose STORM- Sketch Toward Online Risk Minimization, an online sketching-based method for empirical risk minimization. STORM compresses a data stream into a tiny array of integer counters. This sketch is su cient to estimate a variety of surrogate losses over the original dataset. We provide rigorous theoretical analysis and show that STORM can estimate a carefully chosen surrogate loss for regularized least-squares regression and a margin loss for classi cation. We perform an exhaustive experimental comparison for regression and classi cation training on real-world datasets, achieving an approximate solution with size even less than a data sample.