AI and algorithms for ultra large scale information retrieval systems

Shrivastava, Anshumali2024-05-202024-05-202024-052024-02-20May 2024Gupta, Gaurav. AI and algorithms for ultra large scale information retrieval systems. (2024). PhD diss., Rice University. https://hdl.handle.net/1911/115904https://hdl.handle.net/1911/115904Information Retrieval (IR) entails the task of matching a specific query to a single item or a group of items within a database full of potential matches. This pivotal operation serves numerous applications, such as genome sequence and document search, near-neighbor search, and classification. The exponential increase in data generation - encompassing text, audio, images, DNA sequences, and time series - from sources like sensors and the web has resulted in data volumes exceeding standard hardware's storage capabilities. Additionally, modern AI applications rely on substantial data quantities, ranging from Gigabytes to Terabytes, and a key-label pair set spanning billions to trillions. Hashing-based algorithms have shown promise in enabling efficient large-scale retrieval due to their low latency and minimal hardware storage requirements. This research is set to examine two search modalities: exact-match search and similarity search. We propose RAMBO, a Bloom filter-based sub-linear search index for exact-match search. Particularly potent in genome sequence searches, this efficient algorithm indexes terabytes of DNA data for E-coli species, dramatically reducing indexing times from weeks to hours and search times from hours to seconds. This progress empowers any laboratory to scour through extensive genome archives using standard computers. We also extend this work by proposing a hash function IDL (Identity with Locality) that preserves the spatial locality and identity of keys for cache-efficient exact-match queries. Conversely, recommendation engines often lean on similarity matching when handling real-world data such as text, images, and audio. In this context, we introduce BLISS (Balanced Index for Scalable Search), a mechanism capable of learning and associating any two real-world data entities within an acceptable approximation. BLISS can index a billion items through an iterative learning algorithm, and its high accuracy, coupled with a small memory footprint, speeds up retrieval by a factor of 5 compared to current benchmarks. Moreover, BLISS offers the dual functionality of near-neighbor search and extreme classification. In practical retrieval engines, similarity match and exact match form the core components, with embedding and filters serving as the query. Current techniques involve a cascade of vector similarity match and exact match on filters, which often leads to slower and less precise results. To address this, we introduce CAPS (Constrained Approximate Partition Search), a unified, single-stage index designed for filter-based near-neighbor searches. Our findings demonstrate that CAPS not only streamlines the search process but also significantly improves the efficiency of Amazon’s search system.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.HashingInformation RetrievalSearchSimilarityAI and algorithms for ultra large scale information retrieval systemsThesis2024-05-20