AI and algorithms for ultra large scale information retrieval systems

Gupta, Gaurav

AI and algorithms for ultra large scale information retrieval systems

dc.contributor.advisor	Shrivastava, Anshumali	en_US
dc.creator	Gupta, Gaurav	en_US
dc.date.accessioned	2024-05-20T19:18:39Z	en_US
dc.date.available	2024-05-20T19:18:39Z	en_US
dc.date.created	2024-05	en_US
dc.date.issued	2024-02-20	en_US
dc.date.submitted	May 2024	en_US
dc.date.updated	2024-05-20T19:18:39Z	en_US
dc.description.abstract	Information Retrieval (IR) entails the task of matching a specific query to a single item or a group of items within a database full of potential matches. This pivotal operation serves numerous applications, such as genome sequence and document search, near-neighbor search, and classification. The exponential increase in data generation - encompassing text, audio, images, DNA sequences, and time series - from sources like sensors and the web has resulted in data volumes exceeding standard hardware's storage capabilities. Additionally, modern AI applications rely on substantial data quantities, ranging from Gigabytes to Terabytes, and a key-label pair set spanning billions to trillions. Hashing-based algorithms have shown promise in enabling efficient large-scale retrieval due to their low latency and minimal hardware storage requirements. This research is set to examine two search modalities: exact-match search and similarity search. We propose RAMBO, a Bloom filter-based sub-linear search index for exact-match search. Particularly potent in genome sequence searches, this efficient algorithm indexes terabytes of DNA data for E-coli species, dramatically reducing indexing times from weeks to hours and search times from hours to seconds. This progress empowers any laboratory to scour through extensive genome archives using standard computers. We also extend this work by proposing a hash function IDL (Identity with Locality) that preserves the spatial locality and identity of keys for cache-efficient exact-match queries. Conversely, recommendation engines often lean on similarity matching when handling real-world data such as text, images, and audio. In this context, we introduce BLISS (Balanced Index for Scalable Search), a mechanism capable of learning and associating any two real-world data entities within an acceptable approximation. BLISS can index a billion items through an iterative learning algorithm, and its high accuracy, coupled with a small memory footprint, speeds up retrieval by a factor of 5 compared to current benchmarks. Moreover, BLISS offers the dual functionality of near-neighbor search and extreme classification. In practical retrieval engines, similarity match and exact match form the core components, with embedding and filters serving as the query. Current techniques involve a cascade of vector similarity match and exact match on filters, which often leads to slower and less precise results. To address this, we introduce CAPS (Constrained Approximate Partition Search), a unified, single-stage index designed for filter-based near-neighbor searches. Our findings demonstrate that CAPS not only streamlines the search process but also significantly improves the efficiency of Amazon’s search system.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Gupta, Gaurav. AI and algorithms for ultra large scale information retrieval systems. (2024). PhD diss., Rice University. https://hdl.handle.net/1911/115904	en_US
dc.identifier.uri	https://hdl.handle.net/1911/115904	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Hashing	en_US
dc.subject	Information Retrieval	en_US
dc.subject	Search	en_US
dc.subject	Similarity	en_US
dc.title	AI and algorithms for ultra large scale information retrieval systems	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Electrical and Computer Engineering	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US