AI and algorithms for ultra large scale information retrieval systems

dc.contributor.advisorShrivastava, Anshumali
dc.creatorGupta, Gaurav
dc.date.accessioned2024-05-20T19:18:39Z
dc.date.available2024-05-20T19:18:39Z
dc.date.created2024-05
dc.date.issued2024-02-20
dc.date.submittedMay 2024
dc.date.updated2024-05-20T19:18:39Z
dc.description.abstractInformation Retrieval (IR) entails the task of matching a specific query to a single item or a group of items within a database full of potential matches. This pivotal operation serves numerous applications, such as genome sequence and document search, near-neighbor search, and classification. The exponential increase in data generation - encompassing text, audio, images, DNA sequences, and time series - from sources like sensors and the web has resulted in data volumes exceeding standard hardware's storage capabilities. Additionally, modern AI applications rely on substantial data quantities, ranging from Gigabytes to Terabytes, and a key-label pair set spanning billions to trillions. Hashing-based algorithms have shown promise in enabling efficient large-scale retrieval due to their low latency and minimal hardware storage requirements. This research is set to examine two search modalities: exact-match search and similarity search. We propose RAMBO, a Bloom filter-based sub-linear search index for exact-match search. Particularly potent in genome sequence searches, this efficient algorithm indexes terabytes of DNA data for E-coli species, dramatically reducing indexing times from weeks to hours and search times from hours to seconds. This progress empowers any laboratory to scour through extensive genome archives using standard computers. We also extend this work by proposing a hash function IDL (Identity with Locality) that preserves the spatial locality and identity of keys for cache-efficient exact-match queries. Conversely, recommendation engines often lean on similarity matching when handling real-world data such as text, images, and audio. In this context, we introduce BLISS (Balanced Index for Scalable Search), a mechanism capable of learning and associating any two real-world data entities within an acceptable approximation. BLISS can index a billion items through an iterative learning algorithm, and its high accuracy, coupled with a small memory footprint, speeds up retrieval by a factor of 5 compared to current benchmarks. Moreover, BLISS offers the dual functionality of near-neighbor search and extreme classification. In practical retrieval engines, similarity match and exact match form the core components, with embedding and filters serving as the query. Current techniques involve a cascade of vector similarity match and exact match on filters, which often leads to slower and less precise results. To address this, we introduce CAPS (Constrained Approximate Partition Search), a unified, single-stage index designed for filter-based near-neighbor searches. Our findings demonstrate that CAPS not only streamlines the search process but also significantly improves the efficiency of Amazon’s search system.
dc.format.mimetypeapplication/pdf
dc.identifier.citationGupta, Gaurav. AI and algorithms for ultra large scale information retrieval systems. (2024). PhD diss., Rice University. https://hdl.handle.net/1911/115904
dc.identifier.urihttps://hdl.handle.net/1911/115904
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectHashing
dc.subjectInformation Retrieval
dc.subjectSearch
dc.subjectSimilarity
dc.titleAI and algorithms for ultra large scale information retrieval systems
dc.typeThesis
dc.type.materialText
thesis.degree.departmentElectrical and Computer Engineering
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
GUPTA-DOCUMENT-2024.pdf
Size:
9.26 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
CertificateOfCompletion.pdf
Size:
117.37 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: