Long-Context Sequence Models for Image Retrieval

Date
2024-10-25
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Image retrieval is an important problem in computer vision with many applications. In general, retrieval is usually cast as a metric learning problem where a model is trained under a distance or similarity objective to compare pairs of inputs. In this thesis, we introduce Extractive Image Re-ranker, a solution that takes as input local features corresponding to an image query and a group of gallery images, and outputs a refined ranking list through a single forward pass. This model can be used for image retrieval where typically a query image is compared to a large database of images using global features, and then a retrieved gallery of images is re-ranked based on more refined local features. ExtReranker formulates the re-ranking problem as a span extraction task analogous to the text span extraction problem in natural language processing. In contrast to pair-wise correspondence learning, our approach leverages long-context sequence models to effectively capture the list-wise dependencies between query and gallery images at the local-feature level. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks (CUB-200, SOP, and In-Shop). ExtReranker also achieves state-of-the-art re-ranking performance to alternative methods on ROxford and RParis while using 10X fewer local descriptors and having 5X lower forward latency.

Description
Degree
Master of Science
Type
Thesis
Keywords
image retrieval, long-context language models
Citation
Has part(s)
Forms part of
Published Version
Rights
Link to license
Citable link to this page