Long-Context Sequence Models for Image Retrieval

Xiao, Zilin

Long-Context Sequence Models for Image Retrieval

Files

XIAO-DOCUMENT-2024.pdf (3.04 MB)

Date

2024-10-25

Authors

Xiao, Zilin

Abstract

Image retrieval is an important problem in computer vision with many applications. In general, retrieval is usually cast as a metric learning problem where a model is trained under a distance or similarity objective to compare pairs of inputs. In this thesis, we introduce Extractive Image Re-ranker, a solution that takes as input local features corresponding to an image query and a group of gallery images, and outputs a refined ranking list through a single forward pass. This model can be used for image retrieval where typically a query image is compared to a large database of images using global features, and then a retrieved gallery of images is re-ranked based on more refined local features. ExtReranker formulates the re-ranking problem as a span extraction task analogous to the text span extraction problem in natural language processing. In contrast to pair-wise correspondence learning, our approach leverages long-context sequence models to effectively capture the list-wise dependencies between query and gallery images at the local-feature level. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks (CUB-200, SOP, and In-Shop). ExtReranker also achieves state-of-the-art re-ranking performance to alternative methods on ROxford and RParis while using 10X fewer local descriptors and having 5X lower forward latency.

Advisor

Ordóñez-Román, Vicente

Degree

Master of Science

Type

Thesis

Keywords

image retrieval, long-context language models

Citable link to this page

https://hdl.handle.net/1911/118198

Collections

Rice University Theses and Dissertations

Full item page