Long-Context Sequence Models for Image Retrieval
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Image retrieval is an important problem in computer vision with many applications. In general, retrieval is usually cast as a metric learning problem where a model is trained under a distance or similarity objective to compare pairs of inputs. In this thesis, we introduce Extractive Image Re-ranker, a solution that takes as input local features corresponding to an image query and a group of gallery images, and outputs a refined ranking list through a single forward pass. This model can be used for image retrieval where typically a query image is compared to a large database of images using global features, and then a retrieved gallery of images is re-ranked based on more refined local features. ExtReranker formulates the re-ranking problem as a span extraction task analogous to the text span extraction problem in natural language processing. In contrast to pair-wise correspondence learning, our approach leverages long-context sequence models to effectively capture the list-wise dependencies between query and gallery images at the local-feature level. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks (CUB-200, SOP, and In-Shop). ExtReranker also achieves state-of-the-art re-ranking performance to alternative methods on ROxford and RParis while using 10X fewer local descriptors and having 5X lower forward latency.