Vision and Language: Information Integration and Transformation
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Images and text are large-scale data sources for deep learning models to mimic how humans perceive and understand multimodal information. Combining images and text can construct better feature representations by involving complementary information from different sources. For example, text descriptions may ignore the commonsense information such as ”apple is red”, but such information can be effectively learned from images with red apples. Learning independently from a single modality is not enough for complicated tasks that require an understanding of interactions and connections between multi-modalities, such as image-text retrieval, visual question answering, visual grounding, and image captioning. In particular, we explore and develop techniques combining image and text information to improve transformation between modalities. First, we exploit the multilingual image captioning and the multimodal machine translation tasks, using additional vision information to transfer information from images and one language to another language or just between two languages. Second, we focus on the visual grounding task, which aims to ground corresponding image regions from query phrases. Finally, we introduce a novel relation detection task and our solution for it. This thesis consists of three parts:
In the first part, we propose a pipeline to combine image and text information from a source language to generate text for a target language during inference time. We design the feedback propagation process for image captioning models during inference time to show that information can be transferred between images and different languages to achieve benefits in text generation without an additional training process. By providing additional text information from one language, we show that this technique can construct better multimodal representations to generate text for another language.
In the second part, we demonstrate that directly improving the gradient-based explanations for vision-language models produces superior visual grounding results. Visual grounding is a well-defined task that requires the model to figure out image regions corresponding to given text phrases. Most previous works study this problem by extracting image region features and measuring the similarities between image and text features. We observe that the visual explanations for text phrases can be used to solve the visual grounding task directly. Then, we propose a margin-based loss for tuning joint vision-language models so that their gradient-based visual explanations are consistent with region-level annotations provided by humans.
In the last part, we introduce a new setting for relationship prediction, called Subject-Conditional Relation Detection (SCoRD), which consists of enumerating relations, objects, and boxes conditioned on an input subject in an image. We design a model that efficiently mixes up images and text inputs to solve this new task. Specifically, we explore a generation-based method to ground related objects by providing a subject and its location as the text inputs.
Description
Advisor
Degree
Type
Keywords
Citation
Yang, Ziyan. "Vision and Language: Information Integration and Transformation." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115421