Vision and Language: Information Integration and Transformation

Yang, Ziyan

Vision and Language: Information Integration and Transformation

dc.contributor.advisor	Ordonez-Roman, Vicente	en_US
dc.contributor.committeeMember	Balakrishnan, Guha	en_US
dc.creator	Yang, Ziyan	en_US
dc.date.accessioned	2024-01-24T22:54:55Z	en_US
dc.date.available	2024-01-24T22:54:55Z	en_US
dc.date.created	2023-12	en_US
dc.date.issued	2023-11-30	en_US
dc.date.submitted	December 2023	en_US
dc.date.updated	2024-01-24T22:54:55Z	en_US
dc.description.abstract	Images and text are large-scale data sources for deep learning models to mimic how humans perceive and understand multimodal information. Combining images and text can construct better feature representations by involving complementary information from different sources. For example, text descriptions may ignore the commonsense information such as ”apple is red”, but such information can be effectively learned from images with red apples. Learning independently from a single modality is not enough for complicated tasks that require an understanding of interactions and connections between multi-modalities, such as image-text retrieval, visual question answering, visual grounding, and image captioning. In particular, we explore and develop techniques combining image and text information to improve transformation between modalities. First, we exploit the multilingual image captioning and the multimodal machine translation tasks, using additional vision information to transfer information from images and one language to another language or just between two languages. Second, we focus on the visual grounding task, which aims to ground corresponding image regions from query phrases. Finally, we introduce a novel relation detection task and our solution for it. This thesis consists of three parts: In the first part, we propose a pipeline to combine image and text information from a source language to generate text for a target language during inference time. We design the feedback propagation process for image captioning models during inference time to show that information can be transferred between images and different languages to achieve benefits in text generation without an additional training process. By providing additional text information from one language, we show that this technique can construct better multimodal representations to generate text for another language. In the second part, we demonstrate that directly improving the gradient-based explanations for vision-language models produces superior visual grounding results. Visual grounding is a well-defined task that requires the model to figure out image regions corresponding to given text phrases. Most previous works study this problem by extracting image region features and measuring the similarities between image and text features. We observe that the visual explanations for text phrases can be used to solve the visual grounding task directly. Then, we propose a margin-based loss for tuning joint vision-language models so that their gradient-based visual explanations are consistent with region-level annotations provided by humans. In the last part, we introduce a new setting for relationship prediction, called Subject-Conditional Relation Detection (SCoRD), which consists of enumerating relations, objects, and boxes conditioned on an input subject in an image. We design a model that efficiently mixes up images and text inputs to solve this new task. Specifically, we explore a generation-based method to ground related objects by providing a subject and its location as the text inputs.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.citation	Yang, Ziyan. "Vision and Language: Information Integration and Transformation." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115421	en_US
dc.identifier.uri	https://hdl.handle.net/1911/115421	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Computer Vision	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Multimodal Machine Learning	en_US
dc.title	Vision and Language: Information Integration and Transformation	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: YANG-DOCUMENT-2023.pdf
Size:: 16.33 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.98 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations