Vision and Language: Information Integration and Transformation

dc.contributor.advisorOrdonez-Roman, Vicenteen_US
dc.contributor.committeeMemberBalakrishnan, Guhaen_US
dc.creatorYang, Ziyanen_US
dc.date.accessioned2024-01-24T22:54:55Zen_US
dc.date.available2024-01-24T22:54:55Zen_US
dc.date.created2023-12en_US
dc.date.issued2023-11-30en_US
dc.date.submittedDecember 2023en_US
dc.date.updated2024-01-24T22:54:55Zen_US
dc.description.abstractImages and text are large-scale data sources for deep learning models to mimic how humans perceive and understand multimodal information. Combining images and text can construct better feature representations by involving complementary information from different sources. For example, text descriptions may ignore the commonsense information such as ”apple is red”, but such information can be effectively learned from images with red apples. Learning independently from a single modality is not enough for complicated tasks that require an understanding of interactions and connections between multi-modalities, such as image-text retrieval, visual question answering, visual grounding, and image captioning. In particular, we explore and develop techniques combining image and text information to improve transformation between modalities. First, we exploit the multilingual image captioning and the multimodal machine translation tasks, using additional vision information to transfer information from images and one language to another language or just between two languages. Second, we focus on the visual grounding task, which aims to ground corresponding image regions from query phrases. Finally, we introduce a novel relation detection task and our solution for it. This thesis consists of three parts: In the first part, we propose a pipeline to combine image and text information from a source language to generate text for a target language during inference time. We design the feedback propagation process for image captioning models during inference time to show that information can be transferred between images and different languages to achieve benefits in text generation without an additional training process. By providing additional text information from one language, we show that this technique can construct better multimodal representations to generate text for another language. In the second part, we demonstrate that directly improving the gradient-based explanations for vision-language models produces superior visual grounding results. Visual grounding is a well-defined task that requires the model to figure out image regions corresponding to given text phrases. Most previous works study this problem by extracting image region features and measuring the similarities between image and text features. We observe that the visual explanations for text phrases can be used to solve the visual grounding task directly. Then, we propose a margin-based loss for tuning joint vision-language models so that their gradient-based visual explanations are consistent with region-level annotations provided by humans. In the last part, we introduce a new setting for relationship prediction, called Subject-Conditional Relation Detection (SCoRD), which consists of enumerating relations, objects, and boxes conditioned on an input subject in an image. We design a model that efficiently mixes up images and text inputs to solve this new task. Specifically, we explore a generation-based method to ground related objects by providing a subject and its location as the text inputs.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationYang, Ziyan. "Vision and Language: Information Integration and Transformation." (2023). PhD diss., Rice University. https://hdl.handle.net/1911/115421en_US
dc.identifier.urihttps://hdl.handle.net/1911/115421en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectComputer Visionen_US
dc.subjectNatural Language Processingen_US
dc.subjectMultimodal Machine Learningen_US
dc.titleVision and Language: Information Integration and Transformationen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
YANG-DOCUMENT-2023.pdf
Size:
16.33 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: