Browsing by Author "Yang, Ziyan"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Backpropagation-Based Decoding for Multimodal Machine Translation(Frontiers Media S.A., 2022) Yang, Ziyan; Pinto-Alva, Leticia; Dernoncourt, Franck; Ordonez, VicentePeople are able to describe images using thousands of languages, but languages share only one visual world. The aim of this work is to use the learned intermediate visual representations from a deep convolutional neural network to transfer information across languages for which paired data is not available in any form. Our work proposes using backpropagation-based decoding coupled with transformer-based multilingual-multimodal language models in order to obtain translations between any languages used during training. We particularly show the capabilities of this approach in the translation of German-Japanese and Japanese-German sentence pairs, given a training data of images freely associated with text in English, German, and Japanese but for which no single image contains annotations in both Japanese and German. Moreover, we demonstrate that our approach is also generally useful in the multilingual image captioning task when sentences in a second language are available at test time. The results of our method also compare favorably in the Multi30k dataset against recently proposed methods that are also aiming to leverage images as an intermediate source of translations.Item Vision and Language: Information Integration and Transformation(2023-11-30) Yang, Ziyan; Ordonez-Roman, Vicente; Balakrishnan, GuhaImages and text are large-scale data sources for deep learning models to mimic how humans perceive and understand multimodal information. Combining images and text can construct better feature representations by involving complementary information from different sources. For example, text descriptions may ignore the commonsense information such as ”apple is red”, but such information can be effectively learned from images with red apples. Learning independently from a single modality is not enough for complicated tasks that require an understanding of interactions and connections between multi-modalities, such as image-text retrieval, visual question answering, visual grounding, and image captioning. In particular, we explore and develop techniques combining image and text information to improve transformation between modalities. First, we exploit the multilingual image captioning and the multimodal machine translation tasks, using additional vision information to transfer information from images and one language to another language or just between two languages. Second, we focus on the visual grounding task, which aims to ground corresponding image regions from query phrases. Finally, we introduce a novel relation detection task and our solution for it. This thesis consists of three parts: In the first part, we propose a pipeline to combine image and text information from a source language to generate text for a target language during inference time. We design the feedback propagation process for image captioning models during inference time to show that information can be transferred between images and different languages to achieve benefits in text generation without an additional training process. By providing additional text information from one language, we show that this technique can construct better multimodal representations to generate text for another language. In the second part, we demonstrate that directly improving the gradient-based explanations for vision-language models produces superior visual grounding results. Visual grounding is a well-defined task that requires the model to figure out image regions corresponding to given text phrases. Most previous works study this problem by extracting image region features and measuring the similarities between image and text features. We observe that the visual explanations for text phrases can be used to solve the visual grounding task directly. Then, we propose a margin-based loss for tuning joint vision-language models so that their gradient-based visual explanations are consistent with region-level annotations provided by humans. In the last part, we introduce a new setting for relationship prediction, called Subject-Conditional Relation Detection (SCoRD), which consists of enumerating relations, objects, and boxes conditioned on an input subject in an image. We design a model that efficiently mixes up images and text inputs to solve this new task. Specifically, we explore a generation-based method to ground related objects by providing a subject and its location as the text inputs.