Browsing by Author "Ordonez-Roman, Vicente"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Taming Data and Transformers for Audio Generation(2024-12-05) Haji Ali, Moayed; Ordonez-Roman, VicenteGenerating ambient sounds is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demonstrate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. Our code, model checkpoints, and dataset will be made publicly available upon acceptance.Item Vision and Language: Information Integration and Transformation(2023-11-30) Yang, Ziyan; Ordonez-Roman, Vicente; Balakrishnan, GuhaImages and text are large-scale data sources for deep learning models to mimic how humans perceive and understand multimodal information. Combining images and text can construct better feature representations by involving complementary information from different sources. For example, text descriptions may ignore the commonsense information such as ”apple is red”, but such information can be effectively learned from images with red apples. Learning independently from a single modality is not enough for complicated tasks that require an understanding of interactions and connections between multi-modalities, such as image-text retrieval, visual question answering, visual grounding, and image captioning. In particular, we explore and develop techniques combining image and text information to improve transformation between modalities. First, we exploit the multilingual image captioning and the multimodal machine translation tasks, using additional vision information to transfer information from images and one language to another language or just between two languages. Second, we focus on the visual grounding task, which aims to ground corresponding image regions from query phrases. Finally, we introduce a novel relation detection task and our solution for it. This thesis consists of three parts: In the first part, we propose a pipeline to combine image and text information from a source language to generate text for a target language during inference time. We design the feedback propagation process for image captioning models during inference time to show that information can be transferred between images and different languages to achieve benefits in text generation without an additional training process. By providing additional text information from one language, we show that this technique can construct better multimodal representations to generate text for another language. In the second part, we demonstrate that directly improving the gradient-based explanations for vision-language models produces superior visual grounding results. Visual grounding is a well-defined task that requires the model to figure out image regions corresponding to given text phrases. Most previous works study this problem by extracting image region features and measuring the similarities between image and text features. We observe that the visual explanations for text phrases can be used to solve the visual grounding task directly. Then, we propose a margin-based loss for tuning joint vision-language models so that their gradient-based visual explanations are consistent with region-level annotations provided by humans. In the last part, we introduce a new setting for relationship prediction, called Subject-Conditional Relation Detection (SCoRD), which consists of enumerating relations, objects, and boxes conditioned on an input subject in an image. We design a model that efficiently mixes up images and text inputs to solve this new task. Specifically, we explore a generation-based method to ground related objects by providing a subject and its location as the text inputs.