Taming Data and Transformers for Audio Generation

dc.contributor.advisorOrdonez-Roman, Vicenteen_US
dc.creatorHaji Ali, Moayeden_US
dc.date.accessioned2025-01-17T17:25:46Zen_US
dc.date.available2025-01-17T17:25:46Zen_US
dc.date.created2024-12en_US
dc.date.issued2024-12-05en_US
dc.date.submittedDecember 2024en_US
dc.date.updated2025-01-17T17:25:46Zen_US
dc.description.abstractGenerating ambient sounds is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demonstrate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. Our code, model checkpoints, and dataset will be made publicly available upon acceptance.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.urihttps://hdl.handle.net/1911/118239en_US
dc.language.isoenen_US
dc.subjectGenerative modelsen_US
dc.subjectAudio Generationen_US
dc.subjectAudio Captioningen_US
dc.subjectAudio Dataseten_US
dc.subjectDataseten_US
dc.subjectTransformersen_US
dc.subjectText-to-Audioen_US
dc.subjecten_US
dc.titleTaming Data and Transformers for Audio Generationen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineComputer Science, Computer Scienceen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Scienceen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
HAJIALI-DOCUMENT-2024.pdf
Size:
2.84 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: