Finetuning a Text-to-Audio Model for Room Impulse Response Generation
Room Impulse Responses (RIRs) enable realistic acoustic simulation, with applications ranging from multimedia production to speech data augmentation. However, acquiring high-quality real-world RIRs is labor-intensive, and data scarcity remains a challenge for data-driven RIR generation approaches. In this paper, we propose a novel approach to RIR generation by fine-tuning a pre-trained text-to-audio model, demonstrating for the first time that large-scale generative audio priors can be effectively leveraged for the task. To address the lack of text-RIR paired data, we establish a labeling pipeline utilizing vision-language models to extract acoustic descriptions from existing image-RIR datasets. We introduce an in-context learning strategy to accommodate free-form user prompts during inference. Evaluations involving MUSHRA listening tests and downstream ASR performance demonstrate that our model generates plausible RIRs and serves as an effective tool for speech data augmentation.
💡 Research Summary
**
The paper tackles the problem of generating realistic room impulse responses (RIRs) from natural‑language descriptions, a task that traditionally requires either labor‑intensive acoustic measurements or physics‑based simulations that need detailed geometric and material parameters. Recent data‑driven approaches have tried to generate RIRs conditioned on images or acoustic metrics, but they suffer from a lack of large, paired text‑RIR datasets and often rely on synthetic RIRs that deviate from real‑world physics.
To overcome these limitations, the authors propose to fine‑tune a large, pre‑trained text‑to‑audio (TTA) model—Stable Audio Open—on a modest collection of real RIRs. Stable Audio Open has been trained on 7,300 h of diverse audio (speech, music, environmental sounds) and therefore encodes rich acoustic priors. By freezing its T5 text encoder and variational autoencoder (VAE) and updating only the diffusion transformer (DiT) for five epochs, the model adapts its generative capabilities to the RIR domain while preserving the broad audio knowledge acquired during pre‑training.
A major obstacle is the absence of paired text‑RIR data. The authors solve this by exploiting existing image‑RIR datasets (BUT ReverbDB). They run three state‑of‑the‑art visual‑language models (Llama‑3.2‑Vision‑90B, Qwen2.5‑VL‑72B‑Instruct, Molmo2‑8B) on each room image, prompting them to act as acoustic experts and to describe geometry, surface materials, and expected reverberation characteristics. An LLM‑as‑judge (Llama‑3.3‑70B‑Instruct) scores each caption; only images for which at least two VLMs obtain a score > 3 are retained. The highest‑scoring caption is merged with the room’s metadata (volume, RT60, etc.) via another LLM to produce a standardized natural‑language prompt. This pipeline yields 1,736 high‑quality text‑RIR pairs for training.
To make the system usable with arbitrary user inputs, the authors introduce an in‑context learning (ICL) module. When a user supplies a free‑form description, a language model receives a system prompt plus five example pairs (raw caption ↔ refined prompt) and generates an intermediate acoustic caption, which is then transformed into the standardized prompt format. Cosine similarity between T5 embeddings of refined prompts and ground‑truth prompts reaches 0.955, far above the 0.744 similarity of raw free‑form text, confirming that ICL aligns user queries with the model’s training distribution.
Experimental evaluation is thorough. Quantitatively, the fine‑tuned model achieves a mean RT60 error of 5.56 % (median –31.73 %) on the test rooms, dramatically outperforming Image2Reverb (≈96 % mean error) and matching the performance reported by PromptReverb despite using roughly 1 % of the training data (1,736 vs. 145,976 samples). Subjectively, a MUSHRA listening test with 17 valid participants rates the proposed model at 55 ± 2.2 (0–100 scale), significantly higher than Image2Reverb variants (≈42–47) and even the low‑pass anchor (51). The hidden reference (ground‑truth RIR) scores 99, indicating a perceptual gap that remains to be closed.
Finally, the authors assess downstream utility by augmenting Librispeech test utterances with generated RIRs and measuring ASR performance using WhisperX. Word error rate (WER) rises only from 0.08 % (ground‑truth) to 0.12 % (generated), a non‑significant difference (p = 0.728). PESQ and STOI scores are slightly higher for generated RIRs, reflecting the model’s tendency to produce less reverberant (i.e., cleaner) speech, consistent with the negative median RT60 error.
In summary, the paper demonstrates that (1) large‑scale generative audio priors can be efficiently transferred to the RIR domain with minimal fine‑tuning, (2) VLM‑driven automatic captioning can create high‑quality text‑RIR pairs without manual annotation, and (3) an ICL‑based prompt refinement enables robust handling of free‑form user queries. The approach achieves data‑efficiency, competitive acoustic accuracy, and practical usefulness for speech‑technology pipelines. Future work may explore fine‑tuning the frozen encoder and VAE for finer acoustic control, expanding evaluation to a broader variety of room sizes and shapes, and hybridizing the generative model with physics‑based simulators to capture extreme reverberation scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment