ALARM: Audio-Language Alignment for Reasoning Models
Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.
💡 Research Summary
ALARM: Audio‑Language Alignment for Reasoning Models tackles a critical gap in current audio‑language models (ALMs) that integrate large language models (LLMs) with auditory perception. While many existing ALMs freeze the LLM and train a lightweight adapter on self‑generated textual targets, this strategy breaks down for reasoning‑capable LLMs (RLMs) that employ chain‑of‑thought (CoT) prompting. In such models, the internal reasoning process reveals that the input is textual, leading to unnatural, “text‑only” responses when the model is later asked to handle raw audio.
To resolve this, the authors introduce a self‑rephrasing pipeline. First, a frozen RLM (Qᵣ) generates an initial answer R₀ conditioned on the textual metadata (A_text) and a user prompt (P). Second, the same frozen RLM is prompted to rewrite R₀ into an audio‑grounded style (R_text) according to a set of re‑phrasing rules (I_reph) that explicitly forbid references to the underlying text or metadata. This two‑stage process preserves the original distribution of the RLM’s outputs while ensuring that the final targets are compatible with an audio‑only input. A token budget (B = 1536) controls the length of the re‑phrasing chain‑of‑thought, balancing quality and computational cost.
The paper also presents a large, high‑quality multimodal corpus called the ALARM corpus. It contains 6 million training instances (2.5 million unique prompts) covering 19 000 hours of speech, music, and general sound. Prompt generation leverages a pretrained instruction LLM (Q) to sample 20 candidate prompts per audio sample, then filters them for (i) answerability given the available metadata and (ii) avoidance of textual‑input cues. The final prompt is uniformly sampled from the filtered set. This pipeline dramatically reduces hallucinations that plagued earlier public releases such as DeSTA‑A‑QA5M, which only offered 7 K hours and 7 K prompts. The corpus draws from diverse sources (e.g., LibriSpeech, VoxCeleb, GTZAN, AudioSet) and includes additional instruction data (HeySQuAD, Instructs2s) to enrich question‑answer diversity.
Model architecture – ALARM builds on a frozen 4 B‑parameter reasoning LLM (Qwen3‑4B‑Thinking‑2507) and augments it with four specialized audio encoders: Whisper (speech‑oriented), W2V‑BERT‑2.0 (broad auditory cues), MuQ (music), and SS‑LAM (general sound). Each encoder’s hidden layers are combined via a learnable weighted average, then compressed temporally to a token rate of 25 Hz (or 50 Hz for the ensemble). Adapter modules (two‑layer convolutions for most encoders, an MLP for MuQ) map these compressed features into the LLM’s embedding space.
Three fusion strategies are explored:
- ALARM‑CA – a stack of cross‑attention blocks that iteratively refines a primary encoder’s representation (typically Whisper) using the other encoders as context.
- ALARM‑P – Whisper remains the main stream, while three Perceiver modules compress the remaining encoders into fixed‑size prefix embeddings that are prepended to Whisper’s token sequence.
- ALARM‑E – an ensemble that combines the CA approach with Whisper at a 50 Hz token rate, offering a sweet spot between computational cost and performance.
Training updates only the adapters and fusion modules; the RLM’s parameters stay frozen, preserving its strong textual capabilities while efficiently learning audio‑text alignment. Experiments show that the 4 B‑parameter ALARM‑E model outperforms same‑size baselines and even surpasses many larger ALMs on benchmarks such as MMSU and MMAU‑speech. It achieves the best open‑source result on MMAU‑speech and ranks third overall when closed‑source systems are included. Importantly, this performance is attained with substantially lower training cost and data volume compared to prior work.
The paper’s contributions are fourfold:
- A novel self‑rephrasing method that adapts chain‑of‑thought reasoning models to audio without distribution shift.
- Elimination of ASR/VAD dependence by fusing multiple domain‑specialized encoders, enabling robust handling of both vocal and non‑vocal signals.
- Construction of a large, diverse, and carefully filtered multimodal corpus that mitigates hallucination risks.
- Release of code, data collection scripts, and model checkpoints to foster reproducibility and future research.
In summary, ALARM demonstrates that reasoning‑capable LLMs can be efficiently extended to high‑fidelity audio understanding through clever target re‑phrasing and multi‑encoder fusion, setting a new benchmark for open‑source audio‑language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment