SAM: A Mamba-2 State-Space Audio-Language Model

SAM: A Mamba-2 State-Space Audio-Language Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.


💡 Research Summary

The paper introduces SAM (State‑space Audio‑Language Model), a multimodal model that replaces the traditional transformer backbone in audio‑language models (ALMs) with the recent Mamba‑2 state‑space model (SSM). The architecture consists of four components: (1) an audio encoder (EA‑T‑base, 88 M parameters) that converts mel‑spectrograms into 512 tokens of dimension 768; (2) a multimodal connector that projects or rearranges these tokens into a sequence suitable for the SSM; three connector variants are explored – (a) concatenation with dimensionality reduction to 64 tokens, (b) time‑major ordering preserving temporal continuity, and (c) frequency‑major ordering preserving spectral locality, each augmented with separator tokens to signal boundaries; (3) a text encoder that embeds prompts and captions; (4) a Mamba‑2 language model that processes the concatenated audio‑text embeddings. Mamba‑2 differs from Mamba‑1 by sharing a scalar‑times‑identity matrix across heads and increasing the state size, yielding 2‑8× faster training while maintaining or improving performance.

Training follows the LTU curriculum on the OpenA QA dataset (≈5.6 M QA pairs). LoRA adapters (α = 2r) are inserted into each Mamba‑2 block, with rank r varied from 8 to 256 to study parameter‑efficiency. Mixed‑precision fp16 training on two RTX 4090 GPUs takes 0.5–2 days depending on model size.

Evaluation uses the LTU protocol: zero‑shot audio classification on benchmarks such as AudioSet, ESC‑50, DCASE, and audio captioning measured by SPICE. Audio‑text similarity is computed with CLAP embeddings. Additionally, the MMAU‑Sound suite (binary and multiple‑choice questions) assesses reasoning ability.

Key results: SAM‑2.7B (2.7 B parameters) achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, surpassing larger 7 B transformer‑based models (e.g., L‑TU‑7B, GAMA‑7B). Smaller variants (130 M, 780 M) also reach competitive scores. Increasing LoRA rank consistently improves performance (≈1–3 % absolute gain). Mamba‑2 requires ~20 % less training time than Mamba‑1 for comparable LoRA settings.

Three systematic ablations provide deeper insight:

  1. Joint audio‑encoder fine‑tuning – Models trained with the audio encoder frozen (E7‑9) underperform those with end‑to‑end fine‑tuning (E4‑6). The authors measure τ‑effective rank of audio token matrices and find that smaller SSMs produce lower‑rank, higher‑similarity token sets after joint training, indicating that the encoder adapts to the limited recurrent state capacity of the SSM. Swapping encoders across sizes confirms that size‑matched encoders yield the best results, reinforcing the need for co‑adaptation.

  2. Compressed vs. uncompressed audio tokens – While SSMs scale linearly with sequence length, feeding the full 512‑token sequence (variants b and c) does not consistently outperform the compressed 64‑token design (variant a). Uncompressed variants exhibit lower τ‑effective rank, especially in smaller models, suggesting that longer sequences impose a heavier burden on the recurrent state without delivering additional useful information. Thus, compact, information‑dense token representations are more beneficial than merely increasing token count.

  3. Instruction‑following supervision for reasoning – By augmenting training with OpenReasonA QA (structured binary and multiple‑choice questions derived from AudioCaps and Clotho), SAM’s reasoning performance on MMAU‑Sound dramatically improves. SAM+OR‑2.7B reaches 61.86 % on the “Sound” sub‑task, outpacing the strong Gemma‑3n‑4B baseline (55.86 %). Gains are observed across all model scales, indicating that SSMs can acquire sophisticated audio reasoning when provided with appropriate supervision.

Additional observations include: LoRA rank scaling yields modest but consistent gains; Mamba‑2’s matrix‑multiplication kernel contributes to faster training; and the separator token strategy helps the SSM retain structural cues despite its inherently sequential processing.

In conclusion, SAM demonstrates that Mamba‑2 SSMs serve as powerful, parameter‑efficient backbones for audio‑language tasks. The paper establishes three practical design principles: (1) jointly fine‑tune the audio encoder with the SSM, (2) prefer compact, high‑information audio token representations over long uncompressed sequences, and (3) incorporate structured instruction‑following data to boost reasoning capabilities. Future work will extend SAM to speech understanding by integrating speech‑specific encoders and datasets, and will further explore token compression and training curricula tailored to SSM characteristics.


Comments & Academic Discussion

Loading comments...

Leave a Comment