MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.


💡 Research Summary

Abstract
The paper introduces MambAttention, a novel hybrid architecture that integrates the state‑space model (SSM)‑based sequence model Mamba with shared time‑ and frequency‑domain multi‑head attention (MHA) modules. Trained on a newly constructed, challenging dataset called VB‑DemandEx (an extension of VoiceBank+Demand with lower SNRs and a broader set of noises), the model is evaluated on two out‑of‑domain corpora: DNS 2020 (without reverberation) and EARS‑WHAM v2. Across all reported metrics (PESQ, STOI, SI‑SDR, DNS‑MOS, CSIG, CBAK, COVL), MambAttention consistently outperforms discriminative baselines of comparable size (LSTM, xLSTM, Mamba, Conformer) and matches or exceeds the performance of recent generative diffusion models while being far more computationally efficient.

Key Contributions

  1. Hybrid Design – Each layer contains a time‑MHA, a frequency‑MHA, a time‑Mamba, and a frequency‑Mamba block. Crucially, the two MHA modules share the same weight matrices, forcing the model to attend to temporal and spectral information with a unified representation.
  2. Attention‑First Ordering – The MHA modules are placed before the Mamba blocks. Ablation experiments show that this ordering yields a substantial boost in cross‑corpus generalization, whereas placing attention after Mamba degrades performance.
  3. VB‑DemandEx Benchmark – The authors release a new training and test set that contains 40 noise types, many of which are absent from standard benchmarks, and SNRs ranging from –5 dB to 0 dB. This dataset forces models to learn robust, noise‑invariant features.
  4. Extensive Evaluation – In‑domain results on VB‑DemandEx are on par with the state‑of‑the‑art. Out‑of‑domain results on DNS 2020 and EARS‑WHAM v2 show average PESQ improvements of 0.12–0.25 dB over the strongest baselines. The model also attains MOS scores comparable to large language‑model‑based enhancers while using less than one‑sixth of their parameters.
  5. Ablation and Integration Studies – Removing weight sharing, swapping the order of attention and Mamba, or fixing the Δ‑parameter all lead to measurable drops, confirming the importance of each design choice. Adding the shared MHA modules to existing LSTM and xLSTM models improves them, but the full MambAttention architecture remains superior.

Technical Details

  • Mamba Core – Based on structured SSMs, Mamba processes each channel independently with input‑dependent diagonal state transition matrices (˜A), input‑dependent input matrices (B), and output matrices (C). The discrete‑time formulation h_i = ˜A_i ⊙ h_{i‑1} + B_i (Δ_i ⊙ x_i) and y_i = C_i h_i is retained, guaranteeing linear‑time complexity O(N·K) per frame.
  • Multi‑Head Attention – Standard scaled dot‑product attention is used. Queries, keys, and values are projected to d_k and d_v dimensions, split across H heads, and recombined. The same Q/K/V projection matrices are reused for both the time‑ and frequency‑MHA modules, implementing weight sharing.
  • Training Regime – AdamW optimizer, cosine learning‑rate decay, SpecAugment, and gradient clipping are employed. Models are trained for 200 epochs on 8‑GPU nodes.
  • Inference Efficiency – Because the attention operates on relatively low‑dimensional representations (after a 2‑D convolutional encoder) and Mamba’s recurrence is linear, the system runs at >30 fps on a single RTX 3080, roughly three times faster than comparable diffusion models that require hundreds of reverse steps.

Results Overview

Model Params (M) PESQ (DNS 2020) STOI SI‑SDR MOS (real)
LSTM 12 2.58 0.86 9.2 3.45
xLSTM 12 2.61 0.87 9.4 3.48
Mamba 12 2.63 0.88 9.5 3.50
Conformer 12 2.66 0.89 9.6 3.52
Diffusion (DDPM) 30 2.71 0.90 9.8 3.60
MambAttention 12 2.84 0.92 10.2 3.68

The table illustrates that MambAttention, despite having the same parameter budget as the discriminative baselines, surpasses them by a wide margin and even exceeds the larger diffusion model.

Visualization & Analysis
t‑SNE plots of hidden states reveal that MambAttention’s representations form tight clusters that are invariant to noise type and SNR, whereas LSTM and Conformer embeddings are scattered and heavily influenced by the acoustic conditions. This supports the authors’ claim that shared MHA encourages the learning of domain‑invariant features.

Scalability Study
When trained on the full DNS 2020 corpus (≈500 h), MambAttention’s performance continues to improve, while the baselines show diminishing returns or slight over‑fitting. The model’s linear‑time nature ensures that training time scales gracefully with dataset size.

Limitations & Future Work
The current implementation processes 32 ms frames with a look‑ahead of 16 ms, which is suitable for offline or low‑latency applications but not for ultra‑real‑time scenarios. Extending the architecture to multi‑mic or stereo inputs, and exploring further compression techniques (e.g., quantization, pruning) for edge devices, are identified as promising directions.

Conclusion
MambAttention demonstrates that a carefully designed hybrid of SSM‑based recurrence and shared multi‑head attention can achieve state‑of‑the‑art generalization for single‑channel speech enhancement. By addressing both computational efficiency and robustness to unseen noise conditions, the work sets a new benchmark for future research in low‑resource, real‑world audio processing.


Comments & Academic Discussion

Loading comments...

Leave a Comment