Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.
💡 Research Summary
The paper tackles the problem of multilingual conversational automatic speech recognition (ASR) within the INTERSPEECH 2025 Multilingual Conversational Speech Language Modeling (MLC‑SLM) Challenge. The authors compare two families of systems that use the same limited training data (1 500 h of multilingual conversational speech covering 11 languages): (1) an end‑to‑end (E2E) Whisper‑Large‑v3 encoder‑decoder model, and (2) a Speech‑LLM architecture that combines two heterogeneous speech encoders—Whisper and mHuBERT—in parallel and feeds their fused representations into a large language model (LLM, Qwen2.5‑7B).
First, they explore two fine‑tuning strategies for Whisper: LoRA (low‑rank adaptation, rank 32, α 64) and full‑parameter fine‑tuning. LoRA improves the baseline Whisper (WER ≈ 16 % on the development set) to 10.71 % on the official evaluation set, while full fine‑tuning yields a marginally better 10.07 % but degrades out‑of‑domain robustness (WER ≈ 13 % on a CommonVoice‑derived OOD set). This confirms that parameter‑efficient adaptation can achieve most of the in‑domain gains without sacrificing generalization.
Second, the authors design five cross‑attention based fusion mechanisms for the parallel encoders: (i) Direct Feature Concatenation (DFC), (ii) Unidirectional Cross‑Attention Fusion with Residual (Res‑Uni‑CAF), (iii) Bidirectional Cross‑Attention Fusion with Residual (Res‑Bi‑CAF), (iv) Gated Res‑Bi‑CAF (adds learnable sigmoid gates to control each encoder’s contribution), and (v) a hybrid Res‑Gated‑Bi‑CAF©DFC that also concatenates the raw DFC features.
Third, they evaluate two projector designs that map the fused speech features into the LLM embedding space: a lightweight linear projector (1‑D convolutions + MLP) and a more complex Q‑Former that uses learnable queries to summarize the fused sequence.
Training proceeds in two stages rather than the three‑stage pipeline used previously: (1) each speech encoder is fine‑tuned independently (Whisper with LoRA or full fine‑tuning, mHuBERT with CTC), and (2) the projector is first trained alone, then jointly optimized with the LLM (LLM parameters adapted via LoRA) while keeping the speech encoders frozen.
Results on the official development (Dev), evaluation (Eval), and out‑of‑domain (CV‑Test) sets reveal several key findings:
- Whisper fine‑tuning – LoRA yields a 0.7 % absolute WER reduction on Eval; full fine‑tuning gives an extra 0.6 % but harms OOD performance.
- Projector choice – The simple linear projector consistently outperforms the Q‑Former across all sets (e.g., Dev 11.91 % vs 12.52 %). Simplicity translates into better alignment and robustness.
- Fusion mechanisms – In Stage 1 (projector‑only training), gated bidirectional cross‑attention (Res‑Gated‑Bi‑CAF) achieves the lowest WER (≈ 10.77 % on Dev). However, after Stage 2 (LLM LoRA + projector joint training), all fusion variants converge to a narrow band (10.69 %–10.90 % on Eval), indicating that once the encoders are well adapted, the exact fusion strategy matters less.
- Overall Speech‑LLM performance – The best Speech‑LLM configuration (LoRA‑fine‑tuned Whisper + fully fine‑tuned mHuBERT + Res‑Gated‑Bi‑CAF + linear projector) attains 10.69 % WER on Eval, matching top‑ranked challenge submissions that relied on massive external corpora.
Nevertheless, the authors acknowledge a residual gap: the fully fine‑tuned Whisper E2E model still scores 10.07 % WER on Eval, about 0.6 % better than the best Speech‑LLM system trained on the same data. This suggests that current Speech‑LLM pipelines lose some information during the speech‑to‑LLM interface, perhaps due to imperfect cross‑modal alignment or limited capacity of the LLM to absorb raw acoustic nuances.
The paper concludes with several actionable insights for future research: (1) prioritize parameter‑efficient Whisper adaptation (LoRA) for a good trade‑off between in‑domain accuracy and OOD robustness; (2) employ gated bidirectional cross‑attention when fusing heterogeneous encoders, but recognize that its advantage diminishes after joint LLM training; (3) retain simple linear projectors for stability and efficiency; (4) explore richer LLM adapters or multimodal alignment layers to close the remaining performance gap; and (5) consider scaling the number of attention heads or introducing hierarchical fusion to better exploit complementary encoder information.
Overall, the work provides a thorough empirical comparison between Speech‑LLM and E2E Whisper architectures, demonstrates that a carefully fine‑tuned Speech‑LLM can rival state‑of‑the‑art systems with modest data, and highlights the remaining challenges that must be addressed to surpass pure E2E models.
Comments & Academic Discussion
Loading comments...
Leave a Comment