Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

💡 Research Summary

The paper tackles the problem of domain adaptation for Speech Large Language Models (Speech‑LLMs) in low‑resource scenarios where paired speech‑text data are scarce. Conventional approaches rely on fine‑tuning the entire model with additional domain‑specific audio‑text pairs, which is often infeasible and prone to over‑fitting. The authors propose a novel text‑only fine‑tuning strategy that adapts only the LoRA (Low‑Rank Adaptation) modules inserted in the LLM decoder, using abundant in‑domain text while keeping the speech encoder and projection layers frozen.

The overall training pipeline consists of two stages. In the first stage, a generic Speech‑LLM is pretrained on a large source domain (LibriSpeech, 1,000 h) by jointly optimizing the speech encoder (Whisper‑large‑v3), a two‑layer linear projector, and the LLM decoder (Qwen2.5‑7B‑Instruct). This stage establishes a strong cross‑modal alignment between acoustic features and textual representations. In the second stage, the model is adapted to a target domain (e.g., SlideSpeech, Medical) using only target‑domain text. The LoRA adapters (rank 64, α 16, dropout 5%) are updated to minimize the standard language‑modeling loss.

A key innovation is the “real‑time alignment evaluation” mechanism. After each gradient update on text data, the updated LoRA parameters are temporarily integrated with the frozen encoder, projector, and base LLM, and the model’s speech‑recognition loss is computed on a small held‑out set of paired speech‑text data. This continuous monitoring ensures that the text‑only fine‑tuning does not degrade the previously learned speech‑text alignment. If the alignment loss rises sharply, learning‑rate adjustments or early stopping are triggered.

The authors compare three adaptation strategies: (1) pure text‑only fine‑tuning (LoRA only), (2) full speech‑text fine‑tuning (all parameters), and (3) text‑then‑speech fine‑tuning (text‑only followed by speech‑text fine‑tuning). Experiments are conducted on three target domains: SlideSpeech (470 h, online meetings), a Medical corpus (8 h, clinical terminology), and GigaSpeech (10 k h, large‑scale general evaluation). Word Error Rate (WER) is the primary metric.

Results show that text‑only fine‑tuning incurs only a marginal WER increase on the source domain (≈0.2 % absolute) while achieving comparable or better performance on target domains relative to full speech‑text fine‑tuning. Notably, on the low‑resource Medical set, text‑only adaptation improves WER by 1.2 % over the baseline, demonstrating the method’s strength in specialized vocabularies. The real‑time alignment evaluation successfully prevents catastrophic forgetting of acoustic‑text mapping, as evidenced by stable WER on LibriSpeech and GigaSpeech after adaptation. Moreover, because only LoRA parameters (≈161 M) are trainable, the approach reduces GPU memory consumption and training time by roughly 65‑70 % compared with full model fine‑tuning.

The paper’s contributions are threefold: (i) introducing the first text‑only domain adaptation technique for Speech‑LLMs, (ii) proposing a lightweight yet effective real‑time alignment monitoring strategy, and (iii) demonstrating that LoRA‑based lightweight adaptation can achieve competitive performance while drastically lowering computational cost. This work sidesteps the need for synthetic speech generation pipelines, which suffer from limited naturalness and high computational overhead.

Future directions suggested include extending the alignment evaluation to more sophisticated metrics (e.g., CTC‑based alignment scores), integrating multimodal prompts to further enrich domain knowledge, testing the approach on a broader set of LLM backbones and low‑resource languages, and exploring continual learning scenarios where the model can adapt online to evolving domains without sacrificing previously learned capabilities.

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment