KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization
This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
💡 Research Summary
This paper describes the Karlsruhe Institute of Technology (KIT) submissions to the IWSLT 2025 low‑resource speech translation track, covering three language pairs: Bemba → English, North Levantine Arabic → English, and Tunisian Arabic → English. The authors explore both traditional cascaded pipelines (ASR + MT) and end‑to‑end (E2E) speech translation (ST) models, leveraging a wide range of publicly available resources and recent multilingual pre‑trained models under the unconstrained condition.
Two synthetic data generation strategies are investigated to compensate for the scarcity of parallel ST data. The first, “MT‑augmented ST,” uses a trained MT model to translate the transcripts of ASR corpora, thereby creating synthetic (speech, translation) pairs. The second, “TTS‑augmented ST,” employs text‑to‑speech (TTS) models to synthesize speech from MT corpora, turning (source text, translation) pairs into (synthetic speech, translation) pairs. For TTS, the authors train two state‑of‑the‑art non‑autoregressive models—E2TTS and VITS—fine‑tuning them on the limited available data and using classifier‑free guidance and random speaker prompts to increase acoustic diversity. Experiments show that synthetic data improves both ASR and ST performance for Bemba, and that for North Levantine Arabic—where no real ST data exist—a model trained solely on synthetic data slightly outperforms the cascaded system trained on real ASR and MT data.
Model regularization is addressed through intra‑distillation (ID). Building on prior work that applied ID to low‑resource MT, the authors extend it to ASR, MT, and ST. Rather than a single joint loss, they adopt a two‑stage fine‑tuning: (1) vanilla adaptation of the pre‑trained model to the downstream task, followed by (2) ID fine‑tuning where the model is regularized using its own intermediate predictions. This approach yields consistent gains: reductions of about 1 % absolute WER for ASR, and BLEU improvements of 0.5–1.0 points for MT and ST across all language pairs. The benefit is most pronounced for the SeamlessM4T‑large(v2) model, which, despite not having seen Bemba during pre‑training, achieves strong performance after ID.
The study evaluates several large multilingual pre‑trained models: SeamlessM4T‑large(v2) (multilingual speech‑text‑translation), NLLB‑1.3B (multilingual MT, covering Arabic dialects but not Bemba), MMS (multilingual CTC‑based ASR), and XEUS (multilingual encoder‑based ASR with dereverberation). Language‑model shallow fusion further improves MMS and XEUS ASR results, cutting WER by roughly four points. For Arabic dialects, a two‑step fine‑tuning—first on all available data including Modern Standard Arabic, then on dialect‑specific data—helps mitigate domain mismatch.
Results are reported in Table 3. For Bemba ASR, the best configuration (MMS + LM + ID) reaches 9.1 % WER on the test set. MT experiments show that fine‑tuning on development data only sometimes outperforms using all resources, especially for Bemba where the development set is more representative. In ST, both cascaded (C) and E2E (D) systems benefit from synthetic data, achieving BLEU scores above 30. The final step combines the 50‑best hypotheses from the cascaded and E2E systems using Minimum Bayes Risk (MBR) decoding with BLEU as the utility function, yielding an average gain of about 1.5 BLEU points over the best single system.
Overall, the paper demonstrates three key findings: (1) high‑quality synthetic data—whether generated via MT or TTS—can substantially close the gap caused by limited parallel ST resources; (2) intra‑distillation is a generic regularization technique that consistently boosts performance across ASR, MT, and ST, regardless of the underlying pre‑trained architecture; (3) system combination through MBR decoding effectively leverages complementary strengths of cascaded and end‑to‑end models. These insights provide a practical roadmap for building robust speech translation systems in truly low‑resource scenarios and set a strong baseline for future IWSLT challenges.
Comments & Academic Discussion
Loading comments...
Leave a Comment