TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech-LLM models have demonstrated great performance in multi-modal and multi-task speech understanding. A typical speech-LLM paradigm is integrating speech modality with a large language model (LLM). While the Whisper encoder was frequently adopted in previous studies for speech input, it shows limitations regarding input format, model scale, and semantic performance. To this end, we propose a lightweight TTA model specialized in speech semantics for more effective LLM integration. With large-scale training of 358k hours of speech data on multilingual speech recognition (ASR), speech translation (ST) and speech-text alignment tasks, TTA is capable of producing robust cross-lingual speech representations. Extensive evaluations across diverse benchmarks, including ASR/ST, speech retrieval, and ASR-LLM performance assessments, demonstrate TTA’s superiority over Whisper. Furthermore, we rigorously validate the interplay between cross-lingual capabilities and ASR/ST performance. The model weights and training recipes of TTA will be released as part of an audio understanding toolkit Auden.

💡 Research Summary

This paper addresses the limitations of the Whisper encoder—fixed 30‑second input window, large model size, and weak cross‑lingual semantic capability—when integrating speech models with large language models (LLMs). The authors propose a lightweight speech‑semantic foundation model called TTA (Transcribe, Translate, and Alignment) that is under 250 M parameters and is specifically designed for efficient LLM integration.

Model Architecture
TTA adopts a hybrid Zipformer‑Transducer (ZT) and attention‑based encoder‑decoder (AED) architecture. The Zipformer encoder, a memory‑efficient variant of Conformer, processes 80‑dim log‑mel filter‑bank features into high‑level representations H. Three parallel branches operate on H: (1) a Transducer branch for streaming ASR, (2) an AED branch for non‑streaming transcription or translation, and (3) a speech‑text alignment branch. The alignment branch uses a frozen multilingual BERT (bert‑base‑multilingual‑uncased) to obtain text embeddings T from the AED output ˜Y. H is linearly projected, average‑pooled, and then trained with a SigLIP contrastive loss against T, encouraging language‑agnostic, semantically aligned embeddings. A small weight (0.1) balances this loss with the standard Transducer loss.

Training Data and Procedure
The authors compile 358 k hours of multilingual ASR data covering ten languages (zh, en, ja, ko, ru, vi, id, fr, es, pt) from both public corpora (Aishell, LibriSpeech, MLS, VoxPopuli, etc.) and in‑house sources. Quality control is performed using Whisper Large‑v3 for language‑label verification and a WER threshold of 10‑20 %. For speech translation (ST), they use supervised X→EN pairs from CoVoSTv2 and Europarl‑ST, supplemented by LLM‑generated synthetic translations of the ASR corpus, yielding ~217 k hours of ST data. The data are mixed with a 3:2 ASR:ST ratio.

Training proceeds in three stages: (1) a pure ZT model on ASR data for 250 k steps; (2) initialization of ZT‑AED and ZT‑Align from the ZT checkpoint, continued on ASR data for 200 k steps with a reduced learning rate; (3) joint ASR‑ST training for 500 k steps, gradually lowering the temperature parameter t from 1.0 to 0.2 to mitigate language imbalance. The optimizer is Scaled Adam with a peak LR of 0.035 and an Eden scheduler; dynamic bucketing handles variable‑length inputs.

Evaluation
The paper evaluates TTA on several fronts:

Multilingual ASR – On CommonVoice, MLS, VoxPopuli, and other benchmarks, TTA achieves lower word error rates (WER) than Whisper Medium and approaches Whisper Large‑v3 despite having far fewer parameters. For example, on CommonVoice the average WER drops from 8.30 % (Whisper Large‑v3) to 6.70 % (TTA). Zero‑shot performance on Fleurs is slightly behind Whisper Large but still better than Whisper Medium.
Speech Translation – On CoVoSTv2, TTA reaches a BLEU score of 35.12, surpassing Whisper Medium (≈34.7) while trailing Whisper Large‑v3 (≈38.8). A larger‑capacity variant of TTA (double hidden dimension) shows a notable BLEU increase, confirming that translation performance is primarily model‑size limited.
Language Identification (LID) – TTA attains 100 % accuracy on all ten languages in the Fleurs test set, outperforming Whisper Large‑v3 which drops to 81 % on Indonesian.
Speech‑Text Alignment Ablation – Models without the alignment module (ZT‑AED) perform worse on ST validation loss than those with alignment (ZT‑Align, TTA), indicating that the contrastive loss provides a stronger multilingual semantic anchor.
Cross‑Lingual Speech Retrieval – Using 500 semantically paired utterances across the ten languages, cosine‑similarity retrieval shows that TTA exceeds Whisper Large‑v2, achieving the highest accuracy among all tested systems. Retrieval is strongest among Indo‑European language pairs, reflecting linguistic proximity.
Interaction Between Cross‑Lingual Ability and ASR/ST – Adding the alignment loss yields a modest ASR degradation (<0.1 % WER) but improves ST BLEU by ~0.6 points. The authors argue that enforcing stronger language‑agnostic representations benefits translation and retrieval but may slightly conflict with the fine‑grained phonetic discrimination needed for optimal ASR.

LLM Integration
Although not the primary focus, the authors report that plugging TTA’s encoder into a Qwen‑7B LLM yields better downstream generation quality and consistency compared to using Whisper encoders, highlighting the practical advantage of the richer semantic embeddings produced by TTA.

Contributions and Impact
The paper’s key contributions are:

A sub‑250 M parameter model that outperforms Whisper Medium on multilingual ASR, ST, and retrieval tasks.
A novel hybrid ZT‑AED architecture that combines streaming and non‑streaming capabilities.
A contrastive speech‑text alignment loss that explicitly aligns speech embeddings with multilingual BERT, substantially improving cross‑lingual semantic representation.
Empirical analysis of how cross‑lingual alignment interacts with ASR and ST performance, providing guidance for future multi‑task speech foundation models.
Release of model weights and training recipes as part of the Auden audio‑understanding toolkit, facilitating reproducibility and community adoption.

In summary, TTA demonstrates that a carefully designed lightweight architecture, large‑scale multilingual joint training, and explicit cross‑modal alignment can together deliver strong cross‑lingual speech representations suitable for integration with modern LLMs, offering a compelling alternative to the heavyweight Whisper‑based pipelines currently dominant in speech‑LLM research.

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment