Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training and determining appropriate duration during inference. In this paper, we introduce Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without audio prompt transcripts. Our method preprocesses audio prompts by forced alignment to obtain word boundaries, enabling direct synthesis from audio prompts while excluding transcripts during training. To address the duration modeling challenge, we train speaking rate predictors at different linguistic granularities to derive duration from speaker pace. Experiments show that our approach matches the performance of F5-TTS while enabling cross-lingual voice cloning.
💡 Research Summary
The paper introduces Cross‑Lingual F5‑TTS, a novel framework that enables zero‑shot, cross‑lingual voice cloning without requiring transcripts for the audio prompt. Existing flow‑matching TTS models such as F5‑TTS rely on the reference transcript to determine word boundaries and to estimate the duration of the target utterance. This dependency blocks true cross‑lingual cloning, especially for languages lacking reliable transcriptions.
To remove the transcript requirement, the authors first apply the Massive Multilingual Speech (MMS) forced‑alignment toolkit to the training corpus. MMS provides precise word‑level timestamps for each utterance. During training, a random word boundary is selected; the audio segment before the boundary becomes the transcript‑free prompt, while the segment after the boundary is masked and used as the target for reconstruction. The model therefore learns to generate speech conditioned only on an audio prompt and the target text, without ever seeing the prompt’s transcript.
The second challenge is duration prediction. In the original F5‑TTS, duration is derived from a simple length‑ratio between the prompt transcript and the target text, which fails when the languages differ. The authors propose a set of speaking‑rate predictors that infer the speaker’s pace directly from the acoustic features of the prompt. Three separate predictors are trained for phonemes‑per‑second, syllables‑per‑second, and words‑per‑second. Each predictor treats rate estimation as a discrete classification problem over uniformly spaced bins (72 bins for phonemes, 32 for syllables and words). A Gaussian Cross‑Entropy loss is introduced to respect the ordinal nature of the bins, giving higher weight to categories near the ground‑truth. At inference time, the predicted speaking rate is combined with a count of the linguistic units in the target text to compute the target duration (duration = unit count / predicted rate).
Experiments are conducted on the large‑scale Emilia multilingual dataset (≈95 k h of English and Chinese speech). The speaking‑rate predictors are evaluated on LibriSpeech‑PC test‑clean and Seed‑TTS (English and Chinese) using Mean Absolute Error (MAE) and Mean Relative Error (MRE). Phoneme‑level predictor (M1) achieves the lowest MAE/MRE for English, while the syllable‑level predictor (M2) performs best for Chinese. The word‑level predictor (M3) consistently lags behind.
Intra‑lingual voice cloning results show that Cross‑Lingual F5‑TTS (CL‑F5) matches or slightly exceeds the baseline F5‑TTS on standard metrics: WER, speaker similarity (SIM‑o), and UTMOS. For example, on LibriSpeech‑PC, CL‑F5 with M1 obtains a WER of 2.08 % versus 2.20 % for the baseline, while maintaining comparable speaker similarity.
Cross‑lingual experiments use a multilingual test set derived from FLEURS, covering German, French, Hindi, and Korean prompts to synthesize English and Chinese speech. CL‑F5 with the appropriate fine‑grained predictor (M1 for English targets, M2 for Chinese targets) achieves low WERs (≈2.5 % for English, ≈2.8 % for Chinese) and reasonable speaker similarity, whereas the coarse word‑level predictor leads to severe degradation (WER up to 16 %).
The study demonstrates that (1) forced alignment can replace transcripts for learning word boundaries, (2) speaking‑rate prediction at a fine linguistic granularity provides a language‑agnostic duration model, and (3) flow‑matching TTS can be extended to true cross‑lingual voice cloning without sacrificing synthesis quality. The Gaussian Cross‑Entropy loss is highlighted as an effective way to handle ordered classification tasks. The authors suggest future work on finer phoneme‑level alignment, handling low‑resource or dialectal languages, and moving toward fully unsupervised training pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment