Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
Text-to-speech (TTS) development is limited by scarcity of high-quality, publicly available speech data for most languages outside a few high-resource languages. We present Nord-Parl-TTS, an open TTS dataset for Finnish and Swedish based on speech found in the wild. Using recordings of Nordic parliamentary proceedings, we extract 900 hours of Finnish and 5090 hours of Swedish speech suitable for TTS training. The dataset is built using an adapted version of the Emilia data processing pipeline and includes unified evaluation sets to support model development and benchmarking. By offering open, large-scale data for Finnish and Swedish, Nord-Parl-TTS narrows the resource gap in TTS between high- and lower-resourced languages.
💡 Research Summary
The paper introduces Nord‑Parl‑TTS, a large‑scale, open‑source text‑to‑speech (TTS) corpus for Finnish and Swedish derived from publicly available parliamentary recordings. Recognizing the scarcity of high‑quality, studio‑recorded TTS datasets for these languages—existing resources total only 20–60 hours for Finnish and none for Swedish—the authors adapt the Emilia multilingual speech‑processing pipeline to harvest “in‑the‑wild” audio.
Data acquisition begins with 3 000 hours of Finnish parliamentary sessions (2015‑2020) and 23 000 hours of Swedish RixVox‑v2 parliamentary speech. Audio is extracted, resampled to 24 kHz mono, and cleaned using a pretrained UVR‑MDX‑Net source‑separation model. Speaker diarization (Pyannote) and fine‑grained voice‑activity detection (Silero VAD) isolate single‑speaker utterances. For Finnish, the authors found Whisper‑large‑v3 insufficient; they therefore employ an additional wav2vec2‑large model fine‑tuned on Finnish. Only utterances where the normalized transcripts of Whisper and wav2vec2 differ by less than 5 % are retained, leveraging Whisper’s punctuation and capitalization while using wav2vec2 as a confidence check. For Swedish, Whisper‑large alone is used, with a more permissive word‑error‑rate (WER) threshold of 10 % because the source timestamps already reflect sentence‑level segmentation. Finally, DNS‑MOS (P.835 OVRL) scores below 3.0 lead to exclusion, ensuring overall audio quality.
The resulting corpora contain roughly 900 hours of Finnish speech and 5 090 hours of Swedish speech—orders of magnitude larger than any previously released Finnish or Swedish TTS datasets. To facilitate benchmarking, the authors curate evaluation sets of 500 prompt‑target pairs per language. Finnish prompts are drawn from the Perso‑Synteesi dataset, balanced for gender (250 male, 250 female) and filtered to 3‑20 seconds length with at least ten characters. Swedish prompts are sampled from Common‑Voice, limited to 30 utterances per speaker, gender‑balanced, and manually vetted by a native speaker for clarity, volume, accent, and background noise.
Two state‑of‑the‑art non‑autoregressive diffusion TTS models are trained on the new data: Matcha‑TTS (which uses explicit monotonic alignment search and a pretrained SimAMResNet speaker encoder) and F5‑TTS‑Base (which relies on implicit alignment via Diffusion Transformers). Matcha‑TTS is trained on a single Nvidia A100 for 500 k updates (batch size 64), while F5‑TTS‑Base follows the original configuration with 1.2 M updates across 24 AMD MI250X GPUs. Finnish models use character input (Finnish orthography is near‑phonemic), whereas Swedish models employ phoneme sequences generated by Phonemizer.
Objective evaluation synthesizes each prompt‑target pair, runs an ASR (Finnish wav2vec2‑large, Swedish Whisper‑large) on the output, and computes character error rate (CER) and cosine speaker similarity (SIM) using a wavLM‑large speaker verification model. Subjective evaluation collects human judgments via Prolific: Comparative Mean Opinion Score (CMOS) for naturalness and Speaker MOS (SMOS) for perceived speaker identity, with language‑skill checks and confidence intervals reported.
Results show that Matcha‑TTS achieves substantially lower CER (Finnish 2.55 % vs 6.72 %; Swedish 4.66 % vs 13.64 %) and higher speaker similarity for Finnish (0.566 vs 0.538). F5‑TTS‑Base, however, attains higher CMOS (more human‑like) and better SMOS for Swedish (3.41 vs 2.27), indicating that implicit alignment can improve perceived naturalness at the cost of intelligibility. Hallucination rates are low for both models, with only a few utterances filtered out (CER > 100 %).
The authors conclude that large‑scale “in‑the‑wild” parliamentary recordings can be transformed into high‑quality TTS corpora with a carefully designed pipeline. Nord‑Parl‑TTS closes a major resource gap for Finnish and Swedish TTS research, provides standardized evaluation sets, and demonstrates that modern diffusion TTS models can be effectively trained on such data. Future work includes extending the approach to other Nordic languages (Danish, Norwegian), releasing curation tools and visualizations, and expanding benchmark model coverage.
Comments & Academic Discussion
Loading comments...
Leave a Comment