Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

Reading time: 5 minute
...

📝 Original Info

  • Title: Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track
  • ArXiv ID: 2512.17293
  • Date: 2025-12-19
  • Authors: ** - 논문에 명시된 저자 정보가 제공되지 않았습니다. (제출 팀: Team T02) **

📝 Abstract

This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

💡 Deep Analysis

Figure 1

📄 Full Content

Text-to-speech (TTS) research has traditionally relied on clean, high-fidelity studio recordings and carefully curated datasets. Although such data enable stable text-speech alignment and high naturalness, they limit the scalability and accessibility of TTS development. In contrast, in-the-wild speech -characterized by background noise, reverberation, device variability, and inconsistencies in labeling -offers a more realistic but substantially more challenging training regime for robust TTS.

WildSpoof Challenge 2026 [1] provides a benchmark for evaluating TTS systems trained under such unconstrained conditions, using large-scale in-the-wild speech data collected from diverse speakers and recording environments. Systems are evaluated on intelligibility, measured by automatic speech recognition (ASR) word error rate (WER), perceptual quality, assessed using UTMOS and DNS-MOS, and faithfulness, measured using speaker similarity (SPKsim) and Mel Cepstral Distance (MCD). Building a model that performs reliably in this setting requires handling label noise, unpredictable duration variation, and degraded alignment signals.

To address these challenges, we build upon Supertonic [2], a lightweight TTS architecture composed of a speech autoencoder for continuous latent representation, a flow matching text-to-latent generator, and an utterance-level duration predictor. This architecture, with its compact latent space and cross attention modules, provides a strong foundation for adaptation to noisy environments. However, the raw in-the-wild data from the challenge include mislabeled samples and misaligned text-speech pairs-issues conventional flow matching pipelines do not handle well.

1 https://github.com/supertone-inc/supertonic

We therefore utilize Self-Purifying Flow Matching (SPFM) [3], a training-time data selection mechanism for conditional flow matching models. SPFM leverages the model’s own conditional and unconditional objectives to detect unreliable labels on-the-fly and route them to unconditional training. We finetune this SPFM-augmented Supertonic system on the challenge-provided datasets. Despite the compact size of the architecture and the difficulty of the dataset, our submission achieves:

• Best WER among all participating teams, demonstrating strong linguistic consistency and alignment.

• Second-highest UTMOS/DNSMOS, showing strong perceptual quality despite the noisy training domain.

These results suggest that combining flow matching in Supertonic with SPFM provides an efficient and effective solution for robust TTS in real-world noisy conditions.

We start from the publicly available English Supertonic checkpoint and adapt it to the WildSpoof in-the-wild domain. For finetuning, we use the two subsets released by the challenge, TITW-easy and TITWhard, and construct each training batch with a 1:1 sampling ratio between the two sets to balance relatively clean and noisy conditions. In total, the model is finetuned for 10,000 iterations with batch size 32. Training is performed on four NVIDIA RTX A100 GPUs.

During finetuning, we apply SPFM [3] to mitigate the substantial annotation noise present in in-the-wild data. SPFM operates within the classifier-free guidance framework of conditional flow matching.

For each text-speech pair (x1, c), we first sample a source x0 from normal distribution and an interpolation time t ′ , and compute the interpolated sample

We then evaluate two flow matching losses at the same interpolation point:

and an unconditional loss

where v θ denotes the model-predicted velocity field and ∅ indicates the absence of conditioning.

The key intuition is that, when the text label c is correct, the conditional objective is expected not to exceed the unconditional one, i.e., Lcond ≤ Luncond in expectation. SPFM exploits this intuition by comparing Lcond and Luncond on a per-sample basis. If Lcond > Luncond, the label is treated as potentially unreliable, and the sample is used only for unconditional training in that step. Otherwise, training proceeds with ordinary conditional flow matching. In practice, SPFM is activated after an initial warm-up phase of 1,000 steps to avoid spurious detections when the model is still undertrained, and we use a fixed interpolation time t ′ near the midpoint of the trajectory as suggested in prior work. This mechanism allows Supertonic to learn conditional generation primarily from trusted text-speech pairs while still benefiting from the acoustic coverage of noisy samples through unconditional training.

We evaluate our system on four validation sets: two from the original TITW dataset KSKT and KSUT, and two optional datasets derived from Librispeech and VoxCeleb USKT and USUT. These subsets either contain Known Speakers (KS) or Unknown Speakers (US) and Known Text (KT) or Unknown Text (UT). Following the official TTS track evaluation plan [1], we compute:

• Word Error Rate (WER) and Character Error Rate (CER),

• Perceptual quality metri

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut