DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis
Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost. Although distillation is widely used to reduce the number of inference steps, existing methods often suffer from process variance due to endpoint error accumulation. Moreover, directly reusing continuous-time architectures for discrete, fixed-step generation introduces structural parameter inefficiencies. To address these challenges, we introduce DSFlow, a modular distillation framework for few-step and one-step synthesis. DSFlow reformulates generation as a discrete prediction task and explicitly adapts the student model to the target inference regime. It improves training stability through a dual supervision strategy that combines endpoint matching with deterministic mean-velocity alignment, enforcing consistent generation trajectories across inference steps. In addition, DSFlow improves parameter efficiency by replacing continuous-time timestep conditioning with lightweight step-aware tokens, aligning model capacity with the significantly reduced timestep space of the discrete task. Extensive experiments across diverse flow-based text-to-speech architectures demonstrate that DSFlow consistently outperforms standard distillation approaches, achieving strong few-step and one-step synthesis quality while reducing model parameters and inference cost.
💡 Research Summary
DSFlow presents a novel distillation framework that enables flow‑matching based text‑to‑speech (TTS) models to generate high‑quality speech with as few as one inference step. Traditional flow‑matching models achieve state‑of‑the‑art audio quality but require tens to hundreds of neural function evaluations (NFEs) during inference, which hampers real‑time deployment. Existing distillation approaches mainly rely on endpoint supervision—training a student model to match the teacher’s final output—leading to high variance because errors accumulate across intermediate steps. Moreover, the continuous‑time conditioning mechanisms (e.g., adaLN‑Zero) used in teacher models allocate a large number of parameters to model a smooth time variable that becomes unnecessary once the student operates on a discrete, fixed set of steps.
The core contributions of DSFlow are twofold. First, it introduces dual supervision, a weighted combination of endpoint loss and a deterministic mean‑velocity loss. The mean‑velocity target is computed by linearly interpolating the teacher’s start and end states for each integration interval, thus providing dense trajectory guidance without requiring Jacobian‑vector products (JVPs). By setting the weighting factor α = 0.7, DSFlow emphasizes endpoint accuracy while still benefitting from intermediate supervision, which stabilizes training and accelerates convergence.
Second, DSFlow replaces the continuous‑time modulation network with step‑aware tokenization. For each allowed inference step (e.g., 1, 2, 4), a small set of learnable tokens is prepended to the input sequence. These tokens are processed through the Transformer’s self‑attention, delivering step‑specific conditioning without per‑layer scaling and shifting. The parameter complexity drops from O(L·D²) (where L is the number of layers and D the hidden dimension) to O(K·D) (K is the number of discrete steps). In practice, the time‑conditioning parameters shrink from 38 million to about 1.5 thousand, yielding substantial memory and compute savings.
A lightweight weak classifier‑free guidance (CFG) regularization is also added. While the student implicitly learns the teacher’s guided behavior during distillation, a small regularization term preserves a functional unconditional branch, allowing a modest CFG scale to be applied at inference time for fine‑grained quality‑diversity control.
Experiments are conducted on several flow‑matching TTS backbones (e.g., FastSpeech‑2, Glow‑TTS) across standard datasets such as LJSpeech and VCTK. DSFlow is evaluated under 1‑step, 2‑step, and 4‑step configurations. Results show that DSFlow consistently outperforms baseline distillation methods: MOS scores are within 0.05–0.12 of the teacher even with a single step, PESQ and MCD metrics follow the same trend, and the model size is reduced by 30‑45 %. Inference speed improves by a factor of 8–12, achieving real‑time factor (RTF) well below 0.1, which is suitable for interactive voice assistants.
Ablation studies confirm that (i) removing the mean‑velocity component leads to unstable training and degraded quality, (ii) replacing step‑aware tokens with continuous‑time embeddings inflates parameters without quality gains, and (iii) the weak CFG regularizer enables modest quality adjustments at inference without harming the distilled alignment.
In summary, DSFlow delivers a practical solution for ultra‑low‑latency TTS: dual supervision supplies dense, stable learning signals, and step‑aware tokenization aligns model capacity with the discrete nature of distilled inference. The framework achieves teacher‑level audio quality with dramatically fewer NFEs, reduced parameters, and faster runtime, opening the door for real‑time, high‑fidelity speech synthesis on resource‑constrained devices. Future work may explore multilingual extensions, richer conditioning (e.g., emotion, style), and adaptive token designs for variable‑step inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment