MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.
💡 Research Summary
MeanVC introduces a novel framework for streaming zero‑shot voice conversion that simultaneously satisfies three critical constraints: low latency, lightweight architecture, and high‑fidelity output. Existing streaming VC approaches fall into two categories. Autoregressive (AR) models such as StreamVoice achieve high naturalness and speaker similarity but suffer from high inference latency due to sequential decoding and require large model sizes (≈100 M parameters), making real‑time deployment on CPUs impractical. Non‑autoregressive (NAR) models like DualVC2 and Seed‑VC improve speed by parallel generation, yet either compromise zero‑shot generalization (simpler conformer‑based NAR) or fragment long‑range context with fixed‑size sliding windows (diffusion‑based NAR).
MeanVC bridges this gap by combining a chunk‑wise autoregressive denoising strategy with a mean‑flow diffusion technique, enabling a single‑step (1‑NFE) generation while preserving long‑term speaker consistency. The pipeline consists of five components: (1) a streaming ASR encoder (Fast‑U2++) that extracts bottleneck features (BNFs) from the source waveform in 160 ms chunks; (2) a timbre encoder that fuses BNFs with fine‑grained timbre information from a reference mel‑spectrogram via cross‑attention; (3) a pre‑trained speaker encoder (ECAPA‑TDNN) that provides speaker embeddings; (4) a Diffusion Transformer (DiT) decoder that performs chunk‑wise autoregressive denoising under a causal mask, conditioning each noisy chunk on a limited number of preceding clean chunks; and (5) a high‑quality vocoder (Vocos) that converts the generated mel‑spectrogram into 16 kHz audio.
The core technical contribution is the integration of mean flows. Traditional conditional flow matching (CFM) requires many function evaluations to solve the ODE, which is infeasible for real‑time use. Mean flows regress the average velocity field u over a time interval, allowing the ODE solution to be expressed as a simple linear update using this average velocity. During training, MeanVC minimizes the L2 distance between the network output fθ and a target field derived from the conditional velocity vₜ = ε − x and its Jacobian‑vector product, as formalized in Equation (2) of the paper. At inference time, the model computes z₀ = z₁ − fθ(z₁, 0, 1) with z₁ sampled from a standard Gaussian, achieving high‑quality mel generation in a single diffusion step.
To address the common over‑smoothing artifact of diffusion models, MeanVC adds a Diffusion Adversarial Post‑Training (D‑APT) stage. The DiT decoder weights are reused to initialize a discriminator that incorporates cross‑attention‑only transformer blocks at the second and fourth layers, producing a global feature vector for a scalar real/fake logit. A least‑squares GAN loss is applied to both generator and discriminator, encouraging the generator to restore high‑frequency details while preserving the 1‑NFE efficiency.
MeanVC is remarkably compact: the entire model contains only 14 M parameters. The DiT decoder comprises four transformer blocks with hidden size 512 and two attention heads; the timbre encoder uses two cross‑attention modules (hidden size 256, four heads). Despite its size, the model delivers state‑of‑the‑art performance on both zero‑shot and known‑speaker conversion. Experiments on a filtered 10 k‑hour Mandarin corpus (Emilia) and fine‑tuning on Aishell3 show that MeanVC outperforms StreamVoice and Seed‑VC in subjective naturalness (NMOS 3.82 ± 0.05 vs. 3.81/3.76) and speaker similarity (SMOS 3.87 ± 0.06 vs. 3.67/3.62), achieves the lowest character error rate (5.01 %) and highest SSIM (0.687). DNS‑MOS is slightly lower (3.76) than Seed‑VC (3.84), likely due to the smaller parameter budget.
From an efficiency standpoint, MeanVC’s VC module runs at a real‑time factor (RTF) of 0.136 on a single‑core AMD EPYC 7542 CPU, far faster than StreamVoice (13.632) and Seed‑VC (7.039). Including the ASR encoder (RTF 0.120) and Vocos vocoder (RTF 0.066), the full pipeline achieves an overall RTF of 0.322, comfortably below the real‑time threshold of 1.0. With a 160 ms chunk size, the internal processing latency per chunk is 51.52 ms, leading to a total end‑to‑end latency of 211.52 ms (chunk duration + processing), which is suitable for interactive applications.
In summary, MeanVC presents a compelling solution for streaming zero‑shot voice conversion: it leverages chunk‑wise autoregressive conditioning to preserve long‑range speaker identity, employs mean‑flow diffusion to reduce sampling to a single step, and refines output quality through adversarial post‑training—all within a 14 M‑parameter model that runs in real time on modest CPU hardware. This combination opens the door to practical deployments in live dubbing, virtual avatar speech, personalized pronunciation assistants, and privacy‑preserving voice transformation, where both latency and model footprint are critical constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment