MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

💡 Research Summary

MaskVCT (Masked Voice Codec Transformer) is a novel zero‑shot voice conversion (VC) system that unifies masked language modeling, codec token generation, and multiple classifier‑free guidance (CFG) to achieve fine‑grained controllability over speaker identity, linguistic content, and prosody. Traditional VC models either condition on a single factor (e.g., a speaker prompt or a pitch contour) or rely on “leaky” self‑supervised features that unintentionally retain source pitch and timbre. Consequently, they struggle to fully replace the source’s intonation or to balance intelligibility against speaker similarity. MaskVCT addresses these limitations through three key innovations.

First, it employs dual linguistic conditioning. The model ingests both (i) discrete syllabic tokens produced by the pre‑trained SylBoost model (quantized at 8.33 Hz, providing a coarse, pitch‑stripped representation) and (ii) continuous self‑supervised features (e.g., wav2vec‑2.0) that preserve fine‑grained phoneme alignment. These two embeddings are summed element‑wise, allowing the user at inference time to prioritize either representation or blend them. The discrete tokens suppress source speaker information, improving target‑speaker fidelity, while the continuous stream boosts intelligibility and word‑level accuracy.

Second, optional pitch conditioning is realized via a log‑scale sinusoidal embedding applied at a 50 Hz frame rate. This design makes the system agnostic to the pitch extractor (Praat is used in the experiments) and enables the model to operate with or without an explicit pitch contour. When pitch is supplied, the generated speech follows the target’s intonation; when omitted, the model relies solely on linguistic cues, which can be advantageous for speaker similarity.

Third, MaskVCT extends the classifier‑free guidance paradigm to triple CFG. Conventional CFG subtracts an unconditional logit from a conditioned logit, but VC must preserve the source text. Therefore, the authors subtract a “linguistic‑only” logit instead, yielding the following formulation:

log p̃ = log p(L) 
        + ω_all·(log p(Ap,L,P) – log p(L))
        + ω_spk·(log p(Ap,L)   – log p(L))
        + ω_ling·(log p(L)    – log p())

Here Ap denotes the speaker prompt, L the linguistic condition, P the pitch contour, and ω_all, ω_spk, ω_ling are scalar weights that the user can tune at inference. This mechanism lets practitioners trade off intelligibility (ω_ling), speaker fidelity (ω_spk), and prosody adherence (ω_all) on the fly.

The backbone is a 16‑layer Transformer encoder (1024‑dimensional hidden states, 16 heads, ReLU‑FFN of size 4096) equipped with Pre‑LayerNorm and rotary positional embeddings. After the encoder, separate classification heads predict the masked tokens for each of the nine RVQ codebooks extracted from a 16 kHz D A C codec. Training follows a masked token reconstruction paradigm: a random mask time u and a target codebook layer q are sampled, and a cosine‑scheduled binary mask is applied. Only masked positions contribute to a cross‑entropy loss. To improve robustness, the authors apply SpecAugment, PhaseAug, and a 50 % pitch‑shift augmentation that creates “perturbed” versions of the source speech.

Training details: 250 k steps, batch size 168, AdamW with lr = 2e‑4, on two NVIDIA A100 GPUs. The model contains 234 M parameters (including pre‑trained components). Inference proceeds from an all‑masked state and iteratively unmaskes tokens over 64 steps, using Gumbel‑Softmax sampling with top‑k = 35 and top‑p = 0.9.

Experiments compare MaskVCT against five strong baselines: Diff‑HierVC, FA‑Codec, MaskGCT‑S2A, FreeVC, and GenVC. Training data span LibriTTS‑R, MLS‑en, VCTK, LibriHeavy‑Large, HiFi‑TTS, LJSpeech, and RAVDESS, totaling over 100 k hours. Evaluation uses 511 source‑target pairs (≤10 s, 3 s prompts) and the L2‑ARCTIC accented English corpus for accent conversion. Metrics include word error rate (WER), character error rate (CER), speaker similarity (S‑SIM), speaker MOS (SS‑MOS), UTMOS, and Q‑MOS.

Two operating modes are highlighted:

MaskVCT‑All (continuous linguistic features, pitch‑conditioned, CFG weights ω_all = 1.5, ω_spk = 1.0, ω_ling = 1.0). This configuration excels at intelligibility (WER = 4.68 %, CER = 2.22 %) and follows the target pitch contour, but sacrifices some speaker similarity.
MaskVCT‑Spk (discrete syllabic tokens, pitch omitted, CFG weights ω_all = 0, ω_spk = 2.0, ω_ling = 0.5). This mode achieves the highest speaker and accent similarity (S‑SIM ≈ 0.89, SS‑MOS ≈ 3.54) while maintaining acceptable intelligibility.

In accent conversion, pitch‑conditioned generation suffered from background noise in L2‑ARCTIC, so only the Spk mode was evaluated, still delivering strong speaker‑accent alignment.

Overall, MaskVCT demonstrates that multi‑guidance masked modeling can deliver a single, unified VC model capable of dynamic trade‑offs between intelligibility, speaker fidelity, and prosody. This flexibility opens avenues for applications such as multilingual TTS, voice anonymization, and expressive speech synthesis where user‑controlled balance of attributes is essential. Future work may explore real‑time streaming, broader language coverage, and user‑friendly interfaces for CFG weight manipulation.

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

💡 Research Summary

Comments & Academic Discussion

Leave a Comment