TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.


💡 Research Summary

TVTSyn tackles the dual challenges of real‑time voice conversion (VC) and speaker anonymization (SA) by eliminating the longstanding mismatch between dynamic content representations and static speaker embeddings. Traditional streaming VC/SA pipelines encode linguistic content as a time‑varying sequence while conditioning the decoder on a single global speaker vector. This static‑dynamic disparity forces the decoder to reconcile incompatible time scales, often resulting in over‑smoothed timbre, loss of emotional nuance, and insufficient privacy protection.

The proposed system introduces a content‑synchronous, time‑varying timbre (TVT) representation that evolves frame‑by‑frame in lockstep with the content. TVTSyn consists of four tightly integrated modules:

  1. Streaming Content Encoder – A fully causal 1‑D CNN with four down‑sampling stages (strides 8, 5, 4, 2) produces 512‑dimensional frame embeddings at a 20 ms hop. Eight causal multi‑head self‑attention (MHSA) blocks extend the receptive field, using a fixed look‑back window of 2 seconds and allowing up to four future frames (≈80 ms) via an expanded attention mask. A KV‑cache maintains a rolling context, keeping latency low while preserving long‑range coherence.

  2. Factorized Vector‑Quantized Bottleneck – The encoder output is compressed to an 8‑dimensional latent, quantized with a learnable 4096‑entry codebook, and projected back to 512 dimensions. This “compress‑then‑discretize” pipeline forces the representation to discard residual speaker cues while retaining phonetic detail. Training is self‑supervised: HuBERT‑base 9th‑layer activations are clustered into 200 pseudo‑labels, and a cross‑entropy loss aligns the quantized output with these labels.

  3. Time‑Varying Timbre (TVT) Block & Global Timbre Memory (GTM) – A global speaker embedding g is formed by concatenating a noise‑robust X‑vector and an ECAPA‑TDNN embedding. g is projected via an MLP into a set of K key‑value pairs (k_i, v_i). Each pair combines a speaker‑specific component (MLP(g)) with universal timbre prototypes (learnable priors). At every time step t, the content embedding c_t attends over the keys, yielding a weighted timbre vector v_t that reflects the most relevant timbre facet for the current phoneme or prosodic context.

    A gating network predicts a scalar α_t∈


Comments & Academic Discussion

Loading comments...

Leave a Comment