Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
💡 Research Summary
Stream‑Voice‑Anon introduces a novel streaming speaker anonymization (SA) framework that leverages neural audio codecs (NAC) and causal language models (LM) to achieve high utility while preserving privacy in real‑time voice applications. The system consists of three main components: a content encoder, an acoustic encoder, and a speaker encoder. The content encoder extracts speaker‑invariant tokens from raw audio using a lightweight ConvNeXt backbone followed by an 8‑layer decoder‑only transformer; these tokens are quantized through a vector‑quantization (VQ) layer with an 8,192‑entry codebook, effectively separating linguistic information from speaker characteristics. The acoustic encoder, based on the FishSpeech architecture, produces multi‑codebook acoustic tokens (eight codebooks per frame) using causal convolutions, ensuring frame‑synchronous generation. A pre‑trained speaker verification model supplies a global speaker embedding that conditions the downstream model.
The core of the system is a two‑stage autoregressive voice conversion (AR‑VC) model. A “Slow‑AR” transformer (12 layers, 768 hidden units) processes the content tokens and the global speaker embedding to generate a per‑frame latent state zₜ. A lightweight “Fast‑AR” transformer (4 layers) then autoregressively emits the eight acoustic codebook indices for the current frame conditioned on zₜ. This hierarchical decoding reduces the computational burden of predicting all codebooks simultaneously and improves synthesis quality.
A key contribution is the introduction of dynamic‑delay training. Instead of fixing the look‑ahead delay (e.g., d = 4 frames) during training, the authors sample the delay uniformly from 1 to 8 frames for each training utterance. Consequently, the model learns to operate under variable future context, allowing inference‑time selection of any delay between 130 ms and 440 ms without retraining. This flexibility enables service providers to trade latency for intelligibility on the fly.
Privacy protection is achieved through a combination of pseudo‑speaker embedding sampling and prompt‑based randomization. A prompt pool P is built from four diverse corpora (VCTK, ESD, VoxCeleb1, CREMA‑D). For each inference, K prompts are randomly selected, shuffled, and their content and acoustic tokens concatenated to form a rich conditioning context. Speaker embeddings from the selected prompts are averaged, then blended with a Gaussian‑sampled embedding gₛ using a mixing coefficient α = 0.9, yielding the anonymized target embedding gₐₙₒₙ = α·(average gᵢ) + (1‑α)·gₛ. This process is independent of the source speaker, allowing pre‑computation and zero additional latency during streaming.
Experiments follow the VoicePrivacy 2024 Challenge protocol. Training uses LibriHeavy and CommonVoice; evaluation employs LibriSpeech (ASR and privacy) and IEMOCAP (emotion recognition). Two attacker models are considered: a lazy‑informed attacker (ASV‑eval) trained on original speech, and a semi‑informed attacker (ASV‑anon‑eval) that is fine‑tuned on anonymized data. Results show that Stream‑Voice‑Anon achieves a relative word error rate (WER) reduction of up to 46 % compared to the state‑of‑the‑art streaming method DarkStream, while improving unweighted average recall (UAR) for emotion recognition by up to 28 % with the “cremad‑emo‑4rnd” prompt strategy. Privacy is comparable under the lazy‑informed threat (EER ≈ 47 %, near random guessing) but degrades by about 15 % under the semi‑informed threat (EER ≈ 19 %). Latency is kept at 180 ms, slightly lower than DarkStream’s 200 ms.
A detailed ablation on prompt diversity demonstrates that increasing the number and heterogeneity of prompts (e.g., from a single fixed VCTK utterance to cross‑dataset four‑prompt mixes) consistently raises EER against semi‑informed attackers, confirming that prompt variability hampers attacker adaptation. Conversely, the impact on lazy‑informed attackers is modest because they do not exploit anonymization patterns.
Overall, the paper showcases how NAC’s discrete tokenization combined with causal LM conditioning can disentangle speaker identity from linguistic content in a streaming setting, enabling high‑quality, low‑latency anonymization. The dynamic‑delay mechanism provides operational flexibility, and the prompt‑mixing strategy offers a practical, pre‑computable privacy shield. Limitations include a noticeable drop in privacy against semi‑informed attackers and a performance gap relative to offline anonymization methods, suggesting avenues for future work such as more sophisticated speaker embedding sampling, multimodal prompts, and large‑scale real‑world deployment studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment