Content Anonymization for Privacy in Long-form Audio
Voice anonymization techniques have been found to successfully obscure a speaker’s acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual’s vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose a new approach that performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.
💡 Research Summary
The paper addresses a critical gap in current voice‑anonymization research: while existing methods successfully hide a speaker’s acoustic identity in short, isolated utterances (as demonstrated in the VoicePrivacy Challenge), they ignore the privacy risk posed by the linguistic content of long‑form audio such as interviews, phone calls, and meetings. In these scenarios, many utterances from the same speaker are available, and an attacker can exploit vocabulary, syntax, and discourse patterns to re‑identify the speaker even when the voice has been fully anonymized.
To quantify this threat, the authors use the Fisher Speech Corpus and adopt a “hard” speaker‑verification setting where positive trials consist of different conversations from the same speaker (different topics) and negative trials are different speakers speaking on the same topic. This eliminates topic cues and forces the attribution model to rely on subtle stylistic cues. A state‑of‑the‑art authorship‑attribution model (SLUAR) is employed as a content‑only attacker. Experiments show that as the number of utterances available to the attacker increases, the Equal Error Rate (EER) of the content attack drops dramatically, confirming that linguistic content is a powerful biometric side‑channel in long‑form audio.
The proposed defense inserts a contextual paraphrasing step into the standard ASR‑TTS anonymization pipeline. After automatic speech recognition (Whisper‑medium) produces a transcript, a language model rewrites the text to remove speaker‑specific style while preserving meaning. Two paraphrasing strategies are explored: (1) utterance‑by‑utterance paraphrasing using GPT‑4o‑mini, and (2) segment‑based paraphrasing where a sliding window of 8–16 consecutive utterances is paraphrased jointly. The latter captures discourse‑level patterns and mitigates the problem of extremely short utterances lacking sufficient context. Prompt engineering includes instructions to condense content, adjust utterance length, and strip PII.
Three anonymization configurations are evaluated: (a) voice‑only (ASR‑TTS without text modification), (b) content‑only (paraphrasing only, original voice retained), and (c) combined voice + content (paraphrasing followed by synthesis with a pseudo‑target speaker embedding derived from VoxCeleb2). Results demonstrate that voice‑only anonymization is highly vulnerable to the content attack (EER as low as 0.1), while content‑only protects against the acoustic attacker but leaves the content attacker effective. The combined approach flattens the EER curve, achieving values near 50 % even when the attacker has access to dozens of utterances, effectively reducing the attacker’s advantage to random guessing.
Utility is assessed with multiple metrics. Audio naturalness measured by UTMOS remains high (≈3.8–4.2), indicating that the synthesized speech is perceptually pleasant. Semantic preservation is quantified using greedy alignment and Dynamic Time Warping (DTW) similarity between original and paraphrased transcripts, both yielding scores above 0.85. Detectability of the paraphrased text is evaluated with the zero‑shot detector Binoculars; the low detection scores confirm that the output does not exhibit obvious machine‑generated artifacts.
A key contribution is the systematic comparison of large API‑based models (e.g., GPT‑5) versus locally‑run open‑source models (Gemma‑3‑4B). GPT‑5 delivers slightly higher paraphrasing quality but requires sending data to external services, raising privacy concerns. Gemma‑3‑4B runs entirely on‑device, preserving data confidentiality at the cost of occasional quality degradation on complex discourse. The authors argue that both families can achieve sufficient privacy‑utility trade‑offs, and the choice should be guided by deployment constraints.
In summary, the study demonstrates that protecting privacy in long‑form audio demands joint anonymization of both acoustic and linguistic channels. Context‑aware, window‑based paraphrasing effectively neutralizes the stylistic fingerprint that survives traditional voice‑only pipelines. The methodology is compatible with existing ASR‑TTS systems, and the paper provides practical guidance on model selection, prompt design, and window sizing. Future work is suggested on real‑time streaming scenarios and multilingual extensions, but the current findings already offer a concrete, scalable path for industry and researchers to safeguard speaker identity in realistic conversational recordings.
Comments & Academic Discussion
Loading comments...
Leave a Comment