VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user’s private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.

💡 Research Summary

The paper introduces VoxPrivacy, the first benchmark specifically designed to evaluate “interactional privacy” in speech language models (SLMs) deployed in shared, multi‑user environments such as smart homes. Interactional privacy refers to the model’s ability to keep information disclosed by one user private from other users who may later query the same device. Existing SLM benchmarks focus on dialogue quality or speaker identification but do not test whether the model uses speaker identity to gate information, while privacy benchmarks only address globally sensitive data (e.g., passwords) and ignore context‑dependent secrets (e.g., a personal calendar entry).

VoxPrivacy is organized into three difficulty tiers. Tier 1 (Direct Command Secrecy) checks whether the model obeys an explicit “do not share” instruction regardless of who asks later. Tier 2 (Speaker‑Verified Secrecy) adds a conditional: the model must use the speaker’s voice as a biometric key, disclosing the secret only to the original speaker and denying all others. Tier 3 (Proactive Privacy Protection) is the hardest; without any explicit instruction, the model must infer from content, context, and voice that an utterance is inherently private (e.g., medical concerns) and automatically enforce speaker‑conditioned access.

The benchmark comprises 7,107 examples (≈32 hours) in English and Chinese, generated through a four‑stage pipeline: (1) multi‑LLM text generation (Deepseek, Gemini, ChatGPT) to create diverse secret/non‑secret statements, (2) deduplication and linguistic refinement, (3) structuring into tiered dialogues, and (4) assigning synthetic speakers from balanced gender pools (AISHELL‑2, WenetSpeech) and synthesizing audio with Cosyvoice2 TTS. A small human‑recorded validation set, Real‑VoxPrivacy, with 18 volunteers, confirms that synthetic results transfer to real speech.

The authors evaluate nine state‑of‑the‑art SLMs (seven open‑source, two closed‑source) using both automatic metrics (accuracy, precision, recall, F1) and human/LLM judges. Results show that open‑source models perform at chance (~50 % accuracy) on conditional privacy decisions (tiers 2 and 3), while even strong closed‑source systems fall short on proactive privacy (tier 3), achieving at best ~70 % accuracy. Controlled ablations reveal that failures stem from an inability to retain conversational context and to integrate speaker embeddings into the decision‑making pipeline, rather than from general language understanding deficits.

To demonstrate a path forward, the authors assemble a 4,000‑hour multilingual, multi‑speaker training corpus and fine‑tune a base SLM. The fine‑tuned model improves tier 2 accuracy from 68 % to 84 % and tier 3 from 55 % to 73 %, without sacrificing overall dialogue quality. This indicates that privacy‑aware training can substantially close the gap.

Contributions are: (1) the release of the VoxPrivacy benchmark, its synthetic and real validation sets, the large training corpus, and a fine‑tuned privacy‑enhanced model; (2) a comprehensive empirical study showing that interactional privacy remains an unsolved problem for current SLMs; (3) diagnostic experiments pinpointing context‑handling as the key weakness and proposing speaker‑embedding‑driven policy networks as a promising direction; (4) a discussion of future work, including real‑time deployment, policy‑learning frameworks, and broader multimodal privacy considerations.

In sum, VoxPrivacy provides a rigorous, reproducible framework for measuring and improving how speech models protect user‑specific secrets in multi‑user settings, highlighting an urgent need to embed privacy reasoning directly into SLM architectures and training regimes.

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment