Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.


💡 Research Summary

The paper addresses a critical gap in audio deep‑fake research: the detection of synthetic speech in multi‑speaker conversational settings. While most prior work has focused on single‑speaker spoofing, real‑world threats increasingly involve dialogues where two or more participants exchange speech, making detection substantially more challenging due to speaker overlap, diarization, and contextual dynamics.

The authors first propose a three‑dimensional taxonomy that classifies multi‑speaker conversational deepfakes along (1) Context (e.g., casual chat, interview, debate), (2) Speaker Composition (two speakers versus three or more, with gender combinations), and (3) Manipulation Scope (partial manipulation of one or several speakers, or full synthesis of the entire conversation). This taxonomy provides a clear conceptual framework for future dataset creation and model evaluation.

To operationalize the taxonomy, the authors introduce the Multi‑speaker Conversational Audio Deepfake Dataset (MsCADD). The dataset contains 2,830 WAV clips, split into 1,148 genuine human‑recorded two‑speaker dialogues and 1,682 fully synthetic dialogues. Synthetic dialogues are generated from the English Conversation Corpus (66 YouTube videos) using two state‑of‑the‑art TTS systems: VITS (Coqui) for diverse speaker timbres and SoundStorm‑based generation via Google NotebookLM for more spontaneous, human‑like exchanges (including laughter, back‑channel cues, and short pauses). Each clip lasts 10–22 seconds, includes clean and noisy backgrounds, and features random gender pairings (male‑female, female‑female, male‑male). Seven unique synthetic voices and five real speaker groups are used, ensuring a wide acoustic variety.

The paper benchmarks three widely used deep‑fake detection models on MsCADD:

  1. LFCC‑LCNN – a feature‑based approach using Linear Frequency Cepstral Coefficients fed into a Light Convolutional Neural Network (the ASVspoof 2021 baseline). It achieves an F1 score of 0.65, overall accuracy of 0.62, a high true‑positive rate (TPR) of 92.8 % for synthetic speech, but a low true‑negative rate (TNR) of 53.4 % for genuine speech, indicating difficulty distinguishing real voices in multi‑speaker overlap.

  2. RawNet2 – an end‑to‑end raw‑waveform model with learnable band‑pass filters and gated recurrent layers. It reaches an F1 of 0.88, accuracy of 0.90, TPR of 84.27 % and TNR of 97.39 %, showing a balanced performance across both classes.

  3. Wav2Vec 2.0 – a self‑supervised speech representation model fine‑tuned for deep‑fake detection. It records an F1 of 0.89, accuracy of 0.89, TPR of 82.5 % and TNR of 97.8 %, comparable to RawNet2 and slightly better at rejecting genuine audio.

Overall, the results reveal that while existing single‑speaker detection models can detect fully synthetic conversations, they struggle with the nuanced dynamics of multi‑speaker audio. LFCC‑LCNN’s reliance on spectral features makes it vulnerable to speaker overlap and background noise, whereas raw‑waveform and self‑supervised models handle these complexities more robustly.

The authors emphasize that the current dataset only covers full‑conversation synthesis; future work should expand to partial manipulations where only one or a subset of speakers are spoofed, as well as to longer dialogues with more than two participants, overlapping speech, and multimodal cues (video, transcripts). By releasing MsCADD publicly on GitHub, the paper provides a reproducible benchmark that can catalyze the development of detection algorithms specifically tailored to conversational, multi‑speaker deepfakes—a rapidly emerging threat to political discourse, corporate communications, and public trust.


Comments & Academic Discussion

Loading comments...

Leave a Comment