CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.


💡 Research Summary

The paper addresses a newly identified threat in audio deep‑fake detection: component‑level spoofing, where only specific parts of an audio mixture—typically the speech signal or the background/environmental sound—are forged while the other component remains authentic. Existing anti‑spoofing datasets (e.g., ASVspoof, ADD) and models assume that an entire utterance is either bona‑fide or spoofed, which makes them ineffective against such fine‑grained attacks. To fill this gap, the authors introduce two major contributions.

First, they construct CompSpoof, a publicly released dataset containing 2,500 audio clips evenly distributed across five classes that represent all possible combinations of genuine and spoofed speech and environment sounds. The classes are: (0) both speech and environment genuine, (1) both genuine but mixed from different sources, (2) speech spoofed / environment genuine, (3) speech genuine / environment spoofed, and (4) both spoofed. Speech sources are drawn from ASVspoof 5 and CommonVoice (genuine) and from ASVspoof 5 and SSTC (spoofed). Environmental sounds come from VGGSound (genuine) and VCapAV (spoofed) and cover indoor, street, and natural scenes. All files are resampled to 16 kHz, trimmed to the shorter component length, and the relative level of environment to speech is controlled via a predefined SNR. The dataset is split into training (70 %), development (10 %), and evaluation (20 %) sets with stratified sampling to preserve class balance.

Second, the authors propose a separation‑enhanced joint learning (SEJ) framework. The pipeline consists of four modules: (1) a binary mixture‑detection model (XLSR‑AASIST) that decides whether an input contains any spoofed component; (2) a UNet‑based speech‑separation network operating in the STFT domain, which predicts a complex mask for speech and uses an adaptive soft‑mask (based on a tanh function and a scaling factor α) to extract the residual environmental sound; (3) a speech‑specific anti‑spoofing classifier (again XLSR‑AASIST) that processes the separated speech; and (4) an environment‑specific anti‑spoofing classifier that processes the separated background.

A key novelty is the joint training of the separation network together with the two anti‑spoofing classifiers. The total loss combines: (i) mean‑squared‑error separation loss, (ii) binary mixture‑detection cross‑entropy, (iii) speech‑ and environment‑specific spoof‑classification cross‑entropy, and (iv) a consistency loss computed as the KL‑divergence between the predictions on the separated signals and those on the original mixture. A weighting factor κ (set to 10) balances the separation loss against the classification terms. This joint objective forces the separator to retain spoof‑relevant cues while still producing clean component estimates.

During training, the models are first trained independently for four epochs, then jointly optimized from epoch 5 onward. Audio is chunked into 4‑second windows with 2‑second overlap; STFT uses a 64 ms window and 16 ms hop. Adam optimizer is used with learning rates 1e‑3 for the separator and 1e‑5 for the classifiers. Evaluation is performed at the segment level (per chunk) and aggregated to file level via majority voting.

Experimental results on the development and evaluation subsets show that the baseline (single XLSR‑AASIST model extended to five classes) achieves an overall F1 of 0.84. Adding only the separator (SEF) degrades performance on mixed classes (overall F1 ≈ 0.71) because the separator alone removes spoof‑related information. The full SEJ framework (SEF+JL) dramatically improves results, reaching an overall F1 of 0.91 on both dev and eval sets. Notably, for the challenging “speech‑spoof / environment‑bona‑fide” (class 2) and “speech‑bona‑fide / environment‑spoof” (class 3) categories, F1 rises from ~0.84 (baseline) to ~0.92 (SEF+JL). Segment‑level analysis confirms the same trend: speech‑only anti‑spoofing F1 improves from 0.72 to 0.86, and environment‑only from 0.72 to 0.85 when joint learning is employed. The authors also observe that the environment‑specific classifier lags behind the speech classifier, likely because XLSR‑AASIST was originally pre‑trained on speech data; this points to future work on dedicated background‑sound models.

In summary, the paper makes three substantive contributions: (1) defining and formalizing component‑level audio spoofing, (2) releasing the first dataset that systematically covers all genuine/spoofed combinations of speech and environment, and (3) designing a joint separation‑and‑classification architecture that demonstrably outperforms conventional single‑stream anti‑spoofing systems. The work opens new research directions, including extending to multi‑speaker mixtures, real‑time lightweight implementations, and developing pre‑trained models tailored to environmental sounds, thereby advancing the robustness of audio deep‑fake detection in realistic, partially‑forged scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment