BEAT2AASIST model with layer fusion for ESDD 2026 Challenge
Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.
💡 Research Summary
The paper introduces BEAT2AASIST, an advanced system for detecting environmental‑sound deepfakes, and evaluates it on the ESDD 2026 Challenge (Tracks 1 and 2). The core idea builds on the BEATs‑AASIST baseline by (1) splitting the BEATs‑derived token sequence either along the frequency axis or the channel dimension, feeding each half into a separate AASIST graph‑based anti‑spoofing branch, and then concatenating the two branch outputs. Frequency‑based splitting isolates high‑ and low‑frequency spectral cues, while channel‑based splitting captures complementary information embedded in the 768‑dimensional token channels.
A second major contribution is “top‑k layer fusion”. Instead of using only the final transformer layer, the authors select the top k (k = 4–10) hidden layers of the pre‑trained BEATs model and combine them with three strategies: (a) simple concatenation, (b) a CNN‑gate that computes layer‑wise weights from the input mel‑spectrogram via three 2‑D convolutions followed by global pooling and a softmax, and (c) an SE‑gate that derives weights directly from the selected layer tensors using a squeeze‑excitation mechanism. These fusion schemes allow the system to dynamically emphasize the most informative transformer layers for each input, yielding richer representations than a single‑layer approach.
To improve robustness against unseen spoofing methods, the authors employ vocoder‑based data augmentation. They generate copy‑synthesized fake audio using three high‑quality neural vocoders—HiFi‑GAN, BigV‑GAN, and UnivNet—thereby injecting a variety of synthesis artifacts without needing full TTS or ATA pipelines. In Track 1 only HiFi‑GAN is used, while Track 2 incorporates all three vocoders, reflecting the more challenging black‑box scenario.
Experiments are conducted on the En‑vSDD dataset, which contains 45.25 h of real recordings and 316.7 h of fake audio. Track 1 training data consist of 27,811 real and 111,244 TTA/ATA generated fakes; Track 2 adds 270 real and 1,083 black‑box fakes. All models fine‑tune the BEATs‑iter3 backbone (pre‑trained on AudioSet), use 4‑second 128‑mel‑bin spectrograms, apply SpecAug, and adopt class‑weighting (fake = 0.1, real = 0.9) to counter the imbalance. Training runs for 20 epochs with batch size 32.
Results are reported as Equal Error Rate (EER). In Track 1, the frequency‑split model with SE‑gate (k = 4) achieves 1.69 % Eval EER and 1.92 % Test EER; the channel‑split model with SE‑gate (k = 10) improves to 1.66 % Eval EER and 1.70 % Test EER. An ensemble (0.4 × model 1 + 0.6 × model 3) further reduces Test EER to 1.60 %. In Track 2, the frequency‑split model with CNN‑gate (k = 4) attains the lowest errors (0.42 % Eval EER, 0.46 % Test EER). Ensembles that blend the baseline (concatenation) with the CNN‑gate model achieve 0.30 % Eval EER and 0.40 % Test EER, placing the system among the top performers (third overall).
The authors highlight several strengths: dual‑branch processing captures complementary spectral and channel cues; multi‑layer fusion enriches representations beyond the final transformer layer; and vocoder augmentation supplies diverse spoofing artifacts, enhancing generalization to unseen attacks. They also acknowledge limitations: the choice of k and fusion strategy is hyper‑parameter sensitive; computational cost and latency of the dual‑branch plus fusion pipeline are not analyzed, which could hinder real‑time deployment; and the vocoder‑generated fakes may not fully represent all possible adversarial synthesis techniques. Future work is suggested to explore lightweight fusion mechanisms, systematic hyper‑parameter search, and broader augmentation strategies. Overall, BEAT2AASIST demonstrates that careful architectural extensions to self‑supervised audio backbones can substantially improve environmental sound deepfake detection in both open‑set and black‑box settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment