HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
💡 Research Summary
HierCon introduces a novel hierarchical contrastive attention framework for detecting audio deepfakes generated by state‑of‑the‑art text‑to‑speech (TTS) and voice‑conversion (VC) systems. The authors start from the widely used self‑supervised learning (SSL) backbone XLS‑R, which provides 24 transformer layers of contextualized representations. Prior work such as Sensitive Layer Selection (SLS) treats each layer independently, applying a scalar weight to the entire layer output. This ignores two crucial sources of information: (1) temporal dynamics within a layer—only certain frames contain synthesis artefacts, and (2) inter‑layer dependencies—shallow layers capture acoustic cues while deeper layers encode prosodic and semantic information, and their interactions are often decisive for deepfake detection.
HierCon addresses these gaps with a three‑stage hierarchical attention mechanism and a margin‑based contrastive loss. First, Temporal Attention computes frame‑wise weights for each layer using a two‑layer MLP, producing a weighted layer token. Second, Intra‑Group Attention groups the 24 layers into eight clusters of three consecutive layers (motivated by the observation that neighboring layers learn similar abstractions). Within each group, an attention‑pooling operation followed by an MLP aggregates the three tokens, allowing the model to capture local dependencies among acoustically similar layers. Third, Inter‑Group Attention aggregates the eight group vectors into a single utterance embedding, letting the network learn which abstraction level (acoustic, prosodic, semantic) contributes most to a particular decision.
To enforce domain‑invariant representations, the authors add a margin‑based contrastive loss. For each sample in a batch, positive pairs are defined as other samples sharing the same label (real or fake), while negatives belong to the opposite class. The loss pushes the average cosine similarity of positives above that of negatives by a margin m, and is combined with binary cross‑entropy (BCE) using a small weighting factor λ_con = 0.05. This dual‑objective training prevents the hierarchical attention from over‑fitting to dataset‑specific artefacts and encourages a compact, discriminative embedding space.
Experiments are conducted on three benchmarks: ASVspoof 2021 LA (in‑domain), ASVspoof 2021 DF (cross‑generation, 100+ spoofing pipelines), and In‑the‑Wild (real‑world recordings). Using XLS‑R‑300M as the feature extractor, 4‑second audio windows, and standard data augmentation (RawBoost), HierCon achieves equal‑error rates (EER) of 2.46 % (LA), 1.93 % (DF), and 6.87 % (ITW). Compared with the XLS‑R + SLS baseline (3.88 %, 2.09 %, 8.87 % respectively), these results correspond to relative improvements of 36.6 % on DF, 36.5 % on LA, and 22.5 % on ITW. Notably, the In‑the‑Wild dataset—known for its severe domain shift—shows the largest gain, confirming the method’s robustness to unseen recording conditions and synthesis techniques.
Ablation studies isolate the contributions of hierarchical attention and contrastive learning. Adding hierarchical attention alone improves LA by 23.5 % relative and ITW marginally, but slightly degrades DF (2.13 % vs 2.09 %). When contrastive learning is incorporated, performance rises across all three datasets, with the biggest boost on ITW (27.7 % relative). This demonstrates that while hierarchical modeling captures richer multi‑level cues, contrastive regularisation is essential for stabilising training and preserving cross‑domain generalisation.
Interpretability analyses visualize attention weights at each stage. Temporal attention consistently focuses on the central 40‑70 % of each utterance, aligning with forensic observations that synthetic artefacts tend to appear during sustained speech rather than at onset or offset. Intra‑group attention shifts emphasis from shallow to deeper layers across groups, reflecting the multi‑level nature of artefacts (spectral distortions in early layers, prosodic irregularities in later layers). Inter‑group attention peaks at mid‑level groups (layers 12‑14), suggesting that the most discriminative evidence lies in the interaction zone between acoustic and semantic representations. Importantly, these patterns remain stable across diverse spoofing methods (vocoder‑based, diffusion‑based, VC), indicating that HierCon learns generic artefact signatures rather than memorising pipeline‑specific fingerprints.
In summary, HierCon makes three key contributions: (1) a three‑stage hierarchical attention architecture that jointly models temporal, intra‑layer, and inter‑layer dependencies; (2) a margin‑based contrastive loss that enforces domain‑invariant embeddings and mitigates over‑fitting; and (3) an interpretable framework that reveals which frames, layers, and groups drive decisions. The combination of these components yields state‑of‑the‑art performance on challenging audio deepfake benchmarks and offers a promising direction for future research, including scaling to larger multimodal models and real‑time deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment