HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.

💡 Research Summary

**
The paper addresses a critical gap in audio deepfake detection (ADD): the overwhelming reliance on local temporal/spectral features or pairwise relationships, which fails to capture the complex, multi‑component interactions that synthetic speech often exhibits. To quantify these interactions, the authors adopt the O‑information framework, a recent information‑theoretic measure that distinguishes redundancy‑dominated systems (positive O‑information) from synergy‑dominated systems (negative O‑information). Empirical analysis suggests that deepfake artifacts are largely synergy‑dominated, meaning that discriminative cues emerge only when several feature dimensions are considered jointly.

Motivated by this insight, the authors propose HyperPotter, a hypergraph‑based detection framework that explicitly models high‑order synergistic interactions (HOIs). The pipeline begins with raw waveforms processed by a pretrained XLS‑R model and a RawNet2 encoder, yielding two complementary node sets (temporal and spectral). These node embeddings are fed into a hypergraph construction module that uses Fuzzy C‑Means (FCM) clustering to create soft hyperedges. Unlike hard clustering, FCM allows each node to belong to multiple hyperedges with varying membership strengths, naturally representing overlapping high‑order relationships.

A key novelty is the class‑aware prototype‑guided hyperedge initialization. For each class (real vs. fake), class prototypes are extracted from a memory bank of embeddings and used to initialize the centroids of the hyperedges. This ensures that hyperedges are semantically meaningful and discriminative from the start, reducing the instability often observed in unsupervised hypergraph construction.

The hypergraph is then processed by a Memory‑Enhanced Hypergraph Attention GNN (HA‑GNN). The HA‑GNN performs node‑to‑hyperedge aggregation, where each hyperedge representation is a weighted sum of its member nodes. The weights are a product of the soft membership values from FCM and an attention score learned via a relational artifact amplification module. This module explicitly boosts features that exhibit synergistic patterns, while suppressing redundant or noisy components. After hyperedge representations are formed, a hyperedge‑to‑node message passing step updates node embeddings, allowing high‑order relational information to flow back into the node space. The final graph‑level representation is obtained by a read‑out operation and fed to a binary classifier.

Extensive experiments were conducted on 13 publicly available datasets, including 11 in‑domain and 4 cross‑domain sets that vary in language, speaker identity, and synthesis algorithm (e.g., Voice‑Clone, WaveGlow, etc.). HyperPotter consistently outperformed strong baselines such as RawNet2‑GAT‑ST, AASIST, ViHGNN, and LHGNN. On average, it achieved a 22.15 % relative reduction in Equal Error Rate (EER) across the 11 in‑domain datasets and a 13.96 % absolute improvement in Area Under the ROC Curve (AUC) on the four cross‑domain benchmarks. Ablation studies confirmed that both the prototype‑guided hyperedge initialization and the relational artifact amplification contribute significantly to the gains.

The authors also analyze computational complexity. FCM clustering incurs O(N·K·I) operations (N nodes, K hyperedges, I iterations), which is comparable to conventional K‑Means and acceptable for offline training. The HA‑GNN adds only a modest overhead relative to standard graph attention networks because the message‑passing steps are linear in the number of hyperedges and nodes.

In conclusion, HyperPotter demonstrates that explicit modeling of high‑order synergistic interactions via hypergraphs dramatically improves the robustness and generalization of audio deepfake detectors. By moving beyond redundancy‑focused pairwise models, the framework captures subtle, multi‑dimensional artifacts that are otherwise invisible. The paper suggests several promising directions for future work, including extending hypergraph modeling to multimodal deepfake detection (audio‑visual), developing lightweight online variants for real‑time streaming, and exploring dynamic prototype updating mechanisms to adapt to evolving synthesis techniques.

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment