WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.

💡 Research Summary

This paper addresses a fundamental trade‑off in speech deep‑fake detection (SDD) systems: hand‑crafted DSP front‑ends (e.g., mel‑spectrograms, LFCC, CQCC) are transparent and computationally cheap but miss high‑level semantic cues, while self‑supervised learning (SSL) models such as XLSR‑300M, HuBERT, and MMS provide powerful representations at the cost of interpretability and high compute. To bridge this gap, the authors introduce the WST‑X series, a family of front‑ends built on the Wavelet Scattering Transform (WST). WST combines multi‑scale complex Morlet wavelets with a modulus non‑linearity, yielding representations that are provably translation‑invariant, stable to small deformations, and hierarchical (zeroth, first, and higher‑order coefficients).

Two variants are explored. WST‑X1 uses a parallel architecture: a 1‑dimensional WST applied directly to the raw waveform (controlled by averaging scale J, wavelets‑per‑octave Q, and scattering order M) runs alongside a prompt‑tuned XLSR‑300M (PT‑XLSR). After global average pooling and linear projection, the two streams are concatenated, producing a fused feature vector. WST‑X2 adopts a cascaded design: the same PT‑XLSR extracts latent feature maps, which are treated as a 2‑D “image” (time × feature) and fed into a 2‑D WST. The 2‑D transform is governed by averaging scale J, angular resolution L (number of orientations), and order M. The resulting scattering tensor is flattened and linearly projected before classification.

Experiments are conducted on the recent Deepfake‑Eval‑2024 (DE2024) dataset, a large‑scale, multilingual, in‑the‑wild collection (≈56 h, 80+ web sources, 40 languages). Audio is resampled to 16 kHz, split into non‑overlapping 4‑second chunks (~50 k files), and evaluated using minDCF (primary), EER, F1, and AUC, with 1 000 bootstrap repetitions for confidence intervals.

Parameter sweeps reveal that a small averaging scale (J = 2) is crucial: larger J values progressively degrade performance because high‑frequency details are overly smoothed. For 1‑D WST‑X1, the optimal setting is J = 2, Q = 10, M = 2, achieving minDCF = 0.3408, EER = 14.18 %, F1 = 81.66 %, AUC = 92.50 %. This outperforms traditional mel, linear, and constant‑Q filterbanks by 6–8 % absolute EER reduction. For 2‑D WST‑X2, the best configuration is J = 2, L = 10, M = 2, yielding minDCF = 0.3567, EER = 14.84 %, F1 = 81.83 %, AUC = 92.43 %. The high angular resolution captures directional structures (horizontal harmonics, slanted formants, vertical onsets) that are indicative of synthetic artifacts.

Visualization (Fig. 3) shows that first‑ and second‑order scattering coefficients highlight subtle spectral anomalies in fake utterances that are barely visible in conventional spectrograms, providing human‑readable cues for forensic analysts. The parallel WST‑X1 is computationally lighter and easier to integrate, while the cascaded WST‑X2 leverages the structural relationships within SSL embeddings for richer time‑frequency interaction modeling.

Overall, the WST‑X series delivers a three‑fold advantage: (1) mathematically grounded invariance and stability, (2) high‑resolution frequency and directional information, and (3) synergy with SSL semantic embeddings. Compared to pure DSP front‑ends, it reduces EER by roughly 6–8 % and improves AUC by 2–3 %; compared to SSL alone, it adds interpretability without substantial extra cost. The authors suggest future work on broader multilingual/generalization studies, real‑time lightweight implementations, and extensions to other deformation‑stable transforms (e.g., graph scattering).

In conclusion, the paper presents a novel, interpretable, and high‑performing front‑end for speech deep‑fake detection, establishing wavelet scattering as a powerful bridge between transparent signal processing and data‑driven representation learning.

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment