Dual Frequency Branch Framework with Reconstructed Sliding Windows Attention for AI-Generated Image Detection
The rapid advancement of Generative Adversarial Networks (GANs) and diffusion models has enabled the creation of highly realistic synthetic images, presenting significant societal risks, such as misinformation and deception. As a result, detecting AI-generated images has emerged as a critical challenge. Existing researches emphasize extracting fine-grained features to enhance detector generalization, yet they often lack consideration for the importance and interdependencies of internal elements within local regions and are limited to a single frequency domain, hindering the capture of general forgery traces. To overcome the aforementioned limitations, we first utilize a sliding window to restrict the attention mechanism to a local window, and reconstruct the features within the window to model the relationships between neighboring internal elements within the local region. Then, we design a dual frequency domain branch framework consisting of four frequency domain subbands of DWT and the phase part of FFT to enrich the extraction of local forgery features from different perspectives. Through feature enrichment of dual frequency domain branches and fine-grained feature extraction of reconstruction sliding window attention, our method achieves superior generalization detection capabilities on both GAN and diffusion model-based generative images. Evaluated on diverse datasets comprising images from 65 distinct generative models, our approach achieves a 2.13% improvement in detection accuracy over state-of-the-art methods.
💡 Research Summary
The paper addresses the pressing problem of detecting highly realistic images generated by modern generative models such as GANs and diffusion models, which pose significant societal risks. Existing detection approaches either focus on coarse global cues or rely on a single frequency domain, and they often overlook the nuanced importance and interdependencies of elements within local image regions. To overcome these limitations, the authors propose a novel framework that combines two complementary innovations: a Reconstructed Sliding Window Attention (RSW Attention) mechanism and a Dual Frequency Branch architecture.
The RSW Attention module first enhances channel interactions using a 1×1 convolution followed by a 3×3 depthwise‑separable convolution. It then applies Discrete Wavelet Transform (DWT) to decompose each feature map into four sub‑bands (LL, LH, HL, HH). These sub‑band features are tiled into a 4 × (H·W) representation, and a 4 × 4 sliding window is moved across this tiled tensor. Within each window, a local self‑attention operation is performed, effectively reconstructing the features while restricting attention to a confined spatial region. This design enables the attention mechanism to assign fine‑grained importance weights to individual pixels and to capture complex dependencies among neighboring elements, which are essential for exposing subtle forgery artifacts. A parallel Global Window MLP layer (GMLP) supplies complementary long‑range context, ensuring that the model does not lose global information while focusing on local details.
The Dual Frequency Branch enriches the feature space by processing two distinct frequency perspectives in parallel. The first branch utilizes the four DWT sub‑bands, which together encode both low‑frequency (overall color and illumination) and high‑frequency (edges, textures) information. The second branch extracts the phase component of the Fast Fourier Transform (FFT). Through a controlled experiment where the phase of real images was swapped with that of fake images, the authors demonstrated that the phase carries stronger forgery cues than the amplitude, which mainly reflects color and brightness. By concatenating the DWT‑derived and FFT‑phase features and passing them through a shared MLP, the model obtains a comprehensive representation that captures both local high‑frequency anomalies and global structural inconsistencies.
The overall architecture integrates the RSW Attention blocks and the Dual Frequency Branch into a Vision‑Transformer‑style backbone, followed by a simple fully‑connected classifier for binary (real vs. fake) prediction. The method was evaluated on an extensive benchmark comprising images generated by 65 distinct models, including StyleGAN, StyleGAN2, BigGAN, ProGAN, various versions of Stable Diffusion, Midjourney, DALL·E 2/3, and many others, as well as authentic photographs. In cross‑model (cross‑domain) testing, the proposed approach outperformed state‑of‑the‑art detectors such as C2P‑CLIP and AIDE by 5.8 % and 7.0 % absolute accuracy respectively, and achieved an overall 2.13 % improvement in detection accuracy across the entire dataset. Importantly, the model maintained high performance on unseen generative methods, demonstrating strong generalization capabilities.
In summary, the paper introduces a powerful combination of locally focused, reconstruction‑aware attention and multi‑frequency feature extraction, which together enable more precise detection of AI‑generated images across a wide variety of generation techniques. The work opens avenues for further research into lightweight real‑time detectors, extension to video and audio forgeries, and integration with broader multimedia authentication pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment