A Convolutional LSTM based Residual Network for Deepfake Video Detection
š” Research Summary
The paper addresses the pressing need for a deepfake video detector that can generalize across multiple manipulation techniques and exploit temporal cues inherent in video data. The authors propose CLRNet, a Convolutional LSTMābased Residual Network, which processes a short sequence of consecutive frames (five frames per sample) rather than isolated images. By employing ConvLSTM cells, the model retains spatial dimensions while learning temporal dependencies through convolutional gate operations, thereby capturing subtle interāframe inconsistencies such as sudden brightness shifts, contrast changes, or minor facial part deformations that are characteristic of deepfake videos but absent in pristine footage.
CLRNetās architecture mirrors the residual design of ResNet but replaces standard convolutional layers with two custom blocks: the CL block (ConvLSTM ā Dropout ā BatchNorm ā ReLU) and the ID block (identical to CL but with a direct shortcut connection). Each block produces two outputs; one passes through an addition operation, and the other undergoes batch normalization and ReLU before feeding the next block. This residual wiring mitigates vanishing gradients and enables deeper networks to be trained effectively on video sequences.
The experimental pipeline uses the FaceForensics++ suite (Pristine, DeepFake, FaceSwap, Face2Face, NeuralTextures) plus the DeepFakeDetection (DFD) dataset. For each video, 16 samples are extracted, each containing five consecutive frames. Faces are detected and aligned with MTCNN, cropped, and resized to 240āÆĆāÆ240 pixels. Data augmentation (brightness, channel shift, zoom, rotation, horizontal flip) expands the training distribution. The authors allocate 750 videos per class for training, 125 for validation, and 125 for testing; DFD uses a reduced subset due to its larger size.
A central contribution is the systematic evaluation of three fewāshot transfer learning strategies: (1) singleāsource ā singleātarget, (2) multiāsource ā singleātarget, and (3) singleāsource ā multiātarget. In each case, the model is preātrained on a large source domain and then fineātuned on as few as ten real and ten fake videos from the target domain. This approach reflects realistic scenarios where newly emerging deepfake methods lack extensive labeled data.
Results demonstrate that CLRNet outperforms five recent stateāofātheāart detectors (including Xception, MesoNet, TwoāStream, ForensicsTransfer, and others) on both inādomain and crossādomain tests. On average, CLRNet achieves a 3ā5 percentageāpoint gain in accuracy over baselines when evaluated on the same dataset. More importantly, its crossādomain performance degrades far less than competing methods, confirming the efficacy of the temporal modeling and transfer learning scheme. Visualizations of frameādifference maps illustrate that deepfake videos exhibit pronounced interāframe artifacts, which the ConvLSTM layers successfully learn to discriminate.
The paper also discusses limitations: ConvLSTM incurs higher computational and memory costs compared to standard CNNs, potentially hindering realātime deployment; the fixed fiveāframe window may not capture longerārange temporal patterns; and the evaluation is confined to faceācentric, heavily compressed videos, leaving open questions about robustness to diverse content, varying compression levels, and background motion. Future work is suggested in three directions: (i) designing lightweight ConvLSTM variants or hybrid attention mechanisms to reduce overhead, (ii) supporting variableālength sequences or hierarchical temporal modeling, and (iii) integrating multimodal cues (audio, textual metadata) for a more comprehensive deepfake detection framework.
In summary, the authors present a wellāmotivated, technically sound architecture that leverages spatioātemporal modeling and fewāshot transfer learning to achieve superior generalization across deepfake generation methods, marking a notable advance in video forensics.
Comments & Academic Discussion
Loading comments...
Leave a Comment