GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection
The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information via such videos. However, the development of high-performance AI-generated video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Large-scale video collection: The dataset contains 6.78 million videos and is currently the largest dataset for AI-generated video detection. 2) Cross-Source and Cross-Generator: The cross-source generation reduces the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 3) State-of-the-Art Video Generators: The dataset includes videos from 11 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. These generators ensure that the datasets are not only large in scale but also diverse, aiding in the development of generalized and effective detection models. Additionally, we present extensive experimental results with advanced video classification models. With GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models.. Datasets and code are available at https://genvidbench.github.io.
💡 Research Summary
**
The paper addresses the pressing need for robust detection of AI‑generated videos, a problem that has become increasingly urgent as generative video models such as OpenAI’s Sora, MuseV, and others produce content that is virtually indistinguishable from real footage. Existing detection research is hampered by the lack of large‑scale, high‑quality datasets that reflect the diversity of modern video generation pipelines. To fill this gap, the authors introduce GenVidBench, a benchmark comprising 6.78 million video clips—both real and synthetically generated—making it the largest publicly available dataset for AI‑generated video detection to date.
Dataset Construction and Design Principles
GenVidBench is built around two core design principles: Cross‑Source and Cross‑Generator. Real videos are sourced from two distinct repositories, Vript and HD‑VG‑130M, ensuring that the real‑video distribution is not biased toward a single collection. For each real video, the authors generate paired synthetic videos using the same textual prompt or reference image, but they employ different generative models. The first pair (used for training) includes videos from Pika, VideoCraftV2, ModelScope, and T2V‑Zero. The second pair (used for testing) contains videos from MuseV, SVD, CogVideo, Mora, Sora, and Kling. By keeping the semantic content identical across pairs while varying the generation source, the dataset forces detection models to rely on subtle forensic cues rather than obvious content differences.
The synthetic portion spans 11 state‑of‑the‑art generators, covering both Text‑to‑Video (T2V) and Image‑to‑Video (I2V) pipelines. Resolutions range from low‑quality 256 × 256 up to full HD 1920 × 1080, and frame rates vary from 4 FPS to 30 FPS, reflecting the heterogeneity of real‑world video streams. Moreover, each video is annotated with three‑dimensional semantic labels: object category (e.g., people, animals, buildings), action (e.g., sitting, moving, observing), and location (e.g., indoor, natural landscape, city). Although only a subset of the 6.78 M videos carries these annotations, the sampling strategy preserves the overall distribution, enabling researchers to conduct fine‑grained, scenario‑specific analyses.
Experimental Evaluation
The authors evaluate several leading video classification and detection architectures, including VideoSwin‑tiny, DeMamba, and UniformerV2. When training and testing on videos generated by the same model, accuracies exceed 97 %, with some subsets (e.g., T2V‑Zero, ModelScope) achieving near‑perfect scores (>99 %). However, the cross‑generator setting reveals a stark performance drop: accuracies fall to the 50‑60 % range, and the worst case (training on Pika, testing on SVD) yields only 54.66 % accuracy. This demonstrates that current models heavily overfit to generator‑specific artifacts and lack the ability to generalize to unseen generation pipelines—a critical limitation for real‑world deployment where the generator is unknown.
A deeper analysis using the semantic labels shows that certain categories are particularly challenging. Clips involving “people” performing “active” actions in “indoor” settings exhibit the highest error rates, likely because complex motion and varied backgrounds obscure forensic traces. Conversely, more static scenes (e.g., natural landscapes with minimal motion) are easier to classify. These findings suggest that incorporating semantic context into detection models could improve robustness.
Scalability Considerations and the 143k Subset
Training on a 6.78 M‑video corpus demands substantial computational resources. To facilitate rapid prototyping, the authors release GenVidBench‑143k, a carefully sampled lightweight subset that retains the original dataset’s statistical properties. Experiments on this subset mirror the trends observed on the full set, confirming its suitability for algorithmic development and hyper‑parameter tuning without incurring prohibitive costs.
Contributions and Future Directions
The paper’s contributions are fourfold: (1) the introduction of the largest AI‑generated video detection benchmark, (2) a novel cross‑source and cross‑generator split that rigorously tests generalization, (3) multi‑dimensional semantic annotations that enable scenario‑aware evaluation, and (4) a lightweight 143k‑sample version for efficient experimentation. The authors argue that GenVidBench will become a standard reference for the community, encouraging the development of detection methods that go beyond generator‑specific cues.
Future research avenues highlighted include: (a) integrating multimodal forensic signals such as audio, embedded metadata, and textual subtitles; (b) applying domain adaptation, meta‑learning, or contrastive learning to mitigate generator uncertainty; (c) leveraging the semantic labels for error diagnosis and model interpretability; and (d) continuously expanding the benchmark as new generative models emerge. By providing both the data and the evaluation protocols, GenVidBench sets a solid foundation for the next generation of robust, scalable AI‑generated video detectors.
Comments & Academic Discussion
Loading comments...
Leave a Comment