Artifact-Aware Evaluation for High-Quality Video Generation
With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.
💡 Research Summary
The paper addresses the growing need for fine‑grained evaluation of AI‑generated videos. Traditional video quality assessment (VQA) metrics such as SSIM or VMAF provide only a single scalar score and fail to capture the complex, often subtle artifacts introduced by modern generative models. Even recent multimodal large language models (MLLMs) that have been adapted for VQA still output coarse‑grained scores, offering little interpretability for downstream moderation or model improvement.
To bridge this gap, the authors propose an artifact‑aware evaluation framework that decomposes video quality into three perceptual axes—Appearance, Motion, and Camera—each populated with four representative artifact categories, yielding a total of ten common failure modes (e.g., blurred visuals, unnatural lighting, flickering motion, unstable camera trajectories). This taxonomy is derived from a large‑scale human study involving 20 raters who labeled 10 k generated clips, ensuring that the selected categories reflect real‑world viewer sensitivities.
A cornerstone of the work is the GenVID dataset, a curated collection of 80 k videos generated by three state‑of‑the‑art video synthesis systems (WAN 2.1, CogVideoX, Open‑Sora). Each video is annotated for the presence or absence of every artifact category. To make the annotations compatible with multimodal model training, the authors convert the multi‑label information into a question‑answer format: “Does this video exhibit {artifact}?”, producing 960 k binary QA pairs. This format enables straightforward fine‑tuning of vision‑language models without requiring custom loss functions.
Recognizing that exhaustive frame processing is computationally prohibitive, the paper introduces Flow‑Magnitude‑Guided Dynamic Frame Sampling (FMG‑DFS). The method computes dense optical flow for the entire video, smooths the per‑frame motion magnitude, and selects peaks that likely correspond to temporally localized defects. By sampling a small, fixed number of frames (e.g., 10) around the top‑K peaks, FMG‑DFS retains the most informative temporal segments while dramatically reducing memory and compute requirements. The algorithm also includes post‑processing steps to avoid overlap and to fill any gaps if the total frame budget is not met.
The Dense Video Artifact Recognition (DVAR) system combines FMG‑DFS with a frozen visual encoder and a fine‑tuned multimodal language model (Qwen2.5‑VL). During training, only the language model parameters are updated, allowing the model to learn the mapping from visual features to binary artifact decisions while preserving the rich visual representations learned during pre‑training. At inference time, a video is first passed through FMG‑DFS, the selected frames are encoded, and a textual prompt for each artifact category is fed to the language model, which outputs “yes” or “no”.
Extensive experiments on the GenVID test split demonstrate that DVAR‑Mean‑7B outperforms a range of strong baselines, including GPT‑5, GPT‑4o, LLaVA‑NeXT, VideoChat, and smaller Qwen2.5‑VL variants. DVAR achieves 84.9 % accuracy on Appearance, 78.5 % on Camera, and 76.7 % on Motion, yielding an overall 80.0 % accuracy—substantially higher than the best baseline (≈67 %). Ablation studies confirm that FMG‑DFS contributes roughly 4–5 % absolute gain over random or uniform frame sampling, while increasing model size from 3 B to 7 B parameters yields only marginal improvements (~1 %).
The authors also discuss practical deployment: integrating DVAR as a filtering module in content pipelines successfully suppresses low‑quality outputs while preserving the diversity of high‑quality generations. Limitations are acknowledged: FMG‑DFS may miss artifacts in static scenes, the taxonomy is fixed to ten categories, and freezing the visual encoder could limit detection of ultra‑fine texture defects. Future work is suggested in multi‑modal sampling (e.g., incorporating audio cues), self‑supervised discovery of new artifact types, and fine‑tuning visual backbones for higher resolution detail.
In summary, the paper delivers a comprehensive, human‑centric evaluation protocol, a large‑scale annotated dataset, an efficient motion‑guided sampling strategy, and a dense artifact recognition model that together set a new benchmark for assessing the quality of AI‑generated videos. This work has immediate implications for automated moderation, quality assurance, and iterative improvement of generative video models.
Comments & Academic Discussion
Loading comments...
Leave a Comment