How do people watch AI-generated videos of physical scenes?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The growing prevalence of realistic AI-generated videos on media platforms increasingly blurs the line between fact and fiction, eroding public trust. Understanding how people watch AI-generated videos offers a human-centered perspective for improving AI detection and guiding advancements in video generation. However, existing studies have not investigated human gaze behavior in response to AI-generated videos of physical scenes. Here, we collect and analyze the eye movements from 40 participants during video understanding and AI detection tasks involving a mix of real-world and AI-generated videos. We find that given the high realism of AI-generated videos, gaze behavior is driven less by the video’s actual authenticity and more by the viewer’s perception of its authenticity. Our results demonstrate that the mere awareness of potential AI generation may alter media consumption from passive viewing into an active search for anomalies.

💡 Research Summary

The paper investigates how people watch and evaluate AI‑generated videos of physical scenes by recording eye‑movement data from 40 participants. Two video sets were used: (1) physics‑related clips drawn from the Physics‑IQ dataset and (2) professionally edited stock videos (nature, wildlife, food, sports). For each set, 20 real‑world videos and 20 AI‑generated videos were created using the latest text‑to‑video model (Google VEO 3.1, October 2025) with identical prompts and keyframes, trimmed to five seconds and standardized to 960 × 720 px.

The experiment comprised two tasks. In the video‑understanding task, participants watched 20 videos (half real, half AI) and described each clip in a single sentence, mimicking ordinary online viewing. In the AI‑detection task, participants watched another 20 videos and indicated whether each was real or AI‑generated, additionally naming visual cues that led to a “fake” judgment. Eye‑tracking was performed with a Gazepoint GP3 (60 Hz) system; metrics extracted included fixation count, fixation duration, saccade magnitude, scan‑path length, and average pupil diameter (MPD). After the tasks, participants completed a questionnaire about demographics, prior experience with generative video tools, confidence in detection, and the strategies they employed (logical rule‑based vs. intuition).

A total of 21,379 fixations and 1,573 scan‑paths were recorded (27 scan‑paths lost due to hardware glitches). During the AI‑detection task participants made 800 judgments (796 with accompanying eye‑tracking data). The authors tested three hypotheses:

H1 – Task‑dependent gaze behavior – When participants knew they might encounter AI‑generated content, their eye‑movement patterns changed. Compared with the understanding task, the detection task showed a modest increase in fixation count (+13 %, p = 0.13) and a significant reduction in fixation duration (p < 0.01). Saccade magnitudes and scan‑path lengths were larger (p < 0.05), indicating broader, quicker sampling of the visual field. Pupil size was slightly smaller (p < 0.01), suggesting lower cognitive load for anomaly hunting than for narrative comprehension.
H2 – Real vs. AI video differences – Contrary to expectations, there were no significant differences in any eye‑tracking metric between genuine and AI‑generated videos. The realism of current generative models is sufficient to make the visual system treat both types similarly. However, participants’ subjective judgment (whether they believed a video was real) did affect gaze: videos judged as “real” elicited more concentrated fixations, whereas “fake” judgments produced more dispersed scanning.
H3 – Strategy‑dependent gaze behavior – Post‑experiment self‑reports revealed two strategy groups. Participants who reported a logical, rule‑based approach (e.g., checking for physics violations, texture inconsistencies) displayed lower spatial variance in fixations, more consistent scan‑paths, and higher detection accuracy (mean ≈ 71 %). Those relying on intuition showed higher fixation dispersion and lower accuracy (mean ≈ 58 %).

Performance also varied by video set. Detection accuracy was higher for physics videos (70.8 % ± 24.3) than for professional stock videos (62.0 % ± 23.9), likely because physical scenarios have clearer normative constraints that expose generative artefacts.

The authors interpret these findings as evidence of a cognitive mode shift: awareness of possible AI manipulation triggers a “search‑for‑anomalies” mode, characterized by rapid, wide‑area sampling rather than deep processing of any single region. This mode is reflected in the eye‑tracking signatures identified. The lack of real‑vs‑AI differences underscores the challenge of detecting modern AI video content based solely on low‑level visual cues.

Importantly, the study proposes that eye‑movement data can serve as a human‑centric metric for AI‑generated video evaluation. Real‑time heat‑maps or saccade‑based alerts could be integrated into user interfaces to assist viewers in spotting inconsistencies. By releasing the full dataset (eye‑tracking logs, video stimuli, and judgment labels) on GitHub, the authors aim to foster further research on multimodal trustworthiness, human‑AI collaborative detection, and the design of assistive tools.

Limitations include the short 5‑second clip duration, a relatively homogeneous participant pool, and reliance on a single generative model. Future work should explore longer narrative videos, broader demographic samples, and comparisons across multiple state‑of‑the‑art text‑to‑video systems (e.g., Sora, upcoming diffusion‑based models).

In summary, the paper demonstrates that while the visual realism of AI‑generated videos now rivals that of real footage, people’s gaze behavior is driven more by their perceived authenticity and the task at hand than by the actual source. Logical detection strategies lead to more systematic eye movements and better performance, suggesting that training users in explicit evaluation heuristics could improve media literacy in an era of increasingly convincing synthetic video.

How do people watch AI-generated videos of physical scenes?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment