Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
📝 Abstract
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
💡 Analysis
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
📄 Content
With the rapid development of diffusion-based and autoregressive-based generative architectures, video models have witnessed tremendous success in high-fidelity video generation. Previous works, such as Stable Video Diffusion [3] and Imagen Video [15], showcase the capability of video models to generate physically realistic and temporally consistent videos conditioned on their input instructions. Recent studies further reveal that advanced video models are capable of performing a diverse range of visual tasks beyond generation itself, including perception, understanding, and even reasoning. These findings suggest that video models are evolving from pure generative models into generalpurpose visual intelligence models. Analogous to the evolution of language models from text generation to text-based reasoning, the development of video models leads to a question: “Can video models reason via video generation?”
Crucially, the spatiotemporal nature of video modality offers a new perspective on reasoning. The traditional paradigm, which we term reasoning via text, uses language as the medium for expressing intermediate reasoning steps. Representative work, such as Chain-of-Thought prompting [39,44,[49][50][51][52]54], achieves this by eliciting large language models (LLMs) to generate a coherent textual reasoning chain. Recently, this reasoning via text paradigm has been introduced to visual domains, including multimodal question answering and video understanding. However, even in these multimodal settings, current paradigms still express reasoning through textual continuation instead of visual or physical dynamics. In contrast, video represents reasoning as a process of visual continuation over time. Each frame in a video builds upon its previous ones, capturing the dynamics of motion, spatial consistency, and temporal causality within 2D and 3D space. The continuous and structured nature of frames makes video an ideal substrate for multimodal reasoning. Building on this insight, we propose reasoning via video, where reasoning emerges through nextframe generation rather than next-token prediction.
However, a comprehensive testbed for reasoning via video is lacking. To this end, we introduce VR-Bench, a dedicated benchmark designed to systematically assess the reasoning capabilities of video generation models. As shown in Figure 1, we ground our benchmark in the mazesolving task, a natural fit for visual reasoning due to its open-ended solution space and rich trajectory-based supervision. Each instance inherently demands spatial planning, dynamic tracking, and multi-step reasoning, making it an ideal testbed for evaluating model inference quality over time. Our dataset comprises 7,920 procedurally generated maze-centric videos, each paired with a corresponding Trace Reasoning Task that requires models to infer the optimal path. To ensure broad generalizability and challenge model robustness, VR-Bench spans five distinct maze types-Regular Maze, Irregular Maze, 3D Maze, Sokoban, and Trapfield-covering a wide spectrum of spatial structures and decision patterns. Additionally, each maze is rendered in diverse visual styles across more than a dozen themes, enabling fine-grained analysis of how well models generalize across varied visual domains and increasing the realism and complexity of the reasoning tasks.
Building upon the proposed VR-Bench, we conduct a systematic study of the reasoning via video paradigm. We construct instruction-following datasets derived from VR-Bench to elicit the reasoning capability of open-source video models. After supervised fine-tuning (SFT), these models exhibit significant performance gain across all reasoning tasks in VR-Bench. Moreover, SFT endows video models with strong out-of-domain generalization under diverse distribution shifts, including task difficulty, background style, and task type. Compared with vision-language models (VLMs) [2,5,6,19,22] that reason via text, video models consistently outperform their counterparts on high-complexity reasoning tasks, showing greater stability and even superior performance as task difficulty increases, across diverse scenarios and tasks. This finding confirms that videos serve as a more expressive substrate for spatial reasoning, which facilitates video models to leverage temporal continuity and dynamic visual context. Interestingly, we further observe that video models exhibit a test-time scaling effect analogous to that of LLMs. As the inference budget increases, their performance improves substantially. By employing diverse sampling strategies at test time, video models effectively explore multiple reasoning trajectories, reducing uncertainty and achieving an average performance gain of 10-20%. These empirical results highlight the unique potential and scalability of the reasoning via video paradigm.
Our contributions are summarized as follows: • We make an early and systematic exploration of the reasoning via video paradigm, wher
This content is AI-processed based on ArXiv data.