Demystifing Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold…

Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu

Demystifing Video Reasoning
Demystifying V ideo Reasoning Ruisi W ang 1 , Zhongang Cai  , 1 , Fan yi Pu 1 , 2 , Junxiang Xu 1 , W anqi Y in 1 , Maijunxian W ang 3 , Ran Ji 4 , Chenyang Gu 1 , Bo Li 2 , Ziqi Huang 2 , Hokin Deng 5 , Dahua Lin 1 , Ziwei Liu 2 , Lei Y ang 1  Corresponding Author 1 SenseT ime Research 2 Nanyang T echnological University 3 Univ ersity of California, Berkeley 4 Univ ersity of California, San Diego 5 Carnegie Mellon Uni versity Abstract Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and unco ver a fundamentally dif ferent mechanism. W e show that reasoning in video models instead primarily emerges along the dif fusion denoising steps . Through qualitativ e analysis and targeted probing e xperiments, we find that models explore multiple candidate solutions in early denoising steps and progressiv ely con verge to a final answer , a process we term Chain-of-Steps (CoS) . Beyond this core mechanism, we identify se veral emergent reasoning behaviors critical to model performance: (1) working memory , enabling persistent reference; (2) self-correction and enhancement , allowing recov ery from incorrect intermediate solutions; and (3) perception befor e action , where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncov er self-ev olved functional specialization within Diffusion T ransformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motiv ated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improv ed by ensembling latent trajectories from identical models with different random seeds. Overall, our work pro vides a systematic understanding of how reasoning emer ges in video generation models, offering a foundation to guide future research in better e xploiting the inherent reasoning dynamics of video models as a new substrate for intelligence. Homepage: https://www.wruisi.com/demystifying_video_reasoning 1 Introduction V ideo generation models have transformed the landscape of movie, gaming, and entertainment industries. Howe ver , most research has focused primarily on their ability to produce high-fidelity , realistic, and visually appealing videos. Recent advances ha ve rev ealed an unexpected phenomenon: diffusion-based video models exhibit non-tri vial reasoning capabilities in spatiotemporally consistent visual en vironments [ 62 ]. Prior work attributes this beha vior to a Chain-of- Frames (CoF) mechanism, suggesting that reasoning unfolds sequentially across video frames. Despite this intriguing discov ery , the underlying mechanisms of video reasoning remain lar gely unexplored. W ith the recent release of large- scale video reasoning datasets and open-source foundation models [ 58 ], we now ha ve the opportunity to systematically in vestigate this capability . Le veraging these resources, we conduct the first comprehensiv e dissection of video reasoning and uncov er a fundamentally different mechanism: reasoning in diffusion-based video models primarily emerges along the denoising process rather than across frames. 1 Fr a m es Dif fusio n Step s Prun ing Pat hs 2 In it ial Att em pt: Mu lt i - path Explorati on 1 Final D ec ision 3 Figure 1 Chain-of-Steps. W e discov er that video reasoning occurs along the dif fusion steps with surprising emergent beha viors such as making multiple possible moves ( e.g . , paths) simultaneously at early steps, gradually pruning suboptimal choices during middle steps, and reaching a final decision at the late steps. This maze-solving example asks the model to start from the green circle in the top-left corner and find the red rectangle. Key re gions of interest are color-coded and enlar ged on the right. Our key discovery challenges the pre vailing Chain-of-Frames (CoF) hypothesis [ 62 , 66 ], which assumes that video reasoning unfolds sequentially across frames. Instead, we find that reasoning does not primarily operate along the temporal dimension. Rather , it emerges along the diffusion denoising steps, progressing throughout generation. W e term this mechanism Chain-of-Steps (CoS) . This finding suggests a fundamentally different vie w of how dif fusion- based video models reason. Due to bidirectional attention over the entire sequence, reasoning is performed across all frames simultaneously at each denoising step, with intermediate hypotheses progressively refined as the process unfolds. Qualitative analysis re veals intriguing dynamics. In early denoising steps, the model often entertains multiple possibilities (populating alternativ e trajectories or superimposing candidate outcomes) before gradually conv erging to a final solution in later steps. Moreov er , noise perturbation analysis shows that disruptions at specific denoising steps significantly degrade performance, whereas frame-wise perturbations hav e a much weaker impact. Further information propagation analysis identifies that the conclusion primarily solidifies during the middle dif fusion steps. Furthermore, we uncov er several surprising emergent beha viors in video reasoning models that are strikingly similar to those observed in early studies of Large Language Models (LLMs). First, these models e xhibit a form of working memory that is crucial for tasks requiring persistent references ( e.g. , object permanence). Second, we observe that video models can self-correct errors during the CoS reasoning process, rather than committing to incorrect trajectories throughout generation. Third, video models exhibit a "perception bef ore action" behavior , where early diffusion steps prioritize localizing target objects before subsequent steps perform more comple x reasoning and manipulation. W e further conduct a fine-grained analysis of the Diffusion Transformer by examining token representations within a single dif fusion step. This reveals the self-e volved, di verse, task-agnostic functional layers throughout the network. W ithin a diffusion step, early layers focus on dense perceptual understanding ( e.g. , separating foreground from background and identifying basic geometric structures), while a set of critical middle layers performs the b ulk of the reasoning. The final layers then consolidate the latent representation to produce the video state for the next step. Motiv ated by these insights, we present a simple training-fr ee method as a proof-of-concept for improving video reasoning models. Given that the model inherently e xplores multiple reasoning paths during the diffusion process, we propose a inference-time ensemble strate gy that mer ges latents produced by three identical models with dif ferent random seeds. This approach encourages the model to retain a richer set of candidate reasoning trajectories during generation. As a result, the model explores more diverse reasoning paths and is more likely to conv erge to the correct solution, illustrating a way to utilize our findings to design more ef fectiv e video reasoning systems. 2 In summary , we in vestigate the underlying mechanisms of video reasoning in diffusion models and identify Chain- of-Steps (CoS), a reasoning process that unfolds along the denoising trajectory . W e further uncov er sev eral emergent reasoning behaviors that arise in these models. Building on these insights, we demonstrate how such mechanisms can be exploited through a simple training-free strategy for reasoning path ensembling. W e believe our findings provide a foundation for understanding and advancing video reasoning, positioning it as a promising next-generation substrate for machine intelligence. 2 Related W orks 2.1 Reasoning in Language and Multimodal Models Recent studies sho w that large language models (LLMs) exhibit remarkable reasoning capabilities. Early work identifies emergent beha viors that arise as models scale in size and data [ 60 ], and demonstrates that Chain-of-Thought (CoT) prompting, which elicits intermediate reasoning steps, significantly improves performance [ 61 ]. Subsequent work explores mechanisms such as self-reflection, correction, and action [ 21 , 39 , 69 , 72 ]. Coconut further suggests that reasoning can also occur implicitly within latent representations [ 18 ]. Meanwhile, research has increasingly explored extending reasoning beyond language into multimodal settings. Early progress in vision-language models (VLMs) enables reasoning over images in addition to text [ 1 , 2 , 6 , 32 , 37 ], whereas recent work has studied unified architectures that jointly model language and vision [ 4 , 8 , 14 , 33 , 42 , 46 , 54 , 56 , 64 , 65 , 80 , 81 ]. These architectures empower reasoning for generation [ 10 , 16 , 27 , 34 , 50 , 67 , 79 ], enable reasoning with generation through visual CoT [ 5 , 7 , 13 , 23 , 25 , 31 , 45 , 51 , 68 ], and extend to embodied scenarios [ 74 – 76 ]. T ogether , these findings suggest that reasoning over multimodal signals opens up av enues for advanced reasoning capabilities. Ho wev er , these efforts remain limited to discrete text and static images, making it challenging to lev erage spatiotemporally consistent priors. Our work aims to in vestigate video as the next substrate for reasoning in intelligent systems. 2.2 V ideo Generation Models V ideo generation has advanced rapidly with the de velopment of dif fusion models [ 20 , 47 ] and high-fidelity variational autoencoders (V AEs) [ 9 , 26 , 77 ]. While early approaches focuse primarily on generating short clips, the emergence of Dif fusion T ransformers (DiTs) [ 43 ] has enabled ef fectiv e scaling of data and model size. As a result, recent video generators [ 11 , 15 , 28 , 30 , 41 , 48 , 57 , 59 ] achiev e impressive visual fidelity . Despite these adv ances, major challenges remain in physical plausibility [ 73 , 78 ], commonsense knowledge [ 78 ], and spatiotemporal reasoning [ 38 , 58 , 62 ]. Consequently , recent research has be gun shifting tow ard in vestigating the reasoning capabilities of video generation models. One line of work lev erages the reasoning abilities of multimodal LLMs to guide video synthesis. For e xample, VChain [ 23 ] and MetaCan vas [ 35 ] incorporate external reasoning modules into pre-trained generators, while Omni- V ideo [ 53 ] uses symbolic reasoning from LLMs to guide generation. More recently , se veral studies ask whether video generators themselves can perform reasoning without external supervision, treating them as zero-shot learners operating a in spatiotemporal en vironments [ 19 , 55 , 62 ]. Howe ver , the mechanisms underlying this capability remain unexplored. Our work addresses this gap by in vestigating the internal reasoning processes of dif fusion-based video models. 2.3 Similarities to Biological Brains The dif fusion model may be doing something analogous to ho w biological brains plan and think. For e xample, when a rat is deciding which path to take to reach food, researchers have observ ed that multiple simulated trajectories are rolled out in the hippocampus during the planning phase. In these experiments, the rat is held still first, and only after a delay period is it allo wed to mov e [ 44 ]. Recent work suggests that human brains may employ analogous mechanisms for planning and internal simulation during conceptual reasoning and decision-making [ 3 , 40 ]. 3 Chain-of-Steps: Reasoning along Diffusion Steps While prior work [ 62 ] hypothesizes a Chain-of-F rames (CoF) mechanism in which reasoning in video models unfolds frame by frame, generated frames appear to exhibit a “causal” property where later frames gradually build conclusions conditioned on earlier frames. Ho we ver , our analysis of the underlying video reasoning mechanism reveals e vidence to the contrary . First, we empirically analyze a wide range of reasoning tasks and find that the core logical reasoning in video generation models occurs across the diffusion denoising steps ( Sec. 3.1 ). Diffusion steps do more than merely 3 (a ) The rob o t d rives to th e white p aper ar ea. Multi - Path Exp lor ation Ste p 0 Fin al Step Step 0 Fin al Step (c) Place th e g reen p lan t o n the f ar lef t o f th e sam e tier to lef t. Ste p 0 Fin al Step (b) M o d if y o n e ‘O ’ to ‘X ’ in T ic - T ac - T o e to secure a win. Step 0 Fin al Step (d) Fin d the d iam o n d in the f ig u re. (e ) Co m p lete the b o x a ccordin g to the pattern. Su per pos iti on - bas ed Exp lor ation Ste p 0 Fin al Step Ste p 0 Fin al Step (f) Im ita te the rotation p atter n . Figure 2 Chain-of-Steps elicits reasoning along the diffusion process. W e observe that video reasoning models explore multiple possible solutions simultaneously in the early denoising steps before con verging to a final outcome in later steps. Specifically , we observe: (a) two potential routes (cyan arrows highlight the "imaginary traces") for the robot; (b) two possible placements of the "O" piece; (c) multiple candidate end positions for the plant; (d) simultaneous selection of two diamonds; (e) lar ge and small circles ov erlapping with each other; and (f) all possible rotations of the L-shaped object superimposed. refine visual texture; instead, the y explore multiple possibilities, e valuate their plausibility , and gradually con verge to the correct outcome through the denoising process. Second, we introduce noise perturbations to disrupt information flow at both the frame and step levels ( Sec. 3.2 ). Our findings reaffirm that CoS, rather than CoF , more accurately characterizes the reasoning mechanism in video models. 3.1 Diffusion Steps as the Primary Axis of Reasoning If not otherwise stated, we base our study on VBVR-W an2.2 [ 58 ], the latest video reasoning model finetuned from the powe rful W an2.2-I2V -A14B [ 57 ] on unprecedentedly lar ge-scale video reasoning data. W e e xtract test cases mainly from video reasoning benchmarks such as VBVR [ 58 ] and general video generation benchmarks such as VBench [ 22 , 24 ]. T o observe the model’ s internal decision-making dynamics, we examine the estimated clean latent ˆ x 0 at each diffusion step s . Dif fusion-based generativ e models progressi vely transform noise into structured data through an iterative denoising process. When trained with flow matching[ 36 ], the latent ev olves along a continuous transport path between noise and data: x s = (1 − s ) x 0 + sx 1 (1) where x 0 is the clean latent and x 1 ∼ N ( 0 , I ) is noise.The model learns a velocity field v θ ( x s , s, c ) conditioned on prompt c , describing ho w the latent moves along this trajectory . The noise scale σ s controls the magnitude of perturbation 4 at each step. Therefore, the intermediate decoded state is estimated by removing the predicted noise component: ˆ x 0 = x s − σ s · v θ ( x s , s, c ) (2) By decoding ˆ x 0 at each diffusion step, we can visualize how semantic decisions ev olve and analyze the model’ s intermediate reasoning dynamics. Analogous to LLMs that e xhibits reasoning behaviors along chain-of-thought, where the model gradually reaches its conclusion, to our surprise, we discover a similar scheme in video reasoning models along di ffusion denoising steps. Specifically , we consistently observ e a shared behavioral pattern that early diffusion steps act as a high-lev el heuristic search. During this stage, the model populates the latent workspace with multiple hypotheses. As denoising progresses, the model effecti vely "prunes" the solution tree, con verging to ward a logically consistent output. This is ex emplified in Fig. 1 , for comple x navigational tasks such as maze-solving, the decoded latent prediction ˆ x 0 after early dif fusion steps appear as a probabilistic cloud in which se veral plausible paths are spawned and e xplored in parallel. Over subsequent steps, suboptimal trajectories gradually get suppressed, con ver ging tow ards the final solution. By analyzing intermediate latent predictions at each step, we move be yond the "Chain-of-Frames" (CoF) temporal analogy and identify two distinct modes of Step-wise Reasoning: Multi-path Exploration and Superposition-based Exploration. 3.1.1 Multi-Path Exploration. In high-complexity logical tasks, the diffusion process resembles a Breadth-First Search (BFS) or a multi-choice elimination procedure, where the model e xplores a tree of possible solutions and gradually prunes incorrect branches. It is worth noting that this behavior is reminiscent of the parallel reasoning trajectories explicitly studied in the LLM community ( e.g . , T ree of Thoughts [ 71 ]). Ho we ver , video generation models naturally explore multiple solution paths in parallel during the diffusion process, inherently performing a similar form of structured search within their latent space. In some tasks in volving object mo vements, the model explicitly visualizes this exploration process through multiple motion trajectories. In other tasks where the model must select an action from a discrete set of alternatives, we observ e that the model initially considers sev eral actions simultaneously and progressiv ely discards candidates as the denoising process proceeds, until only a single valid outcome remains. • F ig. 2 (a) Robot Navigation. The intermediate steps show the robot simultaneously e xploring both the upper and lower routes through the maze. As the diffusion process proceeds, the trajectory corresponding to the lower path becomes increasingly dominant, while the alternati ve route gradually disappears, indicating that the model chooses the final path. • F ig. 2 (b) T ic-T ac-T oe. During the early reasoning stage, the model simultaneously highlights multiple candidate cells that for a winning mov e. • F ig. 2 (c) Object Movement. In this example, it is clearly observ able that at the early stage, the model proposes four potential trajectories corresponding to the four layers on the left side of the shelf. As the denoising steps continue, these alternati ves gradually collapse to ward placing the plant on the first layer , producing a clear and consistent motion path. • F ig. 2 (d) Diamond Detection. The model initially marks two candidate shapes that might satisfy the query . Through iterativ e refinement, the incorrect candidate fades; only the correct diamond remain circled in the end. 3.1.2 Superposition-based Exploration Another distinctiv e mode observable along the dif fusion trajectory is superposition-based exploration, where the model temporarily represents multiple mutually exclusi ve logical states simultaneously . Instead of committing early to a single configuration, the model maintains o verlapping hypotheses that gradually resolve as noise is remov ed. This phenomenon is particularly evident in tasks in volving object reordering and spatial alignment. • F ig. 2 (e) Size P attern Completion. The size-pattern follows a repeating "large-medium–small" pattern. When predicting the next element, the model initially generates ov erlapping circles of dif ferent sizes, representing competing hypotheses about the correct continuation of the sequence. 5 … … Dif fusion Steps " Noise a t S te p " "N oise at Fram e " noise at step n noise at fra me n (b) ( a ) ( c ) Figure 3 Noise perturbation and information flow . (a) Illustration of noise injection schemes; "Noise at Step" suffers more significant corruption than "Noise at Frame". (b) Performance drop with the tw o noise injection schemes. X-axis is the injection index (either diffusion step or frame). (c) Information flow across denoising steps (CKA dissimilarity: 1.0 indicates complete corruption, 0.0 indicates no effect). • F ig. 2 (f) Objects Rotation. In this task, rather than rotating discretely from one angle to another , the model produces a blurred representation of sev eral candidate orientations. 3.2 Noise Perturbation and Inf ormation Flow Our hypothesis is further validated through tar geted noise injection experiments. W e compare tw o settings to isolate where the core reasoning process occurs: 1) "Noise at Step": x s, ∀ f ← N (0 , I ) . That is, disruptiv e Gaussian noise is injected into all frames at a specific diffusion step. 2) "Noise at Frame": x ∀ s,f ← N (0 , I ) . That is, Gaussian noise is injected into a specific frame across all diffusion steps. The two settings are illustrated in Fig. 3 (a). In Fig. 3 (b), we ev aluate model performance under these two noise injection schemes. Compared to the baseline without noise, the "Noise at Step" setting causes the final score to collapse from 0.685 to belo w 0.3, indicating that the reasoning trajectory is highly sensiti ve to disruptions along the dif fusion steps. Noise injected at a particular dif fusion step therefore leads to a significant interruption of the model’ s reasoning process. In contrast, under "Noise at Frame" injection, the model demonstrates more robustness with a much smaller performance drop. This behavior can be explained by the architecture of diffusion transformers: each denoising step has full observation of the preceding latent sequence through bidirectional attention, allowing the model to refine the entire video latent jointly . Consequently , corrupted frames can be recovered by le veraging the uncorrupted information from neighboring frames during subsequent denoising steps. In Fig. 3 (c), we further analyze information propagation by measuring di ver gence after injecting noise at step s t . W e visualize CKA dissimilarity [ 29 ], where 1.0 indicates complete corruption and 0.0 indicates no effect. The results show that perturbations introduced in early diffusion steps propag ate throughout the entire trajectory , fundamentally altering the final reasoning outcome. Notably , there is little recovery until the final stages, and the model does not fully recover . Moreov er , the red dotted line highlights step-wise sensiti vity to disruptiv e noise, which gradually increases and peaks around steps 20–30. This observation aligns with our qualitativ e analysis. Although steps 20–30 are not the earliest stages where we first observe reasoning phenomena, by this point the model has already pruned its reasoning trajectory toward the final conclusion. Consequently , perturbations at these steps have a large impact, as they can disrupt a reasoning process that is nearly finalized. Later steps, in contrast, appear less critical for the model’ s reasoning capability . 6 (a ) M o v e the center o b ject o u t of the f ra m e and b ack. W orking Me mor y Step 0 Fin al Step Step 0 Fin al Step (b) M o v e the lar g e st tedd y b ear to the lef t. S el f - cor r ec tion an d Enh an ce ment Step 0 Step 1 Step 2 Fin al Step Step 0 Step 1 Step 2 Fin al Step (c) Predict wh ere th e b all will b o u n ce. (d) Ro tate th e sh ap e 18 0 ° clo ck wise in th e to p view . Figure 4 Emergent reasoning beha viors: memory and self-correction. (a) The center point is retained to guide the return motion. (b) The contour of the occluded small teddy bear is preserved, enabling the model to address object permanence. (c) The trajectory of the ball gradually extends and becomes complete. (d) The missing cube only appears in the later diffusion steps. Cyan boxes are added for illustration; they are not part of the generated video. 4 Emergent Reasoning Beha viors Similar to the emergent reasoning beha viors observed in Lar ge Language Models (LLMs), we identify three surprising properties that are critical to effecti ve video reasoning: working memory ( Sec. 4.1 ), which retains essential information throughout the reasoning process; self-corr ection and enhancement ( Sec. 4.2 ), which enables the model to revise intermediate hypotheses or refine pre viously generated answers, gradually adjusting to ward the optimal solution e ven when it is not present initially; and perception befor e action ( Sec. 4.3 ), through which the model spontaneously de velops a univ ersal protocol within its architecture to handle diverse video reasoning tasks. 4.1 W orking Memory Reasoning requires the maintenance of "working memory" or a state. The demonstrations sho w that the dif fusion process naturally establishes persistent anchors that preserve critical information across generation steps. • F ig. 4 (a) Object Reappearance. The model consistently preserves the object’ s initial position throughout the dif fusion steps, enabling the circle to return to its original location and remain consistent with the initial condition. • F ig. 4 (b) T eddy Bear Relocation. During the mov ement task, the largest teddy bear temporarily blocks the small teddy bear on the left. Despite this occlusion, the early dif fusion steps retain the state of the small bear to ensure consistent generation in the whole video. 7 4.2 Self-correction and Enhancement During the diffusion process, we observe several stochastic “aha moments, ” where the model initially selects an incorrect option but later revises its reasoning after a fe w dif fusion steps, exploring an alternativ e strategy . These behaviors are functionally analogous to the internal backtracking and "slo w thinking" discussed in long-thinking Large Language Models (LLMs) [ 69 ]. Importantly , such transitions are not limited to correcting mistak es. The model may also refine an initially incomplete answer into a logically richer and more comprehensive one, reflecting a form of latent self-improv ement rather than simple error repair . In contrast to the "Chain-of-Frames" theory , which would require such corrections to happen sequentially across time, these rev ersals take place globally across all frames simultaneously within a single diffusion step. This provides strong evidence that the video generation model prioritizes global logical inte grity over local, sequential frame-wise updates. • F ig. 4 (c) Hit T arg et After Bounce. Initially , the ball’ s trajectory is incomplete and ambiguous. As dif fusion progresses, the model gradually completes the trajectory , making it increasingly clear, and the outcome con ver ges from four candidate points to a single correct point. • F ig. 4 (d) 3D Shape Rotation. At the first diffusion step, the rotated cubes are generated with incorrect quantities and arrangements. After several dif fusion steps, the model gradually corrects both the number and the spatial configuration, producing a coherent and accurate final result. 4.3 Per ception befor e Action W e observe a phenomenon suggesting that the diffusion trajectory first addresses the "what" and "where" of a scene before determining the "how" and "why" of its thinking progression. This process seems to suggest a "Perception before Action" transition, characterized by a shift from static grounding to dynamic reasoning. As illustrated in Fig. 5 , the initial diffusion step primarily identifies the foreground entity ( e.g. , the car or the door) specified in the prompt. At this stage, no e xplicit motion planning or relational transformation is observ ed. Instead, dynamic structure begins to emerge in later dif fusion steps, where the model moves beyond static grounding and starts coordinating object motion and inter-object interactions. 5 Layer -wise Mechanistic Analysis Inspired by the discovery of vision function layers in vision-language models [ 49 ], we in vestigate how diffusion transformers process visual information during video reasoning by analyzing the internal representations across transformer layers. Rather than focusing solely on generated outputs, we examine ho w hidden states evolv e within the DiT backbone and how different layers contrib ute to semantic grounding and reasoning beha viors. Specifically , we study the model from two complementary perspectiv es: first, we visualize token-lev el activ ations across layers to analyze how attention and representation energy distribute over spatial-temporal regions; second, we conduct a layer-wise latent swapping experiment to causally e v aluate how intermediate representations influence the final reasoning outcome. T ogether, these analyses provide a fine-grained view of how information is organized and progressi vely transformed inside the model. 5.1 Layer -wise T oken-Lev el V isualization T o further inv estigate this transition, we analyze the internal activ ations of DiT blocks. During each diffusion step, we register forward hooks on the transformer blocks of the W an2.2-I2V -A14B model. W e specifically capture the hidden states from the first forward pass (the positive CFG pass) to isolate the model’ s primary reasoning trajectory . The raw features are captured as a sequence of tokens with the shape ( B , N , D ) , f eat ∈ R B × N × D where N represents the total number of tokens and D = 5120 is the embedding dimension. T o restore the visual context, we utilize the grid dimensions ( f , h, w ) captured from the model’ s patch_embedding layer to reshape the features into a 5D tensor of shape ( B , f , h, w , D ) . For each token at every spatial-temporal coordinate, we compute the L 2 norm across the channel dimension D . This reduction produces a scalar value representing the acti vation intensity , or "energy", of that specific patch. The final visualization is organized into a matrix where rows represent specific DiT layers (e.g., L 0 , L 10 , L 20 . . . L 39 ) and columns represent sequential video frames. Each cell in the grid displays a heatmap of the calculated norms, allowing us to observ e how the model’ s attention shifts from coarse global structures in early layers to fine-grained logical reasoning in the deeper blocks. 8 Pe r ce ption befor e Acti on Step 0 Fin al Step Step 9 Step 0 Fin al Step Step 9 (b) Co rr ect th e in co rr ect p arts o f the h o u se. (a) Get th e car run n in g . Figure 5 Emergent reasoning beha vior: understanding befor e reasoning. (a) Early diffusion steps identify the car as the object of interest, while later steps introduce motion and simulate physical interactions. (b) Early steps recognize the door as the target object, and later steps manipulate it. W e observe that within a single dif fusion step, the earliest layers (Layers 0–9) primarily attends to global structures and background context. As computation proceeds through the layers of the same step, attention progressi vely shifts tow ard foreground entities and those specified in the prompt. From around Layer 9 onward, activations become increasingly concentrated on semantically rele vant objects, accompanied by higher channel v ariance in localized regions corresponding to target entities. Notably , reasoning-related features also begin to emerge at this stage, with acti vations correlating with object motion and interactions. This within-step progression is consistently observed across dif fusion steps, indicating a recurrent hierarchy from global context to object-centric reasoning. 5.2 Layer -wise Latent Swapping Experiment As illustrated in Fig. 6 (b), to provide causal e vidence for this transition, we conduct a layer-wise latent swapping experiment on object recognition and grounding tasks at the first diffusion step. W e utilize a controlled en vironment featuring a blank background and two distinct sets of objects ( O A , O B ) to perform a pair-inference task (in this case, a 9 F irst f ew lay er s f o cu ses on the b ackg rou n d . P lea se c irc le the c ats in t he im a g e . Pl ea s e cir cle the bi cyc les in the imag e. ⊕ Output: C irc led bicy c l e s ! Output: C irc led c a ts ! Dif f usi o n S t e ps … … … DiT B l o c k s … St e p 1 St e p 2 St e p N 50 Lat en t Sw ap p in g ⊕ Dif f usi o n S t e ps … … … DiT B l o c k s … St e p 1 St e p 2 St e p N 50 I nit ial F ra m e (a) (b) Midd le lay ers f o cu s o n sem an tical l y relevan t o b jects, with atten tio n sh if tin g across lay ers. Figure 6 Layer specialization. (a) Layer-wise acti vation visualization sho ws that early layers of the video reasoning DiT tend to focus on background structures, whereas later layers carry out reasoning-related computations. (b) Layer-wise latent swapping re veals that certain middle layers ( e .g. , Layer 21 in this case) contain critical reasoning information that strongly influences the final outcome. cat and two bicycles). Let U ( l ) represent the latent representations (vision tokens) at layer l of the transformer backbone. T o quantify the individual contrib ution of each layer to the final logical output, we implement a swapping operation: ˜ U ( k ) ← U ( k ) alt , subject to U ( l  = k ) = U ( l ) orig 10 In-Domain by Category Out-of-Domain by Category Models Overall A vg. Abst. Know . Perc. Spat. T rans. A vg. Abst. Know . Perc. Spat. T rans. Human 0.974 0.960 0.919 0.956 1.00 0.95 1.00 0.988 1.00 1.00 0.990 1.00 0.970 Open-source V ideo Models CogV ideoX1.5-5B-I2V [ 70 ] 0.273 0.283 0.241 0.328 0.257 0.328 0.305 0.262 0.281 0.235 0.250 0.254 0.282 HunyuanV ideo-I2V [ 28 ] 0.273 0.280 0.207 0.357 0.293 0.280 0.316 0.265 0.175 0.369 0.290 0.253 0.250 W an2.2-I2V -A14B [ 57 ] 0.371 0.412 0.430 0.382 0.415 0.404 0.419 0.329 0.405 0.308 0.343 0.236 0.307 L TX-2 [ 17 ] 0.313 0.329 0.316 0.362 0.326 0.340 0.306 0.297 0.244 0.337 0.317 0.231 0.311 Proprietary V ideo Models Runway Gen-4 T urbo [ 48 ] 0.403 0.392 0.396 0.409 0.429 0.341 0.363 0.414 0.515 0.429 0.419 0.327 0.373 Sora 2 [ 41 ] 0.546 0.569 0.602 0.477 0.581 0.572 0.597 0.523 0.546 0.472 0.525 0.462 0.546 Kling 2.6 [ 30 ] 0.369 0.408 0.465 0.323 0.375 0.347 0.519 0.330 0.528 0.135 0.272 0.356 0.359 V eo 3.1 [ 15 ] 0.480 0.531 0.611 0.503 0.520 0.444 0.510 0.429 0.577 0.277 0.420 0.441 0.404 V ideo Reasoning Models VBVR-W an2.2 [ 58 ] 0.685 0.760 0.724 0.750 0.782 0.745 0.833 0.610 0.768 0.572 0.547 0.618 0.615 VBVR-W an2.2 + Training-Free Ensemble 0.716 0.780 0.760 0.744 0.809 0.749 0.858 0.650 0.803 0.705 0.531 0.639 0.716 T able 1 Benchmarking results on VBVR-Bench. Overall In-Domain (ID) and Out-of-Domain (OOD) scores are reported alongside category-wise performance. Higher is better . Bold : best in group; underline: second best. where U ( k ) alt contains the latent features of an alternativ e object configuration. The representations at all other layers remain unchanged. Strikingly , we observe that swapping the representations at layer 20 leads to a complete rev ersal of the inference result. That is, the model’ s predicted identity of the target object flips after the substitution. This suggests that middle-to-late vision layers encode semantically decisiv e information that directly governs the grounding outcome. 6 T raining-Free Ensemble W e hypothesize that while indi vidual inference runs may exhibit stochasticity in their decisions, the reasoning mani- fold—the internal latent space on which the model bases its reasoning capability—often contains a shared probabilistic bias toward the correct outcome. More importantly , since the model develops multi-path reasoning during the early diffusion steps ( Sec. 3.1 ), it is natural to exploit this property . Inspired by Model Soup [ 63 ], which merges models within the same optimization basin, we implement a multi-seed ensemble at the latent le vel during the early dif fusion steps that are critical for the reasoning trajectory ( Sec. 3.2 ). Specifically , we execute three independent forward passes using different initial noise seeds. During the first dif fusion step ( s = 0 ), we extract the hidden representations U ( l ) from the transformer backbone. Guided by our observation that reasoning-activ e features emerge in the mid-layers ( Sec. 4.3 ), we perform a spatial-temporal averaging of the latents across layers 20 to 29. By aggreg ating representations within this specific reasoning windo w , we effecti vely perform a latent-space ensemble resembling expert v oting. This operation filters out seed-specific noise and biases the model’ s probability distribution to ward a more stable and logically consistent latent state. W e apply this training-free ensemble approach on VBVR-W an2.2 and e valuate it on the VBVR-Bench, a benchmark specifically designed for comprehensi ve assessment of video reasoning. Despite its simplicity , the ensemble method yielded a 2% absolute impro vement over the strong baseline in benchmark score ( T ab . 1 ). This performance gain confirms that the model’ s internal reasoning can be "steered" toward the correct answer by simply aggre gating the latent from multiple stochastic trajectories during the critical early steps of the "Perception before Action" transition, effecti vely exploiting the probabilistic bias inherent in the reasoning manifold. 7 Conclusion In this work, we in vestigate the mechanisms underlying reasoning in diffusion-based video generation models. Contrary to the previously hypothesized Chain-of-Frames (CoF) mechanism, we sho w that reasoning primarily unfolds along the dif fusion steps, which we term Chain-of-Steps (CoS), through qualitati ve analysis and tar geted perturbation experiments. Our study further re veals se veral emer gent reasoning behaviors, including working memory retention, self-correction during generation, and layer specialization within the DiT architecture. Moti vated by these insights, we propose a simple training-free reasoning path ensemble method that achiev es performance improvements ov er a strong baseline. 11 References [1] Alayrac, J.B., Donahue, J., Luc, P ., Miech, A., Barr , I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Re ynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv ances in neural information processing systems 35 , 23716–23736 (2022) [2] Bai, J., Bai, S., Y ang, S., W ang, S., T an, S., W ang, P ., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and be yond. arXiv preprint arXi v:2308.12966 (2023) [3] Behrens, T .E., Muller , T .H., Whittington, J.C., Mark, S., Baram, A.B., Stachenfeld, K.L., Kurth-Nelson, Z.: What is a cogniti ve map? organizing knowledge for fle xible behavior . Neuron 100 (2), 490–509 (2018) [4] Chen, J., Xu, Z., Pan, X., Hu, Y ., Qin, C., Goldstein, T ., Huang, L., Zhou, T ., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXi v:2505.09568 (2025) [5] Chen, L.L., Ma, H., Fan, Z., Huang, Z., Sinha, A., Dai, X., W ang, J., He, Z., Y ang, J., Li, C., et al.: Unit: Unified multimodal chain-of-thought test-time scaling. arXiv preprint arXi v:2602.12279 (2026) [6] Chen, Z., W u, J., W ang, W ., Su, W ., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition. pp. 24185–24198 (2024) [7] Chern, E., Hu, Z., Chern, S., Kou, S., Su, J., Ma, Y ., Deng, Z., Liu, P .: Thinking with generated images. arXiv preprint arXiv:2505.22525 (2025) [8] Deng, C., Zhu, D., Li, K., Gou, C., Li, F ., W ang, Z., Zhong, S., Y u, W ., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXi v:2505.14683 (2025) [9] Denton, E., Fergus, R.: Stochastic video generation with a learned prior . In: International conference on machine learning. pp. 1174–1183. PMLR (2018) [10] Duan, C., Fang, R., W ang, Y ., W ang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning. arXiv preprint arXi v:2505.17022 (2025) [11] Fan, W ., Si, C., Song, J., Y ang, Z., He, Y ., Zhuo, L., Huang, Z., Dong, Z., He, J., Pan, D., et al.: Vchitect-2.0: Parallel transformer for scaling up video diffusion models. arXi v preprint arXiv:2501.08453 (2025) [12] Fan, X., Qiu, Z., W u, Z., W ang, F ., Lin, Z., Ren, T ., Lin, D., Gong, R., Y ang, L.: Phased dmd: Few-step distrib ution matching distillation via score matching within subintervals. arXi v preprint arXiv:2510.27684 (2025) [13] Fang, R., Duan, C., W ang, K., Huang, L., Li, H., Y an, S., Tian, H., Zeng, X., Zhao, R., Dai, J., et al.: Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing. arXi v preprint arXiv:2503.10639 (2025) [14] Ge, Y ., Zhao, S., Zhu, J., Ge, Y ., Y i, K., Song, L., Li, C., Ding, X., Shan, Y .: Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXi v:2404.14396 (2024) [15] Google DeepMind: V eo 3.1: Ingredients to V ideo. T echnical Report V eo 3.1, Google DeepMind (January 2026), https: //blog.google/innovation- and- ai/technology/ai/veo- 3- 1- ingredients- to- video/ , released January 13, 2026 [16] Guo, Z., Zhang, R., T ong, C., Zhao, Z., Gao, P ., Li, H., Heng, P .A.: Can we generate images with cot? let’ s verify and reinforce image generation step by step. arXiv preprint arXi v:2501.13926 (2025) [17] HaCohen, Y ., Brazowski, B., Chiprut, N., Bitterman, Y ., Kvochko, A., Berko witz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., Richardson, E., Shiran, G., Chachy , I., Chetboun, J., Finkelson, M., Kupchick, M., Zabari, N., Guetta, N., K otler , N., Bibi, O., Gordon, O., Panet, P ., Benita, R., Armon, S., Kulik ov , V ., Inger , Y ., Shiftan, Y ., Melumian, Z., Farbman, Z.: Ltx-2: Efficient joint audio-visual foundation model (2026), , submitted 6 Jan 2026 [18] Hao, S., Sukhbaatar , S., Su, D., Li, X., Hu, Z., W eston, J., Tian, Y .: Training lar ge language models to reason in a continuous latent space. arXiv preprint arXi v:2412.06769 (2024) [19] He, X., Fan, Z., Li, H., Zhuo, F ., Xu, H., Cheng, S., W eng, D., Liu, H., Y e, C., W u, B.: Ruler-bench: Probing rule-based reasoning abilities of next-le vel video generation models for vision foundation intelligence. arXi v preprint (2025) [20] Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 12 6840–6851. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b- Paper.pdf [21] Huang, J., Gu, S., Hou, L., W u, Y ., W ang, X., Y u, H., Han, J.: Large language models can self-improve. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 1051–1068 (2023) [22] Huang, Z., He, Y ., Y u, J., Zhang, F ., Si, C., Jiang, Y ., Zhang, Y ., W u, T ., Jin, Q., Chanpaisit, N., W ang, Y ., Chen, X., W ang, L., Lin, D., Qiao, Y ., Liu, Z.: VBench: Comprehensi ve benchmark suite for video generati ve models. In: Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition (2024) [23] Huang, Z., Y u, N., Chen, G., Qiu, H., Debev ec, P ., Liu, Z.: Vchain: Chain-of-visual-thought for reasoning in video generation. arXiv preprint arXi v:2510.05094 (2025) [24] Huang, Z., Zhang, F ., Xu, X., He, Y ., Y u, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y ., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) [25] Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Y an, S., Heng, P .A., Li, H.: T2i-r1: Reinforcing image generation with collaborativ e semantic-lev el and token-le vel cot. arXi v preprint arXiv:2505.00703 (2025) [26] Kingma, D.P ., W elling, M.: Auto-encoding variational bayes. arXi v preprint arXiv:1312.6114 (2013) [27] K oh, J.Y ., Fried, D., Salakhutdinov , R.: Generating images with multimodal language models. NeurIPS (2023) [28] K ong, W ., T ian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., W u, B., Zhang, J., et al.: HunyuanV ideo: A systematic framew ork for large video generati ve models. arXi v preprint arXiv:2412.03603 (2024) [29] K ornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations re visited. In: International conference on machine learning. pp. 3519–3529. PMlR (2019) [30] Kuaishou T echnology: Kling ai launches video 2.6 model with “simultaneous audio-visual generation” capability , redefining ai video creation workflow . Press Release (December 2025), model released December 3, 2025; Press release published December 5, 2025 [31] Li, C., W u, W ., Zhang, H., Xia, Y ., Mao, S., Dong, L., V uli ´ c, I., W ei, F .: Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXi v:2501.07542 (2025) [32] Li, J., Li, D., Sa varese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) [33] Li, T ., Lu, Q., Zhao, L., Li, H., Zhu, X., Qiao, Y ., Zhang, J., Shao, W .: Unifork: Exploring modality alignment for unified multimodal understanding and generation. arXiv preprint arXi v:2506.17202 (2025) [34] Liao, C., Liu, L., W ang, X., Luo, Z., Zhang, X., Zhao, W ., W u, J., Li, L., Tian, Z., Huang, W .: Mogao: An omni foundation model for interleav ed multi-modal generation. arXiv preprint arXi v:2505.05472 (2025) [35] Lin, H., Pan, X., Huang, Z., Hou, J., W ang, J., Chen, W ., He, Z., Juefei-Xu, F ., Sun, J., Fan, Z., et al.: Exploring mllm-dif fusion information transfer with metacan vas. arXi v preprint arXiv:2512.11464 (2025) [36] Lipman, Y ., Chen, R.T ., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generati ve modeling (2023) [37] Liu, H., Li, C., W u, Q., Lee, Y .J.: V isual instruction tuning (2023) [38] Luo, Y ., Zhao, X., Lin, B., Zhu, L., T ang, L., Liu, Y ., Chen, Y .C., Qian, S., W ang, X., Y ou, Y .: V -reasonbench: T oward unified reasoning benchmark suite for video generation models. arXiv preprint arXi v:2511.16668 (2025) [39] Madaan, A., T andon, N., Gupta, P ., Hallinan, S., Gao, L., Wie greffe, S., Alon, U., Dziri, N., Prabhumoye, S., Y ang, Y ., W elleck, S., Majumder , B.P ., Gupta, S., Y azdanbakhsh, A., Clark, P .: Self-refine: Iterativ e refinement with self-feedback (2023) [40] Mattar , M.G., Lengyel, M.: Planning in the brain. Neuron 110 (6), 914–934 (2022) [41] OpenAI: Sora: Openai’ s text-to-video model (September 2025), https://openai.com/index/sora- is- here , publicly released September 2025 [42] Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., W ang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F ., et al.: Transfer between modalities with metaqueries. arXiv preprint arXi v:2504.06256 (2025) [43] Peebles, W ., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 13 [44] Pfeiffer , B.E., Foster , D.J.: Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497 (7447), 74–79 (2013) [45] Qin, L., Gong, J., Sun, Y ., Li, T ., Y ang, M., Y ang, X., Qu, C., T an, Z., Li, H.: Uni-cot: T owards unified chain-of-thought reasoning across text and vision. arXi v preprint arXiv:2508.05606 (2025) [46] Qu, L., Zhang, H., Liu, Y ., W ang, X., Jiang, Y ., Gao, Y ., Y e, H., Du, D.K., Y uan, Z., W u, X.: T okenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer V ision and Pattern Recognition Conference. pp. 2545–2555 (2025) [47] Rombach, R., Blattmann, A., Lorenz, D., Esser , P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) [48] Runway Research: Introducing runway gen-4: Next-generation ai models for media generation and w orld consistency (March 2025), https://runwayml.com/research/introducing- runway- gen- 4 , accessed January 27, 2026 [49] Shi, C., Y u, Y ., Y ang, S.: V ision function layer in multimodal llms. arXiv preprint arXi v:2509.24791 (2025) [50] Shi, W ., Han, X., Zhou, C., Liang, W ., Lin, X.V ., Zettlemoyer , L., Y u, L.: Lmfusion: Adapting pretrained language models for multimodal generation. arXiv preprint arXi v:2412.15188 (2024) [51] Shi, W ., Y u, A., Fang, R., Ren, H., W ang, K., Zhou, A., Tian, C., Fu, X., Hu, Y ., Lu, Z., et al.: Mathcanv as: Intrinsic visual chain-of-thought for multimodal mathematical reasoning. arXiv preprint arXi v:2510.14958 (2025) [52] Su, J., Lu, Y ., Pan, S., W en, B., Liu, Y .: Roformer: Enhanced transformer with rotary position embedding (2021) [53] T an, Z., Y ang, H., Qin, L., Gong, J., Y ang, M., Li, H.: Omni-video: Democratizing unified video understanding and generation. arXiv preprint arXi v:2507.06119 (2025) [54] T eam, C.: Chameleon: Mixed-modal early-fusion foundation models. arXi v preprint arXiv:2405.09818 (2024). https://doi.org/10.48550/arXi v .2405.09818, https://github.com/facebookresearch/chameleon [55] T ong, J., Mou, Y ., Li, H., Li, M., Y ang, Y ., Zhang, M., Chen, Q., Liang, T ., Hu, X., Zheng, Y ., et al.: Thinking with video: V ideo generation as a promising multimodal reasoning paradigm. arXiv preprint arXi v:2511.04570 (2025) [56] T ong, S., Fan, D., Li, J., Xiong, Y ., Chen, X., Sinha, K., Rabbat, M., LeCun, Y ., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. In: Proceedings of the IEEE/CVF International Conference on Computer V ision. pp. 17001–17012 (2025) [57] W an, T ., W ang, A., Ai, B., W en, B., Mao, C., Xie, C.W ., Chen, D., Y u, F ., Zhao, H., Y ang, J., et al.: W an: Open and advanced large-scale video generati ve models. arXiv preprint arXi v:2503.20314 (2025) [58] W ang, M., W ang, R., Lin, J., Ji, R., Wiedemer , T ., Gao, Q., Luo, D., Qian, Y ., Huang, L., Hong, Z., Ge, J., Ma, Q., He, H., Zhou, Y ., Guo, L., Mei, L., Li, J., Xing, H., Zhao, T ., Y u, F ., Xiao, W ., Jiao, Y ., Hou, J., Zhang, D., Xu, P ., Zhong, B., Zhao, Z., Fang, G., Kitaoka, J., Xu, Y ., Xu, H., Blacutt, K., Nguyen, T ., Song, S., Sun, H., W en, S., He, L., W ang, R., W ang, Y ., Y ang, M., Ma, Z., Millière, R., Shi, F ., V asconcelos, N., Khashabi, D., Y uille, A., Du, Y ., Liu, Z., Lin, D., Liu, Z., Kumar , V ., Li, Y ., Y ang, L., Cai, Z., Deng, H.: A very big video reasoning suite. arXiv preprint arXiv:2602.20159 (2026), [59] W ang, Y ., Chen, X., Ma, X., Zhou, S., Huang, Z., W ang, Y ., Y ang, C., He, Y ., Y u, J., Y ang, P ., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer V ision 133 (5), 3059–3078 (2025) [60] W ei, J., T ay , Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Y ogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of lar ge language models. arXiv preprint arXi v:2206.07682 (2022) [61] W ei, J., W ang, X., Schuurmans, D., Bosma, M., Xia, F ., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv ances in neural information processing systems 35 , 24824–24837 (2022) [62] W iedemer , T ., Li, Y ., V icol, P ., Gu, S.S., Matarese, N., Swersky , K., Kim, B., Jaini, P ., Geirhos, R.: V ideo models are zero-shot learners and reasoners. arXiv preprint arXi v:2509.20328 (2025) [63] W ortsman, M., Ilharco, G., Gadre, S.Y ., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y ., Kornblith, S., et al.: Model soups: averaging weights of multiple fine-tuned models impro ves accuracy without increasing inference time. In: International conference on machine learning. pp. 23965–23998. PMLR (2022) [64] W u, C., Chen, X., W u, Z., Ma, Y ., Liu, X., Pan, Z., Liu, W ., Xie, Z., Y u, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXi v:2410.13848 (2024) 14 [65] W u, C., Zheng, P ., Y an, R., Xiao, S., Luo, X., W ang, Y ., Li, W ., Jiang, X., Liu, Y ., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXi v preprint arXiv:2506.18871 (2025) [66] W u, J.Z., Ren, X., Shen, T ., Cao, T ., He, K., Lu, Y ., Gao, R., Xie, E., Lan, S., Alv arez, J.M., Gao, J., Fidler, S., W ang, Z., Ling, H.: Chronoedit: T ow ards temporal reasoning for image editing and world simulation. arXiv preprint arXi v:2510.04290 (2025) [67] Xiao, Y ., Song, L., Chen, Y ., Luo, Y ., Chen, Y ., Gan, Y ., Huang, W ., Li, X., Qi, X., Shan, Y .: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXi v preprint arXiv:2505.13031 (2025) [68] Xu, Y ., Li, C., Zhou, H., W an, X., Zhang, C., K orhonen, A., V uli ´ c, I.: V isual planning: Let’ s think only with images (2025), [69] Y ang, S., W u, J., Chen, X., Xiao, Y ., Y ang, X., W ong, D.F ., W ang, D.: Understanding aha moments: from external observations to internal mechanisms. arXiv preprint arXi v:2504.02956 (2025) [70] Y ang, Z., T eng, J., Zheng, W ., Ding, M., Huang, S., Xu, J., Y ang, Y ., Hong, W ., Zhang, X., Feng, G., et al.: CogV ideoX: T ext-to-video dif fusion models with an expert transformer . arXiv preprint arXi v:2408.06072 (2024) [71] Y ao, S., Y u, D., Zhao, J., Shafran, I., Griffiths, T ., Cao, Y ., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. In: Oh, A., Naumann, T ., Globerson, A., Saenko, K., Hardt, M., Le vine, S. (eds.) Advances in Neural Information Processing Systems. v ol. 36, pp. 11809–11822. Curran Associates, Inc. (2023), https://proceedings. neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703- Paper- Conference.pdf [72] Y ao, S., Zhao, J., Y u, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y .: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023) [73] Y ue, J., Huang, Z., Chen, Z., W ang, X., W an, P ., Liu, Z.: Simulating the visual world with artificial intelligence: A roadmap. arXiv preprint arXi v:2511.08585 (2025) [74] Zawalski, M., Chen, W ., Pertsch, K., Mees, O., Finn, C., Levine, S.: Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXi v:2407.08693 (2024) [75] Zeng, S., Chang, X., Xie, M., Liu, X., Bai, Y ., P an, Z., Xu, M., W ei, X.: Futuresightdriv e: Thinking visually with spatio-temporal cot for autonomous driving. arXi v preprint arXiv:2505.17685 (2025) [76] Zhao, Q., Lu, Y ., Kim, M.J., Fu, Z., Zhang, Z., W u, Y ., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: V isual chain-of-thought reasoning for vision-language-action models. In: Proceedings of the Computer V ision and Pattern Recognition Conference. pp. 1702–1713 (2025) [77] Zhao, S., Zhang, Y ., Cun, X., Y ang, S., Niu, M., Li, X., Hu, W ., Shan, Y .: Cv-vae: A compatible video vae for latent generati ve video models. Advances in Neural Information Processing Systems 37 , 12847–12871 (2024) [78] Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y ., Zhang, F ., Zhang, Y ., He, J., Zheng, W .S., Qiao, Y ., Liu, Z.: VBench-2.0: Advancing video generation benchmark suite for intrinsic f aithfulness. arXiv preprint arXi v:2503.21755 (2025) [79] Zheng, K., He, X., W ang, X.E.: Minigpt-5: Interleav ed vision-and-language generation via generativ e vokens (2023) [80] Zhuang, X., Xie, Y ., Deng, Y ., Liang, L., Ru, J., Y in, Y ., Zou, Y .: V argpt: Unified understanding and generation in a visual autoregressi ve multimodal lar ge language model. arXiv preprint arXi v:2501.12327 (2025) [81] Zou, K., Huang, Z., Dong, Y ., T ian, S., Zheng, D., Liu, H., He, J., Liu, B., Qiao, Y ., Liu, Z.: Uni-mmmu: A massi ve multi-discipline multimodal unified benchmark. arXiv preprint arXi v:2510.13759 (2025) 15 A ppendix A More Experiments on T raining-Free Ensemble In-Domain by Category Out-of-Domain by Category Aggregated Layers Overall A vg. Abst. Know . Perc. Spat. T rans. A vg. Abst. Know . Per c. Spat. T rans. baseline(VBVR-W an2.2) [ 58 ] 0.685 0.760 0.724 0.750 0.782 0.745 0.833 0.610 0.768 0.572 0.547 0.618 0.615 0-9 0.688 0.774 0.754 0.741 0.808 0.746 0.835 0.602 0.805 0.635 0.485 0.597 0.642 0-39 0.690 0.767 0.733 0.737 0.807 0.747 0.825 0.613 0.830 0.606 0.482 0.630 0.657 20-29(T raining-Free Ensemble) 0.716 0.780 0.760 0.744 0.809 0.749 0.858 0.650 0.803 0.705 0.531 0.639 0.716 T able 2 Comparison of VBVR-Bench performance across different layer windows at dif fusion step s = 0 . Mid-layer aggregation (20–29) achie ves the best overall performance (0.716) by capturing the critical reasoning-activ e window . Bold : best in group; underline: second best. T o examine the impact of the aggregation window , we conduct an experiment over dif ferent layer ranges when performing the latent ensemble. For fair comparison, all variants perform the ensemble at the first dif fusion step ( s = 0 ), while only the aggregated layer windo w is varied. As shown in T ab . 2 , aggregating representations from the early layers (0–9) yields only marginal improv ement over the baseline, increasing the overall score from 0.685 to 0.688, with limited gains in both in-domain and out-of-domain settings. This suggests that early-layer representations primarily encode low-le vel perceptual features and ha ve not yet formed the semantic structures required for reasoning. Expanding the aggreg ation to all layers (0–39) produces a slightly higher overall score (0.690), b ut the improvement remains modest and inconsistent across categories, indicating that averaging across the entire depth introduces noise from layers that are either too early (perceptual) or too late (already specialized for generation). In contrast, aggregating mid-layer representations (layers 20–29) achie ves the best performance, reaching an overall score of 0.716 and consistently improving most categories. This result aligns with our earlier analysis that the middle layers correspond to the transition stage between understanding and reasoning, where the model acti vely inte grates semantic concepts and forms reasoning trajectories. Consequently , performing the ensemble within this reasoning-activ e windo w provides a more stable and informati ve latent representation, ef fectiv ely filtering stochastic noise across seeds while preserving the semantic structures that guide correct reasoning. B The Impact of the Number of Frames Configuration Overall In-Domain Out-of-Domain Chronoedit 0.581 0.637 0.524 5 frames 0.619 0.688 0.549 9 frames 0.632 0.716 0.549 17 frames 0.663 0.743 0.582 33 frames 0.685 0.685 0.685 65 frames 0.675 0.760 0.591 VBVR-W an2.2 0.685 0.760 0.610 T able 3 Comparison of Model Performance Across Frame Counts. Chronoedit here could be considered as an single-frame version of VBVR-W an2.2. The original VBVR-W an2.2 operates on ∼ 100 frames on av erage. Although Sec. 3.1 sho ws that reasoning primarily occurs across dif fusion steps rather than across frames, we observ e that the number of frames still plays an important role. In practice, frames serve as a latent spatiotemporal workspace (or “scratchpad”) that enables the diffusion model to store essential visual information throughout the diffusion process. In T ab. 3 , following ChronoEdit [ 66 ], we conduct an experiment that repurposes the video generation model as an image editing model to simulate single-frame reasoning. In this setting, the 3D-factorized Rotary Position Embedding (RoPE) [ 52 ] is modified to anchor the input image at time step 0 and the output image at a predefined time step T , 16 while intermediate frames are dropped after fe w steps. Ho wev er , this configuration performs substantially w orse than all multi-frame settings. This result suggests that maintaining multiple frames helps the model capture spatiotemporal coherence, which is critical for effecti ve video reasoning. W e further in vestigate the ef fect of reducing the number of generated frames in VBVR-W an2.2. The performance drop is relativ ely minor when the number of frames decreases from the original ∼ 100 to around 17. Howe ver , further reducing the frame count leads to noticeable degradation. This observation reinforces our hypothesis that although reasoning does not occur strictly in a frame-wise manner, maintaining a minimum lev el of temporal continuity is still necessary to accommodate key e vents required for correct inference. C Perf ormance on 4-Step Distilled Model W e in vestigate ho w distillation af fects reasoning in video generation models using a distilled 4-step W an2.2-I2V -14B model. Distillation significantly compresses the denoising trajectory , raising the question of whether the reasoning dynamics remain observable under such a shortened inference process. T o study this, we simultaneously adapt two LoRA models with scaling weights of 0.5 each: one based on VBVR-W an2.2 that enhances reasoning capability , and the other based on a 4-step model distilled via Phased DMD [ 12 ] that improv es generation speed. W e find that although the number of denoising steps is drastically reduced from 50 to 4, the steps required for reasoning cannot be compressed proportionally . In particular , the characteristic reasoning acti vity that typically emerges during Multi - P ath Explor ati on Step 0 Fin al Step Co m p lete th e b o x accord in g to the p attern. Ro tate th e o b ject an d m o v e to f it in th e d o tted p lace. Step 0 Fin al Step Step 0 Fin al Step T ra v er se f rom b lue g rid to re d g rid and a v o id o b stacles. Su perposition Explor ation Me mor y Figure 7 Qualitati ve visualizations of distilled model. 17 Multi - P ath Explor ati on Step 0 Fin al Step Co m p lete th e b o x accord in g to the p attern. Ro tate th e o b ject an d m o v e to f it in th e d o tted p lace. Step 0 Fin al Step Step 0 Fin al Step T ra v er se f rom b lue g rid to re d g rid and a v o id o b stacles. Su perposition Explor ation Me mor y Self - cor r ec tion and En han ce me nt Step 0 Step 1 Fin al Step B all b o u n ces ag ain st th e walls. Figure 7 (continued) Qualitati ve visualizations of distilled model. the early diffusion steps persists e ven after distillation. Ho wev er, we also observe that in some tasks the noise scheduler reduces the noise lev el too aggressiv ely in the first step, collapsing the latent exploration phase where reasoning signals usually emerge. As a result, intermediate reasoning patterns become difficult to observe, and the ov erall performance on VBVR-Bench drops significantly from 0.685 to 0.605. These results suggest that even for distilled models, preserving sufficient latent e volution during the initial diffusion step is crucial for maintaining effecti ve reasoning capability . D Full Layer -wise Analysis Sec. D.0.2 presents a comprehensi ve visualization of token activ ation energies across all 40 DiT blocks and video frames, cov ering all tasks reported in Sec. 5.1 and Fig. 6 . Each row corresponds to a transformer block, and each column corresponds to a video frame. The heatmaps show the spatial distribution of token acti vation magnitudes. In Sec. 5.1 , 18 we discuss a layer-wise transition in which early layers focus on global structures, while middle layers increasingly attend to prompt-relev ant foreground objects and exhibits reasoning-related features associated with object motion and interactions. Here we discuss about two more additinonal findings. D.0.1 High sparsity in tok en activations. Across most layers, a large fraction of spatial tokens exhibit very low activ ation norms (dark purple regions). This indicates that only a small subset of tokens carry significant signal at any gi ven layer . In practice, this suggests that the model performs highly sparse computation in token space, where meaningful reasoning is concentrated in localized patches corresponding to salient visual structures. Such sparsity becomes particularly pronounced in middle layers, where entire spatial regions remain near -zero while only a few tok ens remain strongly activ e. D.0.2 High concentration on tok en activations at middle layers. Beginning in the intermediate layers, re gular grid-like patterns become clearly visible. These patterns might align with the underlying patch tokenization structure of the transformer, potentially indicating spatial aw areness. At this stage, the model may begin or ganizing information along patch boundaries, resulting in check erboard- or lattice-like acti vation patterns. E More V isualization In this section, more visualization e xamples are provided to further illustrate that reasoning happens across diffusion steps for video generation models. In this section, more visualization examples are pro vided to further illustrate that reasoning happens across diffusion steps for video generation models. These examples illustrate sev eral recurring phenomena discussed in the main paper . Specifically , Fig. 9 present additional cases of the Multi-P ath Exploration phenomenon described in Sec. 3.1.1 , where the model explores multiple candidate structures before con verging to a coherent generation. Fig. 10 pro vides further examples of Superposition-based Explor ation discussed in Sec. 3.1.2 , highlighting ho w multiple hypotheses may coexist in intermediate representations. Fig. 11 illustrates the W orking Memory phenomenon from Sec. 4.1 , sho wing ho w the model memorises key information such as the trajectory . Finally , Figs. 12 and 13 demonstrate additional instances of Self-corr ection and Enhancement described in Sec. 4.2 , where early imperfect structures are gradually refined and improv ed through the denoising steps. 19 Figure 8 Layer-wise tok en activ ation visualization across all 40 DiT blocks. Rows correspond to layers 0–39 (from top to bottom), while columns represent video frames. 20 Figure 8 (Continued) Layer-wise token acti vation visualization across all 40 DiT blocks. Rows correspond to layers 0–39 (from top to bottom), while columns represent video frames. 21 Multi - P ath Explor ati on Step 0 Fin al Step Swap the th ird an d f o u rth sh ap es. Step 0 Step 2 Step 4 Fin al Step Rem o v e th e sh ap es o n e by o n e f rom to p to b o tto m . Step 0 Fin al Step T raverse f rom th e roo t to f in d the red n o d e. Figure 9 More visualization of "Multi-Path Exploration" phenomenon in Sec. 3.1.1 . 22 Multi - P ath Explor ati on Ch o o se th e lar g est secto r o f th e p ie ch art. Step 0 Fin al Step Ch o o se th e recta n g le co n tain in g th e m ax i m u m n u m b er o f red p o in ts. Step 0 Fin al Step Select th e lar g est n u m b er . Step 0 Fin al Step Figure 9 (Continued) More visualization of "Multi-Path Exploration" phenomenon in Sec. 3.1.1 . 23 Su perposition - bas ed Exp lor ation Mov e an d r o tate th e sh ap es o n the lef t in to th e das h ed f ram e o n the righ t. Step 0 Fin al Step I m ita te th e r o tatio n p attern. Step 0 Step 5 Fin al Step Figure 10 More visualization of "Superposition-based Exploration" phenomenon in Sec. 3.1.2 . 24 W or king Me mor y Mov e th e sh ap es o n the lef t in to th e d ash ed f ram e o n the righ t. Step 0 Fin al Step Ins ert th e blu e recta n g le o n the lef t, k eepi n g it in ascend in g o rder as m u ch as p o ss ib le. Alig n the two circles cent rally u n til th ey are m u tu ally tan g en t. So rt th e stars b y size in no n - d escen d in g ord er . Step 0 Fin al Step Step 0 Fin al Step Fin al Step Step 0 Figure 11 More visualization of "Memory" phenomenon in Sec. 4.1 . 25 Self - cor r ec tion and En han ce me nt Maxim iz e p ath su m f o r th e y ello w circle f rom g reen to red v ia th e sh o rtest rou te. Step 0 Step 2 Step 4 Fin al Step Mov e th e y ello w circle f rom g reen to r ed v ia th ree in term ed ia te y ello w cells. Inf er th e f if th sq u are. Fin al Step Step 0 Step 0 Fin al Step Figure 12 More visualization of "Self-correction and Enhancement" phenomenon in Sec. 4.2 . 26 Self - cor r ec tion and En han ce me nt Place a g lass o f water in f ron t o f the m irr o r . Step 0 Step 1 Fin al Step Fin al Step Predict th e trajectory o f the p in b all. Integ rate red, o rang e, g reen, an d blu e gro u p s in seq u en ce. Step 0 Step 4 Fin al Step Step 4 Step 0 Figure 13 More visualization of "Self-correction and Enhancement" phenomenon in Sec. 4.2 . 27

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment