Interpreting Physics in Video World Models

Interpreting Physics in Video World Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition – which we call the Physics Emergence Zone – at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.


💡 Research Summary

The paper tackles a fundamental question in video‑based physical reasoning: do modern video world models store physical variables in compact, factorised latent states (as in a classic physics engine) or do they encode such information in a task‑specific, distributed manner? To answer this, the authors conduct the first large‑scale interpretability study of video transformer encoders, focusing on two state‑of‑the‑art models—V‑JEP‑A2 (in its Large, Huge, and Giant configurations) and VideoMAE‑v2‑G.

Methodology
The authors probe every residual layer of the frozen encoders using two complementary probe families: (1) linear probes applied to mean‑pooled spatiotemporal tokens, which reveal what information is linearly readable, and (2) attention‑MLP probes that preserve patch‑level structure. Probes are trained to predict (a) a binary “possible vs. impossible” label on the IntPhys benchmark (which requires integration of high‑level motion dynamics) and (b) explicit physical quantities—Cartesian velocity components (vₓ, v_y), Cartesian acceleration components (aₓ, a_y), speed |v|, motion direction θ, and acceleration magnitude |a|—on a synthetic toy‑ball dataset generated with Kubric where ground‑truth motion parameters are known.

Key Findings

  1. Physics Emergence Zone (PEZ) – Across all model scales, probe accuracy on the possible‑impossible task exhibits a sharp transition at roughly one‑third of the network depth. The authors term this transition the Physics Emergence Zone. Within PEZ, the model suddenly acquires the ability to integrate spatiotemporal cues required for physical plausibility judgments.

  2. Intermediate‑layer peak – After PEZ, probe performance continues to improve, reaching a maximum in the middle third of the encoder, then declines toward the final layers. This mirrors findings in vision models where intermediate representations are richer for downstream perception tasks, suggesting that the final layers are more tuned to the pre‑training objective (e.g., reconstruction or masked prediction) rather than preserving explicit physical structure.

  3. Scalar quantities are early – Velocity components, speed, and acceleration magnitude become linearly decodable from the earliest layers. Notably, acceleration can be predicted directly without an explicit velocity intermediate; a simple MLP on the early representation suffices, indicating that the model does not rely on a sequential derivation (velocity → acceleration) but learns a direct mapping.

  4. Direction emerges later and is high‑dimensional – Motion direction θ is essentially absent in early layers and only becomes accessible at PEZ. Moreover, direction is not stored in a single scalar dimension. Subspace analysis reveals a circular geometry spread across dozens of approximately orthogonal components, reminiscent of population codes observed in neuroscience. Manipulating a single dimension does not reliably change the decoded direction; coordinated changes across many dimensions are required, confirming a distributed circular code.

  5. Orthogonal subspaces for different tasks – The subspace encoding direction is nearly orthogonal to the subspace used for the possible‑impossible classification. Although both abilities emerge at the same depth, they rely on distinct representations, contradicting the physics‑engine hypothesis that a shared latent state (e.g., direction) would support multiple downstream physical tasks.

  6. Attention‑head specialization – Within PEZ, a small set of attention heads exhibit unusually local spatiotemporal receptive fields. Ablating these heads dramatically harms performance on the possible‑impossible task and on a temporal reasoning benchmark (detecting shuffled videos), while leaving static image classification largely unchanged. This points to a dedicated circuit‑level substrate for physical reasoning that is separate from the pathways used for static visual tasks.

Implications
The findings collectively argue that modern video transformers do not implement a compact, reusable physics engine. Instead, they develop a distributed, task‑specific representation where scalar motion attributes are readily available, but directional information is encoded as a high‑dimensional circular population code that only appears after a specific depth. Physical reasoning emerges from a combination of these distributed codes and a set of specialized attention heads that perform local spatiotemporal integration.

Consequently, future work on physically aware video models should shift focus from enforcing explicit latent state factorisation toward understanding and possibly controlling these distributed codes. Techniques such as targeted interventions on the identified attention heads or manipulation of the high‑dimensional direction subspace could enable more interpretable or controllable physical predictions, bridging the gap between black‑box performance and scientific insight.


Comments & Academic Discussion

Loading comments...

Leave a Comment