How Much 3D Do Video Foundation Models Encode?

Reading time: 5 minute
...

📝 Abstract

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

💡 Analysis

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

📄 Content

How Much 3D Do Video Foundation Models Encode? Zixuan Huang1∗ Xiang Li1∗ Zhaoyang Lv2 James M. Rehg1 1University of Illinois at Urbana-Champaign, 2Impossible, Inc. https://vidfm-3d-probe.github.io/ WAN OpenSora2 CogVideoX V-JEPA Probe Probe Probe Probe Depth Maps Camera Poses 3D Points Figure 1. We study the emergence of 3D in video foundation models by probing their features with 3D reconstruction tasks. Our study reveals state-of-the-art video generators develop strong 3D understanding even compared to 3D experts, despite only trained on 2D data. Abstract Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D un- derstanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model- agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the- art video generation models exhibit a strong understand- ing of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scal- *Both authors contributed equally to this work. able 3D models.

  1. Introduction Recovering 3D structure from 2D visual observations is a long-standing research problem in computer vision, with broad applications in AR/VR and embodied AI. Despite sig- nificant progress, the availability of high-quality 3D data at scale remains the bottleneck for current data-driven ap- proaches. This fundamentally limits the scaling of 3D foun- dation models and makes it questionable whether we can learn truly generalizable models primarily from 3D data. Compared to native 3D assets, videos are much easier to acquire at scale, with multiple large curated datasets al- ready available [1, 4, 8, 35]. The diversity and complexity of video data, with the fact that videos are 2D projections of 3D worlds, lead to a promising pathway for scalable 3D learning. Recent works study how to utilize video models for 3D, either by adding 3D control [3, 17, 18, 44, 56] or by producing 3D caches/estimations [15, 21, 22, 25, 29, 32, 1 arXiv:2512.19949v1 [cs.CV] 23 Dec 2025 33, 41, 47, 55, 60, 63, 64] along with the original frame synthesis target. These works suggest that video priors are useful for 3D, but 3D-inconsistency artifacts, the require- ment of 3D fine-tuning, and task-specific engineering leave it unclear whether video data alone can induce strong 3D awareness in a general-purpose setting. These confounds motivate a direct, model-agnostic evaluation. In this paper, we present the first model-agnostic frame- work to probe the 3D awareness of video foundation mod- els (VidFMs) pretrained on large-scale video data. We ask whether VidFMs develop internal representations of 3D structure and ego-motion and, if so, how strong and prac- tically useful these representations are. We operationalize this question along four axes: 1) Extent: how does the 3D awareness of VidFMs compare to that of image models or specialized 3D models? 2) Factor: Which factors impact 3D awareness? Here, we focus on the effects of tempo- ral reasoning, 3D finetuning and model scaling. 3) Local- ization: In which network layers, and at which timesteps in diffusion models, is this 3D information most concen- trated? 4) Implication: Under limited resources (3D data and compute), can VidFM features be practically useful for 3D reconstruction tasks? We posit that if a video model understands 3D worlds, it should be feasible to extract accurate 3D properties using shallow readout modules in a feedforward manner, without any post-optimization or fine-tuning of the base model. Un- like prior works that evaluate image models using depth and cross-view consistency [12], or per-scene optimization with off-the-shelf initialization [9], our proposed shallow feed- forward readouts that estimate different 3D attributes from VidFMs’ feature space are a more direct probe of globally consistent 3D properties from pretrained video models. Specifically, we extract frozen spatialtemporal features from VidFMs, and design a probe model that predicts 3D points, camera poses and depth maps from these fea- tures. The probe model is a shallow VGGT [51]-like trans- former, consisting of four alternating-attention layers and three read-out heads: two dense prediction heads for 3D points and depth maps, and one camera head. We train the probe model on top of various video features, including fea- tures extracted from self-supervised video models and video generation models of different performance and sizes. We measure the performance of po

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut