Cross-View World Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird’s-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment’s 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one’s actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.

💡 Research Summary

The paper addresses a fundamental limitation of current world models: they are trained to predict future states from a single, usually egocentric, viewpoint. The authors propose Cross‑View World Models (XVWM), which are trained with a cross‑view prediction objective. Given a short sequence of frames from any viewpoint together with the executed action, the model must predict the future frame from either the same viewpoint or a different one. Because the input and output views may have little visual overlap, the model is forced to learn view‑invariant, 3D‑aware representations rather than relying on pixel‑level correlations.

Training data consist of synchronized multi‑camera recordings from the Aimlabs first‑person shooter, providing four aligned views (egocentric, bird’s‑eye view, over‑the‑shoulder, front) at 5 FPS. The architecture builds on the Conditional Diffusion Transformer (CDiT) from NWM, adding a learnable view‑embedding table that is injected into each transformer block via adaptive layer normalization. Three variants are evaluated: a single‑view baseline (egocentric only), a two‑view model (egocentric + BEV), and a four‑view model (all four views).

Quantitative results use perceptual similarity metrics (LPIPS, DreamSim) and localization error in the BEV. The two‑view XVWM achieves the best same‑view prediction quality, outperforming the baseline despite using fewer egocentric samples, thanks to the complementary information supplied by the BEV marker. The four‑view model, while slightly lower in raw metrics due to reduced exposure per view, demonstrates the ability to predict any of the 16 possible input‑output view pairs, offering a universal imagination engine.

Qualitatively, the model can infer its global position and orientation in the BEV solely from egocentric visual cues, effectively constructing a cognitive‑map‑like representation. It also learns scale‑consistent transformations: fast egocentric motions correspond to slow marker movements in BEV and vice‑versa. This cross‑view capability suggests applications in navigation, where planning can be performed in the most convenient frame (e.g., a top‑down map) while actions are executed egocentrically, and in multi‑agent settings where agents must reason about what others can see.

Overall, XVWM introduces a novel self‑supervised objective that regularizes world‑model dynamics with geometric consistency, yields view‑invariant 3D representations, and opens the door to flexible, perspective‑aware planning in embodied AI. Future work may integrate explicit camera geometry, extend to continuous action spaces, or couple the model directly with policy networks for end‑to‑end planning.

Cross-View World Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment