Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
Large-scale video world models have emerged as powerful priors for modeling perception and control for embodied agents [2,4,9,17,19]. By learning to predict future observations from past frames and actions, these models approximate the transition dynamics of the physical world, enabling simulation, planning, and policy evaluation. Operating in the pixel domain aligns them with real-world sensors and exploits the vast implicit supervision available in video, allowing unified modeling across domains such as manipulation, driving, and navigation. Yet despite their impressive generative fidelity, these models are often incentivized to capture the appearance of motion more than its structure. Their rollouts remain visually plausible but geometrically and temporally inconsistent: poses drift, depths wobble, and trajectories lose alignment over time. Even subtle deviations in inferred geometry accumulate into compounding spatial errors corrupting metric structure. These instabilities limit the use of current models for closed-loop tasks such as localization, mapping, and planning, where physically consistent representation is essential.
We define world model grounding as aligning learned dynamics with physically verifiable spatial and temporal invariants, so that rollouts honor geometry and time in addition to reproducing surface appearance. Grounding shifts the objective of world modeling from visual plausibility to structural consistency, ensuring that the model’s internal dynamics respect the constraints of real motion and scene structure. To this end, we introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that refines pretrained world models using verifiable geometric and perceptual rewards derived from model rollouts. RLWG extends the principle of Reinforcement Learning with Verifiable Rewards (RLVR) from language models [14] to the embodied domain, replacing text-based logical verification with geometric and temporal verification. In RLWG, a pretrained world model is treated as a policy that generates multiple candidate rollouts from the same context; each rollout is automatically scored using verifiable grounding rewards that quantify spatial and temporal coherence, such as pose cycle-consistency, depth reprojection agreement and action adherence. Unlike reconstruction losses that only penalize pixel error, these rewards measure physical correctness of the rollouts.
To optimize these verifiable rewards efficiently, we adopt Group Relative Policy Optimization (GRPO) [21] as our training mechanism, yielding our algorithm, GrndCtrl. For each context (and actions when available), the model generates a group of rollouts that are ranked by their grounding rewards; relative advantages are computed within the group, and the latent transition operator is updated using a clipped policy gradient objective regularized toward the pretrained model. This formulation preserves visual qual-ity while progressively aligning the model’s dynamics with measurable structure in the real world. The process requires no human annotations or external simulators, operating entirely through self-supervised reinforcement grounded in the model’s own predictions. Conceptually, GrndCtrl extends the success of GRPO-based alignment in generative modeling to the geometric domain, grounding visual world models in verifiable 3D and temporal coherence.
This paradigm reframes the role of post-training in world modeling. Rather than optimizing for perceptual fidelity or next-frame likelihood, RLWG drives the model toward internal representations that are self-consistent and physically grounded. It establishes a structural analogue to the selfalignment processes that have improved reasoning in large language models: where RLVR grounds language in logic, RLWG grounds world models in geometry. The resulting models are self-grounded, spatially coherent, and dynamically stable-capable not only of rendering the world vividly, but of representing it in actionable, physically consistent form. Through this lens, we move beyond visually coherent generation toward structurally consistent simulation, bridging the gap between generative video modeling and physical world understanding, and opening a path toward world models that can both imagine and inhabit the real world.
The main contributions of this work are: 1. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised grounding framework using verifiable geometric and temporal rewards from frozen evaluators without labels or simulators. 2. We construct GrndCtrl, a method that extends GRPO to the RLWG regime by multi-reward alignment over stochastic rollouts optimizing Translation, Rotation, Depth Temporal Reprojection Inlier ratio, and perceptual quality with pretrained frozen evaluators. 3. We provide a comprehensive evaluation of GrndCtrl across multiple datasets showing reduced pose error means and variances, with strong gai
This content is AI-processed based on open access ArXiv data.