ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.


💡 Research Summary

The paper introduces ResWorld, a novel end‑to‑end autonomous driving framework that improves planning accuracy by focusing the world model on dynamic objects and by tightly coupling future scene prediction with trajectory refinement. Traditional world‑model‑based approaches predict the entire future bird‑eye‑view (BEV) scene, redundantly modeling static elements such as road surfaces and buildings, and they typically do not feed the predicted future scene back into the planner. ResWorld addresses these shortcomings with two key components: the Temporal Residual World Model (TR‑World) and the Future‑Guided Trajectory Refinement (FGTR) module.

Temporal Residual Extraction
Multi‑view images from several past timestamps are first transformed into high‑quality BEV features using GeoBEV. All BEV tensors are aligned to the current timestamp’s coordinate frame and concatenated, producing a fused BEV (B_fuse). A TokenLearner extracts a set of sparse scene queries from B_fuse. By subtracting queries of adjacent timestamps, the method obtains temporal residuals that represent only the changes at each spatial location—effectively isolating moving vehicles, pedestrians, and other dynamic agents while canceling static background.

TR‑World
The temporal residuals are processed by a stack of self‑attention layers, aggregating information across time to form a compact representation of future dynamic objects (ˆR). This representation is then expanded back onto the spatial grid using a TokenFuser, which combines ˆR with the original fused BEV. The resulting future BEV (B_future) retains the exact static layout from the current frame (since static elements are unchanged) and adds the predicted positions of dynamic objects. Consequently, the world model’s capacity is devoted solely to modeling motion, leading to higher spatial‑temporal resolution and fewer parameters compared with full‑scene predictors.

Prior Trajectory Prediction
A set of waypoint queries encodes the desired future timestamps. Cross‑attention between these queries and the fused BEV yields a prior trajectory (T_prior), a sequence of (x, y) coordinates for the ego vehicle.

Future‑Guided Trajectory Refinement (FGTR)
FGTR bridges the gap between prediction and planning. Deformable attention is applied between the waypoint queries and B_future, using T_prior as reference points. This operation lets each future waypoint gather contextual information from the predicted scene around its anticipated location, enabling the system to detect imminent collisions or drivable‑area violations before they occur. The refined waypoints are then decoded by an MLP into the final trajectory (T_final). In parallel, the sparse spatial‑temporal supervision derived from the reference points provides a lightweight training signal for B_future, preventing the world model from collapsing into a trivial mapping that ignores scene diversity.

Experimental Validation
ResWorld is evaluated on two large‑scale benchmarks: nuScenes and NAVSIM. On nuScenes, it outperforms prior state‑of‑the‑art methods across all reported metrics, including lower L2 displacement error, reduced collision rate, and better long‑horizon (3 s) average error. NAVSIM experiments, which feature complex intersections and high‑speed highway segments, show similar gains, confirming the method’s robustness in diverse environments. Ablation studies demonstrate that (1) removing the static‑object redundancy (i.e., using only temporal residuals) yields a substantial accuracy boost, (2) omitting FGTR degrades final trajectory quality, and (3) the sparse supervision is essential for preventing world‑model collapse. The model achieves roughly 30 % fewer parameters than comparable full‑scene world models while maintaining real‑time inference (>20 FPS).

Contributions and Impact

  1. Dynamic‑only world modeling – By explicitly separating static and dynamic information via temporal residuals, the approach reduces unnecessary computation and improves motion prediction fidelity.
  2. Bidirectional planning‑prediction loop – FGTR makes future scene predictions an active part of the planner rather than a passive proxy task, leading to more informed and safer trajectories.
  3. Stability through sparse supervision – The introduced supervision scheme mitigates the common collapse problem in unsupervised world‑model training, enhancing reliability for large‑scale raw‑data learning.

Limitations and Future Work
The current design assumes static background remains unchanged, which may not hold in construction zones or dynamic map updates. Temporal residuals are computed only between adjacent frames, potentially limiting long‑range motion modeling; multi‑scale or non‑linear residual extraction could address this. FGTR relies on deformable attention, and its scalability to multi‑agent or dense traffic scenarios remains to be explored.

In summary, ResWorld presents a compelling new paradigm for end‑to‑end autonomous driving: a world model that concentrates on dynamic entities via temporal residuals and a planner that actively leverages predicted future scenes for trajectory refinement. The publicly released code and thorough experimental analysis make it a valuable baseline for future research aiming to integrate perception, prediction, and planning more tightly.


Comments & Academic Discussion

Loading comments...

Leave a Comment