Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.
💡 Research Summary
The development of robust “World Models” is a cornerstone of advancing autonomous driving technology, as these models are responsible for predicting future driving scenarios to enable safe path planning. However, existing world models face significant hurdles, specifically regarding long-horizon prediction—where errors accumulate over time—and generalization to unseen, complex environments. To mitigate these issues, many previous approaches have relied heavily on auxiliary inputs, such as high-definition maps, depth information, or multi-camera configurations, to provide additional context for the model.
In this paper, the authors introduce “Orabs,” a novel driving world model that tackles these challenges through a remarkably streamlined and efficient design. Unlike its predecessors, Orbis does not require additional supervision or extra sensors like maps or depth sensors. Instead, it focuses on extracting meaningful representations directly from video data. Despite its minimalist approach—utilizing only 469 million parameters and being trained on a relatively modest 280 hours of video—Orbis achieves state-of-the-art (SOTA) performance. It particularly excels in high-difficulty scenarios, such as complex turning maneuvers and dense urban traffic, where traditional models often struggle with accuracy and stability.
A pivotal contribution of this research is the rigorous comparative study between discrete token-based models and continuous models based on flow matching. To ensure a fair and side-by-side evaluation, the researchers developed a unique hybrid tokenizer compatible with both methodologies. The experimental results provide a clear verdict: the continuous autoregressive model is significantly more powerful and robust than the discrete token-based approach. Crucially, the study reveals that the continuous model is “less brittle,” meaning its performance is more stable and less sensitive to specific architectural or design choices.
Ultimately, Orbis demonstrates that the future of autonomous driving world models lies not in the accumulation of massive, multi-modal sensor inputs, but in the sophisticated modeling of continuous physical dynamics. By proving that a lightweight, simple architecture can outperform much larger, more complex models in difficult scenarios, Orbis sets a new benchmark for the field. The availability of the code and models further empowers the global research community to build upon these findings, paving the way for more efficient and reliable autonomous driving intelligence.
Comments & Academic Discussion
Loading comments...
Leave a Comment