DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
💡 Research Summary
DreamDojo tackles two major bottlenecks in robot world‑model research: the scarcity of large‑scale, diverse interaction data and the lack of fine‑grained action labels. The authors curate a massive egocentric human‑video dataset (DreamDojo‑HV) comprising 44 711 hours (≈1.18 M trajectories), covering roughly 9 869 unique scenes, 6 015 distinct tasks, and over 43 000 objects. Compared with prior robot‑learning datasets, DreamDojo‑HV is 15× longer, 96× richer in skills, and 2 000× broader in environments, providing a rich source of physical dynamics despite the embodiment gap between humans and robots.
To overcome missing action annotations, the paper introduces continuous latent actions—a self‑supervised proxy that encodes the motion between consecutive frames into a compact latent vector. An “action encoder‑decoder” learns to map frame pairs to these vectors and back, enabling the model to treat heterogeneous video sources (in‑lab, EgoDex, DreamDojo‑HV) uniformly. Relative robot actions are used instead of absolute joint angles, and actions are chunked into four‑frame groups (matching the 4× temporal compression of the WAN2.2 tokenizer) before being injected into the corresponding latent frame. This design respects causal ordering, reduces noise from future actions, and dramatically improves controllability.
The backbone is built on Cosmos‑Predict2.5, a latent‑video diffusion model that predicts future frames conditioned on text, timestep embeddings, and the latent actions. Training employs a flow‑matching loss that aligns the denoiser’s output with ground‑truth velocity fields, ensuring precise pixel‑level dynamics. After pre‑training on the massive human video corpus, DreamDojo undergoes post‑training on a small target‑robot dataset (e.g., GR‑1). Only the action‑conditioning layer is re‑initialized, allowing the model to quickly adapt to the robot’s specific joint space while preserving the physics learned from human videos.
Real‑time inference is achieved through a knowledge‑distillation pipeline inspired by Self‑Forcing. A lightweight student network is trained to mimic the teacher (the large pre‑trained model) while operating at 640 × 480 resolution and 10.81 FPS for arbitrarily long horizons. Distillation also improves long‑horizon consistency by focusing on short temporal contexts.
Extensive evaluation covers a suite of out‑of‑distribution (OOD) benchmarks: unseen objects, novel scenes, and complex contact‑rich tasks. DreamDojo demonstrates superior zero‑shot generalization, accurate physical predictions, and fine‑grained action controllability compared to prior world‑model baselines. Downstream applications include live teleoperation (the model predicts future frames in real time to guide a human operator), policy evaluation (rapid simulation‑based scoring of candidate policies before real‑world deployment), and model‑based planning (generating action sequences that lead to a desired future state). In all cases, the model’s predictions align closely with real robot outcomes, reducing the sim‑to‑real gap.
The paper acknowledges limitations: residual embodiment mismatch can cause errors in high‑speed or high‑force contacts; the latent actions lack explicit semantic interpretability, making human‑robot explanations harder; and video quality varies across the massive dataset. Future work is suggested on multimodal extensions (audio, force sensing), automated quality control, and tighter integration of latent‑action semantics.
In summary, DreamDojo establishes a new paradigm: leveraging massive, unlabeled human egocentric videos to pre‑train a robot world model, using continuous latent actions as a universal proxy for behavior, and refining the model through lightweight post‑training and distillation to achieve real‑time, generalizable, and controllable simulation of open‑world, contact‑rich robot tasks. This work significantly lowers data‑collection costs and opens pathways toward truly generalist robot intelligence.
Comments & Academic Discussion
Loading comments...
Leave a Comment