Sora as a World Model? A Complete Survey on Text-to-Video Generation
The evolution of video generation from text, from animating MNIST to simulating the world with Sora, has progressed at a breakneck speed. Here, we systematically discuss how far text-to-video generation technology supports essential requirements in world modeling. We curate 250+ studies on text-based video synthesis and world modeling. We then observe that recent models increasingly support spatial, action, and strategic intelligences in world modeling through adherence to completeness, consistency, invention, as well as human interaction and control. We conclude that text-to-video generation is adept at world modeling, although homework in several aspects, such as the diversity-consistency trade-offs, remains to be addressed.
💡 Research Summary
This paper presents a comprehensive survey of text‑to‑video (T2V) generation with a particular focus on its suitability as a visual world model. The authors begin by framing world modeling within the perception‑action loop of an intelligent agent, identifying three core intelligences—spatial, action, and strategic—that a world model must support. They formalize the requirements of a world model as three properties: completeness (coverage and cohesion of generated content), consistency (entity, compositional, and world‑level plausibility), and invention (creative generation that respects internal consistency).
To assess the state of the art, the authors employ the PRISMA systematic‑review methodology, collecting over 400 papers from major venues (AAAI, CVPR, ICCV, NeurIPS, etc.) and arXiv up to November 2025. After applying exclusion criteria (non‑text‑conditioned video synthesis, retrieval/editing tasks, unrelated world‑model work, and other surveys), they retain a curated set of more than 250 studies.
The survey then outlines why video, as a spatio‑temporal medium, is a natural substrate for world modeling. Real‑world video data are costly and biased, while video games provide controlled but limited diversity. T2V models can synthesize arbitrary scenarios from textual prompts, effectively generating “imagined worlds” for training and evaluation.
The core technical analysis is organized around the three world‑model properties. Completeness is examined through visual fidelity, temporal coherence, and scene richness. Recent large‑scale models such as OpenAI’s Sora achieve impressive high‑resolution rendering and smooth motion, yet still struggle with long‑range narrative cohesion and multi‑object interactions. Consistency is broken down into entity consistency (stable appearance of objects across frames), compositional consistency (spatial relationships), and world consistency (physics and semantics). The authors note that most diffusion‑based T2V systems maintain entity consistency but often violate physical laws (e.g., gravity, collisions) because they lack an explicit physics engine. Invention concerns the diversity‑consistency trade‑off: encouraging novel scenes can lead to implausible outputs, while enforcing strict consistency yields conservative generations. The survey highlights emerging solutions such as reinforcement‑learning‑driven diversity rewards, latent‑space exploration via lattice sampling, and meta‑learning for domain adaptation.
Human‑in‑the‑loop control and strategic intelligence are also discussed. Current models primarily accept a single textual prompt; however, the authors argue for multi‑modal control signals (keyboard, mouse, voice, gesture) and step‑wise verification mechanisms to enable long‑horizon planning, opponent modeling, and rule‑based gameplay generation. They cite early work on conditional diffusion, prompt weighting, and seed manipulation as stepping stones toward richer interaction.
The paper reviews the leading T2V architectures: diffusion‑based models (Make‑Video, Imagen Video), transformer‑based autoregressive models (Phenaki, Video‑LM), and proprietary systems (Sora, Google’s Video‑LM). Design choices such as temporal diffusion, latent‑space conditioning, and large‑scale multimodal pre‑training are compared. While Sora’s massive text‑video pair training yields superior visual quality, it still lacks integrated physics simulation and fine‑grained user control.
Dataset and evaluation sections critique existing video corpora (HowTo100M, Kinetics, WebVid‑10M) for domain bias and insufficient annotation of physical or strategic aspects. Evaluation metrics remain dominated by visual quality scores (PSNR, FVD) and lack measures for temporal, physical, and strategic consistency. The authors propose a multi‑dimensional benchmark combining perceptual metrics, temporal consistency scores, physics‑engine validation, and human‑in‑the‑loop task performance.
Finally, the survey outlines future research directions: (1) tighter integration of differentiable physics engines to improve world consistency; (2) reinforcement‑learning frameworks that balance diversity and consistency; (3) development of multi‑modal, step‑wise control interfaces for strategic planning; (4) creation of comprehensive, task‑oriented benchmarks that assess completeness, consistency, and invention holistically.
In conclusion, the authors argue that text‑to‑video generation has rapidly progressed toward fulfilling many requirements of a visual world model, especially in visual fidelity and spatial reasoning. Nevertheless, achieving a full AGI‑level world model will require advances in physical plausibility, diversity management, and interactive control. The survey serves as a roadmap for researchers aiming to bridge the remaining gaps between T2V technology and robust, generalizable world modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment