Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for …

Authors: Muyang He, Hanzhong Guo, Junxiong Lin

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
1 V ideo Generation Models as W orld Models: Ef ficient P aradigms, Architectures and Algorithms Muyang He ⋆ , Hanzhong Guo ⋆ , Junxiong Lin ⋆ , Y izhou Y u † , F ellow , IEEE Abstract —The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. How- ever , a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. T o address this, we comprehensi vely and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. W e introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectur es, and efficient inference algorithms. W e further sho w that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally , we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and rob ust world simulators. Index T erms —V ideo Generation, W orld Models, Interactive Simulation, Diffusion Models, Embodied AI. I . I N T RO D U C T I O N In the rapidly ev olving landscape of generative artificial intelligence, video generation has received remarkable atten- tion due to its potential to simulate complex world dynamics. This field has undergone a transformati ve journey , progressing from early generativ e adversarial networks (GANs) [1], [2] and pixel-le vel auto-regressi ve (AR) models [3], [4] to high-fidelity diffusion-based approaches [5]–[13], and more recently to large-scale architectures that function as ”W orld Simulators” capable of modeling physical laws and long-horizon causal- ities [14], [15]. This progression marks a substantial leap in generativ e capabilities, enabling models not only to synthesize visual content but to understand and predict the underlying physics of the en vironment, thereby paving the way for A GI [16], [17]. T o fully appreciate this leap, it is essential to understand video generation has the potential to achie ve world modeling. The concept of world modeling seeks to move be yond simple pattern matching to ward a fundamental understanding of envi- ronmental dynamics. A world model is generally defined as an internal representation of environmental dynamics that enables the prediction of future states based on historical contexts and, Muyang He and Y izhou Y u are with School of Computing and Data Science, The Univ ersity of Hong Kong, Hong Kong SAR, China and Hong K ong Generative AI Research and De velopment Center , Hong Kong SAR, China (E-mail: muyanghe@connect.hku.hk; yizhouy@acm.org). Hanzhong Guo and Junxiong Lin are with School of Computing and Data Science, The Univ ersity of Hong K ong, Hong Kong SAR, China (E-mail: hanzhong@connect.hku.hk; junxionglin26@outlook.com). ⋆ Equal Contribution † Corresponding Author optionally , actions [16]. In the context of visual synthesis, video-based world models treat the generative process as a simulation of the physical world, where the objecti ve is to model the underlying causal mechanisms such as gravity , col- lision, and object permanence rather than just pixel transitions. Mathematically , this can be vie wed as learning the transition function P ( s t +1 | s t , a t ) , where s represents the state (video frames or latents) and a represents the conditions or actions (e.g., text prompts or camera trajectories). As emphasized in the dev elopment of Sora [14], scaling video generation models leads to the emergence of simulation capabilities, where the model demonstrates an initial comprehension of physical laws without explicit hard-coding. This alignment between video generation and world mod- eling offers several adv antages: Emergent Ph ysics: Large-scale training on div erse video data allo ws models to learn complex interactions, such as agent-en vironment interactions or fluid dynamics, which are difficult to model via traditional analytical engines. Latent Imagination: Modern world models often operate in compact latent spaces [16], [17], allo wing the imagination of future scenarios to occur at a lower computational cost than high-resolution pixel rendering. This inherently links the concept of world modeling to computational ef ficiency . Unified Reasoning: By treating video generation as world modeling, the same architecture can be applied to div erse domains ranging from media production to autonomous driv- ing [18], [19] and robotic manipulation [20], where the model acts as a general-purpose simulator for decision-making. Despite these immense conceptual potentials, realizing the capabilities of video-based world models needs to overcome sev ere hardware limitations. V ideo generators serving as world simulators are required to possess diverse capabilities, such as maintaining long-term spatiotemporal consistency , adhering to physical constraints, and supporting high-resolution interactive generation [20], [21]. Ho wever , due to the high dimensionality of video data and the complexity of physically based dynam- ics, these models are faced with massive computational cost and memory consumption. F or e xample, autore gressiv e models must manage growing key-v alue (KV) caches to prevent memory explosion during long-sequence generation [22], [23]. Diffusion models, while powerful, require efficient sampling strategies to ov ercome the latency of iterati ve denoising. In addition, the vast redundancy in video frames must be reduced so that useful semantic information can be retained without ov erwhelming hardware costs [24], [25]. Moreover , under high-resolution settings, parallel computing topologies should be determined that enable devices to distribute workload 2 V ideo Generation Models as W orld Models Efficient Modeling (Sec. III) Diffusion Model Distillation for Ef ficient Sampling Auto-Regressi ve and Hybrid Approaches Efficient Architecture (Sec. IV) Hierarchical & V AE Designs Long Conte xt & Memory Mechanisms Efficient Attention Extrapolation and RoPE Efficient Inference (Sec. V) Parallelism Caching Pruning Quantization Applications (Sec. VI) Autonomous Dri ving Data Synthesis Interactiv e Simulation Generativ e Planning Embodied AI Data Synthesis Interactiv e Simulation Generativ e Planning Game & Interactive W orld Simulation More Related W ork (Sec. VII) Interactiv e T alking Head Generation Interactiv e Content Creation V ideo-Driven Scene Generation Fig. 1. A taxonomy of representati ve topics related to efficiency improvement for video generation-based world models. effecti vely . W ithout efficiency optimization, traditional video generators struggle to scale or interact in real time. Therefore, due to the abundant redundancy in video data, efficient ar chi- tectur es and algorithms emerge as promising ways to address the aforementioned challenges, transforming heavy and slow generativ e processes into agile and scalable forms that are amenable to practical deployment. T axonomy . As shown in Figure 1, this article systematically in vestigates the role of efficienc y in the aspects of modeling, architectures, and algorithms for video-based world models, cov ering the spectrum between AR-based and Diffusion-based paradigms. Our discussion is structured around three core dimensions: Efficient Modeling (covering efficienc y-oriented modeling paradigms), Efficient Arc hitectur es (designs such as V AEs, memory mechanisms, and ef ficient attention), and Ef- ficient Inference (system deployment considerations including parallelism, caching, pruning, and quantization). Furthermore, this article also explores how these ef ficient models are used in do wnstream application scenarios, such as autonomous driving, embodied AI and games/interacti ve simulations. By revie wing comprehensiv e insights in this rapidly e volving field, we aim to catalyze new adv ances in video-based world models that leverage ef ficient computing to tackle increasingly sophisticated simulation challenges. W ithin the existing literature, previous studies hav e pri- marily explored general video generation or specific diffusion model based techniques. More recently , amidst the significant advances in Sora-like models [14], some works have begun to address the computational demands of video generation. Howe ver , a systematic revie w specifically elucidating how efficiency impr ovement techniques can benefit a video-based world model is notably absent. T o the best of our knowledge, this article presents the first systematic exploration dedicated to the intersection of efficiency impro vement techniques and the multiple facets of video-based world models. The main contributions of this paper are summarized as follo ws: • W e provide the first comprehensiv e revie w of the critical intersection between efficienc y improvement techniques and video-based world models. • W e introduce a novel taxonomy that provides a struc- tured perspectiv e on efficienc y across three dimensions: modeling paradigms, architectural designs, and inference optimizations. • W e detail how these efficienc y improv ement techniques empower critical applications such as autonomous driv- ing, embodied AI, and interacti ve simulation. • W e further discuss key challenges and future opportuni- ties in efficient video-based world modeling. The remainder of this paper is organized as follows. W e in- troduce background kno wledge in video generati ve paradigms and foundations in Section II. Next, a re view of ef ficient mod- eling paradigms is gi ven in Section III. Ef ficient architectures and inference algorithms are presented in detail in Section IV and Section V, respecti vely . In addition, promising applica- tions and more related works are discussed in Section VI and Section VII. Finally , the summary of this paper is presented in Section VIII. I I . B AC K G RO U N D The field of video generation has ev olved from modeling short, lo w-resolution transitions to simulating complex world dynamics. T o understand the challenges of efficient video modeling, it is essential to first comprehend the foundational paradigms of image generation and how these paradigms are extended to the temporal dimension to generate videos. This chapter outlines the mathematical principles and architectural innov ations that define modern video generation. A. Generative P aradigms Modern video generation models are largely built upon paradigms established in image synthesis. W e introduce the mathematical formulations of these generati ve models, focus- ing on Diffusion Models and Flow Matching as the current dominant approaches, followed by Auto-regressi ve models. 3 1) Denoising Diffusion Pr obabilistic Models (DDPM): Diffusion models [26] formulate generation as a denoising process. T o improve efficienc y , most state-of-the-art models operate in the latent space of a pre-trained variational autoen- coder (V AE), kno wn as Latent Diffusion Models (LDMs) [6]. Forward Process. Giv en a data sample x 0 ∼ q ( x 0 ) (or its latent representation z 0 ), the forward process is a fixed Markov chain that gradually adds Gaussian noise according to a v ariance schedule β t ∈ (0 , 1) . The transition probability is defined as: q ( x t | x t − 1 ) = N ( x t ; p 1 − β t x t − 1 , β t I ) (1) Using the notation α t = 1 − β t and ¯ α t = Q t s =1 α s , we can sample x t at any timestep t directly from x 0 : x t = √ ¯ α t x 0 + √ 1 − ¯ α t ϵ, where ϵ ∼ N (0 , I ) (2) Reverse Pr ocess and T raining. The generati ve process re- verses this noise addition. Since the true posterior q ( x t − 1 | x t ) is intractable, we approximate it with a parameterized distribu- tion p θ ( x t − 1 | x t ) . In practice, the model is trained to predict the added noise ϵ or the velocity v . The simplified training objectiv e is often the mean squared error (MSE) between the actual noise ϵ and the predicted noise ϵ θ : L simple = E x 0 ,t,ϵ  ∥ ϵ − ϵ θ ( x t , t ) ∥ 2  (3) Once trained, the model generates data by iterati vely denoising pure Gaussian noise x T to x 0 . 2) Flow Matching: While DDPMs rely on a pre-defined forward process in Eq. (2), which transports samples through a fixed and typically curved noising trajectory , Flow Matching (FM) [27], [28] instead models generation as a continuous- time probability path gov erned by ordinary differential equa- tions (ODEs). FM defines a probability density path p t that transforms a simple prior distribution into the data distribution through a time-dependent vector field v t ( x ) : d dt ϕ t ( x ) = v t ( ϕ t ( x )) , ϕ 0 ( x ) = x (4) where ϕ t maps samples from t = 0 to t . The goal is to learn a parameterized vector field v θ ( x, t ) that matches the target velocity field associated with the chosen probability path. Since directly regressing the marginal target velocity field is generally intractable for complex data distributions, flo w matching is commonly implemented in a conditional form. Giv en a source sample z 0 and a target data sample x 1 , one defines a conditional probability path p t ( x | z 0 , x 1 ) together with a tractable conditional tar get vector field u t ( x | z 0 , x 1 ) . The resulting conditional flow matching (CFM) objective is L CFM = E t,z 0 ,x 1 ,x t ∼ p t ( ·| z 0 ,x 1 )  ∥ v θ ( x t , t ) − u t ( x t | z 0 , x 1 ) ∥ 2  . (5) In common straight-line path formulations, the conditional path is chosen as a linear interpolation between noise z 0 and data x 1 , namely x t = tx 1 + (1 − t ) z 0 . In this case, the target velocity becomes a constant, i.e., u t = x 1 − z 0 , and the objectiv e reduces to L CFM = E t,z 0 ,x 1  ∥ v θ ( x t , t ) − ( x 1 − z 0 ) ∥ 2  . (6) 3) Auto-r egr essive (AR) Models: AR models decompose the joint probability distribution of a sequence x into a product of conditional probabilities. In a canonical visual generation formulation, x represents a flattened sequence of discrete visual tokens derived from a VQ-V AE-style tokenizer [29], where an encoder maps patches or frames to continuous latents that are snapped to a learned finite codebook via nearest- neighbor vector quantization (VQ), although more general autoregressi ve video models may also operate on other com- pressed latent token sequences. For a sequence of length N : p ( x ) = N Y i =1 p ( x i | x