Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

1 V ideo Generation Models as W orld Models: Ef ﬁcient P aradigms, Architectures and Algorithms Muyang He ⋆ , Hanzhong Guo ⋆ , Junxiong Lin ⋆ , Y izhou Y u † , F ellow , IEEE Abstract —The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. How- ever , a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. T o address this, we comprehensi vely and systematically review video generation frameworks and techniques that consider efﬁciency as a crucial requirement for practical world modeling. W e introduce a novel taxonomy in three dimensions: efﬁcient modeling paradigms, efﬁcient network architectur es, and efﬁcient inference algorithms. W e further sho w that bridging this efﬁciency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally , we identify emerging research frontiers in efﬁcient video-based world modeling, arguing that efﬁciency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and rob ust world simulators. Index T erms —V ideo Generation, W orld Models, Interactive Simulation, Diffusion Models, Embodied AI. I . I N T RO D U C T I O N In the rapidly ev olving landscape of generative artiﬁcial intelligence, video generation has received remarkable atten- tion due to its potential to simulate complex world dynamics. This ﬁeld has undergone a transformati ve journey , progressing from early generativ e adversarial networks (GANs) [1], [2] and pixel-le vel auto-regressi ve (AR) models [3], [4] to high-ﬁdelity diffusion-based approaches [5]–[13], and more recently to large-scale architectures that function as ”W orld Simulators” capable of modeling physical laws and long-horizon causal- ities [14], [15]. This progression marks a substantial leap in generativ e capabilities, enabling models not only to synthesize visual content but to understand and predict the underlying physics of the en vironment, thereby paving the way for A GI [16], [17]. T o fully appreciate this leap, it is essential to understand video generation has the potential to achie ve world modeling. The concept of world modeling seeks to move be yond simple pattern matching to ward a fundamental understanding of envi- ronmental dynamics. A world model is generally deﬁned as an internal representation of environmental dynamics that enables the prediction of future states based on historical contexts and, Muyang He and Y izhou Y u are with School of Computing and Data Science, The Univ ersity of Hong Kong, Hong Kong SAR, China and Hong K ong Generative AI Research and De velopment Center , Hong Kong SAR, China (E-mail: muyanghe@connect.hku.hk; yizhouy@acm.org). Hanzhong Guo and Junxiong Lin are with School of Computing and Data Science, The Univ ersity of Hong K ong, Hong Kong SAR, China (E-mail: hanzhong@connect.hku.hk; junxionglin26@outlook.com). ⋆ Equal Contribution † Corresponding Author optionally , actions [16]. In the context of visual synthesis, video-based world models treat the generative process as a simulation of the physical world, where the objecti ve is to model the underlying causal mechanisms such as gravity , col- lision, and object permanence rather than just pixel transitions. Mathematically , this can be vie wed as learning the transition function P ( s t +1 | s t , a t ) , where s represents the state (video frames or latents) and a represents the conditions or actions (e.g., text prompts or camera trajectories). As emphasized in the dev elopment of Sora [14], scaling video generation models leads to the emergence of simulation capabilities, where the model demonstrates an initial comprehension of physical laws without explicit hard-coding. This alignment between video generation and world mod- eling offers several adv antages: Emergent Ph ysics: Large-scale training on div erse video data allo ws models to learn complex interactions, such as agent-en vironment interactions or ﬂuid dynamics, which are difﬁcult to model via traditional analytical engines. Latent Imagination: Modern world models often operate in compact latent spaces [16], [17], allo wing the imagination of future scenarios to occur at a lower computational cost than high-resolution pixel rendering. This inherently links the concept of world modeling to computational ef ﬁciency . Uniﬁed Reasoning: By treating video generation as world modeling, the same architecture can be applied to div erse domains ranging from media production to autonomous driv- ing [18], [19] and robotic manipulation [20], where the model acts as a general-purpose simulator for decision-making. Despite these immense conceptual potentials, realizing the capabilities of video-based world models needs to overcome sev ere hardware limitations. V ideo generators serving as world simulators are required to possess diverse capabilities, such as maintaining long-term spatiotemporal consistency , adhering to physical constraints, and supporting high-resolution interactive generation [20], [21]. Ho wever , due to the high dimensionality of video data and the complexity of physically based dynam- ics, these models are faced with massive computational cost and memory consumption. F or e xample, autore gressiv e models must manage growing key-v alue (KV) caches to prevent memory explosion during long-sequence generation [22], [23]. Diffusion models, while powerful, require efﬁcient sampling strategies to ov ercome the latency of iterati ve denoising. In addition, the vast redundancy in video frames must be reduced so that useful semantic information can be retained without ov erwhelming hardware costs [24], [25]. Moreover , under high-resolution settings, parallel computing topologies should be determined that enable devices to distribute workload 2 V ideo Generation Models as W orld Models Efﬁcient Modeling (Sec. III) Diffusion Model Distillation for Ef ﬁcient Sampling Auto-Regressi ve and Hybrid Approaches Efﬁcient Architecture (Sec. IV) Hierarchical & V AE Designs Long Conte xt & Memory Mechanisms Efﬁcient Attention Extrapolation and RoPE Efﬁcient Inference (Sec. V) Parallelism Caching Pruning Quantization Applications (Sec. VI) Autonomous Dri ving Data Synthesis Interactiv e Simulation Generativ e Planning Embodied AI Data Synthesis Interactiv e Simulation Generativ e Planning Game & Interactive W orld Simulation More Related W ork (Sec. VII) Interactiv e T alking Head Generation Interactiv e Content Creation V ideo-Driven Scene Generation Fig. 1. A taxonomy of representati ve topics related to efﬁciency improvement for video generation-based world models. effecti vely . W ithout efﬁciency optimization, traditional video generators struggle to scale or interact in real time. Therefore, due to the abundant redundancy in video data, efﬁcient ar chi- tectur es and algorithms emerge as promising ways to address the aforementioned challenges, transforming heavy and slow generativ e processes into agile and scalable forms that are amenable to practical deployment. T axonomy . As shown in Figure 1, this article systematically in vestigates the role of efﬁcienc y in the aspects of modeling, architectures, and algorithms for video-based world models, cov ering the spectrum between AR-based and Diffusion-based paradigms. Our discussion is structured around three core dimensions: Efﬁcient Modeling (covering efﬁcienc y-oriented modeling paradigms), Efﬁcient Arc hitectur es (designs such as V AEs, memory mechanisms, and ef ﬁcient attention), and Ef- ﬁcient Inference (system deployment considerations including parallelism, caching, pruning, and quantization). Furthermore, this article also explores how these ef ﬁcient models are used in do wnstream application scenarios, such as autonomous driving, embodied AI and games/interacti ve simulations. By revie wing comprehensiv e insights in this rapidly e volving ﬁeld, we aim to catalyze new adv ances in video-based world models that leverage ef ﬁcient computing to tackle increasingly sophisticated simulation challenges. W ithin the existing literature, previous studies hav e pri- marily explored general video generation or speciﬁc diffusion model based techniques. More recently , amidst the signiﬁcant advances in Sora-like models [14], some works have begun to address the computational demands of video generation. Howe ver , a systematic revie w speciﬁcally elucidating how efﬁciency impr ovement techniques can beneﬁt a video-based world model is notably absent. T o the best of our knowledge, this article presents the ﬁrst systematic exploration dedicated to the intersection of efﬁciency impro vement techniques and the multiple facets of video-based world models. The main contributions of this paper are summarized as follo ws: • W e provide the ﬁrst comprehensiv e revie w of the critical intersection between efﬁcienc y improvement techniques and video-based world models. • W e introduce a novel taxonomy that provides a struc- tured perspectiv e on efﬁcienc y across three dimensions: modeling paradigms, architectural designs, and inference optimizations. • W e detail how these efﬁcienc y improv ement techniques empower critical applications such as autonomous driv- ing, embodied AI, and interacti ve simulation. • W e further discuss key challenges and future opportuni- ties in efﬁcient video-based world modeling. The remainder of this paper is organized as follows. W e in- troduce background kno wledge in video generati ve paradigms and foundations in Section II. Next, a re view of ef ﬁcient mod- eling paradigms is gi ven in Section III. Ef ﬁcient architectures and inference algorithms are presented in detail in Section IV and Section V, respecti vely . In addition, promising applica- tions and more related works are discussed in Section VI and Section VII. Finally , the summary of this paper is presented in Section VIII. I I . B AC K G RO U N D The ﬁeld of video generation has ev olved from modeling short, lo w-resolution transitions to simulating complex world dynamics. T o understand the challenges of efﬁcient video modeling, it is essential to ﬁrst comprehend the foundational paradigms of image generation and how these paradigms are extended to the temporal dimension to generate videos. This chapter outlines the mathematical principles and architectural innov ations that deﬁne modern video generation. A. Generative P aradigms Modern video generation models are largely built upon paradigms established in image synthesis. W e introduce the mathematical formulations of these generati ve models, focus- ing on Diffusion Models and Flow Matching as the current dominant approaches, followed by Auto-regressi ve models. 3 1) Denoising Diffusion Pr obabilistic Models (DDPM): Diffusion models [26] formulate generation as a denoising process. T o improve efﬁcienc y , most state-of-the-art models operate in the latent space of a pre-trained variational autoen- coder (V AE), kno wn as Latent Diffusion Models (LDMs) [6]. Forward Process. Giv en a data sample x 0 ∼ q ( x 0 ) (or its latent representation z 0 ), the forward process is a ﬁxed Markov chain that gradually adds Gaussian noise according to a v ariance schedule β t ∈ (0 , 1) . The transition probability is deﬁned as: q ( x t | x t − 1 ) = N ( x t ; p 1 − β t x t − 1 , β t I ) (1) Using the notation α t = 1 − β t and ¯ α t = Q t s =1 α s , we can sample x t at any timestep t directly from x 0 : x t = √ ¯ α t x 0 + √ 1 − ¯ α t ϵ, where ϵ ∼ N (0 , I ) (2) Reverse Pr ocess and T raining. The generati ve process re- verses this noise addition. Since the true posterior q ( x t − 1 | x t ) is intractable, we approximate it with a parameterized distribu- tion p θ ( x t − 1 | x t ) . In practice, the model is trained to predict the added noise ϵ or the velocity v . The simpliﬁed training objectiv e is often the mean squared error (MSE) between the actual noise ϵ and the predicted noise ϵ θ : L simple = E x 0 ,t,ϵ  ∥ ϵ − ϵ θ ( x t , t ) ∥ 2  (3) Once trained, the model generates data by iterati vely denoising pure Gaussian noise x T to x 0 . 2) Flow Matching: While DDPMs rely on a pre-deﬁned forward process in Eq. (2), which transports samples through a ﬁxed and typically curved noising trajectory , Flow Matching (FM) [27], [28] instead models generation as a continuous- time probability path gov erned by ordinary differential equa- tions (ODEs). FM deﬁnes a probability density path p t that transforms a simple prior distribution into the data distribution through a time-dependent vector ﬁeld v t ( x ) : d dt ϕ t ( x ) = v t ( ϕ t ( x )) , ϕ 0 ( x ) = x (4) where ϕ t maps samples from t = 0 to t . The goal is to learn a parameterized vector ﬁeld v θ ( x, t ) that matches the target velocity ﬁeld associated with the chosen probability path. Since directly regressing the marginal target velocity ﬁeld is generally intractable for complex data distributions, ﬂo w matching is commonly implemented in a conditional form. Giv en a source sample z 0 and a target data sample x 1 , one deﬁnes a conditional probability path p t ( x | z 0 , x 1 ) together with a tractable conditional tar get vector ﬁeld u t ( x | z 0 , x 1 ) . The resulting conditional ﬂow matching (CFM) objective is L CFM = E t,z 0 ,x 1 ,x t ∼ p t ( ·| z 0 ,x 1 )  ∥ v θ ( x t , t ) − u t ( x t | z 0 , x 1 ) ∥ 2  . (5) In common straight-line path formulations, the conditional path is chosen as a linear interpolation between noise z 0 and data x 1 , namely x t = tx 1 + (1 − t ) z 0 . In this case, the target velocity becomes a constant, i.e., u t = x 1 − z 0 , and the objectiv e reduces to L CFM = E t,z 0 ,x 1  ∥ v θ ( x t , t ) − ( x 1 − z 0 ) ∥ 2  . (6) 3) Auto-r egr essive (AR) Models: AR models decompose the joint probability distribution of a sequence x into a product of conditional probabilities. In a canonical visual generation formulation, x represents a ﬂattened sequence of discrete visual tokens derived from a VQ-V AE-style tokenizer [29], where an encoder maps patches or frames to continuous latents that are snapped to a learned ﬁnite codebook via nearest- neighbor vector quantization (VQ), although more general autoregressi ve video models may also operate on other com- pressed latent token sequences. For a sequence of length N : p ( x ) = N Y i =1 p ( x i | x

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment