UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning
Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene’s geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene’s structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at https://github.com/Say2L/UniDWM.
💡 Research Summary
UniDWM (Unified Driving World Model) proposes a comprehensive framework that learns a single latent representation capable of supporting perception, prediction, and planning tasks in autonomous driving. The model is built around two main stages: Joint Reconstruction and Collaborative Generation.
In the Joint Reconstruction stage, a static encoder (a frozen pretrained image backbone) extracts view‑consistent features from each frame, while a dynamic encoder composed of alternating spatial‑ and temporal‑attention layers captures inter‑frame motion and temporal continuity. The two encoders produce a modality‑agnostic, time‑continuous latent tensor z (shape: time × tokens × channels). This latent space is deliberately designed to embed geometry, appearance, and ego‑motion information simultaneously.
Three decoupled decoders then map z to distinct output domains: a geometry decoder reconstructs depth and point clouds, an RGB decoder restores visual texture, and an ego decoder predicts the vehicle’s pose. By keeping the decoders separate, the model forces the shared latent code to be sufficiently expressive to support all modalities without interference.
The authors formalize UniDWM as a variant of a Variational Auto‑Encoder (VAE). The basic ELBO (Equation 4) includes multi‑observation reconstruction terms but suffers from information loss on high‑dimensional data. To address this, they adopt the InfoVAE objective (Equation 5), which decouples reconstruction fidelity from prior regularization and introduces an explicit term for preserving mutual information between inputs and latents. For the regularization term they employ SIGReg, a recent distribution‑discrepancy measure based on the Epps‑Pulley test, ensuring that the aggregated posterior matches the prior. This combination yields a loss that simultaneously (i) encourages accurate reconstruction, (ii) aligns the latent distribution with a simple Gaussian prior, and (iii) retains rich information about the observations.
The second stage, Collaborative Generation, leverages a conditional diffusion transformer (DiT) to predict future latent states. DiT consists of alternating spatial‑ and temporal‑attention blocks, and it receives the current latent z₀, optional conditioning signals (e.g., target waypoints, weather), and a causal mask to enforce autoregressive generation. By iteratively denoising a Gaussian‑noised latent, the model samples future latent tensors zₜ (t > 0). The same decoders used in reconstruction then render these future latents into 4‑D scenes (geometry, RGB, ego pose), effectively providing both prediction and generative capabilities within a single pipeline.
Experiments are conducted on the NAVSIM simulator, covering three downstream tasks: (1) trajectory planning, (2) 4‑D reconstruction, and (3) 4‑D generation. UniDWM outperforms a range of baselines, including BEV‑based planners (DriveX, WOTe), image‑centric models (VAD, UniAD), and diffusion‑based scene generators (VISTA, GAIA‑2). In planning, the success rate rises from ~75 % (best baseline) to 87 %, a 12 % absolute gain. Reconstruction errors drop by 15 % for depth, 12 % for point clouds, and 10 % for RGB. For generative quality, UniDWM achieves an FID of 23.4 and LPIPS of 0.12, both substantially better than competing diffusion models. Ablation studies confirm that each component—static encoder freezing, dynamic spatial‑temporal attention, and SIGReg regularization—contributes positively to performance.
The paper also discusses limitations. While the model works with monocular video alone, extending it to multi‑modal inputs (LiDAR, radar) remains future work. Diffusion‑based sampling is computationally intensive, posing challenges for real‑time planning; the authors suggest exploring faster samplers (e.g., DDIM) and hardware optimizations.
In summary, UniDWM introduces a theoretically grounded, multifaceted latent world representation that unifies perception, prediction, and planning. By framing the problem as a VAE/InfoVAE with a sophisticated regularizer and by employing a conditional diffusion transformer for future simulation, the authors demonstrate a powerful, self‑supervised foundation model for autonomous driving. The results suggest that such unified world models could become a cornerstone for next‑generation driving intelligence, offering scalability, reduced reliance on costly annotations, and improved generalization across diverse downstream tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment