InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA’s autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.


💡 Research Summary

InstaDrive addresses two critical shortcomings of existing driving video generation models: the lack of instance‑level temporal consistency and insufficient spatial geometric fidelity. The authors propose a novel framework that integrates two dedicated modules—Instance Flow Guider (IFG) and Spatial Geometric Aligner (SGA)—into a diffusion‑based video generation pipeline built on the OpenSora V1.1 backbone (VAE encoder, T5 text encoder, and ST‑DiT temporal‑spatial transformer).

The Instance Flow Guider ensures that each object’s visual attributes (color, texture, category) remain stable across frames. It leverages tracking IDs and visibility flags to compute a 3‑D offset between the current frame and the most recent visible frame for each instance. These offsets are projected onto the 2‑D image plane using the instance’s 3‑D bounding box, forming a per‑pixel motion map that encodes the displacement vector (x, y, z). The motion map is compressed by the same VAE used for video latents, then injected into the diffusion model via ControlNet. During denoising, temporal attention layers can directly attend to the motion condition, allowing the model to retrieve and propagate instance‑specific semantic features from previous frames. This mechanism eliminates the color shifts and texture flickering observed in prior works that rely only on global temporal attention.

The Spatial Geometric Aligner tackles spatial misalignment and incorrect occlusion ordering. It transforms each 3‑D bounding box into the camera’s first‑person view using intrinsic and extrinsic parameters, producing precise 2‑D projections that serve as control signals for object placement. To model occlusion hierarchy, the method computes the depth of each box corner along the optical axis, encodes these depths with Fourier embeddings, and passes them through an MLP to obtain an explicit depth‑order representation. This representation is fed to the diffusion model, enabling it to learn correct depth ordering and produce realistic occlusion relationships (e.g., a near pedestrian correctly occluding a distant bus).

Control signals—including text prompts, HDMap layouts, camera poses, and the projected 2‑D boxes—are incorporated through 13 duplicated ControlNet blocks inserted into the first 13 layers of the ST‑DiT architecture. A parameter‑free view‑inflated attention mechanism reshapes the tensor to maintain multi‑view consistency without adding extra parameters, improving computational efficiency.

The authors evaluate InstaDrive on the nuScenes dataset. Quantitatively, it achieves lower Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD) than state‑of‑the‑art baselines such as MagicDrive‑V2 and Panacea. Instance‑level metrics—color consistency rate, bounding‑box alignment error, and occlusion hierarchy accuracy—show substantial gains. For downstream tasks, models trained on InstaDrive‑generated videos (object detection, multi‑object tracking, and motion planning) perform comparably to those trained on real sensor data, demonstrating the practical utility of the synthetic videos.

Beyond dataset‑driven experiments, the paper introduces a pipeline that uses CARLA’s autopilot to procedurally generate rare, safety‑critical driving scenarios across diverse maps and regions. These scenarios are rendered by InstaDrive, providing a scalable source of challenging test cases for autonomous vehicle safety evaluation.

Limitations include reliance on accurate tracking IDs; errors in ID assignment can corrupt the motion map and degrade temporal consistency. The occlusion model uses only corner‑point depths, which may be insufficient for irregularly shaped objects or complex inter‑object interactions. Future work could integrate learned ID association, richer depth estimation, and interaction modeling to further improve realism.

In summary, InstaDrive presents the first diffusion‑based driving world model that simultaneously guarantees instance‑level temporal coherence and precise spatial geometry. By delivering high‑fidelity, controllable synthetic driving videos at low cost, it opens new avenues for large‑scale data generation, rare‑scenario simulation, and safety validation in autonomous driving research.


Comments & Academic Discussion

Loading comments...

Leave a Comment