ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.


💡 Research Summary

ConsisDrive tackles the pervasive “identity drift” problem in diffusion‑based driving world models, where objects change appearance or even category across frames. The authors identify three root causes: (1) lack of explicit instance identity conditioning, (2) attention mechanisms that are not instance‑aware, leading to information leakage between objects, and (3) uniform loss supervision that dilutes foreground learning because background pixels dominate the scene.

To solve these issues, ConsisDrive introduces two complementary components.

  1. Instance‑Masked Attention (IMA) – The model first extracts 3D bounding boxes for every object in each frame. Using camera intrinsics/extrinsics, the boxes are projected onto the image plane, rasterized, and trilinear‑interpolated into the latent space, yielding soft masks ˜BM_i. A token‑to‑instance indicator function I(v_k) determines which instance(s) a visual token belongs to. A binary mask matrix M is then built so that during 3D self‑attention only (a) tokens belonging to the same instance can attend to each other across time (trajectory mask) and (b) tokens can attend to their own instance’s identity embeddings (identity mask). This prevents color, texture, or semantic information from leaking between different agents and enforces long‑range temporal consistency.

  2. Instance‑Masked Loss (IML) – Instead of applying a global MSE loss uniformly, the authors construct a foreground mask L_t for each frame. The training objective becomes a weighted sum: λ·‖L_t⊙(x̂‑x)‖² + (1‑λ)·‖x̂‑x‖², where λ is gradually increased during training. A probabilistic masking strategy randomly perturbs L_t each batch, encouraging the network to remain robust while still focusing on foreground details. This mitigates supervision dilution and improves the model’s ability to preserve fine‑grained identity cues for small objects such as pedestrians.

The overall architecture builds on OpenSora V2.0. Visual tokens V are obtained from a VAE encoder, textual scene descriptions from T5‑XXL, and semantic category embeddings from CLIP‑Large. Instance identity conditions (category, 3D size, tracking ID) are encoded via Fourier mapping and a small MLP, producing global embeddings g_i that are concatenated with V and fed into the IMA module. Control signals (3D box projections, road maps, scene text) are injected through a ControlNet‑style branch that duplicates the first 19 blocks of the MMDiT backbone, ensuring that conditioning information propagates throughout the denoising process.

Experiments on the nuScenes multi‑view dataset demonstrate substantial gains: ConsisDrive achieves FID = 12.3 and FVD = 84.7, outperforming MagicDrive‑V2 (FID ≈ 19, FVD ≈ 130) and DriveDreamer2 (FID ≈ 22, FVD ≈ 150). Qualitatively, generated videos show stable colors, shapes, and categories for cars, buses, cyclists, and pedestrians across long sequences.

Downstream evaluation further validates the utility of the synthetic data. Models trained on ConsisDrive‑generated videos achieve higher 3‑D object detection mAP (0.55 vs. 0.48), multi‑object tracking IDF1 (0.78 vs. 0.71), and planning success rate (92 % vs. 84 %) compared to those trained on prior synthetic generators, approaching the performance of models trained on real sensor data. Ablation studies reveal that removing IMA drops performance by ~15 % and removing IML by ~12 %, confirming that both components are essential and synergistic.

Limitations include increased computational cost for high‑resolution mask rasterization and occasional failure to capture very small objects’ boundaries accurately. The authors suggest future work on integrating learned segmentation networks for mask generation and exploring sparse‑attention schemes to reduce runtime.

In summary, ConsisDrive presents a principled, instance‑aware framework that dramatically reduces identity drift in driving video synthesis, delivering higher‑quality synthetic data that can reliably support perception, tracking, and planning tasks in autonomous driving pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment