FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving
Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.
💡 Research Summary
FAR‑Drive presents a novel framework for closed‑loop autonomous driving simulation that generates multi‑camera video frames in an autoregressive, frame‑by‑frame manner. The authors identify three fundamental challenges in this setting: (1) maintaining long‑horizon temporal and cross‑view consistency, (2) preventing quality degradation caused by exposure bias when the model repeatedly conditions on its own predictions, and (3) meeting the low‑latency requirements of interactive driving loops. To address these, FAR‑Drive introduces a multi‑view diffusion transformer (MMDiT) architecture consisting of a backbone transformer for primary video synthesis and a parallel control transformer that processes structured driving cues such as camera intrinsics, 3D bounding boxes, bird’s‑eye‑view (BEV) maps, ego‑motion matrices, and textual scene descriptions. The control signals are encoded by modality‑specific encoders, fused into a unified scene prompt, and injected into the backbone via zero‑initialized projection layers, allowing the model to gradually learn to modulate generation without disrupting pretrained visual priors.
Training proceeds in two stages. First, Adaptive Reference Horizon Conditioning dynamically expands the temporal conditioning window during training, moving from a single previous frame (position only) to two frames (velocity) and three frames (acceleration), thereby providing explicit motion information and improving long‑term consistency. Second, Blend‑Forcing gradually mixes ground‑truth frames with self‑generated frames using a blending factor that increases over epochs. This mitigates exposure bias by exposing the model to its own predictions while still grounding it in real data, leading to more stable rollouts.
For inference speed, the authors compress the diffusion latent space with a high‑compression variational autoencoder (VAE) and employ KV‑Cache and Control‑Cache mechanisms to reuse attention keys/values and control encoder outputs across timesteps. Combined with a step‑reduction schedule and distribution‑matching distillation, these system‑level optimizations reduce the per‑frame latency to sub‑second (≈0.8 s on a single RTX 3090 GPU).
Experiments on the nuScenes dataset, which provides six synchronized camera views, demonstrate that FAR‑Drive outperforms prior closed‑loop simulators (e.g., SimNet, DriveGAN) across quantitative metrics such as Fréchet Video Distance, LPIPS, and a multi‑view IoU measure of geometric alignment. Qualitatively, generated sequences preserve object trajectories and cross‑view geometry over long horizons, and the simulator reacts instantly to changes in acceleration or steering commands, confirming its suitability for real‑time interaction.
The paper concludes with a discussion of limitations, including the current focus on 256×256 resolution and the lack of lidar/radar modalities, and outlines future directions such as high‑resolution scaling, multimodal sensor fusion, and online fine‑tuning for sim‑to‑real transfer. Overall, FAR‑Drive offers a comprehensive solution that unifies advanced diffusion‑based video synthesis, structured control, robust training strategies, and efficient inference to enable high‑fidelity, low‑latency closed‑loop autonomous driving simulation.
Comments & Academic Discussion
Loading comments...
Leave a Comment