RAPiD: Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors for Safe and Efficient Autonomous Driving

RAPiD: Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors for Safe and Efficient Autonomous Driving
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion-based trajectory planners have demonstrated strong capability for modeling the multimodal nature of human driving behavior, but their reliance on iterative stochastic sampling poses critical challenges for real-time, safety-critical deployment. In this work, we present RAPiD, a deterministic policy extraction framework that distills a pretrained diffusion-based planner into an efficient policy while eliminating diffusion sampling. Using score-regularized policy optimization, we leverage the score function of a pre-trained diffusion planner as a behavior prior to regularize policy learning. To promote safety and passenger comfort, the policy is optimized using a critic trained to imitate a predictive driver controller, providing dense, safety-focused supervision beyond conventional imitation learning. Evaluations demonstrate that RAPiD achieves competitive performance on closed-loop nuPlan scenarios with an 8x speedup over diffusion baselines, while achieving state-of-the-art generalization among learning-based planners on the interPlan benchmark. The official website of this work is: https://github.com/ruturajreddy/RAPiD.


💡 Research Summary

RAPiD (Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors) tackles the fundamental deployment bottleneck of diffusion‑based planners for autonomous driving: high latency and stochastic sampling. The authors start from a state‑of‑the‑art diffusion planner (DiffusionPlanner) that jointly predicts ego‑vehicle and surrounding‑vehicle trajectories using a transformer‑based diffusion model. While this model captures the multimodal nature of human driving, inference requires tens to hundreds of denoising steps, leading to delays unsuitable for safety‑critical, split‑second decisions.

To eliminate sampling while preserving the expressive power of the diffusion prior, RAPiD adopts Score‑Regularized Policy Optimization (SRPO). SRPO formulates policy learning as a reverse‑KL objective that simultaneously maximizes a learned Q‑value and penalizes divergence from the behavior distribution μ(a|s). Crucially, the KL term is replaced by the score function ∇ₐ log μ(a|s), which can be directly approximated by the pretrained diffusion model’s noise‑prediction network (εψ). By using the score at infinitesimal diffusion time (t→0), the method bypasses the iterative denoising process entirely. The resulting policy πθ is deterministic and can be evaluated in a single forward pass, achieving an 8× speedup over the original diffusion sampler.

Safety is enforced through a Predictive Driver Model (PDM) scorer that evaluates trajectories on traffic‑rule compliance, time‑to‑collision, drivable‑area adherence, and passenger comfort. Ground‑truth trajectories from the nuPlan dataset are scored by PDM and stored in an offline replay buffer together with latent state embeddings extracted from the frozen diffusion encoder. A critic is trained via Implicit Q‑Learning (IQL) using expectile regression, which decouples critic learning from the policy and mitigates the need for an explicit behavior policy during Q‑value estimation. The critic provides dense safety‑focused rewards, guiding the policy toward trajectories that are not only high‑scoring on nuPlan metrics but also robust in closed‑loop simulation.

Training proceeds in three stages: (1) offline buffer creation with PDM‑scored trajectories, (2) critic training via IQL, and (3) deterministic policy extraction using SRPO. The policy network is a transformer that receives the latent state and outputs a single trajectory. During policy updates, the gradient combines the critic’s Q‑gradient (promoting safety) with the diffusion score regularization (maintaining realism).

Experiments on the nuPlan benchmark cover non‑reactive splits (val14, test14, test14‑hard) and the interPlan generalization benchmark. RAPiD matches or slightly exceeds DiffusionPlanner’s performance on safety‑centric PDM metrics while reducing inference latency from ~100 ms to ~12 ms, an 8× improvement. In reactive scenarios, a modest performance gap remains, attributed to imperfect preservation of multimodal information during distillation. Nonetheless, RAPiD consistently outperforms conventional imitation‑learning baselines in collision avoidance and comfort.

The paper’s contributions are threefold: (1) the first application of SRPO to autonomous‑driving trajectory planning, enabling deterministic, real‑time inference from a diffusion prior; (2) integration of a PDM‑based safety critic that aligns learned rewards with real‑world safety requirements; and (3) a comprehensive evaluation demonstrating state‑of‑the‑art generalization and a substantial speedup. Limitations include the residual gap in highly reactive situations and reliance on accurate score approximations. Future work will explore higher‑fidelity score estimation, richer multimodal policy architectures, and real‑vehicle validation to bridge the remaining performance gap.


Comments & Academic Discussion

Loading comments...

Leave a Comment