IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models

IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper proposes a novel inverse reinforcement learning framework using a diffusion-based adaptive lookahead planner (IRL-DAL) for autonomous vehicles. Training begins with imitation from an expert finite state machine (FSM) controller to provide a stable initialization. Environment terms are combined with an IRL discriminator signal to align with expert goals. Reinforcement learning (RL) is then performed with a hybrid reward that combines diffuse environmental feedback and targeted IRL rewards. A conditional diffusion model, which acts as a safety supervisor, plans safe paths. It stays in its lane, avoids obstacles, and moves smoothly. Then, a learnable adaptive mask (LAM) improves perception. It shifts visual attention based on vehicle speed and nearby hazards. After FSM-based imitation, the policy is fine-tuned with Proximal Policy Optimization (PPO). Training is run in the Webots simulator with a two-stage curriculum. A 96% success rate is reached, and collisions are reduced to 0.05 per 1k steps, marking a new benchmark for safe navigation. By applying the proposed approach, the agent not only drives in lane but also handles unsafe conditions at an expert level, increasing robustness.We make our code publicly available.


💡 Research Summary

The paper introduces IRL‑DAL, a unified framework that combines imitation, inverse reinforcement learning, reinforcement learning, diffusion‑based safety supervision, and adaptive perception for autonomous driving. The training pipeline begins with Behavioral Cloning (BC) from an expert finite‑state‑machine (FSM) controller, providing a stable initialization that captures basic lane‑keeping and obstacle‑avoidance behaviors with minimal data. After this warm‑start, the policy is fine‑tuned using Proximal Policy Optimization (PPO) under a hybrid reward that blends an environment‑defined term (r_env) with a dense reward learned by an IRL discriminator (r_IRL). A phase‑dependent weight w_IRL gradually shifts emphasis from the sparse environmental signal to the expert‑derived IRL signal, addressing the classic reward‑design bottleneck and reducing covariate shift.

Safety is enforced by a conditional diffusion model that acts as an on‑demand supervisor. During rollout, when the PPO‑selected action is deemed risky, the diffusion planner generates candidate short‑horizon trajectories. An energy‑based objective penalizes collisions, abrupt steering changes, and violations of vehicle dynamics. By computing the gradient of this energy with respect to the noisy diffusion state and adding it to the diffusion step (gradient‑guided sampling), the planner produces “safe” trajectories that are fed back into the replay buffer. This mechanism allows the policy to learn from corrected experiences without requiring a separate open‑loop planner at inference time.

Perception is enhanced through a Learnable Adaptive Mask (LAM). LAM receives two scalar cues: normalized vehicle speed (v_norm) and a hazard level derived from the minimum LiDAR range (h). Two learnable scalars (α_speed, α_lidar) modulate multiplicative factors f_speed and f_hazard, which in turn adjust a vertical intensity weight w_lower. A smooth vertical gradient mask is constructed per image row, normalized, and concatenated as a fourth channel to the RGB input. Consequently, at high speeds the lower part of the image (road surface) receives more attention for precise lane keeping, while in close‑proximity hazard scenarios the mask boosts the upper region, highlighting pedestrians or obstacles. LAM is trained jointly with BC loss, enabling the perception module to discover context‑aware attention patterns without heavy self‑attention overhead.

The overall learning process follows a two‑stage curriculum: (1) BC + LAM training to obtain a competent baseline; (2) PPO fine‑tuning with the hybrid reward, during which the diffusion safety supervisor intervenes only in high‑risk states, correcting actions and storing safe experiences. Experiments are conducted in the Webots simulator across diverse traffic configurations, dynamic obstacles, and varying weather conditions. The proposed system achieves a 96 % success rate and reduces collisions to 0.05 per 1,000 steps, outperforming recent hybrid IL‑RL baselines by a large margin. Ablation studies show that removing LAM increases collision rates by over 30 %, and disabling the diffusion supervisor leads to unstable policies with frequent off‑lane excursions.

All code, trained models, and curriculum scripts are released publicly, facilitating reproducibility and future extensions to real‑world platforms. By tightly integrating expert imitation, IRL‑derived dense rewards, diffusion‑based safety guidance, and adaptive perception, IRL‑DAL sets a new benchmark for safe, efficient, and human‑like autonomous driving.


Comments & Academic Discussion

Loading comments...

Leave a Comment