MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving

MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing “long-tail” scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.


💡 Research Summary

MTDrive introduces a novel multi‑turn interactive reinforcement‑learning framework for autonomous‑driving trajectory planning that tightly integrates multimodal large language models (MLLMs) with a specially designed RL algorithm. The authors observe that existing vision‑language approaches for driving are limited to single‑turn reasoning, which hampers their ability to iteratively refine trajectories in complex, long‑tail scenarios. To address this, MTDrive establishes a closed‑loop interaction where, at each turn, the MLLM receives front‑view images, ego‑vehicle state, navigation commands, and the accumulated feedback from previous turns, and then generates a new trajectory. A PDM (Planning‑Decision‑Metric) agent evaluates the proposed trajectory on safety‑critical metrics—No Collisions (NC), Drivable Area Compliance (DAC), and Time‑to‑Collision (TTC)—and returns a textual description of any violations. This textual feedback is appended to the prompt for the next turn, enabling the model to self‑reflect and correct its mistakes.

The core algorithmic contribution is Multi‑Turn Group Relative Policy Optimization (mtGRPO). Standard GRPO or PPO assign a single scalar reward to an entire episode, which leads to severe reward sparsity when multiple refinement turns are involved. mtGRPO computes a separate reward for each turn by linearly combining the PDM score (weight 0.8) with a format‑preservation score (weight 0.2). These per‑turn rewards are mapped onto the tokens generated in the corresponding turn, and a turn‑level relative advantage is estimated by normalizing across rollouts. Consequently, the contribution of each refinement step is explicitly credited, stabilizing policy updates and accelerating convergence.

Data curation is another pillar of the work. Using the NAVSIM closed‑loop simulator, the authors construct an “interactive trajectory understanding” dataset comprising three components: (1) single‑turn examples for basic trajectory generation, (2) multi‑turn examples generated via a bootstrap process where a model’s output is fed back to the PDM agent and then concatenated with the original prompt to form the next‑turn sample, and (3) PDM‑understanding QA pairs that teach the model to interpret metric feedback. For RL training, a filtered subset of the NAVSIM training set is selected based on low initial scores or non‑empty feedback after the first turn, ensuring that the RL agent focuses on challenging cases.

On the systems side, MTDrive builds on the veRL framework and introduces two optimizations to mitigate the heavy data transfer caused by high‑resolution images and long turn sequences: (i) image compression and streaming pipelines, and (ii) batch‑level caching of token embeddings. These optimizations yield a 2.5× increase in training throughput without sacrificing model performance.

Experimental evaluation on the NAVSIM benchmark demonstrates that MTDrive achieves a PDMS (Planning‑Decision‑Metric Score) of 96.2 when privileged ground‑truth perception is provided, and 91.1 under realistic perception conditions that rely only on current‑frame sensors and kinematic prediction. Both numbers surpass prior state‑of‑the‑art methods by a substantial margin. Ablation studies confirm that mtGRPO’s per‑turn advantage estimation and the multi‑turn dataset are critical for the observed gains. Qualitative analyses show that the model successfully resolves long‑tail scenarios such as complex intersections, sudden lane changes, and emergency braking by iteratively correcting its trajectory based on the PDM feedback.

In summary, MTDrive contributes (1) a full multi‑turn interactive loop for trajectory refinement, (2) a novel RL objective (mtGRPO) that solves reward sparsity via turn‑level relative advantages, (3) a curated multi‑turn dataset that enables supervised fine‑tuning and RL, and (4) system‑level engineering that makes large‑scale multimodal RL practical. The work opens a path toward deploying LLM‑enhanced, feedback‑driven planning modules in real autonomous vehicles, and suggests future extensions to richer sensor suites, real‑world road testing, and human‑in‑the‑loop collaboration.


Comments & Academic Discussion

Loading comments...

Leave a Comment