Closing the Loop: Motion Prediction Models beyond Open-Loop Benchmarks

Closing the Loop: Motion Prediction Models beyond Open-Loop Benchmarks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Fueled by motion prediction competitions and benchmarks, recent years have seen the emergence of increasingly large learning based prediction models, many with millions of parameters, focused on improving open-loop prediction accuracy by mere centimeters. However, these benchmarks fail to assess whether such improvements translate to better performance when integrated into an autonomous driving stack. In this work, we systematically evaluate the interplay between state-of-the-art motion predictors and motion planners. Our results show that higher open-loop accuracy does not always correlate with better closed-loop driving behavior and that other factors, such as temporal consistency of predictions and planner compatibility, also play a critical role. Furthermore, we investigate downsized variants of these models, and, surprisingly, find that in some cases models with up to 86% fewer parameters yield comparable or even superior closed-loop driving performance. Our code is available at https://github.com/aumovio/pred2plan.


💡 Research Summary

The paper addresses a critical gap in the evaluation of motion‑prediction models for autonomous driving. While recent research has produced increasingly large deep‑learning predictors that shave a few centimeters off open‑loop (OL) error metrics such as minADE and minFDE, these benchmarks ignore how the predictions are actually used by downstream planners. The authors ask whether improvements in OL accuracy translate into better closed‑loop (CL) driving performance, and whether other factors—temporal consistency, uncertainty representation, and planner compatibility—play a more decisive role.

To answer these questions, the authors build a comprehensive CL evaluation framework by integrating two previously independent systems: UniTraj, which provides a unified data interface and cross‑dataset handling, and nuPlan, a large‑scale simulation platform that runs realistic planning scenarios derived from over 1,300 hours of real‑world driving logs. The framework allows any motion‑prediction model to be paired with any planner and evaluated on the same set of scenarios with non‑reactive traffic agents, thereby isolating the effect of the predictor.

The study evaluates three state‑of‑the‑art transformer‑based predictors—Motion‑Transformer (MTR), Wayformer, and Autobot—each in its original size and in a “Mini” version with dramatically reduced parameters (up to 86 % fewer). In addition, a simple constant‑velocity kinematic predictor, a multi‑hypothesis kinematic variant, and an oracle predictor (ground‑truth replay) are included as baselines. Model capacities range from 65 M parameters (MTR) down to 1.5 M (Autobot), with corresponding FLOPs from 10.34 G to 1.66 G. Two planners are used: the optimization‑based planner from nuPlan and a Model‑Predictive Contouring Controller (MPCC) from UniTraj, both operating under a two‑stage control architecture that feeds planner outputs to an LQR controller for feasibility.

Key findings:

  1. Non‑linear relationship between OL and CL performance – High OL accuracy does not guarantee superior CL scores. For example, the full‑size MTR achieves a CL score of 0.78, while the 86 % smaller MTR‑Mini scores 0.81, outperforming its larger counterpart. Conversely, Wayformer‑Mini, despite modest OL degradation, suffers a large CL drop because its output format mismatches the planner’s expectations.

  2. Temporal consistency matters – Predictors that produce smooth, temporally coherent trajectories enable planners to generate stable control commands. Models with abrupt trajectory changes force planners into frequent braking or acceleration, increasing collision risk and passenger discomfort. The Mini models, despite fewer parameters, often exhibit smoother predictions, leading to better CL outcomes.

  3. Planner‑predictor compatibility is crucial – The nuPlan planner explicitly uses the probabilities and negative log‑likelihood (NLL) of each predicted mode for risk assessment. Predictors that provide accurate uncertainty estimates allow the planner to adopt more aggressive yet safe maneuvers. When the number of modes, probability distribution, or temporal resolution does not align with the planner’s requirements, the planner defaults to conservative behavior (speed reduction, lane‑keeping), degrading overall performance.

  4. Model compression can be beneficial – Reducing parameters dramatically cuts inference latency and power consumption, which are critical for on‑vehicle deployment. The experiments demonstrate that a 86 % reduction in MTR’s parameters yields comparable or better CL performance, suggesting that large models may be over‑parameterized for the planning task.

  5. Cross‑dataset generalization – By unifying NuScenes, Argoverse2, Waymo Open, and the Shifts dataset through UniTraj, the authors show that training on a diverse set improves CL robustness. Models trained on the Shifts dataset, which contains a wide variety of geographic and traffic conditions, maintain more consistent CL scores across all test sets.

The paper concludes that evaluating motion‑prediction models solely on OL displacement errors is insufficient for real autonomous‑driving systems. Instead, a system‑level CL assessment that accounts for temporal consistency, uncertainty quality, and planner‑predictor interface design provides a more meaningful measure of a model’s utility. Moreover, the results encourage the development of lightweight predictors that are tailored to the needs of downstream planners, rather than pursuing ever‑larger networks for marginal OL gains. Future work is suggested in three directions: (i) incorporating reactive traffic agents to study bidirectional interaction, (ii) jointly optimizing predictor and planner in an end‑to‑end fashion, and (iii) establishing standardized interfaces and benchmark protocols for CL evaluation.

Overall, the study offers a rigorous methodology, extensive empirical evidence, and actionable insights that should reshape how the research community evaluates and designs motion‑prediction models for autonomous driving.


Comments & Academic Discussion

Loading comments...

Leave a Comment