Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.

💡 Research Summary

The paper tackles a growing mismatch in the autonomous driving research community: most recent benchmarks and training pipelines expect end‑to‑end (E2E) models to output future waypoints, while a parallel line of work produces low‑level control actions (throttle, steering, brake). Because the standard evaluation protocols only accept waypoint trajectories, action‑based policies cannot be fairly compared or trained without an additional conversion step, creating a “waypoint‑action gap” that hinders progress of action‑centric approaches.

To bridge this gap, the authors introduce a differentiable, deterministic vehicle‑motion framework that lifts a sequence of predicted actions into an ego‑frame waypoint trajectory. The core component is a “lifting operator” Fϕ, which is decomposed into three modular functions: (1) a control‑activation mapping ψϕ that converts raw network outputs into bounded physical controls, (2) a dynamics rollout fϕ that propagates the vehicle state forward using a chosen kinematic model, and (3) a pose‑projection h that extracts planar (x, y) positions from the full state. By ensuring each sub‑module is C¹ continuous, they prove (Proposition 3.1) that Fϕ is deterministic and continuously differentiable, allowing gradients to flow from waypoint loss back through the vehicle model to the policy network parameters θ while keeping the vehicle model parameters ϕ fixed.

Two concrete kinematic models are instantiated: the classic Kinematic Bicycle Model (KBM) and a Continuous‑Curvature Path Planner (CCPP) that uses clothoid arcs and integrates along arc‑length rather than time. A third, purely data‑driven MLP lifting operator is also evaluated for comparison, though it lacks physical guarantees.

Training proceeds as follows: at each decision step t, the policy network Nθ receives sensor observations ot and a high‑level command ct (e.g., “turn left”) and outputs a sequence of actions at. The lifting operator rolls out these actions into predicted waypoints ŵt, which are supervised against ground‑truth waypoints wgt,t using a simple L1 position loss (optionally weighted over time). The loss is back‑propagated through ψϕ, fϕ, and h, updating only the policy parameters. Heading errors are omitted from the loss, a design choice justified by the focus on position accuracy and the desire to remain comparable with existing waypoint‑only methods.

The framework is evaluated on four challenging benchmarks: NAVSIM navhard, NAVSIM navtest, Bench2Drive, and a CARLA‑based evaluation protocol. Results show that action‑based policies equipped with the proposed lifting operator achieve state‑of‑the‑art performance on NAVSIM navhard (vision‑only), come within 1.5 % of the best vision‑only model on navtest, improve the Bench2Drive baseline DS by up to 61.1 %, and exhibit the strongest correlation between loss and closed‑loop driving outcomes in the CARLA protocol. Importantly, these gains are obtained without modifying any benchmark interfaces; the same waypoint‑based loss and evaluation metrics are used for both waypoint‑ and action‑based models.

Key contributions are: (1) a waypoint‑based training objective that can be applied directly to action‑based policies, yielding more stable evaluation and better closed‑loop performance; (2) the first deterministic, differentiable vehicle‑dynamics lifting framework that bridges the two paradigms; (3) a unified experimental platform that demonstrates comparable or superior performance of action‑based methods across multiple benchmarks.

Limitations include the exclusion of heading and velocity terms from the loss, reliance on relatively simple kinematic models (no tire dynamics or friction modeling), and the lack of real‑vehicle validation. Future work could extend the framework with richer dynamics, multi‑modal sensor fusion, heading‑aware losses, and on‑road testing to further close the gap between simulation and deployment.

Overall, the paper provides a solid methodological contribution that enables fair training, evaluation, and comparison of action‑based autonomous driving policies within existing waypoint‑centric benchmarks, thereby opening new avenues for research that leverages the strengths of both paradigms.

Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment