PlanTRansformer: Unified Prediction and Planning with Goal-conditioned Transformer

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Trajectory prediction and planning are fundamental yet disconnected components in autonomous driving. Prediction models forecast surrounding agent motion under unknown intentions, producing multimodal distributions, while planning assumes known ego objectives and generates deterministic trajectories. This mismatch creates a critical bottleneck: prediction lacks supervision for agent intentions, while planning requires this information. Existing prediction models, despite strong benchmarking performance, often remain disconnected from planning constraints such as collision avoidance and dynamic feasibility. We introduce Plan TRansformer (PTR), a unified Gaussian Mixture Transformer framework integrating goal-conditioned prediction, dynamic feasibility, interaction awareness, and lane-level topology reasoning. A teacher-student training strategy progressively masks surrounding agent commands during training to align with inference conditions where agent intentions are unavailable. PTR achieves 4.3%/3.5% improvement in marginal/joint mAP compared to the baseline Motion Transformer (MTR) and 15.5% planning error reduction at 5s horizon compared to GameFormer. The architecture-agnostic design enables application to diverse Transformer-based prediction models. Project Website: https://github.com/SelzerConst/PlanTRansformer

💡 Research Summary

PlanTRansformer (PTR) addresses a fundamental mismatch in autonomous driving pipelines: trajectory prediction models generate multimodal forecasts for surrounding agents under unknown intentions, while planning modules assume known ego goals and produce deterministic, safety‑constrained trajectories. This asymmetry leads to two problems: prediction lacks supervision for agent intents, and planning cannot directly exploit the rich probabilistic information from prediction. PTR proposes a unified Gaussian‑Mixture‑Transformer framework that bridges this gap by conditioning the prediction process on high‑level navigation commands and reachable lane information, and by embedding differentiable dynamic‑feasibility and collision‑avoidance constraints into the training loss.

Core Architecture
PTR builds on the Motion Transformer (MTR) encoder‑decoder backbone. The encoder ingests three polyline‑based modalities: (1) agent histories, (2) map geometry, and (3) reachable lanes (i.e., lanes that satisfy a given navigation command). Each modality is encoded with a PointNet‑style MLP and then concatenated. A local‑attention transformer processes the concatenated sequence, restricting attention to the k‑nearest neighboring polylines to preserve spatial locality while keeping memory usage modest. The resulting fused features are split into refined agent (F′_A), map (F′_M), and lane (F′_L) tokens for the decoder.

The decoder introduces two novel mechanisms:

Goal‑conditioned query initialization – High‑level commands (HLCs) such as left turn, straight, right turn, stop, unknown, and vulnerable road user are derived from rule‑based heuristics on agent geometry and dynamics during training, and from planner‑provided waypoints at inference. Each command type has a learned embedding e_c that initializes the content of motion query pairs (static intention and dynamic search queries). This biases the decoder toward intention‑aligned modes and accelerates convergence.
Reachable lane cross‑attention – Refined lane tokens F′_L are projected into the decoder’s embedding space and attended to alongside agent and map tokens. This explicitly enforces route feasibility: predicted trajectories are encouraged to stay within lanes that are reachable given the current command.

The decoder iteratively refines queries through self‑attention and cross‑attention, finally outputting a mixture of Gaussians (GMM) for each agent. Each mixture component provides a mean trajectory, covariance, and a mode probability, preserving multimodality while being grounded in navigation constraints.

Teacher‑Student Masking Strategy
A key training innovation is progressive masking of surrounding‑agent commands. Early epochs use full command information (teacher mode) to stabilize learning; later epochs gradually hide these signals (student mode) so that the model learns to predict under the realistic inference condition where other agents’ intents are unavailable. This curriculum‑style approach yields a model that is both accurate when intents are known and robust when they are not.

Loss Functions
PTR optimizes a multi‑objective loss:

Dense ℓ1 loss for auxiliary future predictions (captures interaction cues).
Negative log‑likelihood of the selected Gaussian component (GMM loss) to maximize likelihood of ground‑truth positions.
Cross‑entropy on mode probabilities (classification loss).
Dynamic feasibility penalties (velocity/acceleration limits).
Lane‑violation penalties (distance outside reachable lanes).
Collision‑avoidance penalties (hinge loss on inter‑agent distances).

All terms are differentiable, enabling end‑to‑end training that simultaneously improves prediction accuracy and enforces safety constraints.

Experimental Evaluation
Experiments on the Waymo Open Motion Dataset compare PTR against the baseline Motion Transformer and the planning‑oriented GameFormer. Results show:

Prediction – marginal mAP improves by 4.3 % and joint mAP by 3.5 % over MTR, demonstrating that goal‑conditioning and lane constraints sharpen multimodal forecasts.
Planning – 5‑second horizon trajectory error drops by 15.5 % relative to GameFormer, confirming that the embedded feasibility and collision losses produce safer ego trajectories.
Ablation studies reveal that removing command embeddings, lane cross‑attention, or the progressive masking each leads to noticeable performance degradation, underscoring the necessity of all three components.

Generality and Limitations
Because PTR’s modifications are applied at the level of the transformer encoder‑decoder rather than MTR‑specific modules, the approach can be transplanted to other transformer‑based prediction models (e.g., SceneTransformer, TNT). Limitations include reliance on rule‑based command labeling (which may not capture nuanced intent in complex traffic) and the computational overhead of additional cross‑attention layers, which may require optimization for real‑time deployment.

Conclusion
PlanTRansformer offers a principled, end‑to‑end solution that unifies trajectory prediction and planning. By conditioning on high‑level navigation commands, explicitly modeling reachable lanes, and enforcing differentiable safety constraints, PTR narrows the gap between probabilistic forecasting and deterministic, feasible motion planning. The teacher‑student masking curriculum further equips the model to operate under realistic information scarcity. This work paves the way for future autonomous driving systems where a single neural network can simultaneously predict surrounding agents’ possible futures and generate safe, goal‑aligned ego trajectories.

PlanTRansformer: Unified Prediction and Planning with Goal-conditioned Transformer

💡 Research Summary

Comments & Academic Discussion

Leave a Comment