ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization
Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying LLMs to ship trajectory prediction remains largely unexplored. In this paper, we propose ShipTraj-R1, a novel LLM-based framework that reformulates ship trajectory prediction as a text-to-text generation problem. (1) We design a dynamic prompt containing trajectory information about conflicting ships to guide the model to achieve adaptive chain-of-thought (CoT) reasoning. (2) We introduce a comprehensive rule-based reward mechanism to incentivize the reasoning format and prediction accuracy of the model. (3) Our ShipTraj-R1 is reinforced through the GRPO mechanism guided by domain-specific prompts and rewards, and utilizes the Qwen3 as the model backbone. Extensive experimental results on two complex and real-world maritime datasets show that the proposed ShipTraj-R1 achieves the least error compared with state-of-the-art deep learning and LLM-based baselines.
💡 Research Summary
The paper introduces ShipTraj‑R1, a novel framework that leverages large language models (LLMs) for ship trajectory prediction by reformulating the task as a text‑to‑text generation problem enriched with chain‑of‑thought (CoT) reasoning. Traditional deep‑learning approaches (LSTM, GRU, CNN, graph‑based networks) excel at modeling temporal and spatial dependencies but struggle to incorporate collision‑avoidance logic that depends on the dynamic context of nearby vessels. ShipTraj‑R1 addresses this gap through three key innovations.
First, it designs a dynamic prompt that embeds the historical trajectory of the target ship together with the trajectories of neighboring ships that pose a collision risk. Risk assessment is performed using a Quaternion Ship Domain (QSD) model, which quantifies safe water zones around each vessel and selects only those ships whose domain overlap exceeds a predefined threshold. This selective inclusion keeps the prompt concise while providing the LLM with the necessary situational awareness.
Second, the authors develop a rule‑based reward function composed of two complementary components. The “thinking format reward” is binary and enforces strict output structure: the model must wrap its reasoning in
Third, ShipTraj‑R1 applies Group Relative Policy Optimization (GRPO) to fine‑tune the LLM (Qwen‑3) under the defined rewards. For each input, M candidate completions (including CoT reasoning and predicted coordinates) are sampled from the current policy. Their scalar rewards are normalized relative to the group mean and standard deviation, yielding a relative advantage. This advantage is clipped and combined with a KL‑divergence regularization term to form the GRPO objective, which encourages higher‑advantage outputs while preventing drastic policy drift. The process iteratively refines the model’s ability to generate coherent reasoning tailored to the specific conflict context and to output highly accurate coordinate predictions.
The experimental evaluation uses real AIS data from two maritime regions: ChengShanJiao Promontory (CSJP) and CaoFeiDian Port (CFDP). After cubic‑spline interpolation to a uniform 5‑second sampling interval, the datasets contain 2,649 and 2,948 complete trajectories respectively. The data split is 90 % training, 5 % validation, and 5 % testing. Evaluation metrics include Final Displacement Error and Mean Absolute Error. ShipTraj‑R1 outperforms a suite of baselines—classical deep‑learning models (LSTM, CNN, Graph Convolutional Networks) and recent LLM‑based methods (LMTraj‑SUP, LG‑Traj)—achieving the lowest errors on both datasets. Notably, in high‑risk collision scenarios, the CoT‑driven reasoning contributes significantly to the performance gain, and the structured output facilitates interpretability and safety verification.
The authors acknowledge limitations: converting continuous coordinates to textual tokens may introduce quantization error; long prompts can approach token limits of the underlying LLM; and the QSD parameters may need domain‑specific tuning. Future work is suggested on high‑precision coordinate tokenizers, prompt compression techniques, and multimodal extensions incorporating radar or map imagery.
In summary, ShipTraj‑R1 demonstrates that LLMs, when equipped with domain‑aware prompts, rule‑based reinforcement signals, and GRPO fine‑tuning, can surpass traditional deep‑learning approaches for ship trajectory prediction, delivering both higher accuracy and explainable reasoning essential for maritime safety.
Comments & Academic Discussion
Loading comments...
Leave a Comment