Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology
While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model’s reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at https://github.com/Marcowky/Weather-R1.
💡 Research Summary
The paper tackles the challenge of applying Vision‑Language Models (VLMs) to high‑stakes meteorological reasoning. First, the authors construct WeatherQA, a multimodal benchmark covering four thematic areas—precipitation, weather phenomena, temperature, and weather systems—across seven image modalities (e.g., cumulative precipitation maps, infrared cloud images, geopotential height and wind fields). The dataset contains 15,400 multiple‑choice items, split chronologically (train 2017‑2021, validation 2022, test 2023), and a 5 % expert‑validated sample confirms >95 % quality.
Second, they identify a critical flaw in standard Reinforcement Fine‑Tuning (RFT): it rewards only the correctness of the final answer, leading to Self‑Contradictory Reasoning (Self‑Contra) where the generated reasoning does not support the answer. Empirical analysis shows ~30 % of RFT outputs exhibit this inconsistency across all tasks.
To remedy this, they propose Logically Consistent Reinforcement Fine‑Tuning (LoCo‑RFT). LoCo‑RFT adds a logical consistency reward (R_LoCo) that uses an external LLM (gpt‑oss‑20b) as a judge to verify whether the reasoning text aligns with the answer. The total reward is a weighted sum of format reward (0.1), logical consistency reward (0.3), and answer accuracy reward (0.6). This design preserves the efficiency of the Group Relative Policy Optimization (GRPO) algorithm while dramatically reducing Self‑Contra rates to around 10 %.
Using LoCo‑RFT, the authors fine‑tune a 7 B parameter model (Weather‑R1) initialized from Qwen2.5‑VL‑7B‑Instruct. Weather‑R1 achieves 52.9 % average accuracy on WeatherQA, a 9.8 percentage‑point gain over the baseline and surpasses the much larger 32 B Qwen2.5‑VL model. It also outperforms supervised fine‑tuning (49.8 %) and standard RFT (51.4 %). On an out‑of‑domain ScienceQA subset focused on weather and climate, Weather‑R1 reaches 86.46 % accuracy, exceeding all baselines.
The study demonstrates that incorporating logical consistency into the reinforcement signal yields VLMs that are both more accurate and more trustworthy for meteorological reasoning. The authors suggest that LoCo‑RFT could be generalized to other high‑risk domains such as medicine or law, and they encourage future work on scaling the approach and testing broader domain applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment