Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.

💡 Research Summary

The paper addresses the critical need for reliable verification of multi‑step medical reasoning generated by large language models (LLMs). Existing verification approaches rely on scalar reward models that provide only a single numeric score and use a static retrieval‑augmented generation (RAG) pipeline where a fixed set of documents is fetched once before evaluation. These methods suffer from two major drawbacks: they lack interpretability because they do not expose the evidence behind a judgment, and they cannot adaptively acquire additional knowledge as verification proceeds.

To overcome these limitations, the authors propose Med‑TIV (Medical Tool‑Integrated Verification), an agentic framework in which a verifier model is equipped with tool‑use capabilities. Given a medical question q and a reasoning trace τ produced by a frozen generator, the verifier iteratively performs three actions at each step k: (1) generate a natural‑language analysis rₖ, (2) formulate a search query aₖ, and (3) retrieve documents oₖ from an external medical corpus via a search engine E. The retrieved documents are appended to the verifier’s context, allowing it to ground subsequent analysis in concrete evidence. The process continues until the verifier emits a final judgment ℓ (error‑free or contains errors) wrapped in an tag. All outputs are required to follow a strict tag‑based format (,

, ) to ensure interpretability and facilitate downstream parsing.

Training relies on reinforcement learning (RL) with only trace‑level correctness labels (whether the whole reasoning trace is correct). No step‑level expert annotations are needed. The RL algorithm is a variant of PPO called Dr‑GRPO. Two complementary reward components are defined: (i) Correctness Reward (R_c), which is 1 if the verifier’s final judgment matches the ground‑truth label and 0 otherwise; (ii) Format Reward (R_f), which is 1 only if the output respects the tag structure and does not exceed a maximum of ten tags, otherwise 0 (or a small penalty for over‑use). The final reward is the product R = R_c × R_f, encouraging both factual accuracy and well‑formed output.

A key innovation is an adaptive curriculum that dynamically filters training instances each iteration. For each sampled instance, the current policy generates G verification trajectories and computes their rewards. Instances where all G rewards are identical (all successes or all failures) are discarded because they provide little gradient signal. Only “boundary” cases with non‑zero reward variance are retained, ensuring the policy learns from examples where it is uncertain. The curriculum size is kept constant (20 K instances per iteration) by resampling until the desired number of informative examples is reached. This self‑adjusting difficulty schedule eliminates the need for manually designed curricula and aligns training difficulty with the model’s evolving competence.

The entire training loop consists solely of RL; there is no alternating supervised fine‑tuning (SFT) or rejection sampling stage. This self‑bootstrapping approach follows the RL‑Zero paradigm, allowing the verifier to improve its own verification abilities without dense expert demonstrations.

Experiments were conducted on four medical reasoning benchmarks: MedQA, MedXpertQA, PubMedQA, and a diagnostic scenario set. A 7‑billion‑parameter generator model was kept frozen, and Med‑TIV was attached as a plug‑in verifier. During inference, multiple candidate traces were sampled (Best‑of‑N) and scored by the verifier; final answers were selected either by highest score or by verification‑based majority voting. Compared with strong baseline reward models, Med‑TIV achieved 23.5 % absolute improvement on MedQA and 32.0 % on MedXpertQA. Moreover, it required 8× fewer sampled traces to reach comparable accuracy, demonstrating a substantial gain in sampling efficiency.

Ablation studies confirmed the importance of each component: removing tool use (reverting to a pure text‑based verifier) caused a steep drop in performance; disabling the format reward led to malformed outputs and higher error rates; and training without the adaptive curriculum resulted in slower convergence and lower final accuracy.

The authors acknowledge limitations: the verifier’s performance heavily depends on the quality and coverage of the external medical corpus; current corpora are primarily English‑language guidelines and may miss region‑specific knowledge. The binary reward design also limits the ability to capture nuanced logical faults or varying evidence reliability. Future work is outlined to incorporate multimodal evidence (e.g., medical images, tables), develop richer continuous reward signals (partial credit for partially correct reasoning), and blend human feedback with RL to further enhance robustness and generalization.

In summary, Med‑TIV introduces a novel tool‑integrated reinforcement‑learning framework that simultaneously improves factual grounding, interpretability, and sampling efficiency for medical reasoning verification. By enabling dynamic evidence retrieval and learning solely from trace‑level supervision, it offers a practical pathway toward trustworthy LLM‑assisted clinical decision support systems.

Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment