THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
💡 Research Summary
Paper Overview
The paper addresses a persistent limitation of large language models (LLMs): while they have become adept at natural‑language reasoning, they still falter on high‑precision tasks such as exact numerical computation, symbolic manipulation, and formal proof. Recent work has shown that integrating external code‑based tools (e.g., Python interpreters, symbolic libraries) can compensate for this weakness, but three major challenges remain: (1) constructing high‑quality tool‑integrated reasoning (TIR) data that matches the policy model’s style, (2) performing fine‑grained optimization beyond coarse episode‑level reinforcement learning (RL), and (3) exploiting immediate tool feedback during inference.
Key Contributions
-
TIRGen – a Generator‑Refiner data pipeline
- Generator produces natural‑language reasoning steps (rₜ) up to a fixed length.
- Refiner examines each step, extracts any sub‑task that can be solved with code, and converts it into an executable Python snippet (aₜ) while preserving the surrounding logical narrative (rₜ^logic).
- The snippet is executed in a sandbox interpreter S, producing an observation oₜ that is fed back to the Generator, which continues reasoning from the updated context.
- This loop yields full trajectories τ = (r₁, a₁, o₁, …, rₙ).
- Multi‑stage filtering (format consistency, code quality, difficulty & call‑round balancing) produces a clean, policy‑aligned dataset D_SFT. Because the Refiner never sees the whole problem statement or final answer, the resulting data stay in‑distribution with the Generator’s intrinsic style, reducing the “distribution shift” that plagues data‑synthesized from large external models (e.g., GPT‑4o).
-
Hierarchical Reinforcement Learning
- Episode‑level RL: Uses GRPO (Generalized Reward‑Based Policy Optimization) to maximize the binary reward of final answer correctness. This coarse optimization improves the overall problem‑solving strategy.
- Step‑level RL: Introduces an auxiliary reward for each tool call based on execution success. Empirical analysis shows a strong correlation between intermediate tool success and final answer correctness; leveraging this signal mitigates the sparse‑reward problem typical of pure episode‑level RL.
- The combined objective jointly updates the policy π_θ to be better at both selecting when to call a tool and generating correct code for each call.
-
Self‑Correction during Inference
- When a code execution fails, the model backtracks to the most recent reasoning step, re‑examines the failure, and generates an alternative code snippet or reasoning path.
- This dynamic revision uses the immediate observation (error message, traceback) as a corrective signal, turning a single‑pass inference pipeline into an iterative, error‑aware process.
Experimental Setup
Benchmarks: Math500, AIME 2024/2025, AMC, Minerva Math, Olympiad Bench (mathematics) and HumanEval, MBPP, LiveCodeBench (code generation).
Model scale: Primary experiments on 7‑billion‑parameter models (both reasoning‑oriented and standard language models).
Baselines: Toolformer, Aimo‑2, recent tool‑integrated RL methods, and pure chain‑of‑thought (CoT) models.
Results
- On Math500, THOR improves accuracy by ~5.2 % absolute over the strongest baseline.
- On AIME 2025, THOR reaches 81.3 % correct, surpassing the previous state‑of‑the‑art by 7 % points.
- Code benchmarks see 3–5 % absolute gains, confirming that step‑level RL indeed sharpens code generation.
- The self‑correction mechanism yields a 10 % relative boost on problems requiring multiple tool calls, with only a modest 1.2× increase in average inference latency.
Ablation Studies
- Removing step‑level rewards drops performance by ~3 % on Math500, indicating the importance of intermediate feedback.
- Replacing TIRGen with a naïve GPT‑4o‑generated dataset reduces accuracy by ~4 %, highlighting the benefit of policy‑aligned data.
- Disabling self‑correction leads to a noticeable drop on multi‑step problems, confirming its role in robustness.
Analysis & Discussion
The paper convincingly demonstrates that (a) high‑quality, policy‑aligned TIR data can be generated without massive external models, (b) hierarchical RL that rewards intermediate tool success alleviates sparse‑reward issues and directly improves code generation, and (c) leveraging tool feedback at inference time transforms the LLM from a static predictor into an interactive problem‑solver. The approach is model‑agnostic, working for both reasoning‑specialized and generic LLMs, and the computational overhead remains practical for real‑world deployment.
Limitations & Future Work
- The current implementation focuses on Python‑based numerical/symbolic tools; extending to theorem provers, graph libraries, or domain‑specific APIs will require additional Refiner capabilities.
- The self‑correction loop may still struggle with deeply nested failures where the correct alternative is non‑trivial; integrating human‑in‑the‑loop feedback or more sophisticated search strategies could help.
- Scaling to larger models (e.g., 70B) and evaluating on open‑ended scientific reasoning tasks are natural next steps.
Conclusion
THOR introduces a comprehensive framework—TIRGen for data creation, hierarchical RL for fine‑grained policy refinement, and self‑correction for inference—that collectively pushes LLMs toward reliable, tool‑augmented mathematical reasoning. The reported gains across a wide suite of benchmarks set a new state‑of‑the‑art for models of comparable size and suggest a promising direction for future research on interactive, tool‑aware language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment