ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

ASTER: Agentic Scaling with Tool-integrated Extended Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.


💡 Research Summary

The paper tackles a fundamental challenge in scaling tool‑integrated reasoning (TIR) for large language models (LLMs): the “interaction collapse” phenomenon, where during reinforcement learning (RL) the model abandons multi‑turn tool usage and reverts to heavy internal reasoning with only superficial post‑hoc code verification. The authors systematically investigate three research questions: (i) how supervised fine‑tuning (SFT) as a cold‑start influences the emergence of an agentic, tool‑using behavioral prior; (ii) how the interaction density of cold‑start trajectories shapes exploration and downstream RL performance; and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference‑time budgets.

To address these questions, they propose ASTER (Agentic Scaling with Tool‑integrated Extended Reasoning), a two‑stage framework. In the first stage, instead of using sparse or synthetically short tool demonstrations (as in prior “Zero‑TIR” or “ReTool” approaches), they construct an interaction‑dense cold‑start dataset. Using GPT‑OSS‑20B they generate 45 K tool‑augmented solutions for math problems, then curate a high‑quality subset of 4 K trajectories that contain nine or more tool calls. This dataset exhibits a markedly higher proportion of long‑horizon tool interactions (≈17 % of trajectories have ≥5 tool calls) compared with existing datasets that are heavily skewed toward 1–2 calls.

Training proceeds with SFT on this dense dataset, deliberately prioritizing the formation of a high‑entropy behavioral prior rather than immediate post‑SFT accuracy. The resulting model retains strong exploratory capacity for tool usage. In the second stage, they fine‑tune the model with reinforcement learning using Group Relative Policy Optimization (GRPO). GRPO samples a group of G outputs per query, using relative performance within the group as a baseline, which reduces memory overhead and stabilizes training compared with traditional PPO. They also set a generous tool‑invocation limit (up to 50 calls per trajectory) to allow the policy to explore extended tool sequences.

Empirical evaluation spans four competitive mathematics benchmarks: AIME 2024, AIME 2025, HMMT 2025, and BeyondAIME. Under a 30 K inference budget, ASTER‑4B achieves 85 % accuracy on AIME 2025, surpassing a 235 B parameter Qwen3‑235B‑A22B‑Thinking model (85.7 % overall but lower on AIME) and outperforming DeepSeek‑V3.2‑Exp (671 B) which scores 89.3 % on the same benchmark. When the inference budget is increased to 90 K, ASTER‑4B reaches 90 % on AIME 2025, demonstrating strong test‑time scaling. Across all benchmarks, ASTER‑4B consistently outperforms strong baselines such as ReTool‑32B, DemyAgent‑4B, and POLARIS‑4B‑Preview, despite having orders of magnitude fewer parameters.

The paper’s analysis of the three research questions yields clear insights: (RQ1) Different cold‑start strategies induce distinct behavioral priors; an agentic judge (based on GPT‑5) rates ASTER’s prior highest across planning, code modeling, error handling, and tool efficiency. (RQ2) Interaction density is the key factor; dense trajectories preserve multi‑turn tool usage throughout RL training, whereas sparse priors quickly collapse to minimal tool calls. (RQ3) Larger RL interaction budgets improve learning stability and enable the model to maintain performance when inference‑time tool budgets are constrained, indicating that the dense prior confers robustness to budget variations.

Overall, ASTER demonstrates that a carefully curated, interaction‑dense cold‑start dataset—combined with a group‑based RL optimizer—can overcome interaction collapse and unlock scalable agentic intelligence in LLMs. The work highlights data efficiency: a mere 4 K high‑quality trajectories suffice to endow a 4 B parameter model with capabilities that rival or exceed models with hundreds of billions of parameters. The authors acknowledge that their experiments focus on mathematical reasoning; extending the approach to other domains (e.g., programming, scientific simulation) and automating the generation of interaction‑dense data are promising directions for future research.


Comments & Academic Discussion

Loading comments...

Leave a Comment