Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic Reinforcement Learning (ARL) focuses on training large language models (LLMs) to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically investigate this assumption by introducing a Linear Effect Attribution System(LEAS), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank adaptation modules. Experimental results show that DART consistently outperforms baseline methods with averaged 6.35 percent improvements and achieves performance comparable to multi-agent systems that explicitly separate tool-use and reasoning using a single model.


💡 Research Summary

Agentic Reinforcement Learning (ARL) aims to turn large language models (LLMs) into autonomous agents that can both reason and invoke external tools. The prevailing design trains a single set of model parameters to handle both capabilities, assuming that joint optimization will improve overall performance. This paper challenges that assumption through systematic empirical analysis.

The authors introduce the Linear Effect Attribution System (LEAS), a diagnostic framework that isolates the contributions of reasoning and tool‑use and quantifies their interaction. By constructing six model variants—base, reasoning‑only, tool‑only, unified (jointly trained), tool‑hybrid, and reasoning‑hybrid—and encoding each variant with a binary capability vector, they formulate a linear system z = Xλ where λ contains main‑effect and interaction coefficients for each question. Solving this system yields interaction coefficients λ₍₂₃₎; negative values indicate interference between reasoning and tool‑use. Across two large‑scale QA benchmarks (Natural Questions and HotpotQA) and multiple model sizes (Qwen2.5‑3B and 7B), the majority of λ₍₂₃₎ are negative, revealing a “seesaw” phenomenon: improving tool performance often degrades reasoning and vice‑versa.

To uncover the root cause, the paper examines gradient dynamics. By masking gradients at the token level (separating reasoning tokens from tool tokens) they compute cosine similarity between the gradients of the two loss components. The similarity is consistently strongly negative, showing that the two objectives push the shared backbone in opposite directions. This gradient conflict becomes more pronounced in larger models, suggesting that the limited shared parameter space cannot simultaneously accommodate both skill sets without compromise.

In response, the authors propose Disentangled Action‑Reasoning Tuning (DART). DART freezes the pretrained backbone and attaches two independent low‑rank adaptation (LoRA) modules: one dedicated to reasoning tokens and one to tool‑use tokens. A router ℓ(t) directs each token’s gradient to the appropriate LoRA, ensuring that reasoning updates never affect the tool LoRA and vice‑versa. Because LoRA adds only a small set of trainable matrices (rank r ≪ model dimension), DART incurs negligible parameter overhead while fully eliminating gradient interference.

Experimental evaluation on seven tool‑augmented QA datasets demonstrates that DART consistently outperforms standard joint‑training baselines, achieving an average exact‑match improvement of 6.35 percentage points. Moreover, DART’s performance is on par with multi‑agent systems that explicitly separate reasoning and tool modules into distinct models, despite using a single model. Ablation studies confirm that (1) removing the token‑level routing (single LoRA) collapses performance, and (2) disabling either LoRA module harms the corresponding capability, underscoring the necessity of the disentangled design.

The paper concludes that reasoning and tool‑use are inherently competing objectives when trained on a shared backbone, and that simple, low‑cost architectural changes can resolve this competition. DART offers a practical path forward for ARL research, enabling more reliable and scalable agent behavior without the complexity of managing multiple models. Future work may explore extending the disentanglement to additional skills (e.g., planning, memory) or integrating dynamic routing mechanisms for even finer‑grained specialization.


Comments & Academic Discussion

Loading comments...

Leave a Comment