RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents
Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., <|high_reward|>, <|low_reward|>) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.
💡 Research Summary
The paper tackles the difficulty of multi‑turn tool‑calling with large language models (LLMs), where rewards are extremely sparse and exploration is costly. Existing pipelines that first perform supervised fine‑tuning (SFT) and then apply Group‑Relative Policy Optimization (GRPO) often stall because the SFT policy becomes highly peaked: rollouts generated from the same prompt are almost identical, leading to near‑zero variance of rewards within each GRPO group. Since GRPO normalizes advantages by the group’s mean and standard deviation, a vanishing standard deviation makes the advantage term degenerate and policy updates disappear – a phenomenon the authors call “gradient collapse.”
To solve this, the authors introduce Reward‑Conditioned GRPO (RC‑GRPO), a two‑stage method that deliberately injects diversity into each GRPO group via discrete reward tokens.
Stage 1 – Reward‑Conditioned Trajectory Policy (RCTP).
A base LLM is fine‑tuned on a mixed‑quality dataset where each trajectory is labeled with a special token indicating its expected quality: <|high_reward|> for successful (binary reward = 1) trajectories and <|low_reward|> for failures (reward = 0). The model learns a conditional distribution πθ(aₜ | hₜ, r), where r is the reward token. This step equips the model with the ability to deliberately generate high‑quality or low‑quality behavior on demand.
Stage 2 – Reward‑Conditioned GRPO.
Starting from the RCTP as a reference policy πref, the algorithm performs GRPO but modifies the sampling process: for each prompt, a group of G rollouts is generated, and each rollout is first assigned a reward token rⱼ sampled from a fixed distribution Psample(r) (probability p for <|high_reward|> matching the proportion of successful trajectories in the RCTP data). The rollout is then conditioned on rⱼ, producing a mixture of high‑ and low‑quality trajectories within the same group. This guarantees non‑zero within‑group reward variance σg, which in turn yields informative, non‑degenerate advantages Aj = (R(τⱼ) − μg)/σg. The loss follows the PPO‑style clipped objective used in GRPO, with an additional KL‑penalty that keeps the learned policy close to πref.
Experiments.
The method is evaluated on the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi‑turn benchmark using two open‑source models: LLaMA‑3.1‑8B‑Instruct and Qwen‑2.5‑7B‑Instruct. Compared to the standard SFT + GRPO baseline, RC‑GRPO improves average success rates by 4–7 percentage points. Notably, Qwen‑2.5‑7B‑Instruct with RC‑GRPO outperforms all closed‑source API agents evaluated on the same leaderboard. Training‑dynamics analysis shows that RC‑GRPO increases the correlation between entropy and reward, and dramatically widens the distribution of group‑normalized advantages, confirming that the injected token‑based variance is the key driver of improvement. Ablations demonstrate that merely increasing temperature or entropy does not replicate the effect.
Theoretical Insight.
The authors provide a variance‑based analysis showing that sampling reward tokens with probability p guarantees a minimum within‑group variance of p(1 − p), ensuring that the denominator in the advantage calculation never collapses. They also discuss how the KL‑regularization and clipping parameters affect convergence speed and sample efficiency.
Conclusion.
RC‑GRPO resolves the “perfect SFT kills exploration” paradox by turning reward variance into a controllable design variable rather than a stochastic by‑product. By conditioning on discrete reward tokens, the method restores informative gradients to GRPO without needing a separate value network, preserving GRPO’s memory efficiency while achieving state‑of‑the‑art performance on a challenging multi‑turn tool‑calling benchmark. Future work may explore continuous reward tokens, hierarchical token schemes, and application to other POMDP‑style agent tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment