TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a “Double Homogenization Dilemma.” This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
💡 Research Summary
The paper addresses a fundamental limitation of current reinforcement‑learning (RL) approaches for tool‑integrated, multi‑turn reasoning with large language models (LLMs). Existing methods typically assign a sparse binary reward only at the final turn (outcome‑level reward). The authors identify two intertwined problems they call the “Double Homogenization Dilemma.” First, process‑level homogenization: trajectories that differ widely in intermediate reasoning or evidence retrieval receive identical rewards, because a correct answer is required for any positive reward. This masks useful learning signals from partial successes such as retrieving the correct evidence but failing to synthesize it. Second, intra‑group homogenization: Group Relative Policy Optimization (GRPO) normalizes rewards within a batch of trajectories sampled for the same question, but binary rewards cause many groups to have zero variance (all‑wrong or all‑correct). When the standard deviation is zero, the advantage estimates vanish and the policy receives no gradient signal, stalling training.
To break this dilemma, the authors propose Turn‑level Stage‑aware Policy Optimization (TSPO). The core idea is the First‑Occurrence Latent Reward (FOLR) mechanism. By analyzing a large set of test trajectories, they show that the presence of the gold answer in any intermediate retrieval feedback is highly predictive of final correctness (χ² test p < 0.001). Consequently, they define the first turn t* at which the gold answer appears in the reasoning trace. For every turn k they assign: a full reward of 1 if the final answer is correct; a partial reward α (0 ≤ α ≤ 1) for all turns up to and including t* when the final answer is wrong; and 0 thereafter. This yields dense, turn‑level rewards that give positive credit to “Near‑Miss” trajectories (those that retrieve the correct evidence but synthesize incorrectly).
TSPO reformulates the problem as a turn‑level Markov Decision Process. The state encodes the dialogue history and retrieved evidence; the action is either a tool query or a synthesis step. The per‑turn reward r_i,k follows the FOLR rule. Within each question‑specific group of G sampled trajectories, the authors compute a group‑relative advantage for each turn:
Â̂_i,k = (r_i,k – mean_k) / (std_k + ε)
where mean_k and std_k are the mean and standard deviation of the turn‑k rewards across the group. This per‑turn normalization restores variance even in groups that would otherwise be all‑wrong under binary rewards. The policy update uses a PPO‑style clipped surrogate loss with a KL‑penalty, applied separately at each turn. Variable‑length trajectories are padded to the maximum turn length in the group, and padded positions are masked so they do not affect gradient computation.
The method requires no external reward model, human annotations, or additional inference passes; it operates solely on the standard (question, gold answer) pairs and the model’s own retrieval/ reasoning trace.
Empirical evaluation is conducted on seven open‑domain QA datasets using Qwen2.5‑3B and Qwen2.5‑7B models. Compared with strong baselines—standard GRPO, PPO, and RLHF‑style fine‑tuning—TSPO achieves average accuracy improvements of 24 % for the 3B model and 13.6 % for the 7B model. The gains are especially pronounced early in training, where “all‑wrong” groups constitute ≥ 40 % of the sampled batches; TSPO’s turn‑level rewards generate non‑zero advantages for Near‑Miss trajectories, enabling the optimizer to make progress where baselines receive no gradient. Ablation studies vary the partial‑reward coefficient α (0.5, 0.8, 1.0) and confirm that α = 1.0 provides the most stable and highest performance. Additional ablations disabling the FOLR or using only trajectory‑level rewards reproduce the homogenization problem, confirming the necessity of the proposed mechanism.
In summary, TSPO offers a simple yet powerful solution to the double homogenization dilemma in multi‑turn, search‑augmented LLM reasoning. By leveraging the first occurrence of the ground‑truth answer as an intrinsic latent signal, it creates dense, turn‑level rewards that preserve process‑level information and re‑introduce reward variance within sampling groups—all without extra annotation or model overhead. This work paves the way for more efficient and effective RL‑based training of LLM agents that must interact with external tools over multiple reasoning steps.
Comments & Academic Discussion
Loading comments...
Leave a Comment