Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.
💡 Research Summary
Agentic reinforcement learning (RL) has turned large language models (LLMs) into autonomous agents capable of multi‑turn reasoning and tool use. However, training such agents for long‑horizon tasks remains difficult because reward signals are typically sparse and only provided at the end of an episode. Existing tree‑based approaches try to alleviate this by branching from intermediate states and estimating rewards via Monte‑Carlo rollouts, but they suffer from high variance and large computational overhead.
The authors first conduct an empirical analysis of search agents that follow the ReAct paradigm (alternating between query generation, retrieval, and synthesis). They discover a consistent failure pattern: early steps across different rollouts are almost identical, while errors concentrate in the final stages where the model must synthesize evidence or generate the final answer. A simple experiment that truncates a completed trajectory and resamples only the last answer step shows that Pass@K improves dramatically, indicating that the “tail” of the trajectory carries the most informative learning signal.
Motivated by this observation, the paper introduces Branching Relative Policy Optimization (BranPO), a value‑free, contrastive learning framework built on top of Group Relative Policy Optimization (GRPO). The key idea is to keep the shared prefix of a rollout fixed, truncate the trajectory near its end, and generate multiple alternative suffixes (branches). If the original rollout succeeded, the method searches for a failing suffix; if it failed, it searches for a succeeding one. Only branches whose outcomes differ from the original are retained, ensuring each training pair provides a meaningful contrast.
To allocate computation efficiently, BranPO employs difficulty‑aware branch sampling. After an initial rollout, the group‑level accuracy of a task is used as a proxy for difficulty. Easy instances receive “simple” branching (a single set of alternative suffixes), while hard instances or failed rollouts trigger recursive branching, i.e., deeper exploration of alternative continuations. This adaptive budgeting concentrates effort where it matters most.
The authors also add Redundant Step Masking (RSM). During branch generation, steps that are duplicated or uninformative (e.g., repeated search queries that do not change the context) are identified and their advantage contributions are masked to zero. This prevents the policy update from being polluted by noise and reduces “continuation bias” where the model would otherwise be encouraged to generate longer, but useless, sequences.
Formally, BranPO retains the group‑normalized advantage computation of GRPO but splits the advantage estimation into two parts: one for the shared prefix (trunk) and one for each divergent suffix (branch). This preserves GRPO’s stable gradient properties while injecting step‑level preference signals derived from contrastive outcomes. The authors provide a theoretical justification that BranPO’s gradient estimator is unbiased under the contrastive sampling scheme.
Extensive experiments are conducted on multi‑hop question answering benchmarks (HotpotQA, 2WikiMultihopQA) and web‑search tasks. Compared against strong baselines—including GRPO, Tree‑GRPO, GiGPO, StepSearch, and MT‑GRPO—BranPO consistently yields higher accuracy. In Pass@K evaluations, BranPO improves Pass@32 by 10–25 % relative to the best RL‑trained baselines, despite using the same total number of training tokens. Ablation studies show that removing difficulty‑aware sampling or RSM degrades performance by 1.5–2.3 % absolute, confirming their contribution. Moreover, BranPO converges faster, achieving comparable performance with roughly half the number of training steps required by GRPO.
The paper acknowledges limitations: the method assumes that early prefixes are largely shared; in tasks where early decisions diverge significantly, the benefit of tail‑focused branching may diminish. Additionally, for extremely easy examples where few informative alternative suffixes exist, the sampling budget may be under‑utilized. Future work could explore dynamic prefix clustering and extensions to environments with multiple tools or more complex action spaces.
In summary, BranPO offers a principled, efficient, and empirically validated solution to the credit‑assignment problem in long‑horizon agentic RL. By concentrating contrastive supervision on the most uncertain tail decisions, adapting branching depth to task difficulty, and masking redundant steps, it achieves superior performance without increasing overall training cost. The authors release their code, facilitating further research and practical deployment of long‑horizon search agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment