SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.


💡 Research Summary

The paper addresses a critical limitation of current deep search agents built on large reasoning models (LRMs): they are typically trained with outcome‑only supervision, which provides sparse rewards and ignores the quality of intermediate thoughts and search actions. To fill this gap, the authors introduce SRR‑Judge, a step‑level evaluation and refinement framework designed specifically for search‑integrated reasoning in real‑world web environments.

SRR‑Judge works by scoring each thought‑action pair in a ReAct‑style reasoning trajectory according to four criteria: clarity & conciseness, logical structure, query appropriateness (or answer fidelity), and coverage & improvement potential. It takes the history of previous steps as context, produces a quality rating (1‑5), an explanatory comment, and optionally a refined thought and action when the rating is low. The rating threshold is set to 4; if no candidate exceeds it, SRR‑Judge refines the highest‑scoring candidate and the loop repeats until an answer is generated or a maximum iteration count is reached. This “rate‑and‑refine” loop enables best‑of‑N sampling (N=1 for fast inference, N=5 for high‑quality trajectory generation) while keeping computational cost low.

Because manually annotating step‑level data at scale is prohibitive, the authors use a strong agentic foundation model, DeepSeek‑V3.1, as a teacher. They run multiple models through the ReAct workflow, collect their full trajectories, and apply a self‑consistency technique: each step is annotated five times by DeepSeek‑V3.1, and a majority vote determines the final rating. To improve data balance, they up‑sample rating‑2 examples and synthesize rating‑1 negatives by pairing unrelated thought‑action pairs with DeepSeek‑V3.1 annotations. After filtering trajectories whose average step rating conflicts with the binary correctness of the final answer (point‑biserial correlation < 0.7), they obtain a high‑quality dataset for fine‑tuning.

SRR‑Judge itself is a 32‑billion‑parameter Qwen‑based model (QwQ‑32B) fine‑tuned for one epoch on this dataset. Empirical evaluation shows that SRR‑Judge’s step‑level ratings correlate strongly with final answer correctness—average point‑biserial correlation of 0.479, outperforming the teacher model and smaller baselines. Notably, the first‑step, last‑step, and overall average ratings all exhibit higher correlations after SRR fine‑tuning, confirming that the model has learned effective step‑level judgment.

The authors then integrate SRR‑Judge into a “rate‑and‑refine” inference workflow and use the resulting high‑quality trajectories to improve the base search agent via iterative rejection‑sampling fine‑tuning (RFT). In each RFT iteration, the current policy generates new trajectories, which SRR‑Judge evaluates and refines; only the top‑rated trajectories are kept for the next fine‑tuning round. After two iterations, the policy shows substantial gains on three real‑world web QA benchmarks—BrowseComp, BrowseComp‑ZH, and XBench‑DeepSearch—achieving more than a 10 % absolute increase in pass@1 compared to the original model.

Key contributions are: (1) a novel step‑level evaluation framework tailored to the black‑box nature of web search, (2) a cost‑effective data annotation pipeline that leverages a larger teacher model and statistical filtering, and (3) a combined inference and alignment strategy (rate‑and‑refine + RFT) that demonstrably boosts deep search performance. The work suggests that enabling agents to self‑assess and refine their intermediate reasoning steps can lead to more reliable, efficient, and scalable search‑integrated AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment