PRInTS: Reward Modeling for Long-Horizon Information Seeking

Reading time: 5 minute
...

📝 Original Info

  • Title: PRInTS: Reward Modeling for Long-Horizon Information Seeking
  • ArXiv ID: 2511.19314
  • Date: 2025-11-24
  • Authors: ** 논문 본문에 저자 정보가 명시되어 있지 않아 제공할 수 없습니다. **

📝 Abstract

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

💡 Deep Analysis

📄 Full Content

A long-standing goal in artificial intelligence has been to develop agents that can answer novel queries by intelligently seeking information (Bachman et al., 2016;Yuan et al., 2020), thereby enabling them to tackle challenging tasks in mathematics (Liu et al., 2025a;b), software engineering (Yang et al., 2025b;Pan et al., 2024), and research (Li et al., 2025b;Wu et al., 2025a). Large Language Models (LLMs) have shown promise as agents for such tasks when equipped with frameworks like ReAct (Yao et al., 2023), which interleaves LLM reasoning with external tool interactions. However, long-horizon information-seeking tasks, which require agents to gather and synthesize information across multiple steps (Su et al., 2025;Shao et al., 2025), remain challenging, even for recent LLMs with tooluse training, performing far below human-level (Mialon et al., 2024;Wei et al., 2025). While finetuning LLMs as information-seeking agents has shown promise (Li et al., 2025b;Wu et al., 2025a;Tao et al., 2025), it is limited to specific model families and is highly computationally demanding (Gao et al., 2025). An alternative way to boost a variety of agents is to build reward models (e.g., as done for math reasoning and instruction following (Wang et al., 2024a;Zou et al., 2025)). These models approximate the expected reward of a step or sequence of steps, enabling test-time scaling by ranking and selecting higher-quality actions or trajectories to successfully tackle long-horizon tasks. Specifically, Process Reward Models (PRMs) (Zou et al., 2025;Choudhury, 2025;Chae et al., 2025) offer a promising model-agnostic way of improving performance, scoring the quality of each of an agent's steps.

While past work has developed PRMs for tasks such as mathematics and logical reasoning, these methods are insufficient for long-horizon information-seeking tasks for two critical reasons. (1) Tool-Reasoning Evaluation Granularity: existing PRMs evaluate short reasoning units in isolation, typically one-to two-sentence logical or mathematical inferences (Xiong et al., 2025;Zhao et al., 2025), providing binary judgments based on logical/math validity. In contrast, long-horizon information-seeking requires jointly evaluating a complete trajectory step, which encompasses a reasoning step combined with tool interactions (e.g., web search, web browsing, code execution). Moreover, step quality depends on multiple factors (e.g., interpretation of tool outputs, tool call informativeness, plan for next action) that coarse feedback cannot capture, increasing the granularity of the guidance needed to effectively steer agents toward good trajectories. (2) Context Accumulation: existing PRMs cannot manage the ever-growing reasoning context Comparison between existing PRMs and PRINTS. Top: Existing PRMs are limited for long-horizon information-seeking as they evaluate a short reasoning unit (e.g., one-to-two-sentence inferences) with coarse feedback, which cannot capture multi-faceted quality factors from tool interactions. They also struggle with rapidly accumulating reasoning context (left). Bottom: In contrast, PRINTS evaluates a complete trajectory step (reasoning + tool interactions), considers multiple trajectory step quality dimensions to produce dense scores for finer-grained guidance at each step, and maintains compact trajectory summaries that keep key information for the evaluation.

that arises over multiple trajectory steps. As illustrated in Figure 1 (Top-left), the information-seeking trajectory -interleaving reasoning steps, tool calls, and tool call outputs -grows rapidly as tool responses at each step introduce lengthy, noisy content, creating computational overhead. Furthermore, recent studies show that models struggle to process long, accumulated contexts (Tang et al., 2025;Yen et al., 2025;Kuratov et al., 2024), resulting in noisy evaluations. This necessitates compressing trajectories into compact forms instead of processing entire historical contexts.

Our work aims to fill these gaps by introducing Progress Reward via Information gain scoring and Trajectory Summarization (PRINTS), a novel generative PRM for long-horizon information-seeking tasks. PRINTS is a unified model jointly trained with two key abilities to address both the need for fine-grained guidance and the challenge of context accumulation. These two abilities are learned jointly within the same PRM. First, PRINTS acts as a scorer that evaluates candidate next trajectory steps by generating Chain-of-Thought (Wei et al., 2022) analyses across multiple quality dimensions and outputting dense scores derived from this generative reasoning, as illustrated in (Figure 1 (Bottom)). Crucially, we frame step evaluation as information gain estimation that quantifies how much each trajectory step increases the probability of reaching the correct answer. This formulation enables training via reinforcement learning with information gain estimation and preference prediction objecti

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut