ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents
Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot’s contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in https://github.com/pc-inno/ICA_MM_deepsearch.git.
💡 Research Summary
The paper tackles two fundamental challenges that hinder reinforcement‑learning (RL) based information‑seeking agents operating on the open web: (1) the loss of layout and visual cues when webpages are linearized into raw text, which injects abundant noise and reduces the signal‑to‑noise ratio of observations; and (2) the sparsity of supervision in long‑horizon trajectories, where only a final answer reward is available, making it difficult to attribute success to specific search or click actions.
To address (1), the authors propose a visual‑native observation pipeline that renders each visited webpage as an image snapshot rather than extracting HTML text. The snapshot preserves headings, tables, highlighted regions, figures, charts and other visual structures, allowing the agent to exploit stable spatial cues for rapid evidence localization and distractor suppression. Because the observation is an image, token‑length limits of language models are avoided, and the agent can process the entire page at once.
For (2), they introduce Information‑Aware Credit Assignment (ICA), a post‑hoc credit‑allocation method that operates after a batch of trajectories has been collected. Each trajectory is labeled as successful or not using an LLM‑as‑a‑judge. The set of atomic evidence units—defined as the smallest self‑contained pieces of external information (a single search result or a fetched snapshot)—is extracted from every trajectory. For each evidence unit e, ICA computes the empirical success probability when e is present, (P(R=1|I_e=1)), and when it is absent, (P(R=1|I_e=0)). The difference (\Delta_e) serves as a marginal contribution score. This score is then assigned as a dense turn‑level reward (\tilde r_t) to the specific search or fetch action that first introduced e, and propagated backward through the reasoning steps.
ICA is integrated with Generalized Reward‑Based Policy Optimization (GRPO), a value‑free RL algorithm that directly optimizes policies using reward signals. The dense, information‑level rewards supplied by ICA replace the sparse terminal reward, dramatically reducing variance and improving learning stability.
Experiments are conducted on a ReAct‑style agent with three actions (SEARCH, FETCH, ANSWER) across a spectrum of benchmarks ranging from single‑hop (NQ) to multi‑hop (Bamboogle, Xbench‑DS) and deep‑search tasks (BrowseComp, SealQA). Model sizes from 7 B to 70 B parameters are evaluated. Compared with strong text‑based baselines that parse HTML into linear text, the visual‑snapshot + ICA agents achieve consistent absolute gains of 3–7 percentage points in accuracy, with especially pronounced improvements on trajectories longer than ten steps. Attention visualizations show that the agents focus on layout‑derived regions such as headings, tables, and charts, confirming that visual cues are being exploited. Moreover, the variance of reward signals during training is markedly lower, leading to smoother convergence curves.
The analysis demonstrates that ICA provides finer‑grained credit than prior turn‑level shaping or LLM‑as‑judge scoring, because it directly measures the causal impact of each evidence unit on success. This sidesteps the need for state similarity assumptions required by methods like anchor‑state grouping, which often break in tool‑use settings where minor parameter changes produce divergent observations.
In summary, the work introduces a novel paradigm—visual snapshot grounding combined with information‑aware credit assignment—that simultaneously mitigates observation noise and credit‑assignment sparsity in long‑horizon web information seeking. The authors suggest future extensions such as multimodal encoders tailored to snapshots, automatic clustering of evidence units, hybrid human‑in‑the‑loop ICA, and efficient on‑the‑fly snapshot rendering for real‑time interactive agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment