Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model’s belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model’s output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner

💡 Research Summary

The paper introduces InfoReasoner, a unified framework that equips large reasoning models (LRMs) with a principled, dense reward signal for optimizing their retrieval actions during agentic reasoning. The authors begin by formalizing the reasoning process as a Partially Observable Markov Decision Process (POMDP) and define a belief state (b_t) as the posterior distribution over a latent answer variable (Y) given all observations up to time (t). They propose an uncertainty functional (U) that satisfies minimality, concavity, and expected monotonicity; Shannon entropy is used as a concrete instantiation.

Information gain (IG) is re‑defined as the reduction in this uncertainty:
\

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

💡 Research Summary

Comments & Academic Discussion

Leave a Comment