On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.
💡 Research Summary
The paper investigates a previously undocumented failure mode that emerges when large language model (LLM) agents are trained with outcome‑based reinforcement learning (RL) for active reasoning tasks—situations where the agent must ask a series of strategic questions to acquire missing information before producing a final answer. The authors call this phenomenon Information Self‑Locking (SeL). In the SeL regime, agents gradually stop asking informative queries and fail to internalize the evidence they have already collected, even though the overall task reward continues to improve.
Decomposition into Action Selection and Belief Tracking
To dissect the problem, the authors split the agent’s behavior into two coupled components:
- Action Selection (AS) – a belief‑conditioned policy π_Q that decides which question to ask at each turn, based on the current belief state b_t.
- Belief Tracking (BT) – an internal update operator π_U that incorporates the observation o_t returned by the environment into a new belief b_{t+1}.
Both components are modeled within a Partially Observable Markov Decision Process (POMDP). The paper argues that deficiencies in either component can restrict the other, creating a feedback loop that locks the agent into a low‑information regime.
Empirical Observations
Two interactive benchmarks are used:
- PE‑G – a preference‑estimation task where the agent queries a user about attribute subspaces of items.
- MediQ – a medical diagnosis task where the agent asks a simulated patient for diagnostic facts.
For each benchmark the authors track per‑turn AS (information gain of a query) and BT (improvement in belief similarity or hypothesis margin). Across seven datasets, they observe three key patterns:
- Reward–information decoupling – total episode reward rises during training, but AS and BT metrics plateau or even decline.
- Weak BT masks informative AS – when the agent’s own belief updates are poor, the correlation between informative queries and final reward is low. Replacing the agent’s BT with a strong oracle (human‑defined rules or a frontier model) dramatically increases this correlation, showing that the agent’s BT is hiding the contribution of good questions.
- Conservative AS limits BT – when the policy becomes risk‑averse and asks few or trivial questions, BT receives little useful signal, encouraging the agent to rely on early context and ignore interaction altogether.
These observations suggest a bidirectional coupling: AS needs BT to translate gathered evidence into reward, while BT needs AS to receive enough evidence to learn.
Theoretical Framework
The authors formalize the concepts:
- AS informativeness I_th(ω) – expected total belief improvement under an oracle Bayesian update when following policy π_Q(ω).
- BT absorption C_BT(ω) – expected sum of positive belief changes actually absorbed by the agent’s own update operator.
A self‑locking region R_{δ,ε} is defined as the set of parameters where both I_th ≤ δ and C_BT ≤ ε. Within this region, the policy gradient can be decomposed into separate AS and BT components (g_J,Q and g_J,U). The paper proves (informally, Theorem 3.4) that if the agent resides in R_{δ,ε}, gradient updates to improve AS do not increase I_th because BT is too weak to reflect the benefit, and updates to improve BT do not increase C_BT because AS supplies insufficient informative queries. Hence the system gets stuck.
Proposed Remedy: AReW (Advantage Re‑Weighting)
To break the loop, the authors introduce directional critiques—binary signals that are cheap to obtain and indicate whether a query yielded new evidence (e.g., “Did the user provide a novel fact?”). These signals are used to re‑weight the advantage terms in the policy gradient for both AS and BT:
- A_J,Q (AS advantage) ← A_J,Q × critique_AS
- A_J,U (BT advantage) ← A_J,U × critique_BT
By amplifying the gradient when a query is known to be informative, the method restores a non‑degenerate learning signal even when the agent’s own BT is poor. The approach requires only a lightweight wrapper around standard policy‑gradient RL and is robust to noisy critiques.
Experimental Validation
AReW is evaluated on seven datasets spanning preference elicitation, medical diagnosis, and other interactive reasoning tasks. Results show:
- Consistent increases in both AS and BT proxy metrics throughout training.
- Final task success rates improve by up to 60 % compared with vanilla outcome‑based RL.
- The method works across model families (Qwen‑2.5‑7B‑Instruct, GPT‑4) and RL algorithms (PPO, REINFORCE).
- Adding random noise to the critiques degrades performance only marginally, demonstrating robustness.
Implications and Limitations
The study provides the first quantitative and theoretical explanation for why RL‑trained LLM agents may stop asking questions, linking it to a mutual deficiency in action selection and belief updating. AReW shows that internal, easily obtainable signals can substitute for complex external reward shaping, offering a practical tool for improving interactive AI systems such as tutoring agents, diagnostic assistants, and collaborative bots.
Limitations include the reliance on a simple binary notion of “new evidence”. More sophisticated information‑value measures (e.g., expected information gain, entropy reduction) could further enhance the re‑weighting. Additionally, the theoretical analysis assumes a specific decomposition of the gradient; extending it to more complex architectures (e.g., hierarchical policies) remains an open direction.
Conclusion
Information self‑locking arises from a two‑way coupling between an LLM agent’s ability to select informative actions and its capacity to integrate the resulting observations. By injecting cheap directional critiques that re‑weight policy‑gradient advantages, the proposed AReW method restores a meaningful learning signal, enabling agents to escape the low‑information trap and achieve substantially higher performance on active reasoning tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment