Large Language Model Agents Are Not Always Faithful Self-Evolvers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent’s decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

💡 Research Summary

The paper presents the first systematic study of “experience faithfulness” in self‑evolving large language model (LLM) agents, i.e., the degree to which an agent’s decisions causally depend on the past experience it is given. The authors distinguish two experience modalities: raw experience, which consists of concrete successful trajectories (observations, actions, intermediate states), and condensed experience, which is a distilled, high‑level summary or heuristic extracted from many trajectories. To probe whether agents truly leverage these inputs, the authors design a suite of causal interventions that perturb either raw or condensed experience in controlled ways (emptying content, shuffling steps, inserting irrelevant trajectories, corrupting key tokens, or replacing with filler symbols).

Four representative self‑evolving frameworks are evaluated: the offline single‑agent ExpeL system, the online single‑agent Dynamic CheatSheet and ReasoningBank systems, and the online multi‑agent G‑Memory system. Experiments span ten LLM backbones—including closed‑source GPT‑4o (and mini) and Gemini‑2.5‑Flash, as well as a spectrum of open‑weight Qwen3 models from 1.7 B dense up to 235 B mixture‑of‑experts—covering a wide range of model scales and architectures. The agents are tested on nine benchmarks across four domains: knowledge‑intensive QA (HotpotQA, FEVER, GPQA‑Diamond, MMLU‑Pro Eng.), mathematical reasoning (AIME 2024, Game of 24), embodied interaction (ALFWorld), and web interaction (WebArena, WebShop).

Results are strikingly consistent across all dimensions. When both raw and condensed experience are supplied, removing or corrupting raw experience (e.g., emptying the content or replacing it with irrelevant trajectories) leads to large drops in success rates (often 20–30 percentage points), demonstrating that agents heavily rely on the concrete temporal and semantic structure of raw trajectories. In contrast, analogous perturbations to condensed experience (empty, corrupt, irrelevant, filler) produce little or no measurable impact on performance; even completely omitting condensed experience has only marginal effect. This pattern holds for offline and online paradigms, for single‑ and multi‑agent settings, and across all model sizes—from 1.7 B to 235 B MoE.

The authors then investigate why condensed experience is largely ignored. Three intertwined causes are identified: (1) Semantic limitations – condensed summaries are often overly abstract or generic, lacking the specificity required to guide concrete actions; (2) Internal processing biases – the frozen LLM’s attention mechanisms prioritize the immediate task prompt and recent context over inserted summary blocks, effectively suppressing the influence of external condensed knowledge; (3) Task regime effects – in knowledge‑intensive tasks the pretrained model’s internal knowledge is already sufficient, reducing the marginal utility of any external experience. Notably, in tasks where pretrained priors dominate (e.g., GPQA‑Diamond, MMLU‑Pro Eng.), the sensitivity to both raw and condensed experience becomes more balanced, suggesting that the gap is amplified when the model’s own knowledge can fully solve the problem.

The paper contributes a rigorous causal‑intervention framework for measuring experience faithfulness, a comprehensive empirical map of how current self‑evolving agents treat raw versus condensed experience, and a diagnostic of the underlying failure modes. The findings challenge the prevailing assumption that simply storing and retrieving past experience guarantees its effective use. They also highlight a critical design gap: existing agents lack mechanisms to reliably integrate high‑level distilled knowledge.

Future work, as suggested by the authors, should explore richer condensed representations (e.g., more detailed procedural templates), architectural or prompting modifications that elevate the attention given to summary blocks, and explicit reward or loss terms that penalize ignoring provided experience. Such directions could close the faithfulness gap and enable truly self‑improving LLM agents that make full use of both concrete trajectories and abstract lessons learned from past interactions.

Large Language Model Agents Are Not Always Faithful Self-Evolvers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment