Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent \emph{self-evolving} systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. We introduce \textsc{Live-Evo}, an online self-evolving memory system that learns from a stream of incoming data over time. \textsc{Live-Evo} decouples \emph{what happened} from \emph{how to use it} via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task. To manage memory online, \textsc{Live-Evo} maintains experience weights and updates them from feedback: experiences that consistently help are reinforced and retrieved more often, while misleading or stale experiences are down-weighted and gradually forgotten, analogous to reinforcement and decay in human memory. On the live \textit{Prophet Arena} benchmark over a 10-week horizon, \textsc{Live-Evo} improves Brier score by 20.8% and increases market returns by 12.9%, while also transferring to deep-research benchmarks with consistent gains over strong baselines. Our code is available at https://github.com/ag2ai/Live-Evo.

💡 Research Summary

LiveEvo tackles a fundamental limitation of current self‑evolving LLM agents: the reliance on static train‑test splits and the inability to truly adapt to a continuous stream of tasks and feedback. The authors propose a dual‑bank architecture—an Experience Bank (E) that stores past interactions as structured question‑experience pairs with associated weights, and a Meta‑Guideline Bank (M) that holds higher‑level instructions on how to transform those experiences into task‑specific guidance. By separating “what happened” from “how to use it,” the system can evolve its memory usage policies over time.

For each incoming task, LiveEvo follows a four‑stage loop:

Retrieve – The agent generates search queries from the task and retrieves the top‑k experiences and a relevant meta‑guideline. Retrieval scores combine semantic similarity with the dynamic weight of each experience, ensuring that historically useful experiences are prioritized.
Compile – Using the retrieved experiences and the selected meta‑guideline, a large language model synthesizes a concise, task‑adapted guideline. This step implements meta‑cognitive compilation: extracting regularities across experiences, grounding them in the current context, and instantiating an executable plan.
Act – The agent solves the task twice: once with the compiled guideline (memory‑on) and once without any memory (memory‑off). The contrastive evaluation yields a performance gain Δr = r_on – r_off, providing a direct causal estimate of the memory’s contribution.
Update – Δr drives reinforcement‑and‑decay updates to the weights of the experiences used in the current task. Positive gains increase weights, making those experiences more likely to be retrieved later; negative or zero gains decrease weights, gradually “forgetting” stale or misleading entries. If the guideline harms performance, a new meta‑guideline is generated and added to M. After processing a batch, the system identifies the worst‑performing fraction of tasks, summarizes their memory‑on trajectories, re‑evaluates the summarized experience, and commits it to E only if it yields a statistically significant improvement. This selective write‑back controls memory growth while ensuring that new entries are justified by measurable benefit.

The authors draw an explicit analogy to human memory: repeated successful use strengthens a memory trace, whereas disuse or error leads to decay. This dynamic is especially valuable in domains with non‑stationary distributions, such as financial markets.

Experiments are conducted on two fronts. The primary benchmark, Prophet Arena, provides 500 future‑prediction tasks over a ten‑week horizon, each requiring probability forecasts and offering both Brier score and market‑return metrics. LiveEvo, built on GPT‑4.1‑mini, improves the Brier score by 20.8 % and boosts market returns by 12.9 % compared to strong static baselines. The second evaluation uses Xbench‑DeepResearch, a traditional deep‑research benchmark, where LiveEvo also outperforms state‑of‑the‑art methods, confirming its generality beyond time‑series prediction.

Ablation studies systematically remove each component (experience weight updates, meta‑guidelines, contrastive evaluation, selective write‑back). Every ablation leads to a noticeable drop in performance, underscoring that the four mechanisms are jointly necessary. Notably, without meta‑guidelines the system reverts to naïve concatenation of experiences, reducing Brier‑score gains to under 12 %.

In summary, LiveEvo introduces a principled, feedback‑driven memory evolution framework for LLM agents operating in truly online settings. By maintaining weighted experiences, learning how to compose them via meta‑guidelines, and continuously measuring their causal impact, the system achieves robust adaptation under distribution shift. The work opens avenues for deploying self‑evolving agents in real‑time applications such as finance, robotics, and interactive assistants, where continual learning and memory management are critical.

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback

💡 Research Summary

Comments & Academic Discussion

Leave a Comment