진화하는 메모리 연속 스트리밍 벤치마크와 프레임워크

Reading time: 6 minute
...

📝 Abstract

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

💡 Analysis

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

📄 Content

2025-11-27 Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory Tianxin Wei†,1, Noveen Sachdeva2, Benjamin Coleman2, Zhankui He2, Yuanchen Bei1, Xuying Ning1, Mengting Ai1, Yunzhe Li†,1, Jingrui He1, Ed H. Chi2, Chi Wang2, Shuo Chen2, Fernando Pereira2, Wang-Cheng Kang2 and Derek Zhiyuan Cheng2 †Work done while at Google DeepMind, 1University of Illinois Urbana-Champaign, 2Google DeepMind Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action–think–memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement. Keywords: LLMs, Agentic Memory, Test-time Learning, Self-evolving Agents, Lifelong Intelligence

  1. Introduction Large Language Models (LLMs) have rapidly evolved from simple chatbots into capable systems that can write code, control browsers, and perform advanced question answering (Comanici et al., 2025). These advances have been driven by improving inference, planning, and tool use, as shown by benchmarks emphasizing logical reasoning and multi-step actions. Yet a fundamental capability, memory, remains largely underexplored. Memory allows LLMs to maintain state across interactions, accumulate experience, and adapt strategies over time. Recent studies have introduced memory modules that track dialogue histories through compression, indexing, or retrieval (Maharana et al., 2024b), improving conversational recall and personalization. However, most of these systems only reuse static dialogue context rather than learning from experience to improve future reasoning or decision-making. Despite these advances, existing LLM memory systems remain largely static, retrieving information passively rather than evolving through use. Current evaluations test whether models can recall past context but rarely assess their ability to reuse experience. In essence, agents remember what was said but not what was learned. Conversational recall retrieves prior facts, whereas experience reuse abstracts reasoning strategies for future tasks. Without such reuse, models repeatedly solve similar problems, as long-term assistants often recall context yet fail to adapt across sessions. Corresponding author(s): twei10@illinois.edu © 2025 Google DeepMind. All rights reserved arXiv:2511.20857v1 [cs.CL] 25 Nov 2025 Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory Multi-turn task Single-turn task Self-Evolving Memory Q: Solve 𝟐𝒙𝟐−𝟓𝒙+ 𝟏= 𝟎 Identify equation Apply quadratic formula Goal: Putting a cooled tomato in the microwave Tomato is in the kitchen Tomato is cooled down Find the tomato Put it in the microwave Reuse Reuse Goal: Put a green cup with a fork in it on the counter. Cup is on the table Fork is in the cup Find the cup Put them on the counter Q: Solve 𝟓𝒙𝟐−𝟏𝒙+ 𝟕= 𝟎 Identify equation Apply quadratic formula … … … … Figure 2 | Illustration of different task types and experience reusing. A stateful agent encounters both multi-turn tasks (e.g., embodied manipulation) and single-turn tasks (e.g., solving equations), and should learn reusable experiences from past experiences. X=-2, 0.5 Quadratic formula What are the solutions for 2x² + 3x – 1 = 0? For 2x² + 3x – 1, x = –2, 0.5 Quadratic formula: –b ± b² – 4ac 2a What How Conversational Recall Experience Reuse Figure 1 | Conversational recall retrieves past facts (e.g., solutions to 2𝑥2 + 3𝑥−1 = 0). Experience reuse recalls reasoning strategies (e.g., using the formula). Several recent benchmarks have begun examining static adaptation but remain limited in scope. StreamBench (Wu et al., 2024a) evaluate

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut