Evaluating Memory Structure in LLM Agents
Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent’s ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.
💡 Research Summary
The paper introduces StructMemEval, a benchmark specifically designed to assess how well large‑language‑model (LLM) agents can organize their long‑term memory rather than merely retrieve stored facts. Existing memory benchmarks such as LOCOMO and LongMemEval focus on fact recall, multi‑hop retrieval, and time‑based updates—tasks that can often be solved with simple retrieval‑augmented LLMs and do not stress the hierarchical or structural capabilities of more sophisticated memory systems. To fill this gap, the authors collect a suite of tasks that humans naturally solve by arranging information into concrete structures: hierarchical trees (family or corporate hierarchies), state‑tracking ledgers (neighbor relationships that change over time), accounting‑style transaction ledgers (netting circular debts), and counting‑based data curation scenarios.
Dataset construction
- 73 distinct conversational scenarios, each accompanied by 544 evaluation questions.
- Scenarios are synthetic but grounded in realistic domains (accounting, personal assistants, knowledge curation) to avoid privacy concerns.
- For each scenario an optional “memory organization hint” is provided, describing how a human would structure the information (e.g., “store the relations as a tree”).
- Evaluation is performed in two modes: with hints (to diagnose whether failures stem from poor organization) and without hints (the primary setting).
Memory systems compared
- Retrieval‑augmented LLM using OpenAI embeddings (text‑embedding‑3‑large) with top‑k retrieval budgets (15 for tree tasks, 10 for counting, 5 for state tracking).
- Mem‑agent – a markdown‑based note‑taking memory that can create interlinked notes.
- Mem0 – a graph‑oriented memory framework.
All agents use Gemini‑2.5‑pro or Gemini‑3‑pro as the underlying LLM; additional backbones (Gemini‑flash, GPT‑4o‑mini) are reported in the appendix.
Key findings
- Retrieval‑only agents solve very small instances but performance collapses as the number of edges, state transitions, or transactions grows beyond the retrieval budget.
- Both Mem‑agent and Mem0 dramatically outperform retrieval when a hint is supplied, achieving near‑perfect accuracy across all difficulty levels.
- Without hints, memory agents still beat retrieval by 20‑30 % absolute accuracy, yet the gap between “hint” and “no‑hint” conditions can exceed 50 % for some tasks. This indicates that current LLMs do not spontaneously infer the appropriate memory schema.
- Failure analysis reveals two dominant error modes: (i) lack of organization – the model stores raw messages without structuring them, leading to incomplete or ambiguous answers; (ii) hallucinated memories – especially after hundreds of consecutive updates, the model invents spurious entries that corrupt downstream reasoning.
- Even state‑of‑the‑art Gemini models, which can execute algorithmic reasoning (trees, heaps, state machines) in code, fail to apply the same algorithmic knowledge to their own external memory unless explicitly prompted.
Implications and future work
The study argues that memory organization should be treated as a first‑class capability in LLM agent evaluation. It suggests (a) training or fine‑tuning LLMs on datasets that pair tasks with explicit memory‑structuring instructions, (b) designing memory frameworks that can infer or enforce a suitable schema without external hints, and (c) expanding the benchmark to cover additional structures such as ordered to‑do lists, DAGs, assignment maps, and multi‑structure coordination.
Overall, StructMemEval provides a concrete, implementation‑agnostic yardstick for measuring whether an LLM agent can shape its long‑term knowledge into useful structures, revealing a significant gap between current retrieval‑centric approaches and the more ambitious goal of truly organized, agentic memory.
Comments & Academic Discussion
Loading comments...
Leave a Comment