Structured Episodic Event Memory

Structured Episodic Event Memory
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current approaches to memory in Large Language Models (LLMs) predominantly rely on static Retrieval-Augmented Generation (RAG), which often results in scattered retrieval and fails to capture the structural dependencies required for complex reasoning. For autonomous agents, these passive and flat architectures lack the cognitive organization necessary to model the dynamic and associative nature of long-term interaction. To address this, we propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression. Grounded in cognitive frame theory, SEEM transforms interaction streams into structured Episodic Event Frames (EEFs) anchored by precise provenance pointers. Furthermore, we introduce an agentic associative fusion and Reverse Provenance Expansion (RPE) mechanism to reconstruct coherent narrative contexts from fragmented evidence. Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines, enabling agents to maintain superior narrative coherence and logical consistency.


💡 Research Summary

The paper introduces Structured Episodic Event Memory (SEEM), a hierarchical memory architecture designed to overcome the long‑term memory limitations of large language model (LLM) agents. Conventional Retrieval‑Augmented Generation (RAG) approaches rely on flat vector similarity search, which often yields scattered retrievals and fails to capture multi‑hop dependencies or temporal/causal relations needed for complex reasoning. Recent graph‑based extensions such as GraphRAG, RAPTOR, and Mem0 improve factual organization but still bind all information to a single static schema, limiting dynamic re‑structuring as new knowledge arrives.

SEEM addresses these gaps by coupling two complementary layers:

  1. Episodic Memory Layer (EML) – each dialogue turn is transformed into an Episodic Event Frame (EEF). An EEF follows Fillmore’s frame semantics and contains a high‑level summary plus six semantic roles: participants, action, time, location, causality, and manner. Extraction is performed by an LLM‑driven module (F_ext). Crucially, every EEF stores a provenance pointer (ρ_eml) linking it back to the original passage, preserving traceability.

  2. Graph Memory Layer (GML) – static factual statements are extracted as quadruples (subject, relation, object, temporal validity) and inserted into a schema‑agnostic knowledge graph. Each node/edge also carries a provenance pointer (ρ_gml) to its source text. Entity merging based on vector similarity unifies lexical variants, keeping the graph compact yet expressive.

The core integration mechanism is Reverse Provenance Expansion (RPE). At inference time a query is first mapped to relational patterns, which are used to retrieve the most relevant facts (K_top) from the GML. A graph propagation step then yields a set of source passages (P_ret). Because a single event may be split across many turns, SEEM next looks up the EEFs attached to P_ret, gathers their provenance pointers, and expands the evidence set to include all passages that contributed to those frames. The final context (P_final) therefore contains every textual fragment that belongs to the activated events, eliminating the “scattered retrieval” problem.

The final reasoning context C is a serialized mixture of (i) expanded passages, (ii) the corresponding EEF structures, and (iii) the selected relational facts. A downstream LLM decoder conditions on (query, C) to generate the answer, allowing simultaneous reasoning over high‑level graph semantics and low‑level episodic details.

Experiments were conducted on two demanding long‑conversation benchmarks:

  • LoCoMo – up to 16 k tokens per dialogue, 32 sessions, 1 986 QA items covering single‑hop, multi‑hop, temporal, open‑domain, and adversarial reasoning.
  • LongMemEval – 500 curated questions probing information extraction, multi‑session synthesis, timeline reasoning, and knowledge updates.

Metrics included token‑level F1, BLEU‑1, and LLM‑as‑Judge factual consistency scores. SEEM outperformed strong baselines (standard RAG, GraphRAG, HippoRAG2, Mem0) by 4 %–5 % absolute on most metrics. Notably, temporal ordering and multi‑hop chain reasoning errors dropped by over 30 % relative to the best prior system. Ablation studies confirmed that both the EEF fusion and the RPE expansion contribute substantially to the gains.

Limitations are acknowledged: EEF extraction depends on the quality of the underlying LLM, so parsing errors can propagate; graph construction still hinges on accurate relation extraction; the current provenance system is text‑only, leaving multimodal evidence (images, audio) unsupported; and scalability of provenance pointers may require additional indexing tricks for real‑time agents.

Future work plans to (1) introduce automated frame validation and error correction, (2) extend provenance to multimodal modalities, and (3) develop memory compression and forgetting policies to keep the system lightweight for continual deployment.

In summary, SEEM demonstrates that a dual‑layer memory—static relational graph plus dynamic episodic frames—combined with reverse provenance expansion can fundamentally solve the scattered retrieval issue of flat RAG systems. By preserving fine‑grained episodic context while still leveraging structured factual knowledge, SEEM enables LLM agents to maintain narrative coherence and logical consistency over very long interaction histories, marking a significant step toward truly long‑term, memory‑augmented AI assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment