CAST: Character-and-Scene Episodic Memory for Agents
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
💡 Research Summary
The paper introduces CAST (Character‑and‑Scene Episodic Memory for Agents), a novel memory architecture for large‑language‑model (LLM) agents that draws directly from cognitive and dramatic theory. Human memory separates semantic knowledge from episodic recollection, the latter being organized around “who, when, and where”. Existing agent memory designs—key‑value stores, dense vector stores, or graph‑based stores—treat experience as flat or loosely linked data, making it difficult to retrieve coherent, context‑rich episodes. CAST addresses this gap by building two complementary indices: an episodic index that captures 3‑dimensional scenes (time, place, topic) and a semantic index that stores a heterogeneous graph of extracted triples together with dense passage retrieval (DPR) capabilities.
Episodic Index Construction
The raw dialogue is first split into short windows called “views”. Each view is annotated with a timestamp, a location label, a topic (derived via a topic model), and the set of participants mentioned. Views that are close in the three dimensions are clustered using a greedy 3‑D algorithm, forming a “scene”. A scene is defined by thresholds Δ_t, Δ_ℓ, and Δ_τ on temporal, spatial, and topical distances, mirroring the classical unities of drama (unity of time, place, and action). For each scene the system creates a summary vector ϕ(s) and records the participant set P(s). Within a scene participants are labeled as Main Character (MC) or Supporting Character (SC) based on speaking turns and narrative importance.
Character Profiles
All scenes containing a given participant are ordered chronologically, producing a character profile π(c) =
Comments & Academic Discussion
Loading comments...
Leave a Comment