ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents
Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as context to the model. ENGRAM attains state-of-the-art results on LoCoMo, a multi-session conversational QA benchmark for long-horizon memory, and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens. These results show that careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.
💡 Research Summary
The paper introduces ENGRAM, a lightweight external memory architecture designed to give large language models (LLMs) long‑horizon consistency in conversational agents without the engineering overhead of complex graph‑based or multi‑stage retrieval systems. ENGRAM’s central premise is to mirror human memory by separating stored information into three canonical types—episodic (time‑ordered events), semantic (stable facts or user preferences), and procedural (instructions or workflows). A single router examines each incoming user turn and emits a three‑bit mask indicating which of the three stores should receive the turn. For each selected type, a lightweight extractor converts the turn into a normalized JSON record, obtains a dense embedding via a shared encoder, and persists the record together with its embedding in a local SQLite database.
At query time, the user’s question is embedded with the same encoder. For each memory type, ENGRAM computes cosine similarity between the query embedding and all stored embeddings, then selects the top‑k (k=20 in the experiments) nearest neighbors. The three per‑type result sets are merged, deduplicated, and truncated to a fixed budget of K=25 snippets. These snippets are serialized with a timestamp prefix, combined with the original question using a deterministic, non‑learned template, and fed to the LLM as context. The final answer is generated by the LLM (GPT‑4o‑mini in the reported experiments).
The authors evaluate ENGRAM on two long‑horizon conversational benchmarks. LoCoMo consists of ten multi‑session dialogues (≈600 turns each) covering single‑hop, multi‑hop, open‑domain, and temporal reasoning questions. LongMemEval embeds 500 QA pairs in very long chat histories (≈115 K tokens per problem) and tests information extraction, multi‑session reasoning, temporal reasoning, knowledge updates, and abstention. Baselines include a variety of existing memory systems (Mem0, MemOS, LangMem, Zep), a retrieval‑augmented generation (RAG) approach, and a full‑context control that feeds the entire conversation to the LLM.
Results show that ENGRAM achieves the highest LLM‑as‑Judge semantic correctness scores on LoCoMo (overall 77.55 %), outperforming all baselines across most categories, especially multi‑hop (79.79 %) and open‑domain (72.92 %). Importantly, ENGRAM does so while using only about 916 tokens of evidence on average—roughly a 35 % reduction compared with other systems that consume 1.5–4 K tokens. On LongMemEval, ENGRAM exceeds the full‑context baseline by 15 absolute points despite using only ~1 % of the tokens, demonstrating strong horizon generalization.
The paper emphasizes that ENGRAM’s simplicity—single router, single dense retriever, SQLite persistence—makes it easy to implement, debug, and reproduce. Typed records provide interpretability and allow straightforward ablations of the routing logic. The authors acknowledge limitations: using a single encoder for all types may miss type‑specific representation nuances, and SQLite may not scale to production‑level workloads. Future work is suggested on type‑specific encoders, distributed vector stores, and reinforcement‑learning‑based routing policies.
In summary, ENGRAM proves that careful memory typing combined with straightforward dense retrieval can deliver state‑of‑the‑art long‑term memory performance for LLM‑driven conversational agents, challenging the prevailing trend toward increasingly elaborate memory architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment