SimpleMem: Efficient Lifelong Memory for LLM Agents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis, an intra-session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent-Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

💡 Research Summary

SimpleMem tackles the pressing problem of memory management for large‑language‑model (LLM) agents that must operate over long‑horizon, multi‑turn interactions. Existing solutions fall into two unsatisfactory categories: (1) naïvely extending the full conversation context, which quickly exhausts the fixed token window with redundant, low‑information content; and (2) iterative filtering approaches that repeatedly invoke the LLM to prune noise, incurring prohibitive inference latency and token costs. SimpleMem proposes a unified “semantic lossless compression” framework that simultaneously reduces redundancy, preserves downstream utility, and keeps token usage low.

The system is built around a three‑stage pipeline.

Semantic Structured Compression splits incoming dialogue into overlapping sliding windows (size 20). Instead of a separate classifier, the LLM itself acts as a semantic density gate: it generates either an empty set (discarding low‑entropy windows) or a set of compact “memory units” that have already undergone coreference resolution, absolute timestamp conversion, and atomization into self‑contained factual statements. Each unit is indexed in three complementary views – a dense semantic embedding (for fuzzy matching), a sparse lexical inverted index (for exact keyword matches), and symbolic metadata (timestamps, entity types). This multi‑view indexing enables flexible retrieval later on.
Online Semantic Synthesis performs on‑the‑fly consolidation during the write phase. As new units are created, the model examines the current session’s observations and merges semantically related fragments into higher‑level abstractions. For example, three separate statements about a user’s coffee preference are synthesized into a single concise entry. This real‑time synthesis prevents the memory from fragmenting into a massive bag of low‑density facts, keeping the memory topology compact and highly informative.
Intent‑Aware Retrieval Planning uses the LLM as a planner. Given a user query q and the interaction history H, the planner outputs a plan consisting of three view‑specific sub‑queries (q_sem, q_lex, q_sym) and an estimated retrieval depth d. The depth determines how many top‑k results to pull from each view (k scales with d, e.g., 3 for simple look‑ups, up to 20 for complex multi‑hop reasoning). Retrieval is executed in parallel across the semantic, lexical, and symbolic indexes, after which the three result sets are unioned with ID‑based deduplication to form the final context C_q. This adaptive strategy ensures that token budget is spent only on the most relevant information, avoiding both under‑retrieval (missed facts) and over‑retrieval (unnecessary tokens).

The authors evaluate SimpleMem on two demanding benchmarks: LoCoMo (long‑context conversational reasoning) and LongMemEval‑S (extreme‑length interaction histories). Experiments span a range of backbone models—from GPT‑4o and GPT‑4.1‑mini to various Qwen families (1.5B to 8B parameters). SimpleMem consistently outperforms strong baselines (READ‑AGENT, MEMORY‑BANK, MEMGPT, A‑MEM, LIGHT‑MEM, MEM0). On LoCoMo with GPT‑4.1‑mini, SimpleMem achieves an average F1 of 43.24, a 26.4 % absolute gain over the best prior method, while cutting inference‑time token consumption by up to 30×. Similar gains are observed on LongMemEval‑S. Additional analyses show robustness to adversarial distractors and confirm that each component (semantic gating, online synthesis, intent‑aware planning) contributes meaningfully to the overall improvement.

Key contributions are: (1) a semantic‑density gating mechanism that lets the LLM act as a lossless compressor; (2) a real‑time synthesis step that maintains a high‑density, low‑redundancy memory graph; (3) a planning‑driven, multi‑view retrieval process that dynamically adjusts scope based on inferred intent. Together these innovations deliver a superior trade‑off between accuracy and efficiency, establishing a new state‑of‑the‑art for lifelong memory in LLM agents. Future work may explore quantitative metrics for compression loss, extensions to multimodal memories (images, audio), and self‑supervised training of the compression/synthesis modules to further reduce reliance on external LLM calls.

SimpleMem: Efficient Lifelong Memory for LLM Agents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment