InfMem: Learning System-2 Memory Control for Long-Context Agent

InfMem: Learning System-2 Memory Control for Long-Context Agent
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by $3.9\times$ on average (up to $5.1\times$) via adaptive early stopping.


💡 Research Summary

InfMem tackles the challenge of answering questions that require evidence scattered across ultra‑long documents while operating under a strict memory budget. Existing streaming agents such as MemAgent process a document chunk‑by‑chunk and overwrite a fixed‑size memory after each step. This passive strategy often discards low‑salience “bridging” facts that are crucial for multi‑hop reasoning, leading to poor performance on tasks where a few distant pieces of information must be combined.

Inspired by the dual‑process theory of human cognition, the authors introduce a System‑2‑style control loop called PreThink‑Retrieve‑Write with an early‑stop mechanism. The loop consists of four explicit stages:

  1. PreThink – Given the question q and the current memory mₜ₋₁, a lightweight controller predicts whether the memory already contains enough evidence. It outputs a structured decision tuple (aₜ, uₜ, kₜ) where aₜ ∈ {STOP, RETRIEVE}, uₜ is a dynamically generated retrieval query, and kₜ is the number of fine‑grained retrieval units to fetch.

  2. Retrieve – If aₜ = RETRIEVE, the system performs a global in‑document search over a pre‑indexed set of paragraph‑level units {pⱼ}. The top‑kₜ units are concatenated into a compact context rₜ. This step allows the agent to jump arbitrarily forward or backward in the document, overcoming the linear‑only access of pure streaming.

  3. Write – The agent then jointly compresses the newly observed chunk cₜ and the retrieved evidence rₜ together with the question and the previous memory. The compression is evidence‑aware: it explicitly identifies “bridging” facts and links, and overwrites the bounded memory mₜ (size ≤ M) with the most task‑relevant tokens.

  4. Early Stop – If aₜ = STOP, the loop terminates and the final answer is generated from the current memory. Otherwise the loop repeats until the document ends or sufficient evidence is gathered, dramatically reducing unnecessary retrieval‑write cycles.

Training proceeds in two stages. First, a Supervised Fine‑Tuning (SFT) phase distills a strong teacher model (e.g., Qwen‑3‑32B) into a smaller student model using protocol‑consistent trajectories. The student learns to emit only inference‑valid actions and to produce the correct formatted decision records. Only trajectories that lead to a correct final answer are retained, ensuring high‑quality supervision.

Second, a Verifier‑based Reinforcement Learning (RL) phase aligns the discrete control decisions with the ultimate task reward. A verifier checks whether the generated answer matches the ground truth; the reward combines answer correctness with a penalty for extra retrieval/write steps, encouraging both accuracy and efficiency. This RL fine‑tuning teaches the controller to stop early when possible and to request just enough evidence to close reasoning gaps.

Empirical evaluation spans three backbone LLMs—Qwen‑3‑1.7B, Qwen‑3‑4B, and Qwen‑2.5‑7B—on ultra‑long QA benchmarks ranging from 32 k to 1 M tokens. Across all settings, InfMem outperforms the baseline MemAgent by +10.17, +11.84, and +8.23 absolute accuracy points respectively. Moreover, the adaptive early‑stopping reduces inference latency by an average 3.9× (up to 5.1× in the most favorable cases). Ablation studies confirm that each component—monitoring, targeted retrieval, evidence‑aware joint compression, and early stopping—contributes significantly to the gains.

In summary, InfMem demonstrates that a System‑2‑style, state‑dependent controller can effectively manage bounded memory for ultra‑long context reasoning. By explicitly monitoring evidence sufficiency, performing goal‑directed re‑access, and compressing with awareness of bridging facts, the model preserves critical information that would otherwise be lost in naïve streaming pipelines. The proposed SFT‑to‑RL recipe provides a practical blueprint for training long‑horizon control policies in LLM‑based agents, opening avenues for applying similar techniques to multimodal documents, massive codebases, or any domain where sparse, distributed evidence must be synthesized under tight resource constraints.


Comments & Academic Discussion

Loading comments...

Leave a Comment