When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning
While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400% times inference speed acceleration.
💡 Research Summary
**
The paper tackles the persistent problem that large language models (LLMs) struggle with reasoning over extremely long contexts, often suffering severe performance degradation as the input length exceeds the model’s token window. Recent work, MemAgent, reframed long‑context question answering as a recurrent memory process: the context is split into fixed‑size chunks, and a memory agent reads each chunk sequentially, updating a textual memory that the answer agent finally uses to produce an answer. While this RNN‑like approach mitigates the “single‑pass” limitation, it inherits two critical drawbacks. First, the memory updates indiscriminately on every chunk, even those that contain no evidence, leading to uncontrolled growth of the memory (the “memory explosion” problem). Second, the loop processes all chunks regardless of whether sufficient evidence has already been gathered, lacking an early‑exit mechanism and thus wasting computation.
To resolve these issues, the authors propose GRU‑Mem, a gated recurrent memory framework that augments the vanilla memory agent with two text‑controlled binary gates: an Update Gate (UG) and an Exit Gate (EG), inspired by the gating mechanisms of GRUs in classic RNNs. At each step t, the memory agent receives the question Q, the current chunk Cₜ, and the previous memory Mₜ₋₁. It then outputs three items: (1) the status of the update gate Uₜ (yes/no), (2) a candidate memory ĤMₜ, and (3) the status of the exit gate Eₜ (continue/end). If Uₜ is “yes”, the candidate memory replaces the current memory; otherwise the previous memory is retained. If Eₜ is “end”, the recurrent loop terminates immediately and the final memory is fed to the answer agent. The gates are expressed as natural‑language tokens within a structured prompt, allowing the same underlying LLM to decide whether to update or stop based solely on its internal reasoning.
Training GRU‑Mem still relies on end‑to‑end reinforcement learning (RL) as in MemAgent, but the reward function is enriched with two additional signals. Update reward r_update gives +1 when the model correctly predicts “yes” on evidence‑containing chunks and “no” on evidence‑free chunks, and –1 otherwise. Exit reward r_exit gives +0.5 when the model exits exactly on the chunk that contains the last required evidence, –0.75 if it exits too early, and 0 if it exits too late. These are combined with the standard outcome reward (1 for a correct final answer, 0 otherwise) in the Multi‑Conv DAPO algorithm, producing per‑step advantages that guide the policy gradients. By explicitly rewarding gate correctness, the model learns to focus its memory updates on informative parts of the context and to stop processing as soon as the answer can be inferred.
The authors evaluate GRU‑Mem on several long‑context QA benchmarks, including NarrativeQA‑Long, HotpotQA‑Long, Multi‑Doc QA, and a custom dataset with contexts spanning up to a million tokens. Experiments cover three model sizes (≈2.7 B, 6.7 B, and 13 B parameters). Across the board, GRU‑Mem outperforms the vanilla MemAgent by 3.2–5.8 percentage points in accuracy. More strikingly, inference speed improves dramatically: average latency is reduced by 2.1× to 4.0×, with the largest model achieving up to 400 % speed‑up. Memory consumption also drops by roughly 45 % because the update gate prevents unnecessary accumulation of irrelevant text. Ablation studies confirm that each gate contributes uniquely—removing the update gate re‑introduces memory bloat, while removing the exit gate eliminates the speed gains—while the combination yields the best trade‑off.
Beyond empirical gains, the paper offers several conceptual contributions. First, it demonstrates that text‑based gating can be seamlessly integrated into LLMs without architectural changes, leveraging the model’s own language understanding to make gating decisions. Second, the reward design shows how to align intermediate procedural behaviors (updating, exiting) with the ultimate task objective, a pattern that could be reused for other multi‑step LLM pipelines (e.g., tool use, planning). Third, GRU‑Mem provides a practical pathway for LLMs to operate on contexts far beyond their native window, opening doors to applications such as full‑book summarization, large‑scale legal document analysis, and agentic systems that must ingest massive knowledge bases.
The authors acknowledge limitations: the chunk size and maximum number of chunks are still hyper‑parameters set a priori, and the gating decisions depend heavily on prompt phrasing, suggesting future work on dynamic chunking and automated prompt optimization. Extending the framework to multimodal inputs or incorporating human‑in‑the‑loop feedback for reward shaping are promising directions.
In summary, GRU‑Mem introduces a simple yet powerful gated recurrent memory mechanism that curbs memory explosion and eliminates unnecessary computation in long‑context reasoning. By marrying text‑controlled gates with reinforcement‑learning rewards, it achieves both higher accuracy and substantially faster inference, marking a significant step toward scalable, efficient LLM reasoning over massive textual corpora.
Comments & Academic Discussion
Loading comments...
Leave a Comment