G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot’’ or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.
💡 Research Summary
The paper introduces G‑MemLLM, a memory‑augmented architecture that equips a frozen large language model (LLM) with a trainable latent memory bank and a GRU‑style gated update mechanism. The authors argue that existing solutions for extending context—such as context compression (e.g., Gist Tokens, Recurrent Context Compression) and recurrent state‑passing (e.g., Recurrent Memory Transformer, M+ framework)—suffer from information loss (context rot) or gradual forgetting of early facts (vanishing knowledge). To address these issues, G‑MemLLM separates linguistic processing from knowledge retention: the frozen backbone (GPT‑2 124M or Llama 3.1 8B) generates hidden states, while a latent memory bank of fixed‑size slots stores intermediate representations that can be selectively refreshed, preserved, or overwritten.
The latent memory bank consists of S slots, each a Dₘ‑dimensional learnable vector. An encoder compresses the LLM’s hidden states into a lower‑dimensional space, and a decoder projects memory slots back to the LLM dimension. During each interaction, the current hidden states act as keys and values in a cross‑attention operation where memory slots are queries. The attention output (M_attended) is combined with the previous memory (M_old) via a gate g produced by a small neural network: M_new = (1 − g) ⊙ M_old + g ⊙ M_attended. This gating mirrors the update gate of a GRU, allowing the model to keep long‑term facts when g is low and to incorporate new evidence when g is high, thereby mitigating drift and overwriting problems typical of pure recurrent memories.
Training optimizes a composite loss: (1) a standard language‑model cross‑entropy on the memory‑augmented logits, (2) an L1 sparsity term encouraging the model to activate only a few memory slots, and (3) a negative‑entropy term that discourages reliance on a single slot by maximizing the entropy of slot importance scores. Hyperparameters λ_s and λ_e balance these regularizers.
Experiments evaluate G‑MemLLM on two benchmarks: HotpotQA (multi‑hop question answering) and Zero‑Shot Relation Extraction (ZsRE). Across both model scales, adding the memory module yields consistent gains. For GPT‑2, Answer F1 improves from 45.52 to 54.08 (+8.56), Joint F1 from 30.72 to 38.51 (+7.79). For Llama 3.1 8B, Supporting Fact F1 rises from 76.53 to 83.42 (+6.89) and Joint F1 from 72.15 to 78.23 (+6.08). On ZsRE, Llama 8B’s accuracy jumps from 55.63 % to 63.03 % (+13.3 %). An ablation on memory‑slot count shows that 1024 slots strike the best trade‑off between accuracy and inference overhead; increasing to 2048 slots yields only a marginal 0.28 % gain, indicating saturation.
The authors discuss that the memory bank acts as a scratchpad for small models, while for larger models it functions as an indexing layer that organizes the model’s internal knowledge more efficiently. Limitations include the fixed slot count, which may constrain extremely long‑range reasoning, and the reliance on a simple sigmoid gate; more sophisticated multi‑scale gating or reinforcement‑learning‑based memory management could further improve performance. Additionally, experiments are limited to frozen backbones; integrating the memory module with fully fine‑tuned models remains an open question.
In conclusion, G‑MemLLM demonstrates that a lightweight, trainable latent memory, coupled with a gated update, can substantially extend the effective context window of LLMs without requiring full model fine‑tuning. The approach yields notable improvements in multi‑hop reasoning and relational extraction, scales well from 124 M to 8 B parameters, and adds less than 3 % extra parameters. Future work may explore dynamic slot allocation, hierarchical gating, and joint training with large, fine‑tuned LLMs to further close the gap between short‑term context processing and long‑term factual consistency.
Comments & Academic Discussion
Loading comments...
Leave a Comment