Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88%} of attention operations retrieve information already predictable from the model’s hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100% retrieval accuracy} at 1.6% attention compute (vs.\ 68% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48–52%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$–$0.5$). Code and benchmarks are available at [anonymized].


💡 Research Summary

The paper addresses a fundamental inefficiency in modern sequence models that combine state‑space models (SSMs) with self‑attention. While hybrid architectures such as Jamba, SeqBoat, and TransMamba achieve impressive quality‑efficiency trade‑offs, they all allocate a fixed amount of attention compute regardless of whether the model has already learned a pattern. By probing pretrained GPT‑2 models, the authors discover that 88 % of attention operations retrieve information that is already predictable from the hidden state; this redundancy persists throughout training because standard objectives provide no signal to reduce compute.

Motivated by this observation, the authors propose CRAM (Consolidation‑based Routing for Adaptive Memory), a biologically inspired mechanism that gradually “consolidates” episodic retrievals into a parametric semantic memory. CRAM consists of three memory tiers: (1) a continuous‑time working memory (CT) that handles local dynamics via an ODE‑style update, (2) an episodic KV buffer accessed with full‑attention (O(n) cost) for novel events, and (3) a low‑rank semantic adapter that learns to predict the output of the episodic buffer. A consolidation‑aware router receives a feature vector containing the time gap, CT dynamics magnitude, and a quality signal qₜ measuring how well the semantic adapter matches the episodic output. Using a Gumbel‑Softmax, the router selects among CT‑only, episodic retrieval, or semantic approximation for each token.

The consolidation loss L_cons trains the semantic adapter to approximate the episodic output, while the overall objective penalizes episodic attention when qₜ is high and rewards semantic routing. As training proceeds, qₜ rises for recurring patterns, causing a sharp phase transition around 3 K steps: attention usage drops by a factor of 37.8×, moving from O(n) to O(1) per token. This dynamic reduction is absent in static sparse‑attention methods, which maintain a constant attention budget.

Theoretical contributions include: (i) a lower bound proving that any static routing scheme must spend Ω(f·n) attention on tasks where a fraction f of positions require retrieval of recurring patterns; (ii) a corollary showing that consolidation can break this bound, achieving sub‑linear attention proportional to the fraction of patterns that fail to consolidate; (iii) convergence guarantees for the semantic adapter based on stochastic optimization, and an attention‑reduction guarantee that scales with the proportion ρ of recurring, Lipschitz‑smooth patterns.

Empirically, the authors introduce the SRCD benchmark (Sparse Retrieval in Continuous Dynamics), which mixes recurring motifs with irregular time gaps. CRAM attains 100 % retrieval accuracy while using only 1.6 % of the attention compute, far surpassing baselines (e.g., SeqBoat at 68 % accuracy). Consolidated patterns transfer to unseen tasks, yielding a 48–52 % reduction in attention without any retraining. Moreover, the learned growth curve of qₜ matches human episodic‑to‑semantic memory transition curves (γ = 0.43 vs. γ_human ≈ 0.4–0.5).

In summary, CRAM introduces a novel adaptive‑compute paradigm: instead of learning only “what to attend to,” it learns “when attention is unnecessary” by continuously compressing episodic memories into a fast parametric store. This yields dramatic compute savings, aligns with cognitive theories of memory consolidation, and opens new avenues for energy‑efficient large‑scale language models and other sequence‑processing systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment