Learning to Remember, Learn, and Forget in Attention-Based Models
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
💡 Research Summary
The paper tackles the memory and learning limitations of fixed‑size attention models such as linear transformers and state‑space models (SSMs). While these architectures replace the growing key‑value cache with a constant‑size memory matrix, they suffer from catastrophic interference: new key‑value pairs overwrite older information when the keys are not orthogonal. The authors reinterpret in‑context learning (ICL) as an online continual‑learning problem and introduce a Bayesian metaplasticity framework to resolve the stability‑plasticity dilemma.
Palimpsa, the proposed attention block, treats each memory state S as a Gaussian distribution with mean μ and variance σ². An importance factor β and an input‑dependent forgetting gate αₜ = exp(−A·dₜ) control how much prior knowledge is retained versus how aggressively new evidence is incorporated. The gate effectively discards a fraction 1/Nₜ of accumulated prior weight at each step, preventing “catastrophic remembering” where the posterior becomes overly concentrated and new data have negligible impact.
The authors formulate the attention update as Bayesian linear regression and solve it via variational inference. In this conjugate setting the variational solution matches the exact posterior, and a diagonal covariance approximation yields a computationally cheap, linear‑time update. Two internal states are maintained: a plastic state that quickly absorbs new key‑value pairs, and a stable state that preserves information deemed important by the uncertainty‑driven learning rate. The learning rate for each individual state adapts in‑context, a property the authors call true metaplasticity.
The paper demonstrates that several existing gated linear attention models (e.g., Linear Transformer, Longhorn, Mesanet) emerge as special cases of Palimpsa under particular posterior approximations or fixed plasticity assumptions. Moreover, Mamba2 corresponds to the regime where forgetting dominates, showing that Palimpsa subsumes it as a limiting case. Consequently, any non‑metaplastic model can be transformed into a metaplastic version by applying the derived conversion rules, thereby expanding its effective memory capacity.
Empirical evaluation uses two backbones: Palimpsa‑D (based on Deltanet) and Palimpsa‑M (based on Mamba2). Experiments on the Multi‑Query Associative Recall (MQAR) benchmark and several commonsense reasoning datasets (Winogrande, ARC‑Easy/Challenge) reveal consistent gains over baselines. Across memory sizes (N = 64, 128, 256), Palimpsa improves MQAR accuracy by roughly 4–7 percentage points compared to Linear Transformer, Longhorn, and Mesanet, and yields 2–4 percentage‑point improvements on commonsense tasks. The gains are especially pronounced in low‑memory regimes, confirming that the Bayesian metaplasticity mechanism effectively balances forgetting and retention.
In summary, Palimpsa introduces a principled Bayesian metaplasticity approach to fixed‑size attention memories, providing a unified theoretical framework that unifies prior gated linear models, offers a practical method to convert existing architectures, and demonstrates tangible performance benefits on both synthetic and real‑world benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment