Fast-weight Product Key Memory

February 21, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Fast-weight Product Key Memory
ArXiv ID: 2601.00671
Date: 2026-01-02
Authors: ** - Tianyu Zhao (Sakana AI) - Llion Jones (Sakana AI) **

📝 Abstract

Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

💡 Deep Analysis

📄 Full Content

2026-1-1 Fast-weight Product Key Memory Tianyu Zhao1 and Llion Jones1 1Sakana AI Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, “fast-weight” episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences. Contents 1 Introduction 2 2 Product Key Memory 2 3 Fast-weight Product Key Memory 4 4 Experiments 9 5 Interpretability Analyses 11 6 Cost Analyses 15 7 Related Work 15 8 Conclusion 16 A Detailed Training Settings 22 B Ablation Study 23 C More Visualization Examples 24 Corresponding author(s): Tianyu Zhao (tianyu@sakana.ai) arXiv:2601.00671v1 [cs.CL] 2 Jan 2026 Fast-weight Product Key Memory 1. Introduction Sequence modeling layers, or token mixers, are the foundational components in modern language models. The most successful architectures today can be fundamentally understood as forms of associative memory (Dao and Gu, 2024; Peng et al., 2025; Vaswani et al., 2017; Yang et al., 2024b, 2025), characterized by their ability to maintain key-value associations, execute retrieval, and perform memorization (Gershman et al., 2025). Within this framework, existing layers lie on a spectrum defined by the trade-off between storage capacity and computational efficiency. Standard softmax attention (Vaswani et al., 2017) acts as an associative memory with unbounded storage, yet its computational cost becomes increasingly prohibitive as the sequence length grows (Zhong et al., 2025). Conversely, linear attention vari- ants (Behrouz et al., 2025c; Dao and Gu, 2024; Gu and Dao, 2024; Katharopoulos et al., 2020; Schlag et al., 2021b; Sun et al., 2025; Yang et al., 2025) provide efficient, sub-quadratic mechanisms but rely on fixed storage capacities that often struggle to capture the same depth of information. We focus our investigation on resolving this specific tension: balancing large-scale storage with low computational overhead. We posit that an ideal associative memory should satisfy four key properties: 1. Key-value Association: The ability to link keys to values. 2. Large Storage: Capacity that is extensive, if not unbounded. 3. Low Cost: Sub-quadratic computational complexity w.r.t. input length. 4. Retrieval and Memorization: The capability to retrieve information and, crucially, memorize new key-value pairs from inputs at any time. Product Key Memory (PKM, Lample et al. 2019) is an architecture that elegantly satisfies the first three properties. Its sparse key-value design handles an enormous number of memory slots (e.g., 𝑁= 106) with fixed and low computation. However, PKM was originally designed as a “slow-weight” channel mixer – similar to Feed-Forward Networks (FFN) – meaning it is updated only during training and remains frozen during inference. Consequently, it lacks the ability to rapidly adapt to new inputs during deployment, failing property 4. In this paper, we propose to convert PKM from a static, slow-weight module into Fast-weight Product Key Memory (FwPKM). By redesigning PKM to update its parameters dynamically at both training and inference time, we enable it to function as a high-fidelity episodic memory. FwPKMcan store “episodes” directly from input sequences and carry that memory across different contexts, offering a promising new path for continual learning and personalized AI agents. 2. Product Key Memory Top-𝑘Key-value Memory A standard key-value memory consists of a key matrix 𝐾∈R𝑁×𝐷𝐾and a value matrix 𝑉∈R𝑁×𝐷𝑉, where 𝑁represents the number of memory slots and 𝐷{𝐾,𝑉} are the hidden dimensions. A common approach to learning a large memory without sacrificing computation efficiency is to exploit sparsity via a Top-𝑘operation (Rae et al., 2016; Weston et al., 2015). Given an input query vector q, the model computes a score 𝑠𝑖for each memory slot as the inner product between the query and the keys. A Top-𝑘operation T𝑘then selects the indices of the 𝑘slots with the highest scores. The selected scores are normalized via softmax to produce weights {𝑠′ 𝑖}, and the 2 Fast-weight Product Key Memory Table 1 | Comparison of PKM and FwPKM PKM FwPKM Weight Type Slow Weig

📄 Read Full PDF on ArXiv