Online Vector Quantized Attention

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

💡 Research Summary

The paper tackles a fundamental trade‑off in modern large language models (LLMs): self‑attention offers strong long‑range capabilities but incurs quadratic compute and linear memory, while linear‑attention and state‑space models (SSMs) run in linear time with constant memory but struggle with long‑context fidelity. Building on the recently introduced Vector‑Quantized Attention (VQ‑attention), which quantizes keys to a fixed dictionary of centroids and updates values online, the authors identify a critical limitation: the key dictionary (Dₖ) is pretrained and static, leading to substantial quantization error when processing sequences far longer than those seen during training.

To overcome this, they propose Online Vector‑Quantized Attention (OVQ‑attention), a novel sequence‑mixing layer that learns both the key dictionary Dₖ and the value dictionary Dᵥ online during inference. The theoretical foundation is Gaussian Mixture Regression (GMR). GMR fits a Gaussian mixture model (GMM) to paired key‑value data and predicts values for a query by weighting each mixture component’s mean value (µᵥ) with a softmax over the squared distance between the query and the component’s mean key (µₖ), multiplied by the component’s prior (proportional to its count). The authors prove that the GMR prediction formula exactly matches the linear‑time form of VQ‑attention:   E

Online Vector Quantized Attention

💡 Research Summary

Comments & Academic Discussion

Leave a Comment