ZeroS: Zero-Sum Linear Attention for Efficient Transformers

ZeroS: Zero-Sum Linear Attention for Efficient Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.


💡 Research Summary

The paper “ZeroS: Zero‑Sum Linear Attention for Efficient Transformers” investigates why existing linear‑time attention mechanisms, despite reducing the quadratic cost of the classic soft‑max attention to (O(N)), consistently lag behind in accuracy. The authors identify two fundamental shortcomings that affect both linear and even soft‑max attention when viewed through a Taylor‑expansion lens.

First, the convex‑combination bottleneck: soft‑max produces strictly non‑negative weights that sum to one, forcing the output to lie inside the convex hull of the value vectors. Consequently, a single attention head can only perform additive blending; subtraction, contrast, or differential operations cannot be expressed directly, forcing deeper stacks of layers to emulate such behaviours.

Second, the uniform weight bias: the zero‑order term (the constant (1) or (1/t) in the Taylor series of (\exp(q!\cdot!k))) injects an equal‑weight averaging component into every attention distribution. In long sequences this component dominates, diluting the influence of higher‑order interactions and leading to the well‑known “attention dilution” problem.

To address both issues, the authors propose Zero‑Sum Linear Attention (ZeroS). The key idea is to remove the zero‑order term from the soft‑max expansion, thereby forcing the remaining weights to sum to zero. After subtraction, the residual first‑order term (\delta_{t,i}=s_{t,i}-\bar s_t) and the higher‑order remainder (\epsilon_{t,i}) are re‑weighted by learned, step‑dependent scalar gates (\sigma^{(1)}_t) and (\sigma^{(h)}_t). The final weight for token (i) at step (t) is

\


Comments & Academic Discussion

Loading comments...

Leave a Comment