Long-Context Generalization with Sparse Attention

Long-Context Generalization with Sparse Attention
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $α$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.


💡 Research Summary

The paper tackles a fundamental limitation of transformer models when scaling to very long sequences: the softmax attention mechanism distributes probability mass across all tokens, which leads to three intertwined problems—representational collapse, over‑squashing, and attention dispersion. As the context length grows, non‑informative tokens accumulate attention weight, causing token representations to become indistinguishable, gradients to vanish across the O(n L) paths, and the attention distribution to approach a uniform (maximum‑entropy) state.

To address these issues, the authors replace softmax with the α‑entmax transformation, a differentiable sparse mapping that can assign exact zero probability to tokens whose logits fall below a learned threshold. They first provide a rigorous theoretical analysis. Lemma 1 (Non‑Vanishing Attention Property) shows that, unlike softmax where adding a new token strictly reduces the weight of existing tokens, α‑entmax can keep the weight of a token unchanged if the new token’s logit is below the threshold. Definition 1 formalizes “attention dispersion” via normalized entropy, and Proposition 1 proves that softmax inevitably exhibits complete dispersion (entropy → log n) while α‑entmax can maintain bounded entropy O(log s) when the support size s grows sub‑linearly with sequence length n. Proposition 2 further demonstrates that sparse attention preserves token‑wise L1 distances (preventing representational collapse) and reduces the number of effective gradient paths from O(n L) to O(s L), thereby mitigating over‑squashing.

However, a fixed α and temperature are insufficient for arbitrarily long contexts because the range of logits typically expands as O(√log n) for Gaussian‑distributed scores, causing the attention to become overly peaky or overly diffuse. To overcome this, the authors introduce Adaptive‑Scalable Entmax (ASEntmax). ASEntmax learns head‑specific scaling coefficients β, γ, and offset δ as functions of the input and the logarithm of the sequence length. The attention logits are rescaled by (δ + β·(log n)^γ) before applying α‑entmax. This dynamic temperature mechanism allows the model to automatically increase or decrease sparsity as the context grows, preserving a stable support size without manual tuning. The formulation reduces to standard α‑entmax when β = 0, ensuring a smooth continuum between sparse and dense regimes.

Empirical evaluation is conducted on two fronts. First, a synthetic “Multi‑query Multi‑token Associative Recall” benchmark tests length extrapolation. Models trained on sequences of length 64 are evaluated on lengths up to 65 K (a 1000× extrapolation). ASEntmax achieves 95.3 % exact‑match accuracy, dramatically outperforming Scalable‑Softmax (SSMax) and Adaptive‑Temperature baselines, and demonstrating robust pattern recall even when the context is orders of magnitude larger than seen during training. Second, language‑modeling experiments on WikiText‑103 and OpenWebText assess real‑world performance. When trained with an 8× longer context, ASEntmax yields lower perplexity and higher retrieval accuracy than softmax, while matching or slightly surpassing softmax on standard short‑context evaluations. Ablation studies confirm that the learned β and γ parameters indeed follow a logarithmic trend with sequence length, and that the sparsity level adapts appropriately across layers and heads. Gradient‑norm analyses corroborate the theoretical claim of O(s L) effective paths, showing substantially larger gradient magnitudes for distant tokens compared to softmax.

In summary, the paper makes three key contributions: (1) a formal proof that α‑entmax eliminates attention vanishing and maintains concentration, (2) the Adaptive‑Scalable Entmax mechanism that learns to modulate sparsity with context length, and (3) extensive empirical evidence that these ideas enable transformers to generalize to sequences thousands of tokens longer than those seen during training without sacrificing short‑context performance. The work suggests that dynamic sparse attention is a promising direction for building next‑generation large language models capable of handling extremely long contexts efficiently and reliably.


Comments & Academic Discussion

Loading comments...

Leave a Comment