Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability.


💡 Research Summary

The paper tackles the quadratic cost of self‑attention in transformers, especially when processing very long sequences, by introducing a query‑independent “pre‑scoring” step that selects a subset of globally important keys before any approximate attention computation. Two scoring mechanisms are explored: (1) a lightweight clustering approach (K‑means or K‑median) that partitions the key matrix into d + 1 clusters (one for each embedding dimension plus a noise bucket) and picks the s keys closest to each centroid, and (2) a fast leverage‑score approximation that ranks keys by their statistical influence. The selected key set S is then fed into an existing efficient attention kernel—in the experiments, HyperAttention, which uses angular locality‑sensitive hashing (LSH) and low‑rank compression inside hash buckets. By restricting the LSH‑based computation to the pre‑scored keys, the method dramatically improves recall of heavy‑weight attention scores while keeping the overall interaction budget sub‑quadratic.

Empirically, the authors evaluate on two domains. In language modeling, a 131 k‑token context with ChatGLM shows perplexity dropping from 17.54 (plain HyperAttention) to 9.53 when combined with pre‑scoring, and to 10.38 with pre‑scoring alone—demonstrating that most of the gain comes from better key selection. In vision, a ViT‑Large model retains 84.46 % of baseline ImageNet‑1k accuracy when only 128 keys are kept via clustering, compared to 77.17 % when using leverage‑score selection. Across all settings, clustering consistently outperforms leverage‑based selection under identical key budgets, suggesting that geometric structure in the key embeddings is more informative than purely algebraic influence measures.

Theoretical contributions are provided under a planted‑subspace model that mirrors the structure of transformer keys: d disjoint signal clusters plus a large noise cluster, each with bounded within‑cluster variance and a clear separation Δ between signal and noise. Under this model, the authors prove that both K‑means clustering and leverage‑score ranking recover all ϵ‑heavy keys with probability 1 − exp(−c n), i.e., with exponentially small failure. The clustering proof hinges on setting the number of clusters to d + 1, aligning each cluster with a latent orthogonal direction, and normalizing keys to unit ℓ₂ norm to avoid pathological outliers. This analysis shows that clustering can achieve the same worst‑case guarantees as LevAttention while being computationally cheaper and empirically stronger.

Complexity analysis reveals that pre‑scoring adds only O(n d_k k I) time for clustering (with I ≤ 10 iterations) or O(n d_k log d_k) for leverage scores, and it is performed once per layer. No additional trainable parameters are introduced, and back‑propagation does not flow through the clustering step. For autoregressive decoding, the selected key set can be cached or refreshed periodically, avoiding an O(n) clustering cost at every generation step.

Limitations include the necessity of ℓ₂‑normalizing keys (the authors note that without normalization, high‑norm noise vectors can dominate the clustering objective) and the lack of formal guarantees for softmax attention—the theoretical results apply to polynomial kernels, with softmax behavior validated empirically.

In summary, the proposed pre‑scoring framework provides a simple yet powerful way to embed a global importance prior into existing efficient attention mechanisms. By coupling query‑independent key selection with query‑dependent locality (e.g., HyperAttention’s LSH), the method shifts the accuracy‑efficiency frontier: it achieves lower perplexity or higher accuracy at the same computational budget, and it generalizes across modalities from language to vision. This work opens a new design axis for future efficient transformer research, emphasizing the complementary role of global, structure‑aware key prioritization.


Comments & Academic Discussion

Loading comments...

Leave a Comment