Prism: Spectral-Aware Block-Sparse Attention
Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a “blind spot” for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.
💡 Research Summary
Block‑sparse attention has emerged as a promising technique for accelerating the pre‑filling phase of large language models (LLMs) that need to handle very long contexts. The key challenge lies in efficiently estimating which blocks of tokens are important without computing the full quadratic attention matrix. Existing training‑free methods typically compress each block by mean‑pooling its token embeddings and then compute a coarse‑grained attention matrix on these pooled vectors. However, when rotary positional embeddings (RoPE) are used, mean‑pooling unintentionally acts as a low‑pass filter: high‑frequency dimensions—those that encode fine‑grained relative positions—undergo destructive interference and their signal magnitude collapses to near zero. The authors provide a rigorous theoretical analysis, showing that the pooled vector’s magnitude in dimension j follows λ_j(B) = (1/B)·sin(Bθ_j/2)/sin(θ_j/2), which approaches zero whenever the block size B covers an integer number of rotation periods. Empirically, they demonstrate that for a typical block size of 128, the first ~30 dimensions (the “Blind Spot”) lose almost all energy after pooling, while low‑frequency dimensions retain their signal.
To overcome this spectral bias, the paper introduces Prism, a training‑free framework that separates block importance estimation into two parallel branches: a high‑frequency branch (the first half of the embedding dimensions) and a low‑frequency branch (the second half). Each branch independently mean‑pools its respective sub‑vectors, computes block‑level query/key dot products, and applies softmax to obtain block scores. Because the high‑frequency branch suffers from attenuated magnitudes, Prism applies an energy‑based temperature calibration: the temperature for each branch is derived from the average RMS energy of that branch across all blocks, effectively scaling up the logits of the weakened high‑frequency scores. The two calibrated scores are then linearly combined (with an automatically determined weighting) and top‑k or top‑p selection is performed, yielding the final block mask. Crucially, all operations remain at the block level; no token‑level rescoring is required, eliminating the selection overhead that plagues prior methods.
The authors evaluate Prism on a wide range of long‑context tasks, including language modeling (PG‑19), comprehension (LongBench), retrieval (RULER), and video understanding (VideoMME, LongVideoBench). Across these benchmarks, Prism matches full‑attention accuracy within 0.1 % and consistently outperforms existing block‑sparse baselines. In terms of speed, Prism achieves up to 5.1× acceleration over FlashAttention‑based full attention at 128 K tokens, and 2.3–3.8× over the best prior sparse methods. Ablation studies confirm that both the spectral separation and the temperature calibration are essential: removing either component degrades accuracy by 1–2 % or re‑introduces instability.
In summary, the paper makes three major contributions: (1) a theoretical insight that mean‑pooling under RoPE behaves as a low‑pass filter, creating a “blind spot” for local positional information; (2) the Prism algorithm, which restores the lost high‑frequency signals via dual‑branch scoring and energy‑based temperature scaling, all without any training or token‑level computation; and (3) extensive empirical evidence that Prism delivers state‑of‑the‑art accuracy‑speed trade‑offs for long‑context LLMs. This work opens the door to more practical deployment of very long‑context models in real‑time applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment