Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
📝 Abstract
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
💡 Analysis
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
📄 Content
KASCADE: A PRACTICAL SPARSE ATTENTION METHOD FOR LONG-CONTEXT LLM INFERENCE Dhruv Deshmukh 1 Saurabh Goyal 1 Nipun Kwatra 1 Ramachandran Ramjee 1 ABSTRACT Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic- programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1× speedup in decode attention and 2.2× speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24. The source code of Kascade will be available at https://github.com/microsoft/kascade . 1 INTRODUCTION Large language models are increasingly deployed in settings that demand long contexts: chain-of-thought style reason- ing, multi-step tool use, retrieval-augmented generation over multi-document corpora, coding agents, etc. In long context inference, the computation cost is dominated by the atten- tion operation in both the prefill (where attention is O(n2) for context length n, compared to O(n) MLP operation) and decode (O(n) attention vs O(1) MLP) phases. Moreover, decode attention is memory bandwidth bound and therefore does not benefit much from batching, making it inefficient on modern GPUs. The attention operation is expensive because each token has to attend to all previous context tokens. A common method to decrease the cost is sparse attention, where the attention function is approximated by using only a subset of the context tokens. Numerous sparse attention methods have been proposed, including fixed-pattern (Beltagy et al., 2020; Xiao et al., 2023; Zaheer et al., 2020; Jiang et al., 2024), workload-aware (Gim et al., 2024; Yao et al., 2025; Lu et al., 2024; Ma et al., 2025), and dynamic sparsity variants (Singhania et al., 2024; Zhang et al., 2023; Tang et al., 2024; Yang et al., 2025c; Gao et al., 2024; 2025). However, some of these methods require model retraining, 1Microsoft Research India. Correspondence to: Saurabh Goyal saurabh.goyal@microsoft.com. or sacrifice generality across tasks. In this paper, we present Kascade, a dynamic sparsity based technique which reduces the cost of attention significantly while retaining the accuracy of dense attention. Compared to other training free sparse attention schemes, we find that Kascade achieves the best accuracy on AIME-24, at a given sparsity ratio, as shown in Table 2. Kascade leverages two known observations: 1) the post- softmax attention scores are inherently sparse, and 2) the sparsity structure is stable across nearby layers. Figure 1 shows the sparsity inherent in attention operation. As shown, only 256 (about 10%) of the tokens contribute to over 95% of the softmax output. This is intuitive, as the softmax operation exponentially amplifies the relative magnitude of larger values compared to the smaller ones. Thus, if we have an oracle that determines the Top-k tokens, which contribute most to the attention operation, we can get a very accurate approximation of the operation, at a fraction of the cost. Figure 2 shows the accuracy of Oracle Top-k with varying values of k. As shown, with just 2.5% of tokens, one can recover almost the full accuracy of dense attention. However, computing these Top-k values efficiently is a fun- damental challenge as accurate computation will entail read- ing all the O(n) keys of the tokens and computing the soft- max. This is where we leverage the second observation — the exact Top-k of layeri is very close to the exact Top-k of layeri+m for reasonable values of m. Figure 3 illus- arXiv:2512.16391v1 [cs.LG] 18 Dec 2025 Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference trates this observation. For example, the Top-k of layer 16 captures 99% of the Top-k attention of layers 17 and 18. These observations motivate our solution: we compute full attention, and identify Top-k tokens, only on a subset of layers, which we call anchor layers, and reuse those Top-k tokens to compute sparse attention in intermediate layers. In order to identify the best subset of anchor layers, we propose an automated dynamic programming scheme that maximizes cross layer similarity scores. This
This content is AI-processed based on ArXiv data.