HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer’s token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

💡 Research Summary

The paper introduces HySparse, a hybrid sparse attention architecture that interleaves full‑attention layers with several sparse‑attention layers. The key innovation lies in using the immediate preceding full‑attention layer as an “oracle” to both select the most important tokens and to provide the key‑value (KV) caches for the subsequent sparse layers. This eliminates the need for auxiliary token‑importance predictors that many prior sparse‑attention methods rely on, thereby simplifying the design and improving selection accuracy. Moreover, by reusing the KV cache generated by the full‑attention layer, HySparse reduces memory consumption dramatically—by nearly an order of magnitude—while also cutting computational cost.

The architecture alternates between full‑attention and sparse‑attention blocks. In a full‑attention block, standard Q, K, V projections are computed, and the attention score matrix S is formed. S is then aggregated block‑wise (e.g., in windows of 128 tokens) and the top‑K tokens per block are identified. Their indices are stored and later used by the sparse‑attention block. The sparse block employs a block‑sparse attention mechanism: within each block a full attention is performed, but across blocks only the previously selected tokens are attended to. Crucially, the sparse block does not recompute or store new KV matrices; it directly reads the K and V tensors from the preceding full‑attention block, sharing the cache. This cache‑sharing eliminates the extra memory overhead that typically plagues sparse‑attention designs.

Experiments were conducted on two model families: a 7‑billion‑parameter dense transformer and an 80‑billion‑parameter mixture‑of‑experts (MoE) model. For the 80B MoE, only five out of 49 layers were full‑attention; the remaining layers used HySparse’s sparse‑attention. Across all settings, HySparse consistently outperformed both the pure full‑attention baseline and prior hybrid sparse‑attention (SWA) baselines in terms of perplexity and downstream task accuracy. In the 80B MoE case, KV cache storage was reduced by almost tenfold without sacrificing inference speed, demonstrating that HySparse can alleviate the memory bottleneck that limits the deployment of very large LLMs.

In summary, HySparse resolves two fundamental limitations of earlier sparse‑attention approaches: (1) it replaces external token‑importance proxies with an exact oracle derived from the full‑attention layer, and (2) it enables sparse layers to share the KV cache, achieving simultaneous reductions in computation and memory. The result is a more efficient, scalable attention mechanism that delivers better performance for large language models while dramatically lowering resource requirements.

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment