NOSA: Native and Offloadable Sparse Attention

Reading time: 2 minute
...

📝 Original Info

  • Title: NOSA: Native and Offloadable Sparse Attention
  • ArXiv ID: 2510.13602
  • Date: 2025-10-15
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (추후 논문 본문 혹은 arXiv 페이지에서 확인 필요) **

📝 Abstract

Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.

💡 Deep Analysis

Figure 1

📄 Full Content

📸 Image Gallery

attn_ratio.png cache_hit_rate.png comm_kernel.png framework.png infllmv2_locality.png locality_compare_with_stats.png memory_hierarchy.png memory_manager.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut