Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling
ArXiv ID: 2512.20198
Date: 2025-12-23
Authors: Researchers from original ArXiv paper

📝 Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.

💡 Deep Analysis

Deep Dive into Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling.

📄 Full Content

1 Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Student Member, IEEE, Shaojun Wei, Fellow, IEEE, Yang Hu, Senior Member, IEEE, Shouyi Yin, Fellow, IEEE Abstract—Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput in- ference and large-scale token parallelism (LTPP). Existing dy- namic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross- stage coordination can substantially reduce redundant compu- tation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm–hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add- only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2× speedup and 71.2× energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1× energy and 27.1× area efficiency gains. Further, we deploy STAR onto a multi-core spatial ar- chitecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1× throughput improvement. Index Terms—Transformer, attention sparsity, FlashAttention, top-k, tiling, distributed attention, spatial architecture. I. INTRODUCTION E MPOWERED by self-attention, large language models (LLMs) have revolutionized fields such as chatbots [1] and code generation [2]. The self-attention processes three matrices: Q (query), K (key) and V (value). First, the at- tention matrix A∈RS×S is computed by Q×KT , where S denotes the sequence length. The resulting matrix A is then passed through a softmax function for normalization before being multiplied by V to generate the final output. This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115200; in part by the NSFC under Grant 62125403, Grant U24A20234, Grant 92464302 and Grant U24B20164; in part by the Beijing S&T Project Z251100008425010; in part by Shanghai Munici- pal Science and Technology Major Project; the Natural Science Foundation of Jiangsu Province Basic Research Program under Grant BK20243042; in part by the Beijing National Research Center for Information Science and Tech- nology; in part by the Northern IC Technology Innovation Center (Beijing) Co., Ltd under Grant QYJS20232801B; and in part by the Beijing Advanced Innovation Center for Integrated Circuits. An earlier version of this paper was presented at the IEEE 57th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024 [DOI: 10.1109/MICRO61859.2024.00093]. (Corresponding author: Yang Hu, email: hu yang@tsinghua.edu.cn). Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Shaojun Wei, and Yang Hu are with the School of Integrated Circuits, Tsinghua University, Beijing, 100084, China. Shouyi Yin is with the School of Integrated Circuits, Tsinghua University, Beijing, 100084, China, and Shanghai AI Lab, Shanghai, 200232, China. Sequence length increases 0 20 40 60 80 100 2048 4096 8192 16384 32768 65536 131072 262144 Normalized Complexity (%) Prompt Sequence Length QKV Atten FFN (b) Computation breakdown for Llama 13B 13x BERT-B BERT-L GPT-2 Bloom-3B Llama-13B LLama4-Ma verick 1 2 4 35 46 2198 (a) Normalized Memory Requirement for Attention Fig. 1: (a) Normalized memory requirement for attention. (b) Computation breakdown for the Llama 13B. LLMs increasingly demand faster inference and higher throughput, particularly for long-context tasks. However, un- like the linear complexity of O(SH2) in the feed-forward net- work (FFN), where H is the hidden dimension, the quadratic complexity O(S2H) of self-attention severely hinders the efficiency of LLMs on long sequences. From early models such as BERT [3] to recent ones like LLaMA 4-Maverick [4], the maximum sequence length has expanded over 32× (512 to 16k), whereas the hidden dimension H has increased only 10× (768 to 8k). This dramatic growth of sequence length results in more than a 2000× increase in attention memory footprint, as depicted in Fig.1 (a), creating significant barriers to deploying LLMs across both cloud and edge environments. Moreover, the quadratic computation of self-attention emerges as a critical bottleneck for fast inference. As shown in Fig.1 (b), when the sequence length reaches 16k tokens, attention surpasses the FFN as the mo

…(Full text truncated)…

📄 Read Full PDF on ArXiv