Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Reading time: 6 minute
...

📝 Original Info

  • Title: Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling
  • ArXiv ID: 2512.20198
  • Date: 2025-12-23
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.

💡 Deep Analysis

Deep Dive into Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling.

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator archite

📄 Full Content

1 Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Student Member, IEEE, Shaojun Wei, Fellow, IEEE, Yang Hu, Senior Member, IEEE, Shouyi Yin, Fellow, IEEE Abstract—Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput in- ference and large-scale token parallelism (LTPP). Existing dy- namic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross- stage coordination can substantially reduce redundant compu- tation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm–hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add- only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2× speedup and 71.2× energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1× energy and 27.1× area efficiency gains. Further, we deploy STAR onto a multi-core spatial ar- chitecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1× throughput improvement. Index Terms—Transformer, attention sparsity, FlashAttention, top-k, tiling, distributed attention, spatial architecture. I. INTRODUCTION E MPOWERED by self-attention, large language models (LLMs) have revolutionized fields such as chatbots [1] and code generation [2]. The self-attention processes three matrices: Q (query), K (key) and V (value). First, the at- tention matrix A∈RS×S is computed by Q×KT , where S denotes the sequence length. The resulting matrix A is then passed through a softmax function for normalization before being multiplied by V to generate the final output. This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115200; in part by the NSFC under Grant 62125403, Grant U24A20234, Grant 92464302 and Grant U24B20164; in part by the Beijing S&T Project Z251100008425010; in part by Shanghai Munici- pal Science and Technology Major Project; the Natural Science Foundation of Jiangsu Province Basic Research Program under Grant BK20243042; in part by the Beijing National Research Center for Information Science and Tech- nology; in part by the Northern IC Technology Innovation Center (Beijing) Co., Ltd under Grant QYJS20232801B; and in part by the Beijing Advanced Innovation Center for Integrated Circuits. An earlier version of this paper was presented at the IEEE 57th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024 [DOI: 10.1109/MICRO61859.2024.00093]. (Corresponding author: Yang Hu, email: hu yang@tsinghua.edu.cn). Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Shaojun Wei, and Yang Hu are with the School of Integrated Circuits, Tsinghua University, Beijing, 100084, China. Shouyi Yin is with the School of Integrated Circuits, Tsinghua University, Beijing, 100084, China, and Shanghai AI Lab, Shanghai, 200232, China. Sequence length increases 0 20 40 60 80 100 2048 4096 8192 16384 32768 65536 131072 262144 Normalized Complexity (%) Prompt Sequence Length QKV Atten FFN (b) Computation breakdown for Llama 13B 13x BERT-B BERT-L GPT-2 Bloom-3B Llama-13B LLama4-Ma verick 1 2 4 35 46 2198 (a) Normalized Memory Requirement for Attention Fig. 1: (a) Normalized memory requirement for attention. (b) Computation breakdown for the Llama 13B. LLMs increasingly demand faster inference and higher throughput, particularly for long-context tasks. However, un- like the linear complexity of O(SH2) in the feed-forward net- work (FFN), where H is the hidden dimension, the quadratic complexity O(S2H) of self-attention severely hinders the efficiency of LLMs on long sequences. From early models such as BERT [3] to recent ones like LLaMA 4-Maverick [4], the maximum sequence length has expanded over 32× (512 to 16k), whereas the hidden dimension H has increased only 10× (768 to 8k). This dramatic growth of sequence length results in more than a 2000× increase in attention memory footprint, as depicted in Fig.1 (a), creating significant barriers to deploying LLMs across both cloud and edge environments. Moreover, the quadratic computation of self-attention emerges as a critical bottleneck for fast inference. As shown in Fig.1 (b), when the sequence length reaches 16k tokens, attention surpasses the FFN as the mo

…(Full text truncated)…

📸 Image Gallery

Ablation_study6.png Ablation_study6.webp Acceleration_breakdown12.png Acceleration_breakdown12.webp Area_Power_Breakdown.png Area_Power_Breakdown.webp Cache_to_Throughput.png Cache_to_Throughput.webp DLOS-v14.png DLOS-v14.webp DRAttention5.png DRAttention5.webp Dojo_Wafer2.png Dojo_Wafer2.webp Efficiency_gain7.png Efficiency_gain7.webp Fig1_v10.png Fig1_v10.webp Fig2_v13.png Fig2_v13.webp Fig3_v9.png Fig3_v9.webp FlashAttention_Complexity6.png FlashAttention_Complexity6.webp Layer_Error13.png Layer_Error13.webp Overal_hardware6.png Overal_hardware6.webp Overall_comp_reduction4.png Overall_comp_reduction4.webp QKV_breakdown2.png QKV_breakdown2.webp SADS4_2_v3.png SADS4_2_v3.webp SADS9.png SADS9.webp SOFA_TC_Q-stream2.png SOFA_TC_Q-stream2.webp SOFA_workflow7.png SOFA_workflow7.webp SU-FlashAttention12.png SU-FlashAttention12.webp Throughput_Gain3.png Throughput_Gain3.webp Transformer-v4.png Transformer-v4.webp Wafer_Performance9.png Wafer_Performance9.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut