PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.


💡 Research Summary

Diffusion Transformers (DiTs) have become a cornerstone for high‑fidelity image and video generation, yet their self‑attention mechanism scales quadratically with sequence length, creating a severe bottleneck for high‑resolution or long‑duration generation. Existing block‑sparse attention methods alleviate this by selecting a subset of “critical” key‑value blocks and completely discarding the rest. While effective at modest sparsity, these “keep‑or‑drop” approaches suffer dramatic quality degradation when the sparsity ratio is high because they lose valuable contextual information.

The authors of this paper make a key observation: the pre‑softmax attention scores (QKᵀ) of non‑critical blocks exhibit a stable, symmetric distribution centered around zero or negative values. This distributional stability makes the scores highly amenable to approximation via a mean‑centered Taylor expansion. Leveraging this insight, they propose PISA (Piecewise Sparse Attention), a training‑free sparse attention mechanism that retains the full attention span while reducing computational complexity to sub‑quadratic levels.

PISA works by partitioning the query, key, and value matrices into blocks of size B. For each query block, a Top‑K selection identifies a set of critical blocks Sᵢ; the remaining blocks form the long‑tail set Uᵢ. Critical blocks are processed exactly as in conventional block‑sparse attention. For the long‑tail blocks, PISA applies a two‑stage approximation:

  1. Zero‑order (0‑th order) approximation – each block is represented by its mean key (\bar{k}j) and the sum of its values (\sum_n v{j,n}). The contribution is computed as (\exp(q_t \cdot \bar{k}j) \times \sum_n v{j,n}).

  2. First‑order correction – the block‑wise deviation ((k_{j,n} - \bar{k}j) v{j,n}) is aggregated into a matrix (H_j). Rather than handling each (H_j) individually (which would be memory‑bound), PISA replaces the weighted sum of (H_j) with a global average (\bar{H}) multiplied by a query‑dependent scalar (\beta_t). The scalar is derived from the average of (\exp(q_t \cdot \bar{k}_j)) over all unselected blocks, providing a more accurate slope than a naïve Taylor coefficient.

Both the numerator and denominator of the softmax are modified with these approximations, preserving the intrinsic normalization of attention weights. The authors prove (Theorem 3.1) that the error introduced by the global first‑order replacement is bounded by the product of the tail probability mass (\rho_t) and the variance of the block matrices (M). Since (\rho_t) is small for non‑critical blocks (their exponentials are tiny), the overall error remains negligible.

Implementation-wise, PISA follows a three‑phase pipeline:

  • Prepare Phase – a single pass pre‑computes block‑wise means of Q and K, block sums of V, and the global statistic (\bar{H}).
  • Mask Phase – a Top‑K operation selects critical blocks and builds a binary mask.
  • Fused Attention Kernel – a custom GPU kernel dynamically switches between exact computation for selected blocks and the two‑stage approximation for the rest. The global first‑order term (\bar{H}) is loaded once, avoiding the memory‑bound bottleneck of per‑block corrections.

Empirical evaluation spans three major diffusion models:

  • Wan2.1‑14B (video generation) – PISA achieves a 2.14× speedup (latency reduced from 1468 s to 687 s) while maintaining PSNR ≈ 24.7 and LPIPS ≈ 0.13, essentially matching dense attention quality.
  • Hunyuan‑Video – a 2.57× acceleration is reported with comparable visual fidelity.
  • FLUX (text‑to‑image) – despite the lower inherent sparsity of image generation, PISA still delivers a 1.2× speedup without perceptible degradation.

Across all benchmarks, PISA consistently outperforms prior sparse attention methods (e.g., SparseAttn) in both speed and quality, especially at high sparsity ratios (r = 85%). Importantly, because PISA does not alter the attention distribution, it can reuse pretrained weights directly, eliminating the need for costly fine‑tuning.

In summary, PISA introduces a principled “exact‑or‑approximate” paradigm that leverages the statistical regularities of non‑critical attention scores. By integrating block‑wise zero‑order approximations with a globally shared first‑order correction inside the softmax, it achieves near‑lossless fidelity while reducing computational complexity to O(L·B). This work represents a significant step toward practical, real‑time, high‑resolution diffusion generation without sacrificing the quality of pretrained models.


Comments & Academic Discussion

Loading comments...

Leave a Comment