FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.


💡 Research Summary

FlashBlock addresses the inefficiency that remains in block diffusion models when handling long contexts. Block diffusion already improves inference speed by caching key‑value (KV) states and processing tokens in contiguous blocks, but each diffusion step still recomputes attention over the entire growing cache. The authors observe a strong cross‑step redundancy: attention contributions from tokens outside the current block (block‑external attention) remain almost unchanged across adjacent diffusion steps, whereas attention among tokens inside the block (block‑internal attention) varies significantly as the block is refined.

Leveraging this observation, FlashBlock introduces a caching mechanism for block‑external attention. At the first diffusion step of a block, the model computes the full attention, stores the external attention output A_out and its log‑normalizer L_out, and then reuses these cached tensors in subsequent steps. Only the internal attention (A_in, L_in) is recomputed each step. The final attention is obtained by log‑space composition of the cached external part and the freshly computed internal part, following the numerically stable approach used in FlashAttention. This eliminates repeated KV‑cache reads for the external portion and reduces the overall number of attention operations.

To avoid accuracy loss when the block changes dramatically, FlashBlock employs a selective reuse policy. If the number of tokens updated within the block at a given step (M) is below a threshold τ, the cached external attention is reused; otherwise it is recomputed and the cache refreshed. For video diffusion, a head‑wise similarity estimate γ determines which attention heads can safely reuse their external component, accounting for higher variability in visual sequences.

FlashBlock is orthogonal to sparse‑attention techniques. In a sparse setting, the set of keys selected by the sparsity pattern forms J_in, while the remaining keys constitute J_out. The method simply caches the residual attention from J_out, allowing it to be combined with any sparse‑attention computation on J_in. Consequently, FlashBlock can be layered on top of existing sparse‑attention methods, mitigating the quality degradation that often accompanies aggressive sparsification.

Experiments on large diffusion language models (e.g., an 8‑billion‑parameter transformer) and on long‑horizon video diffusion models demonstrate substantial speedups. Token throughput improves by up to 1.44×, and the wall‑clock time spent on attention drops by up to 1.6×, while standard generation metrics (BLEU, ROUGE for text; FID for video) show negligible differences (<0.1%). When combined with sparse‑attention, FlashBlock maintains high quality even at sparsity levels of 70 % or more, effectively compensating for the information loss introduced by sparsification.

In summary, FlashBlock exploits the inherent stability of block‑external attention across diffusion steps, providing a simple yet powerful caching strategy that reduces both computation and memory traffic without altering the diffusion process. Its compatibility with sparse‑attention further broadens its applicability, making it a valuable addition to any long‑context diffusion model seeking faster inference while preserving generation quality.


Comments & Academic Discussion

Loading comments...

Leave a Comment