VideoNSA: Native Sparse Attention Scales Video Understanding

VideoNSA: Native Sparse Attention Scales Video Understanding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks. Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA


💡 Research Summary

VideoNSA tackles a fundamental bottleneck in multimodal large language models (MLLMs): the inability to process long video sequences due to quadratic attention costs. The authors adapt Native Sparse Attention (NSA), a learnable, hardware‑aware sparse mechanism, to the video branch of Qwen2.5‑VL‑7B while keeping dense Grouped‑Query Attention (GQA) for text. NSA consists of three complementary sub‑branches—Compression (CMP), Selection (SLC), and Sliding‑Window (SW)—each dynamically weighted by a two‑layer MLP gate. The Compression branch aggregates consecutive KV blocks via a learnable MLP, reducing token count; the Selection branch scores blocks and keeps only the top‑n most salient ones; the Sliding‑Window branch retains a fixed recent window to guarantee local temporal coverage. By combining these branches with learned gates, VideoNSA can attend to 128 K vision tokens while using only about 3.6 % of the total attention budget.

Training is performed end‑to‑end on a curated 216 K video‑instruction dataset (4 fps, 350‑550 frames per video). The maximum context length per instance is limited to 36 K tokens, and the model is trained for 4600 GPU‑hours on H100s using the SWIFT optimizer and an implementation of NSA from the FLA library. Block size is set to 64 tokens per frame, block stride to 32, and sliding‑window size to 256, yielding a hardware‑friendly cache layout.

Empirical evaluation spans three families of benchmarks: long‑video understanding (LongVideoBench, LongTimeScope), temporal reasoning (TimeScope, Tomato), and spatial reasoning (VSIBench). VideoNSA consistently outperforms strong baselines, including dense Qwen2.5‑VL, quantized variants (AWQ), token‑compression methods (FastV, VScan, VisionZip), and training‑free sparse approaches (Tri‑Shape, FlexPrefill, XAttention). Notably, VideoNSA narrows the gap to state‑of‑the‑art on ultra‑long videos (up to 10 hours) and achieves the highest accuracy on the Tomato temporal‑reasoning suite, demonstrating its ability to capture fine‑grained transitions that compression‑based models miss.

Ablation studies reveal four key findings: (1) the model scales reliably to 128 K tokens, extrapolating beyond the training length with minimal degradation; (2) an optimal global‑local attention split (global branches = CMP + SLC, local branch = SW) maximizes performance under a fixed token budget, and the optimal split is task‑dependent; (3) the gating distribution evolves across layers—early layers rely heavily on selection and compression, while deeper layers shift toward sliding‑window, indicating a hierarchical processing of salient versus contextual information; (4) learned sparse attention weights remain beneficial when transferred to a dense setting, improving dense‑NSA performance over the baseline by up to 20 % relative on some tasks.

Further analysis of attention patterns shows that the Selection branch creates almost no “attention sinks,” whereas the Compression branch generates periodic sinks that help re‑route information in ultra‑long contexts. The Sliding‑Window branch consistently provides coverage of recent frames, preventing loss of short‑range continuity.

In summary, VideoNSA demonstrates that a hardware‑aware, learnable sparse attention mechanism can be seamlessly integrated into a vision‑language model to dramatically extend its effective context length while preserving or improving performance on temporal and spatial reasoning tasks. By allocating only a tiny fraction of the attention budget to video, the approach offers a practical path toward real‑time, long‑form video understanding in future multimodal LLMs, opening doors to applications such as live sports commentary, long‑form video summarization, and continuous video‑driven dialogue systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment