BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.


💡 Research Summary

BlindSight tackles the growing latency problem of large vision‑language models (VLMs) when processing prompts that contain many images. Because each image is tokenized into thousands of visual tokens, the overall context length can easily exceed 100 K tokens, and the quadratic cost of the attention operation dominates the “time‑to‑first‑token” (TTFT) during the pre‑fill stage. Existing solutions such as MMInference or Look‑M try to discover sparsity on‑the‑fly, but they incur extra memory accesses and runtime overhead.

The key observation of this work is that, across several state‑of‑the‑art VLMs (Qwen2‑VL, Qwen2.5‑VL, Gemma 3), a large fraction of attention heads never attend across different images. Instead, attention is concentrated either on a “sink” token that appears right after a text‑to‑image or image‑to‑image transition, or it stays completely within a single image. Based on extensive visual inspection of attention matrices, the authors categorize heads into four static patterns:

  1. Dense – no discernible sparsity, behaves like standard full‑attention.
  2. Sink – only the sink token receives significant attention; no inter‑image connections.
  3. Intra‑Image – attention is limited to tokens belonging to the same image, with no sink behavior.
  4. Intra‑Image+Sink – combines intra‑image attention with a sink token.

BlindSight’s pipeline consists of two offline stages.

Prompt‑Level Characterization: For a given prompt, each head is evaluated with the four candidate masks. The dense attention output is taken as reference; the normalized mean‑squared error (NMSE) between the reference and each sparse variant is computed. If a mask yields NMSE below a user‑defined threshold α and also reduces theoretical FLOPs, that mask is selected for the head; otherwise the head remains dense.

Dataset‑Level Aggregation: The above step is repeated over a large multi‑image benchmark (MMIU). For each layer‑head pair the most frequently chosen mask is recorded. A rule‑based aggregation then decides the final mask: if the dense pattern dominates beyond a fraction γd, keep dense; else pick Sink, Intra‑Image, or fall back to Intra‑Image+Sink based on dominance thresholds γs and γi. This yields a prompt‑independent, per‑head sparsity map that can be baked into the model.

To exploit the map at inference time, the authors implement a Triton‑based GPU kernel that mirrors FlashAttention’s tile‑wise computation but adds four specialized sub‑routines for the different mask types. The kernel receives the head type, tile indices, and the boundaries of text versus image tokens. Tiles that are completely sparse are skipped, and the appropriate sub‑routine constructs the mask on‑the‑fly with minimal branching.

Empirical results show that for sequence lengths between 36 K and 300 K tokens, the BlindSight kernel achieves 1.8‑3.2× speed‑up in the attention portion of the pre‑fill stage compared with the standard Triton FlashAttention implementation. Across three VLM families (Qwen2‑VL 7 B, Qwen2.5‑VL 32 B, Gemma 3 12 B) the overall inference latency improves proportionally, while the average drop in accuracy on multi‑image comprehension benchmarks is only 0.78 percentage points.

The paper also discusses how BlindSight’s static sparsity can be combined with token‑compression techniques (e.g., pruning redundant visual tokens) to achieve sub‑quadratic overall complexity. Finally, the authors advocate for a new VLM architecture that deliberately mixes sparse and dense attention layers, using the identified patterns as a design guideline for future hardware‑friendly models.

In summary, BlindSight demonstrates that a careful offline analysis of inter‑image attention reveals a robust, model‑agnostic sparsity structure. By turning this insight into a static mask and a custom GPU kernel, the method delivers substantial speed‑ups with negligible accuracy loss, paving the way for real‑time, large‑scale vision‑language applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment