AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 93.1% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.


💡 Research Summary

Vision‑language models (VLMs) have achieved remarkable performance on tasks such as visual question answering, image captioning, and multimodal retrieval, but their inference cost remains a major bottleneck. The dominant source of overhead is the large number of visual tokens generated by the image encoder, which can be an order of magnitude larger than the textual tokens. Existing acceleration techniques either redesign the visual encoder to produce fewer tokens or prune visual tokens during the language model’s pre‑fill stage. A recent line of work, exemplified by SparseVLM, uses static text‑prompt guidance to rank visual tokens, but it assumes that the importance of text tokens is fixed throughout inference. In practice, the relevance of each word in a prompt evolves across transformer layers as the model refines its internal representation.

AdaptInfer addresses these two shortcomings with a fully plug‑and‑play framework that (1) dynamically estimates text‑token importance at every pruning layer using the model’s own text‑to‑text (t2t) attention maps, and (2) schedules pruning operations at layers where cross‑modal attention exhibits significant shifts. The method requires no additional training and incurs negligible extra FLOPs because it reuses attention matrices already computed in the forward pass.

Dynamic text guidance. For each predefined pruning layer ℓ, the model’s t2t attention matrix A⁽ℓ⁾_t2t ∈ ℝ^{T×T} (T = number of text tokens) is extracted from all heads and summed across the query dimension to obtain a soft prior w ∈ ℝ^{T}. This prior reflects how much each text token is attended to by the rest of the prompt at layer ℓ. The prior is then transposed and multiplied with the t2v attention matrix A⁽ℓ⁾t2v ∈ ℝ^{T×V} (V = remaining visual tokens) to produce a weighted visual‑token score vector s ∈ ℝ^{V}: s = (1/H) Σ_h wᵀ·A⁽ℓ⁾{t2v,h}. All text tokens contribute to s proportionally to their current importance, allowing the ranking of visual tokens to adapt to the evolving textual focus. The top‑k visual tokens are retained, and the rest are discarded. Because A_t2t and A_t2v are already available, the extra computation is limited to a few matrix‑vector products (≈ T² + 2TV FLOPs), which is negligible compared with the main transformer cost (≈ 4 n d² + 2 n² d + 3 n d m).

Pruning schedule derived from attention shifts. The authors analyze 1,000 samples from the MME and TextVQA benchmarks, tracking cumulative t2v attention for the top‑10 % visual tokens across layers. Change‑point detection reveals consistent inflection points at layers 1, 10, and 20 for LLaVA‑1.5‑7B, and at layers 0, 9, and 19 for Qwen2‑VL‑2B. These points correspond either to the moment a visual token becomes salient or to the moment its information has been fully extracted. Pruning immediately after these layers therefore removes redundant tokens while preserving the most informative ones.

Complexity analysis. Let n = T + V be the current sequence length, d the hidden dimension, and m the feed‑forward projection size. The FLOPs for a standard pre‑fill transformer layer are: F_pre = 4 n d² + 2 n² d + 3 n d m. The additional cost for a pruning layer is: F_prune = T² + 2 T V, which is orders of magnitude smaller than F_pre. During decoding, the FLOPs are similarly dominated by the standard transformer term, confirming that AdaptInfer’s overhead is minimal.

Experimental evaluation. The method is tested on two large VLMs—LLaVA‑1.5‑7B and Qwen2‑VL‑2B—across five multimodal benchmarks (MME, TextVQA, COCO‑Caption, VQAv2, and RefCOCO). Token budgets of 30 % and 50 % of the original visual token count are examined. On LLaVA‑1.5‑7B with a 30 % budget, AdaptInfer reduces CUDA latency by 61.3 % while maintaining an average accuracy of 93.1 % (baseline 94.0 %). Compared to state‑of‑the‑art pruning baselines (SparseVLM, FlowCut, etc.), it achieves 1.5–2.2 % higher accuracy under the same token budget. Ablation studies show that (i) dynamic t2t‑based guidance outperforms static prompt selection by ~1.8 % accuracy, (ii) pruning at detected shift points yields ~4 % latency improvement over uniform pruning, and (iii) the trade‑off between token budget and performance is smooth, with <2 % accuracy loss at 20 % budget and negligible loss at 70 % budget.

Limitations and future work. AdaptInfer relies on rich t2t attention, which may be weaker in smaller language models, potentially limiting its effectiveness. The pruning layers are manually set based on offline analysis; a fully adaptive, possibly reinforcement‑learning‑driven schedule could further improve efficiency. Extending the approach to non‑transformer architectures or to scenarios with multiple visual inputs (e.g., video) remains an open direction.

Conclusion. AdaptInfer introduces a principled, lightweight, and training‑free mechanism for adaptive visual token pruning in VLMs. By exploiting dynamic text‑token importance and data‑driven pruning points, it dramatically cuts inference time while preserving (or even slightly improving) task performance. Its plug‑and‑play nature makes it readily applicable to existing multimodal models, offering a practical solution to the growing computational demands of large‑scale vision‑language systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment