CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models
Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency–performance trade-offs with improved robustness.
💡 Research Summary
The paper tackles the inference inefficiency of large vision‑language models (VLMs), which stems from processing thousands of visual tokens and the heavy computation of Feed‑Forward Networks (FFNs). Existing acceleration methods mainly rely on raw attention scores to prune visual tokens or apply generic FFN compression techniques borrowed from large language models. The authors argue that these approaches overlook two crucial aspects: (1) attention scores alone do not reflect the true informational contribution of a token because they ignore the magnitude of the associated value vectors, and (2) visual tokens exhibit markedly different non‑linearity requirements in FFNs compared with text tokens.
Key Empirical Findings
- Attention Contribution vs. Raw Attention – By jointly measuring the “Sink Value” (activation magnitude in outlier dimensions) and a newly defined “Attention Contribution” (Cᵢ), which multiplies attention probability by the ℓ₂ norm of the projected value vector, the authors discover a functional dichotomy among high‑attention visual tokens. “Probability Dumps” (Type I) have high attention mass but negligible value magnitude, contributing almost nothing to the residual stream. “Structural Anchors” (Type II) combine high attention with large value vectors, acting as essential biases in the model’s hidden state. Pruning based solely on raw attention indiscriminately removes both groups, leading to severe performance drops; using Cᵢ preserves the anchors while safely discarding the dumps.
- Modality‑Specific FFN Redundancy – The authors compute cosine similarity between a layer’s input hidden state x and the post‑FFN residual output y = x + FFN(x). For visual tokens, similarity stays above 0.96 across most intermediate layers, indicating that the FFN behaves almost like an identity mapping (i.e., linear). In contrast, text tokens show lower similarity in early and middle layers, confirming that they rely on genuine non‑linear transformations. This pattern holds across three major VLM backbones (LLaVA‑1.5, Qwen2.5‑VL, InternVL3), suggesting a systematic redundancy that can be exploited.
CAPA Framework
CAPA (Contribution‑Aware Pruning and FFN Approximation) integrates the two insights into a unified acceleration pipeline:
- Contribution‑Aware Pruning (CAP‑Step) – At each generation step t, the model computes Cᵢ for every visual token with respect to the current query token qₜ. Tokens are ranked by Cᵢ, and only the top‑k are retained in the key‑value cache; the rest are pruned on‑the‑fly. This dynamic, query‑dependent pruning ensures that visual context adapts to the evolving textual generation.
- FFN Approximation – Layers whose input‑output cosine similarity exceeds a preset threshold (e.g., 0.95) are flagged as redundant for visual tokens. In these layers, the dense O(d²) FFN is replaced by a lightweight learned element‑wise Hadamard product (O(d)). The replacement is trained to mimic the original FFN’s linear effect on visual tokens, while text tokens continue to use the full FFN in non‑redundant layers.
Experimental Validation
CAPA is evaluated on a suite of multimodal benchmarks (VQAv2, COCO‑Caption, RefCOCO, GQA) and three VLM architectures. Results show:
- FLOPs reduction of 1.5×–2.3× and inference speed‑ups of 1.2×–1.8× on GPU.
- Negligible accuracy loss (≤ 0.2% absolute drop in top‑1/ top‑5 metrics).
- Preservation of fine‑grained visual reasoning (e.g., color or spatial queries) where prior pruning methods suffered 5%–10% larger degradations.
Ablation studies confirm that removing Structural Anchors dramatically harms performance, while pruning only Probability Dumps yields almost identical results to the full model. Likewise, applying the Hadamard approximation to text‑dominant layers leads to noticeable drops, underscoring the importance of modality‑aware selection.
Significance and Limitations
CAPA demonstrates that (i) incorporating value‑vector magnitude into token importance metrics yields a safer, more expressive pruning criterion, and (ii) recognizing the near‑linear behavior of visual‑token FFNs enables substantial computational savings without sacrificing model capability. The approach, however, still requires manual tuning of pruning ratios and similarity thresholds, and its efficacy on ultra‑high‑resolution inputs or video‑language tasks remains to be explored. Future work could automate hyper‑parameter selection via meta‑learning and extend the contribution‑aware paradigm to other modalities such as audio or video.
In summary, CAPA offers a principled, dual‑pronged strategy that jointly trims unnecessary visual tokens and streamlines redundant FFN computations, setting a new benchmark for efficient inference in large vision‑language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment