ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models
Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier – preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.
💡 Research Summary
Vision‑Language Models (VLMs) suffer from high computational cost because a frozen vision encoder produces hundreds of visual tokens that the large language model (LLM) must process. Existing token‑reduction strategies fall into two camps: vision‑only pruning, which ranks tokens using the encoder’s internal saliency (broad but query‑agnostic), and cross‑modal pruning, which uses text‑vision attention inside the LLM (query‑aware but sparse, costly, and disruptive to FlashAttention). The authors first demonstrate that neither signal alone yields reliable importance estimates; vision scores give a strong base ranking, while cross‑modal scores provide targeted corrections for a small subset of tokens. They quantify agreement, disagreement, and recovery rates across six benchmarks, showing that fusing the two consistently improves accuracy, especially at aggressive retention ratios.
To make this fusion practical, they introduce ConsensusDrop, a training‑free framework consisting of three components. 1) Static Cross‑Attention Probe (SCAP) replicates the first decoder layer of the LLM and runs a single self‑attention pass on the full multimodal sequence, extracting query‑conditioned cross‑modal scores before the LLM proper. 2) A Multi‑modal Fuser combines the vision‑side saliency and SCAP scores via a vision‑biased weighted sum, producing a consensus ranking from which the top‑K tokens are retained. 3) Encoder‑Guided Token Merge (EGTM) compresses the remaining low‑saliency tokens by clustering them in the vision encoder’s feature space and merging each cluster into a single token in the projector space. This preserves information while dramatically reducing token count.
ConsensusDrop is evaluated on LLaVA‑1.5, LLaVA‑Next, Video‑LLaVA, and other open‑source VLMs. Under identical token budgets, it outperforms prior pruning methods by 2–4 percentage points in normalized accuracy and achieves 30–50 % reductions in KV‑cache usage and time‑to‑first‑token latency. Notably, even with only 25 % of the original visual tokens, the model retains near‑baseline performance, making it suitable for latency‑sensitive deployments such as robotics and autonomous driving. The approach requires no additional training, retains FlashAttention, and is plug‑and‑play, with code to be released publicly.
Comments & Academic Discussion
Loading comments...
Leave a Comment