IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50% while maintaining $\geq$ 99% of the original performance and even achieving improvements on several benchmarks. Source codes are available at https://github.com/FireRedTeam/IVC-Prune.


💡 Research Summary

The paper tackles the prohibitive inference cost of Large Vision‑Language Models (LVLMs) when processing high‑resolution images, a problem that stems from the massive number of visual tokens generated. While recent visual token pruning methods have shown promise, they primarily focus on semantic relevance and consequently discard tokens that are crucial for spatial reasoning tasks such as visual grounding. The authors uncover a previously unnoticed property of Rotary Position Embeddings (RoPE), the positional encoding scheme widely adopted in modern LVLMs. By mathematically analyzing RoPE’s rotation matrices, they demonstrate that certain token positions act as implicit visual coordinate (IVC) tokens: positions where the rotation matrix approximates either the identity transformation (real‑axis reference) or a 90° counter‑clockwise rotation (imaginary‑axis reference). These positions provide absolute spatial anchors that enable LVLMs to reason about object locations despite RoPE’s original design for relative positioning.

To exploit this insight, the authors propose IVC‑Prune, a training‑free, prompt‑aware pruning strategy that preserves two categories of tokens: (1) IVC tokens identified by ranking positions according to the summed cosine (real‑axis score V(m)) and sine (imaginary‑axis score U(m)) components of RoPE across all dimensions, and (2) foreground tokens that are semantically aligned with the textual prompt. Foreground token selection proceeds in two stages to mitigate the positional bias inherent in attention scores. First, a small set of “semantic seeds” is extracted by computing similarity between the value vectors of text and image tokens—these vectors are free of RoPE’s positional influence. Second, the seed set is combined with all text tokens to form an expanded query set; a second round of value‑vector similarity yields a refined relevance score for every visual token, and the top‑k tokens become the retained foreground set.

IVC‑Prune applies a single‑selection pruning scheme: the token set chosen at a designated intermediate layer is fixed and reused to prune KV‑caches in earlier layers and to guide token selection in later layers, preserving original position IDs while dramatically reducing memory consumption. The method is evaluated on four representative LVLMs—Qwen2.5‑VL, InternVL‑2.5, DeepSeek‑VL2, and LLaVA‑v1.5—across twenty diverse benchmarks covering visual grounding (RefCOCO, RefCOCO+, RefCOCOg), visual reasoning, hallucination detection, OCR, and VQA. Results show that IVC‑Prune cuts visual token counts by roughly 50% while retaining ≥ 99% of the original performance; on spatially sensitive tasks it even surpasses the full‑model baseline, delivering up to a 10% absolute gain in grounding accuracy. Moreover, augmenting existing pruning methods with the identified IVC tokens consistently improves their spatial reasoning capabilities.

The contributions are twofold: (1) a novel theoretical analysis revealing that LVLMs implicitly construct a visual coordinate system through RoPE’s periodic orthogonal rotations, providing a concrete explanation for how absolute spatial information is encoded; (2) a practical, training‑free pruning pipeline that jointly preserves implicit coordinate anchors and semantically relevant foreground tokens, achieving substantial efficiency gains without sacrificing—and sometimes improving—task performance. This work opens a new direction for efficient multimodal inference, highlighting the importance of preserving spatial reference tokens when compressing LVLMs for real‑time, high‑resolution applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment