Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM’s processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens’ positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).
💡 Research Summary
The paper addresses a critical shortcoming of current visual token pruning techniques for Vision‑Language Models (VLMs). While pruning can dramatically reduce the number of visual tokens and thus accelerate inference, existing methods preserve performance on Visual Question Answering (VQA) but cause severe degradation on Visual Grounding (VG) tasks. The authors trace this discrepancy to the loss of a global spatial reference frame that is built from the interaction of token positional information throughout the VLM pipeline.
Through systematic experiments, they first categorize pruning methods into three families—vision‑encoder‑side, LLM single‑layer, and LLM multi‑layer pruning—and compare them against simple baselines (random sampling and average pooling). The results show that advanced methods offer little advantage over baselines on VQA, and all methods suffer systematic performance drops on object‑centric grounding tasks. Further analysis of attention flows and gradient‑weighted attribution reveals a two‑stage visual processing pipeline: an early stage that integrates global multimodal information, and a middle stage where the model forms object‑level, fine‑grained representations. Grounding tasks rely heavily on this middle stage, making them sensitive to any disruption of spatial structure.
To remedy this, the authors propose Nüwa, a two‑stage spatial‑aware token pruning framework. The first stage operates immediately after the vision encoder and consists of three operations inspired by the Boids swarm‑intelligence algorithm:
- Separation – partitions the token map into localized regions, reducing token density while preserving spatial continuity.
- Alignment – selects representative tokens that best align with a global spatial anchor (derived from absolute token coordinates and a positional histogram) and have high information density.
- Aggregation – merges neighboring tokens around each representative using a weighted combination of semantic similarity and spatial distance, producing a compact yet spatially coherent feature.
These operations keep a set of “global spatial anchors” that maintain the integrity of the positional reference frame even after aggressive token reduction.
The second stage occurs inside the LLM decoder. Here, a text‑guided pruning step uses the question or instruction to extract key nouns and relations, then retains only those visual tokens that exhibit strong cross‑modal attention with the extracted textual cues. This task‑specific refinement ensures that grounding tasks keep the necessary object tokens while still discarding irrelevant visual information.
Experiments are conducted on the LLaVA‑1.5 7B model across 12 datasets covering both VQA (e.g., GQA, VQAv2, MM‑Bench) and VG (RefCOCO, RefCOCO+, RefCOCOg). Compared with baselines and state‑of‑the‑art pruning methods such as VisionZip, FastV, and SparseVLM, Nüwa achieves:
- VQA: performance retention improves from 94 % to 95 % while reducing visual tokens by 88.9 %.
- VG: accuracy jumps from a mere 7 % (with existing methods) to 47 %, a more than six‑fold gain.
- Efficiency: 89 % reduction in TFLOPs, 62 % reduction in pre‑fill time, and 88.9 % token reduction overall.
The authors also introduce two fine‑grained metrics—Visual Attention Entropy (VAE) and Object‑Centric Cohesion (OCC)—to quantify spatial integrity. Nüwa consistently yields lower VAE (indicating more focused attention) and higher OCC (indicating stronger alignment of tokens with ground‑truth objects) across layers, confirming that spatial structure is preserved.
Ablation studies show that each of the three Boids‑inspired operations contributes uniquely: removing Separation leads to token crowding and loss of local detail; omitting Alignment degrades the selection of globally informative anchors; skipping Aggregation reduces the compactness of the final representation. Text‑guided pruning in the LLM further refines the token set for task‑specific relevance.
The paper releases code and model weights, facilitating reproducibility. Limitations are acknowledged: the current design is tuned for ViT‑based vision encoders, and the text‑guided stage assumes explicit object mentions in the query, which may not hold for more ambiguous prompts. Future work could extend Nüwa to other backbone architectures, explore dynamic pruning ratios during streaming inference, and integrate learned spatial anchors rather than handcrafted histograms.
In summary, Nüwa offers a principled solution to the “spatial integrity” problem in VLM token pruning. By preserving a global spatial reference frame through a swarm‑inspired token reduction and then applying task‑aware textual guidance, it simultaneously achieves state‑of‑the‑art accuracy on both VQA and Visual Grounding while delivering substantial computational savings. This work sets a new benchmark for efficient, task‑robust multimodal inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment