iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2 times throughput boost and a 4 times reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.

💡 Research Summary

Large Vision‑Language Models (LVLMs) have achieved impressive performance across image, video, and multimodal reasoning tasks, but their inference cost remains prohibitive due to the quadratic complexity of self‑attention in both the visual encoder and the large language model (LLM). Existing efficiency methods focus almost exclusively on reducing the number of visual tokens that are fed into the LLM, either by pruning or compressing them just before or inside the language model. This narrow focus ignores a second, equally important bottleneck: the visual encoder itself, which consumes a large share of FLOPs and produces the majority of tokens that later burden the LLM.

The paper introduces iLLaVA, a method that jointly accelerates the image encoder and the LLM by applying a two‑stage token merging strategy. Instead of simply discarding low‑importance tokens, iLLaVA merges them with the most informative ones, thereby recycling useful information that would otherwise be lost. Token importance is measured by attention scores: in the encoder, the multi‑head attention maps highlight the few spatial regions that actually drive the model’s predictions; in the LLM, attention scores similarly indicate which visual tokens are most relevant to the textual context.

Implementation details: a lightweight Token‑Merge module is inserted after the attention sub‑layer of selected encoder blocks and after specific LLM blocks. For each selected block, a fixed number of tokens (R_v in the encoder, R_t in the LLM) are merged, reducing the token count from N to N – R_v·B_v – R_t·B_t, where B_v and B_t are the numbers of blocks where merging occurs. The merging operation computes a weighted sum of the discarded token embeddings using the attention distribution, and adds the result to the retained token embeddings, preserving salient information while shrinking the sequence length.

Extensive experiments were conducted on more than ten benchmark datasets covering single‑image classification, multi‑image question answering, and video understanding. Results show that iLLaVA can remove 30‑70 % of visual tokens while retaining over 95 % of the original performance. Compared with pruning the same number of tokens only within the LLM, performing merging already in the encoder yields an average +25.3 % throughput boost and a 21.2 % reduction in memory consumption. Overall, iLLaVA achieves up to a 2× increase in inference throughput and a 4× reduction in prefilling time. Notably, a larger InternVL‑2.5 26B model equipped with iLLaVA outperforms a smaller InternVL‑2.5 8B model both in accuracy and efficiency, demonstrating that the method scales favorably with model size.

The paper benchmarks iLLaVA against recent token‑pruning and token‑merging approaches such as FastV, SparseVLM, Faster‑VLM, VisionZip, PyramidDrop, DiVPrune, AdaFV, and AIM. Across all metrics—FLOPs, latency, memory, and task accuracy—iLLaVA consistently leads. Visualizations of attention maps and merging steps illustrate that selected tokens (highlighted in red) correspond to semantically important regions, while discarded tokens (black) are merged in a way that preserves their contextual contribution.

Limitations include the use of fixed merging ratios, which may not be optimal for every input, and the modest additional parameters introduced by the merge modules. Future work could explore adaptive, input‑dependent merging schedules, apply the technique to non‑transformer visual backbones (e.g., CNN‑based encoders), and integrate the merging policy into end‑to‑end training for further gains.

In summary, iLLaVA demonstrates that substantial redundancy exists not only in the token stream fed to the LLM but also within the visual encoder itself. By intelligently merging tokens at both stages, the method achieves true end‑to‑end acceleration of LVLMs, setting a new benchmark for efficient multimodal inference.

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment