When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

💡 Research Summary

Mask‑LLaVA addresses the growing computational burden of autoregressive vision‑language models (VLMs) that rely on hundreds of visual tokens per image. The authors propose a multi‑granularity token composition strategy that combines three distinct visual representations: a global CLS token from a Vision Transformer (ViT), locally pooled patch tokens, and mask‑based object tokens derived from automatically generated segmentation masks. During training, all three token types are concatenated, resulting in a visual sequence that is roughly 75 % smaller than the standard LLaVA baseline while preserving rich visual information.

The pipeline begins by feeding an input image into a pretrained ViT to obtain the CLS token and a dense grid of patch embeddings. The patch grid (24 × 24) is average‑pooled with a 4 × 4 kernel, yielding a compact set of 6 × 6 (36) tokens that capture local context. In parallel, an objectness detector selects the top‑100 bounding boxes, which are then refined into segmentation masks using the Segment‑Anything Model (SAM). For each mask, the Mask‑inversion technique learns a dedicated embedding that aligns an explainability map with the mask region; a sinusoidal positional encoding based on the box centre is added to preserve spatial relationships. A background mask is also included to guarantee full image coverage.

Because the three token families have different magnitude distributions, the authors introduce a scaling step: the mean (µ_local) and standard deviation (σ_local) of the patch token norms are computed, and both the CLS and object tokens are linearly transformed to match this distribution (ˆF = F·σ_local + µ_local). This normalization mitigates imbalance during the multimodal projection.

The multimodal projector P(·) aligns the concatenated visual features with the language space. Training follows the two‑stage protocol of LLaVA: (1) vision‑language pre‑training where only P(·) is updated on caption data while the vision backbone and LLM remain frozen, and (2) instruction tuning where P(·) and the LLM are jointly fine‑tuned on a suite of multimodal tasks (VQA, ScienceQA‑IMG, etc.) with the vision encoder still frozen.

A key advantage of Mask‑LLaVA is dynamic token reduction at inference time. The 101 object masks often contain redundancy; the authors apply an IoU‑based filter (removing masks with IoU ≥ 0.5) and then prune the remaining masks by descending objectness confidence. This allows the number of object tokens to be flexibly adjusted without retraining, and even a 50 % reduction in object tokens incurs negligible performance loss. Patch tokens can also be further pruned if desired.

Extensive experiments on eight benchmarks—VQA‑v2, GQA, VizWiz, ScienceQA‑IMG, POPE, MME, MMBench, and MM‑V—demonstrate that Mask‑LLaVA achieves competitive or slightly superior accuracy compared to the original LLaVA‑1.5 while using only a fraction of visual tokens. Ablation studies reveal that (i) the scaling step is crucial; without it, performance drops by 3–5 %, (ii) using only pooled patch tokens already yields a strong baseline, but adding CLS and object tokens consistently improves results by 1–2 %, and (iii) the IoU‑based mask pruning outperforms random token dropping.

In summary, the paper contributes three main innovations: (1) a token composition framework that fuses global, local, and object‑level visual information to drastically reduce token count, (2) a simple yet effective norm‑scaling technique to harmonize heterogeneous token types, and (3) a flexible inference‑time token pruning mechanism that enables VLM deployment under varying computational budgets. By demonstrating that over‑sampling object tokens during training can be safely under‑sampled at test time, Mask‑LLaVA opens a practical pathway for efficient, high‑performing vision‑language models in real‑world applications.

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment