Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.


💡 Research Summary

This paper investigates the fine‑grained visual knowledge of contemporary vision‑language models (VLMs) and identifies the design and training factors that most influence performance on fine‑grained classification tasks. The authors evaluate 15 recent VLMs—ranging from LLaVA and Phi‑3‑Vision to Qwen2‑VL and Molmo—on four well‑established fine‑grained benchmarks (ImageNet‑1K, Oxford Flowers‑102, Oxford‑IIIT Pets‑37, and Food‑101). To fit the VLM paradigm, each dataset is transformed into a five‑choice multiple‑choice format using a separate OpenCLIP model to select hard negative options.

The study reveals a striking divergence: models that achieve comparable scores on eight general VQA benchmarks can differ by up to 19 percentage points on fine‑grained classification. This indicates that fine‑grained recognition constitutes a distinct axis of visual intelligence not captured by standard VLM evaluations.

Through 22 systematic ablation experiments, the authors isolate four key contributors: (1) Language model quality – larger, better‑pretrained LLMs uniformly boost both VQA and fine‑grained tasks; (2) Vision encoder strength – a more powerful encoder (e.g., CLIP‑ViT‑L/14, DFN‑CLIP) disproportionately improves fine‑grained accuracy while having limited effect on VQA; (3) Pretraining stage – large‑scale image‑caption pretraining markedly raises fine‑grained performance, especially when the LLM’s weights are unfrozen during this stage; (4) Weight‑updating strategy – jointly training the vision‑LLM connector and the LLM yields higher fine‑grained scores than training only the connector. Data quality (noise level, label diversity) shows a modest impact when the LLM is frozen.

Direct comparisons between VLMs and their underlying CLIP vision encoders expose a consistent gap: most VLMs fall 4–18 pp behind the encoder‑only baseline, underscoring that current multimodal integration often sacrifices pure visual discrimination.

The authors conclude that to endow VLMs with robust real‑world capabilities—such as distinguishing poisonous mushrooms, diagnosing subtle medical conditions, or recognizing nuanced traffic signs—future work must prioritize stronger vision encoders, allow LLM weight updates during pretraining, and design training pipelines that jointly optimize visual and linguistic representations. This roadmap points toward “vision‑centric” VLMs that balance language reasoning with high‑fidelity visual perception.


Comments & Academic Discussion

Loading comments...

Leave a Comment