📝 Original Info
- Title: Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
- ArXiv ID: 2512.10384
- Date: 2025-12-11
- Authors: Researchers from original ArXiv paper
📝 Abstract
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pretraining phase can improve the model's category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.
💡 Deep Analysis
Deep Dive into Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies.
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by
📄 Full Content
Towards Fine-Grained Recognition with Large Visual
Language Models: Benchmark and Optimization
Strategies
Cong Pang∗
ShanghaiTech University
pangcong2022@shanghaitech.edu.cn
Hongtao Yu∗
Southeast University
yuht@njust.edu.cn
Zixuan Chen
SenseTime Research
chenzixuan3@sensetime.com
Lewei Lu
SenseTime Research
luotto@sensetime.com
Xin Lou†
ShanghaiTech University
louxin@shanghaitech.edu.cn
Abstract
Large Vision Language Models (LVLMs) have made remarkable progress, enabling
sophisticated vision-language interaction and dialogue applications. However, ex-
isting benchmarks primarily focus on reasoning tasks, often neglecting fine-grained
recognition, which is crucial for practical application scenarios. To address this
gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark,
designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we
propose a novel optimization strategy from two perspectives: data construction
and training process, to improve the performance of LVLMs. Our dataset includes
mosaic data, which combines multiple short-answer responses, and open-world
data, generated from real-world questions and answers using GPT-4o, creating
a comprehensive framework for evaluating fine-grained recognition in LVLMs.
Experiments show that mosaic data improves category recognition accuracy by 1%
and open-world data boosts FROW benchmark accuracy by 10%-20% and content
accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre-
training phase can improve the model’s category recognition accuracy by up to 10%.
The benchmark will be available at https://github.com/pc-inno/FROW.
1
Introduction
Large Language Models (LLMs) have demonstrated exceptional performance in open-domain lan-
guage tasks, marking significant strides toward the realization of Artificial General Intelligence (AGI).
Similarly, advancements in Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5, 6, 7] have enabled
sophisticated vision-language interactions and complex dialogue capabilities. To evaluate these mod-
els, a variety of benchmarks have been introduced, spanning from general-purpose to domain-specific
tasks [8, 9, 10, 11, 12, 13, 14]. However, few evaluations focus on fine-grained image tasks—a
critical aspect of computer vision—which involve distinguishing objects among multiple subordinate
categories. Current benchmarks [15, 16] predominantly evaluate fine-grained capabilities using
multiple-choice questions, simplifying the task by limiting the search space. Consequently, the true
extent of LVLMs’ fine-grained recognition abilities remains unclear. For instance, as illustrated in
∗Equal contribution.
†Corresponding author.
Preprint. Under review.
arXiv:2512.10384v1 [cs.CV] 11 Dec 2025
V e g F r u
F o o d 1 0 1
F l o w e r s 1 0 2
S t a n f o r d D o g
C U B - 2 0 0 - 2 0 1 1
A i r c r a f t
M
u l t i - c h o i c e
F R O W
V e g F r u
F o o d 1 0 1
F l o w e r s 1 0 2
S t a n f o r d D o g
C U B - 2 0 0 - 2 0 1 1
G P T - 4 o
I n t e r n V L - 2 . 5
L L a V A - 1 . 5
Q w e n - V L - c h a t - 7 2 B
I n t e r n V L - 2 . 5 - e n h a n c e
L L a V A - 1 . 5 - e n h a n c e
A i r c r a f t
Figure 1: The left panel depicts the model’s recognition accuracy on fine-grained multiple-choice questions,
while the right panel showcases its accuracy on the FROW benchmark. Detailed scores are provided in Table 1.
Dashed lines represent the recognition accuracy of LVLMs optimized using the proposed strategy, highlighting
performance improvements for both InternVL and LLaVA. The images used in the evaluation are drawn from
six fine-grained datasets: FGVC-Aircraft [17], Caltech-UCSD Birds-200-2011 [18], Food-101 [19], Stanford
Dogs [20], Oxford Flowers-102 [21], and VegFru [22].
Figure 1, GPT-4o achieved near-perfect accuracy in multiple-choice fine-grained recognition tasks,
highlighting the limitations of existing evaluations.
To address the gap, we develop a more challenging Fine-grained Recognition Open-World benchmark
(FROW) based on images and Wikipedia content. To thoroughly assess LVLMs and their fine-grained
recognition capabilities, we employ expert models to evaluate various models on this benchmark.
Composed entirely of open-ended questions, FROW requires models to identify objects in images
accurately before providing correct answers. The benchmark assesses responses using two primary
metrics: recognition accuracy and content accuracy. Recognition accuracy evaluates whether a
model correctly identifies objects in images, whereas content accuracy gauges whether the model’s
responses are factually correct. As shown in Figure 1, the results demonstrate that both proprietary
and open-source LVLMs exhibit significant deficiencies in fine-grained recognition and domain-
specific knowledge. These findings highlight the critical need for improving LVLMs’ performance
in such tasks to enable meaningful reasoning. Misidenti
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.