Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
ArXiv ID: 2512.10384
Date: 2025-12-11
Authors: Researchers from original ArXiv paper

📝 Abstract

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pretraining phase can improve the model's category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.

💡 Deep Analysis

Deep Dive into Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies.

📄 Full Content

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies Cong Pang∗ ShanghaiTech University pangcong2022@shanghaitech.edu.cn Hongtao Yu∗ Southeast University yuht@njust.edu.cn Zixuan Chen SenseTime Research chenzixuan3@sensetime.com Lewei Lu SenseTime Research luotto@sensetime.com Xin Lou† ShanghaiTech University louxin@shanghaitech.edu.cn Abstract Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, ex- isting benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre- training phase can improve the model’s category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW. 1 Introduction Large Language Models (LLMs) have demonstrated exceptional performance in open-domain lan- guage tasks, marking significant strides toward the realization of Artificial General Intelligence (AGI). Similarly, advancements in Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5, 6, 7] have enabled sophisticated vision-language interactions and complex dialogue capabilities. To evaluate these mod- els, a variety of benchmarks have been introduced, spanning from general-purpose to domain-specific tasks [8, 9, 10, 11, 12, 13, 14]. However, few evaluations focus on fine-grained image tasks—a critical aspect of computer vision—which involve distinguishing objects among multiple subordinate categories. Current benchmarks [15, 16] predominantly evaluate fine-grained capabilities using multiple-choice questions, simplifying the task by limiting the search space. Consequently, the true extent of LVLMs’ fine-grained recognition abilities remains unclear. For instance, as illustrated in ∗Equal contribution. †Corresponding author. Preprint. Under review. arXiv:2512.10384v1 [cs.CV] 11 Dec 2025 V e g F r u F o o d 1 0 1 F l o w e r s 1 0 2 S t a n f o r d D o g C U B - 2 0 0 - 2 0 1 1 A i r c r a f t M u l t i - c h o i c e F R O W V e g F r u F o o d 1 0 1 F l o w e r s 1 0 2 S t a n f o r d D o g C U B - 2 0 0 - 2 0 1 1 G P T - 4 o I n t e r n V L - 2 . 5 L L a V A - 1 . 5 Q w e n - V L - c h a t - 7 2 B I n t e r n V L - 2 . 5 - e n h a n c e L L a V A - 1 . 5 - e n h a n c e A i r c r a f t Figure 1: The left panel depicts the model’s recognition accuracy on fine-grained multiple-choice questions, while the right panel showcases its accuracy on the FROW benchmark. Detailed scores are provided in Table 1. Dashed lines represent the recognition accuracy of LVLMs optimized using the proposed strategy, highlighting performance improvements for both InternVL and LLaVA. The images used in the evaluation are drawn from six fine-grained datasets: FGVC-Aircraft [17], Caltech-UCSD Birds-200-2011 [18], Food-101 [19], Stanford Dogs [20], Oxford Flowers-102 [21], and VegFru [22]. Figure 1, GPT-4o achieved near-perfect accuracy in multiple-choice fine-grained recognition tasks, highlighting the limitations of existing evaluations. To address the gap, we develop a more challenging Fine-grained Recognition Open-World benchmark (FROW) based on images and Wikipedia content. To thoroughly assess LVLMs and their fine-grained recognition capabilities, we employ expert models to evaluate various models on this benchmark. Composed entirely of open-ended questions, FROW requires models to identify objects in images accurately before providing correct answers. The benchmark assesses responses using two primary metrics: recognition accuracy and content accuracy. Recognition accuracy evaluates whether a model correctly identifies objects in images, whereas content accuracy gauges whether the model’s responses are factually correct. As shown in Figure 1, the results demonstrate that both proprietary and open-source LVLMs exhibit significant deficiencies in fine-grained recognition and domain- specific knowledge. These findings highlight the critical need for improving LVLMs’ performance in such tasks to enable meaningful reasoning. Misidenti

…(Full text truncated)…

📄 Read Full PDF on ArXiv