Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Reading time: 6 minute
...

📝 Original Info

  • Title: Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
  • ArXiv ID: 2512.10384
  • Date: 2025-12-11
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pretraining phase can improve the model's category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.

💡 Deep Analysis

Deep Dive into Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies.

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by

📄 Full Content

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies Cong Pang∗ ShanghaiTech University pangcong2022@shanghaitech.edu.cn Hongtao Yu∗ Southeast University yuht@njust.edu.cn Zixuan Chen SenseTime Research chenzixuan3@sensetime.com Lewei Lu SenseTime Research luotto@sensetime.com Xin Lou† ShanghaiTech University louxin@shanghaitech.edu.cn Abstract Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, ex- isting benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: data construction and training process, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre- training phase can improve the model’s category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW. 1 Introduction Large Language Models (LLMs) have demonstrated exceptional performance in open-domain lan- guage tasks, marking significant strides toward the realization of Artificial General Intelligence (AGI). Similarly, advancements in Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5, 6, 7] have enabled sophisticated vision-language interactions and complex dialogue capabilities. To evaluate these mod- els, a variety of benchmarks have been introduced, spanning from general-purpose to domain-specific tasks [8, 9, 10, 11, 12, 13, 14]. However, few evaluations focus on fine-grained image tasks—a critical aspect of computer vision—which involve distinguishing objects among multiple subordinate categories. Current benchmarks [15, 16] predominantly evaluate fine-grained capabilities using multiple-choice questions, simplifying the task by limiting the search space. Consequently, the true extent of LVLMs’ fine-grained recognition abilities remains unclear. For instance, as illustrated in ∗Equal contribution. †Corresponding author. Preprint. Under review. arXiv:2512.10384v1 [cs.CV] 11 Dec 2025                         V e g F r u F o o d 1 0 1 F l o w e r s 1 0 2 S t a n f o r d D o g C U B - 2 0 0 - 2 0 1 1 A i r c r a f t M u l t i - c h o i c e F R O W V e g F r u F o o d 1 0 1 F l o w e r s 1 0 2 S t a n f o r d D o g C U B - 2 0 0 - 2 0 1 1 G P T - 4 o I n t e r n V L - 2 . 5 L L a V A - 1 . 5 Q w e n - V L - c h a t - 7 2 B I n t e r n V L - 2 . 5 - e n h a n c e L L a V A - 1 . 5 - e n h a n c e A i r c r a f t Figure 1: The left panel depicts the model’s recognition accuracy on fine-grained multiple-choice questions, while the right panel showcases its accuracy on the FROW benchmark. Detailed scores are provided in Table 1. Dashed lines represent the recognition accuracy of LVLMs optimized using the proposed strategy, highlighting performance improvements for both InternVL and LLaVA. The images used in the evaluation are drawn from six fine-grained datasets: FGVC-Aircraft [17], Caltech-UCSD Birds-200-2011 [18], Food-101 [19], Stanford Dogs [20], Oxford Flowers-102 [21], and VegFru [22]. Figure 1, GPT-4o achieved near-perfect accuracy in multiple-choice fine-grained recognition tasks, highlighting the limitations of existing evaluations. To address the gap, we develop a more challenging Fine-grained Recognition Open-World benchmark (FROW) based on images and Wikipedia content. To thoroughly assess LVLMs and their fine-grained recognition capabilities, we employ expert models to evaluate various models on this benchmark. Composed entirely of open-ended questions, FROW requires models to identify objects in images accurately before providing correct answers. The benchmark assesses responses using two primary metrics: recognition accuracy and content accuracy. Recognition accuracy evaluates whether a model correctly identifies objects in images, whereas content accuracy gauges whether the model’s responses are factually correct. As shown in Figure 1, the results demonstrate that both proprietary and open-source LVLMs exhibit significant deficiencies in fine-grained recognition and domain- specific knowledge. These findings highlight the critical need for improving LVLMs’ performance in such tasks to enable meaningful reasoning. Misidenti

…(Full text truncated)…

📸 Image Gallery

FROW_scores.png FROW_scores.webp app_train_val_sample.png app_train_val_sample.webp diverse.png diverse.webp diverse_mosaic.png diverse_mosaic.webp mixed_tasks_results.png mixed_tasks_results.webp mosaic.png mosaic.webp pipeline.png pipeline.webp prompt1.png prompt1.webp prompt2.png prompt2.webp prompt3.png prompt3.webp radar.png radar.webp samples.png samples.webp st2_fg.png st2_fg.webp st3_fg.png st3_fg.webp st3_generic.png st3_generic.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut