Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of “visual analysis, candidate sub-categories, comparison, and prediction”, transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.
💡 Research Summary
Fine‑R1 tackles the long‑standing challenge of fine‑grained visual recognition (FGVR) for multimodal large language models (MLLMs). While MLLMs excel at coarse‑grained vision‑language tasks, they lag behind contrastive CLIP models on FGVR due to high intra‑class variance, low inter‑class variance, and the scarcity of annotated fine‑grained data. Fine‑R1 introduces a two‑stage training framework that dramatically improves FGVR performance with only four labeled examples per category.
The first stage, Chain‑of‑Thought Supervised Fine‑tuning (CoT‑SFT), builds a high‑quality FGVR CoT dataset. For each sub‑category a single image is selected, and a strong vision‑language model (Qwen2.5‑VL‑32B) generates multiple captions that capture diverse visual attributes. An information‑bottleneck filter retains the most discriminative concepts, which are then concatenated with a structured prompt that forces the model to follow four reasoning steps: visual analysis, candidate sub‑category generation, detailed comparison, and final prediction. The resulting 404 examples are rigorously filtered (ensuring correct language, matching predictions, and consistency with ground‑truth) to provide reliable supervision. Fine‑tuning on this dataset teaches the MLLM to articulate human‑like reasoning chains, integrating domain knowledge before making a decision.
The second stage, Triplet Augmented Policy Optimization (T‑APO), refines the model via reinforcement learning with contrastive signals. For each anchor image x, a positive image xₚₒₛ from the same sub‑category and a negative image xₙₑ𝑔 from the most visually similar but different sub‑category are sampled, forming a triplet (x, xₚₒₛ, xₙₑ𝑔). Intra‑class augmentation mixes rollouts generated from both the anchor and the positive image, aggregating their rewards while updating the policy only on the anchor, thereby exposing the model to diverse intra‑class variations without diluting the decision focus. Inter‑class augmentation uses the negative image to push the policy to increase the response gap between similar but distinct categories. T‑APO builds on Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), inheriting Clip‑Higher, dynamic sampling, and token‑level policy‑gradient losses, but augments them with the triplet structure to explicitly address FGVR’s unique variance profile.
Extensive experiments on six FGVR benchmarks (birds, plants, cars, aircraft, etc.) under a 4‑shot base‑to‑new setting demonstrate that Fine‑R1‑3B outperforms strong baselines. In closed‑world evaluation it exceeds Qwen2.5‑VL‑7B by +8.51% absolute accuracy and DeepPerception‑7B by +5.59%; in open‑world evaluation the gains rise to +23.75% and +30.98% respectively. For unseen sub‑categories, Fine‑R1 improves over standard supervised fine‑tuning (+15.59%), CLS‑RL (+10.28%), and No‑Thinking‑RL (+10.05%). Moreover, the model delivers more accurate answers on non‑classification tasks that require object recognition (e.g., ImageWikiQA) while maintaining or surpassing performance on general VQA.
Ablation analyses reveal that the visual embeddings change little after training; the primary benefit lies in better deployment of existing knowledge through structured reasoning and contrastive policy signals. The intra‑class augmentation enhances robustness to diverse appearances within a class, while inter‑class augmentation sharpens discriminability among visually similar classes.
In summary, Fine‑R1 shows that a carefully crafted CoT fine‑tuning phase combined with triplet‑based reinforcement learning can turn a generic MLLM into a state‑of‑the‑art FGVR system with minimal annotation effort. This approach opens the door to practical fine‑grained recognition in domains where expert labeling is expensive or infeasible, and suggests future work on scaling the CoT dataset, extending to other modalities, and integrating human‑in‑the‑loop feedback for further efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment