No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Reading time: 2 minute
...

📝 Original Info

  • Title: No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
  • ArXiv ID: 2512.08889
  • Date: 2025-12-09
  • Authors: Damiano Marsili, Georgia Gkioxari

📝 Abstract

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pretrained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design com...

📄 Full Content

Visual reasoning is a key skill for artificial intelligence: to understand and act in the world, systems must not only identify objects in images but also reason about their spatial relationships and attributes. For example, answering the query in Fig. 1 requires grounding objects (fireplace, coffee table, sofa), inferring 3D size from 2D cues, and combining attributes to produce the final answer.

Visual reasoning methods fall into two categories. The first integrates grounding with language reasoning, where vision-language models (VLMs) generate chain-of-thought explanations in text (Fan et al., 2025;Sarch et al., 2025;OpenAI,

…(Content truncated for length.)

📸 Image Gallery

4.1.png 4.2.png 4.3.png 4.4.png 4.5.png 4.6.png failure_cases_logic.png method-figure.png reward_analysis.png scaling.png teaser-v2.png valor_vs_gpt5.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut