No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

February 09, 2026

Reading time: 2 minute

...

📝 Original Info

Title: No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
ArXiv ID: 2512.08889
Date: 2025-12-09
Authors: Damiano Marsili, Georgia Gkioxari

📝 Abstract

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pretrained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design com...

📄 Full Content

Visual reasoning is a key skill for artificial intelligence: to understand and act in the world, systems must not only identify objects in images but also reason about their spatial relationships and attributes. For example, answering the query in Fig. 1 requires grounding objects (fireplace, coffee table, sofa), inferring 3D size from 2D cues, and combining attributes to produce the final answer.

Visual reasoning methods fall into two categories. The first integrates grounding with language reasoning, where vision-language models (VLMs) generate chain-of-thought explanations in text (Fan et al., 2025;Sarch et al., 2025;OpenAI,

…(Content truncated for length.)

📄 Read Full PDF on ArXiv