📝 Original Info
- Title: V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
- ArXiv ID: 2512.11995
- Date: 2025-12-12
- Authors: Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou
📝 Abstract
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
💡 Deep Analysis
📄 Full Content
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Chenrui Fan*, Yijun Liang*, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou
University of Maryland, College Park
{cfan42,yliang17,minglii}@umd.edu
Project: https://github.com/tianyi-lab/VREX
Abstract
While many vision language models (VLMs) are developed
to passively answer well-defined, straightforward questions
with highly specified targets, as in most benchmarks, they
often struggle in practice with complex open-ended tasks,
which usually require multiple rounds of active exploration
and reasoning in the visual space. Such visual thinking paths
not only provide step-by-step exploration and verification
as an AI detective but also produce better interpretations of
the final answers. However, these paths are challenging to
evaluate due to the large exploration space of intermediate
steps. To bridge the gap, we develop an evaluation suite,
“Visual Reasoning with multi-step EXploration (V-REX)”,
which is composed of a benchmark of challenging visual
reasoning tasks requiring native multi-step exploration and
an evaluation protocol. V-REX casts the multi-step ex-
ploratory reasoning into a Chain-of-Questions (CoQ) and
disentangles VLMs’ capability to (1) Planning: breaking
down an open-ended task by dynamically selecting a chain
of exploratory questions; and (2) Following: answering cu-
rated CoQ sequentially to collect information for deriving
the final answer. By curating finite options of questions
and answers per step, V-REX achieves a reliable quantita-
tive and fine-grained analysis of the intermediate steps. By
assessing SOTA proprietary and open-sourced VLMs, we
reveal consistent scaling trends, significant differences be-
tween planning and following abilities, and substantial room
for improvement in multi-step exploratory reasoning.
1. Introduction
Various practical applications of vision language models
(VLMs) need to perform sophisticated multi-step visual rea-
soning [2, 6, 19, 20, 22, 43, 45, 48] to solve the user queries.
Recent studies reveal the weakness of existing VLMs on
exploratory reasoning tasks, showing that they often rely on
brute-force search in the input image to allocate the potential
*These authors contributed equally to this work.
Question-1
Answer-1
Answer-2
Question-2
Original Question
Question-3
(Final Question)
Answer-3
(Final Answer)
Following
Task
Planning
Task
Ground Truth
QA-Chain
Distractor
Answers
Distractor
Questions
Correct Answers for
Distractor Questions
CoQ GT
Questions
CoQ GT
Answers
Model
Choice
Provided
Choice
Figure 1. Overview of Chain-of-Questions (CoQ). The left repre-
sents the manually formulated ground truth QA chain. The middle
represents the Planning task, evaluating the model’s capability
in selecting sub-questions that are helpful to answer the original
question. The right represents the Following task, evaluating the
model’s capability in answering each sub-question.
objects of interest, and rarely adjust their plans to be adaptive
to collected clues [4, 18, 32]. This weakness substantially
limits the application of VLMs in challenging open envi-
ronments where the goals cannot be fully specified at the
very beginning but need progressive planning on the fly. For
example, guessing the location based on a street view im-
age [22], detecting cheating from posted images [18], or sim-
ply complicated tasks [34]. Tasks like these require multiple
rounds of active exploration, sub-goal proposal, and answer-
ing the sub-questions to collect sufficient contextual clues
and identify the final targets, while poor exploration may
significantly undermine or distract such reasoning processes.
However, recent visual reasoning and benchmarks mainly
focus on math problems with visual contexts [23, 44, 46]
or puzzle games with toy environments [24, 33, 37], which
1
arXiv:2512.11995v1 [cs.CV] 12 Dec 2025
Original Question: Who is mainly
responsible for the accident?
The black car
Based on the previous results, who is mainly responsible for the accident?
CoQ GT Question
The silver car
CoQ GT Answer
Distractor Question
Distractor Answer
Correct Answer for Distractor Question
What is
the black
car doing?
How
many cars
are visible?
Is the
ground
wet or dry?
What
color is
the sign?
Backing
6
Wet
Red
What is
the sign
about?
Is it still
raining?
Are there
people
walking?
What is
the silver
car doing?
Not sure
Not Sure
No
Driving
Forward
Following Evaluation Step
Planning Evaluation Step
What is the
black car
doing?
Backing
Driving
Forward
Parking
Turning
Left
What is the
silver car
doing?
Turning
Right
Parking
Driving
Forward
Backing
Planning Task
Following Task V-REX
Tasks
Figure 2. An example from V-REX with corresponding planning and following tasks. In the planning task, the model is given the
original question and asked to select a sub-question in each step that is necessary and helpful for solving the original question. In the
following task, the model is asked to a
Reference
This content is AI-processed based on open access ArXiv data.