V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Reading time: 5 minute
...

📝 Original Info

  • Title: V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
  • ArXiv ID: 2512.11995
  • Date: 2025-12-12
  • Authors: Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

📝 Abstract

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

💡 Deep Analysis

📄 Full Content

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions Chenrui Fan*, Yijun Liang*, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou University of Maryland, College Park {cfan42,yliang17,minglii}@umd.edu Project: https://github.com/tianyi-lab/VREX Abstract While many vision language models (VLMs) are developed to passively answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of active exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, “Visual Reasoning with multi-step EXploration (V-REX)”, which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX casts the multi-step ex- ploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs’ capability to (1) Planning: breaking down an open-ended task by dynamically selecting a chain of exploratory questions; and (2) Following: answering cu- rated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantita- tive and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences be- tween planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning. 1. Introduction Various practical applications of vision language models (VLMs) need to perform sophisticated multi-step visual rea- soning [2, 6, 19, 20, 22, 43, 45, 48] to solve the user queries. Recent studies reveal the weakness of existing VLMs on exploratory reasoning tasks, showing that they often rely on brute-force search in the input image to allocate the potential *These authors contributed equally to this work. Question-1 Answer-1 Answer-2 Question-2 Original Question Question-3 (Final Question) Answer-3 (Final Answer) Following Task Planning Task Ground Truth QA-Chain Distractor Answers Distractor Questions Correct Answers for Distractor Questions CoQ GT Questions CoQ GT Answers Model Choice Provided Choice Figure 1. Overview of Chain-of-Questions (CoQ). The left repre- sents the manually formulated ground truth QA chain. The middle represents the Planning task, evaluating the model’s capability in selecting sub-questions that are helpful to answer the original question. The right represents the Following task, evaluating the model’s capability in answering each sub-question. objects of interest, and rarely adjust their plans to be adaptive to collected clues [4, 18, 32]. This weakness substantially limits the application of VLMs in challenging open envi- ronments where the goals cannot be fully specified at the very beginning but need progressive planning on the fly. For example, guessing the location based on a street view im- age [22], detecting cheating from posted images [18], or sim- ply complicated tasks [34]. Tasks like these require multiple rounds of active exploration, sub-goal proposal, and answer- ing the sub-questions to collect sufficient contextual clues and identify the final targets, while poor exploration may significantly undermine or distract such reasoning processes. However, recent visual reasoning and benchmarks mainly focus on math problems with visual contexts [23, 44, 46] or puzzle games with toy environments [24, 33, 37], which 1 arXiv:2512.11995v1 [cs.CV] 12 Dec 2025 Original Question: Who is mainly responsible for the accident? The black car Based on the previous results, who is mainly responsible for the accident? CoQ GT Question The silver car CoQ GT Answer Distractor Question Distractor Answer Correct Answer for Distractor Question What is the black car doing? How many cars are visible? Is the ground wet or dry? What color is the sign? Backing 6 Wet Red What is the sign about? Is it still raining? Are there people walking? What is the silver car doing? Not sure Not Sure No Driving Forward Following Evaluation Step Planning Evaluation Step What is the black car doing? Backing Driving Forward Parking Turning Left What is the silver car doing? Turning Right Parking Driving Forward Backing Planning Task Following Task V-REX Tasks Figure 2. An example from V-REX with corresponding planning and following tasks. In the planning task, the model is given the original question and asked to select a sub-question in each step that is necessary and helpful for solving the original question. In the following task, the model is asked to a

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut