투명한 추론과 일관성 평가를 통한 VLM 오류 진단

Reading time: 5 minute
...

📝 Abstract

Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models (VLMs). Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets (ARS), compact sub-question-answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.

💡 Analysis

Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models (VLMs). Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets (ARS), compact sub-question-answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.

📄 Content

Evaluating large vision-language models (VLMs) has predominantly focused on final-answer correctness. However, this metric is often insufficient and misleading. When a model produces an incorrect answer, it remains unclear where in the reasoning process the failure occurred or how errors propagated through the computational graph. Conversely, a correct final answer does not guarantee a coherent reasoning process; models can reach correct outcomes through flawed or inconsistent intermediate steps, masking conceptual misunderstandings or accidental self-corrections. These limitations raise critical questions: How can we pinpoint failures in multi-step, multimodal reasoning? What patterns produce “silent errors,” where incorrect intermediate steps still yield correct answers? And to what extent does final-answer accuracy truly reflect reasoning ability versus chance or memorization?

To address these challenges, we introduce TRACE (Transparent Reasoning and Consistency Evaluation), a framework that enhances diagnostic evaluation of VLM reasoning. TRACE systematically decomposes complex multimodal tasks into Auxiliary Reasoning Sets (ARS): interpretable sub-questions with structured dependencies. By assessing consistency across ARS, TRACE precisely localizes reasoning failures and exposes error propagation. Unlike finalanswer-centric evaluation, this decomposition reveals how consistently a model reasons across sub-questions and which steps exhibit the most variation across reasoning paths, highlighting potentially unstable or ambiguous reasoning steps. In addition, TRACE maps ARS dependencies into a reasoning graph to identify the First Failure Step (FFS). As illustrated in Figure 1, this allows tracing inconsistencies across intermediate reasoning steps, revealing which sub-questions or dependencies exhibit the most variation, even when the final model answer does not fully reflect these differences.

  1. We propose TRACE, a diagnostic framework for transparent multimodal reasoning, which decomposes complex tasks into Auxiliary Reasoning Sets (ARS) with a dependency-aware evaluation protocol for tracking error propagation.

  2. We introduce novel consistency metrics, including path mean consistency, global mean consistency, and the diagnostic First Failure Step (FFS), which reliably localize reasoning failures overlooked by traditional finalanswer metrics.

  3. We construct a benchmark of 3.7k ARS question-answer pairs across 630 reasoning paths, enabling evaluation at both intermediate and final-answer levels.

  4. Experiments demonstrating TRACE uncovers reasoning errors missed by standard final-answer evaluation, enhancing interpretability, robustness, and evaluation quality.

consisting of sub-question-answer pairs that explicitly capture intermediate reasoning steps. An ARS is defined such that each sub-question q i paired with an answer a i satisfies three properties:

• Completeness: The sub-questions collectively provide all information needed to solve the problem, with no redundancy. The ARS also extracts any required information from the image, so that the main question can be answered without directly referencing the figure.

• Independence: Each sub-question depends only on raw inputs or explicitly specified predecessors.

• Soundness: Sub-questions are answerable, non-overlapping, and do not leak the final answer.

The model then uses this structured set to produce its final answer, making the reasoning process transparent and enabling fine-grained evaluation of reasoning behavior.

Example. Consider a geometry problem illustrated in Figure 1, where the vertices of triangle ABC lie on a square grid and the task is to compute tan A. The ARS decomposes the problem into sub-questions such as identifying the coordinates of points A, B, and C, and computing the slopes of lines AB and AC. This decomposition allows TRACE to analyze the model’s reasoning process, track how intermediate answers propagate, and provide a structured view of the entire solution pathway.

Construction. ARS are generated using two complementary strategies:

• Exploration: Given the original question and a specialized prompt, the model generates diverse sub-questions along with their dependencies.

• Exploitation: Sub-questions are generated from candidate reasoning chains in two steps:

Step 1: Given the question and image, the model produces a step-by-step reasoning answer.

Step 2: Using the original question, the generated reasoning answer, and a prompt, the model generates sub-questions and their dependencies, ensuring the ARS captures the critical steps necessary to solve the problem.

Reasoning Graph. Each sub-question is annotated with metadata specifying its dependencies on other questions, text, or image inputs. Together, these dependencies form a directed acyclic reasoning graph (DAG) that encodes the structure of the reasoning process and defines the order in which the model answers sub-questions. Intermediate answer

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut