MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Reading time: 5 minute
...

📝 Original Info

  • Title: MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
  • ArXiv ID: 2512.08228
  • Date: 2025-12-09
  • Authors: ** Jusheng Zhang¹, Kaitong Cai¹, Xiaoyang Guo¹, Sidi Liu¹, Qinhan Lv¹, Ruiqi Chen¹, Jing Yang¹, Yijia Fan¹, Xiaofei Sun², Jian Wang³, Ziliang Chen¹, Liang Lin¹, Keze Wang¹ ¹ Sun Yat‑sen University, ² Alibaba Group, ³ Snap Inc. **

📝 Abstract

The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating freeform explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, i.e., revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

💡 Deep Analysis

Figure 1

📄 Full Content

MM-CoT: A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models Jusheng Zhang1 Kaitong Cai1 Xiaoyang Guo1 Sidi Liu1 Qinhan Lv1 Ruiqi Chen1 Jing Yang1 Yijia Fan1 Xiaofei Sun2 Jian Wang3 Ziliang Chen1 Liang Lin1 Keze Wang1 1Sun Yat-sen University 2Alibaba Group 3Snap Inc. Abstract The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), en- abling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Exist- ing benchmarks emphasize generation but neglect verifica- tion, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coher- ence of CoT reasoning in MMs. Instead of generating free- form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consis- tency, ensuring all steps are anchored in observable evi- dence, and (ii) logical coherence, ensuring causal and com- monsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reason- ing failures. We evaluate leading vision–language models on MM-CoT and find that even the most advanced systems struggle, i.e., revealing a sharp discrepancy between gen- erative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world. 1. Introduction Large-scale Multimodal Models (MMs), especially Vision– Language Models (VLMs) [1–3, 16, 25, 30, 42, 50, 57] equipped with Chain-of-Thought (CoT) prompting [7, 22, 45, 48, 49], now generate remarkably detailed multi-step explanations for visual tasks. Yet this apparent sophisti- cation often masks a critical weakness: models can repli- cate familiar reasoning templates without genuinely un- derstanding the causal or visual structure underlying the scene [30, 47, 51]. This gap between fluent narration and true inferential depth raises a central question: are multi- modal CoT explanations truly grounded in visual evidence, and do they follow coherent, causally valid progressions? Despite rapid progress, existing multimodal bench- marks [4, 6, 11, 29, 31, 53, 55] overwhelmingly emphasize generation. They reward models for producing plausible answers or fluent rationales, but largely overlook verifica- tion, i.e., the ability to assess whether a reasoning chain is visually faithful and logically sound. This omission be- comes evident when models produce narratives that refer- ence nonexistent objects, misinterpret key visual cues, or violate causal order [24, 33, 44]. Such failures indicate that current multimodal CoT behaviors remain predominantly pattern-driven rather than evidence-driven, implying that correctness-based evaluations provide an incomplete, and at times misleading, picture of visual reasoning ability. To address this limitation, we introduce MM-CoT, a di- agnostic benchmark that reframes multimodal CoT reason- ing as a discriminative verification task rather than open- ended generation. As illustrated in Fig. 1, the model is given an image or video and must select the unique valid event chain from a set of carefully constructed candidates. Each chain follows a triadic structure A→B→C: an initiating condition (A), a visually grounded mediating step (B), and a logically entailed outcome (C). Distractors are adversari- ally designed to violate exactly one of two orthogonal con- straints: (i) visual consistency. Each step must be anchored in observable evidence, and (ii) logical coherence. Causal and temporal transitions must be physically and common- sensically valid. Distractors are intentionally written to be linguistically plausible, preventing models from exploiting textual shortcuts and forcing genuine visual-and-causal ver- ification. [20, 28, 39, 47, 57] MM-CoT consists of 5616 image-based and 2,100 video-based reasoning instances. Each item includes a sin- gle valid chain and K distractors (K=3 for images, K=4 for videos), enabling controlled evaluation of both percep- tual grounding and multi-step causal reasoning across in- creasing difficulty tiers. This design separates visual plau- arXiv:2512.08228v1 [cs.CV] 9 Dec 2025 Image Video Firefighters may cut water output temporarily or pause flow to "stabilize the hose" and reduce vibration, then continue spraying. The dog stops in time and avoids falling into the water. The dog stops in time and avoids falling into the water. The nozzle of the fire hose slightly loosen and shake begins to due to high water pressure

📸 Image Gallery

image_error.png video_error.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut