MM-THEBench: Do Reasoning MLLMs Think Reasonably?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.

💡 Research Summary

The paper addresses a critical gap in the evaluation of multimodal large language models (MLLMs) that have recently been equipped with chain‑of‑thought (CoT) reasoning capabilities. While existing benchmarks focus almost exclusively on the correctness of the final answer, they ignore the intermediate reasoning steps where hallucinations—outputs that are inconsistent with the visual evidence, factual knowledge, or logical context—often arise. To fill this void, the authors introduce MM‑THEBench (Multimodal Thinking Hallucination Evaluation Benchmark), a comprehensive framework designed to assess hallucinations that occur within the CoT of reasoning MLLMs.

Benchmark Design and Taxonomy
MM‑THEBench is built around a two‑level hallucination taxonomy grounded in cognitive dimensions. The top‑level categories are Knowledge, Perception, and Reasoning, each further divided into fine‑grained sub‑categories (e.g., Knowledge‑World, Knowledge‑Commonsense, Perception‑Spatial, Reasoning‑Deductive, etc.). This hierarchical taxonomy enables precise attribution of errors to specific cognitive processes, something prior work has lacked.

Data Construction
The benchmark re‑uses eight high‑quality multimodal datasets (MathVision, MM‑vet‑v2, MMMU‑pro, HallusionBench, Omni‑Spatial, CharXiv, GUI‑Agent, Video‑MME) to assemble 1,340 diverse questions covering single‑image, multi‑image, and video modalities, as well as multiple‑choice and open‑ended formats. For each question, the authors automatically generate a provisional step‑by‑step reasoning chain using the state‑of‑the‑art reasoning model Gemini‑2.5‑pro. Human annotators then verify, correct, and enrich these chains, ensuring that every atomic reasoning step is necessary and correct. Based on the verified chains, a rubric is created for each question; each rubric item encodes (1) a binary satisfaction flag, (2) a difficulty/importance score, and (3) a label indicating the relevant cognitive dimension and sub‑category.

Automated Evaluation Pipeline
Evaluation is performed by an “LLM‑as‑a‑Judge” model, specifically Qwen‑3‑32B, chosen for its strong instruction‑following and judgment abilities while being runnable on modest hardware. The judge carries out four tasks: (i) extract the final answer from the model output, (ii) segment the generated CoT into atomic steps, (iii) align these steps with the gold‑standard chain, and (iv) apply the rubric to assign fine‑grained scores and detect hallucinations. This pipeline yields three complementary metrics: answer‑level accuracy, step‑level consistency (precision, recall, F1), and a Hallucination‑free score (H‑score) that reflects the proportion of rubric items satisfied without hallucination.

Experimental Findings
Fourteen recent reasoning MLLMs—including GPT‑5 (OpenAI, 2025a) and OpenAI‑o3 (2025b)—are evaluated. The results reveal a striking dichotomy: while final‑answer accuracies are high (often >80 %), the intermediate CoTs exhibit non‑trivial hallucination rates. On average, models achieve an H‑score of 27/30, indicating that roughly one in ten reasoning steps contains a hallucination. A detailed breakdown shows that Perception‑related hallucinations are the most frequent (≈49 % of all steps) but have a relatively weak correlation with final‑answer errors. In contrast, Reasoning‑related hallucinations—especially those involving spatial, deductive, or causal reasoning—are far more predictive of incorrect answers. Mixed hallucinations that combine Knowledge and Reasoning errors also strongly degrade performance. Notably, spatial hallucinations dominate both the Perception and Reasoning sub‑categories, accounting for over 40 % of all detected hallucinations, highlighting a systemic weakness in current models’ handling of geometric and layout information.

Implications and Future Directions
The study empirically confirms the intuition that longer, more elaborate CoTs do not automatically guarantee trustworthy reasoning; instead, they introduce additional failure points. Consequently, developers should incorporate intermediate‑step monitoring, hallucination detection, and possibly corrective feedback loops into the training and inference pipelines of reasoning MLLMs. MM‑THEBench itself offers a reusable, extensible benchmark that can serve as a standard for future work aiming to improve the fidelity of multimodal reasoning. Potential extensions include (1) integrating adversarial prompting to stress‑test hallucination detectors, (2) developing training objectives that penalize hallucinated steps, and (3) expanding the taxonomy to cover emerging modalities such as 3‑D point clouds or embodied sensor streams.

In summary, MM‑THEBench provides the first large‑scale, fine‑grained, automated evaluation suite for hallucinations within the chain‑of‑thought of multimodal reasoning models. Its taxonomy, data annotations, and judgment pipeline together reveal that current state‑of‑the‑art models, despite impressive final‑answer performance, still suffer from systematic intermediate reasoning errors—particularly in spatial perception and logical deduction—that must be addressed to achieve truly reliable multimodal AI.

MM-THEBench: Do Reasoning MLLMs Think Reasonably?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment