AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models’ capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
💡 Research Summary
The paper introduces AdaptMMBench, a comprehensive benchmark designed to evaluate adaptive multimodal reasoning in Vision‑Language Models (VLMs). Existing evaluations rely on static difficulty labels and coarse metrics such as final accuracy or token‑level reduction, which do not capture the dynamic nature of difficulty that varies with model capacity, nor do they separate the ability to select an appropriate reasoning mode from the subsequent reasoning performance. To address these gaps, the authors construct a dataset of 1,420 samples spanning five domains—real‑world images, OCR, graphical user interfaces (GUI), knowledge, and mathematics. Each sample is represented as a quintuple (I, Q, A, E, K): the image I, textual query Q, ground‑truth answer A, visual‑tool annotation E (specifying required regions, rotations, contrast adjustments, etc.), and an ordered list of human‑verified key reasoning steps K that describe the logical solution path.
AdaptMMBench evaluates two distinct aspects: (1) mode selection and (2) reasoning process quality. For mode selection, the benchmark dynamically determines task difficulty based on each model’s performance boundary, then uses the Matthews Correlation Coefficient (MCC) to quantify how well the model decides between text‑only reasoning and tool‑augmented visual reasoning. MCC treats true positives (correctly invoking tools on difficult items), true negatives (correctly staying text‑only on easy items), false positives (unnecessary tool calls on easy items), and false negatives (failing to call tools on hard items) equally, providing a balanced measure even under class imbalance.
For the reasoning process, three dimensions are measured: key‑step coverage, tool effectiveness, and computational efficiency. Key‑step coverage assesses the overlap between the model‑generated reasoning steps and the human‑provided K, reflecting logical coherence. Tool effectiveness evaluates whether each invoked tool actually supplies the needed visual information and contributes to the correct answer, distinguishing successful from redundant or harmful tool usage. Computational efficiency aggregates token count, number of reasoning turns, and tool‑call frequency to estimate inference cost.
The authors evaluate a range of open‑source models (e.g., GPT‑5, Qwen3‑VL‑23.5B, InternVL) and several closed‑source systems. Findings reveal that MCC scores increase with model size, indicating that larger models possess stronger meta‑cognitive abilities to recognize task difficulty and select appropriate modes. However, the correlation between MCC and final accuracy is relatively weak, suggesting that good mode selection does not automatically translate into higher correctness. In contrast, key‑step coverage shows a strong positive correlation with accuracy, highlighting the importance of faithfully following the logical steps. Tool effectiveness varies widely across architectures; some models invoke many tools but often fail to obtain useful visual cues, leading to inefficiencies.
The study underscores that adaptive multimodal reasoning comprises at least two separable competencies: (i) the meta‑cognitive skill of difficulty awareness and mode selection, and (ii) the execution skill of constructing coherent, tool‑aware reasoning chains. By introducing MCC‑based dynamic difficulty assessment and a multi‑dimensional process metric suite, AdaptMMBench provides a more nuanced evaluation framework that can guide future VLM development toward both smarter mode selection and higher‑quality reasoning. The benchmark, along with its annotation pipeline and evaluation code, is publicly released to foster reproducibility and further research in adaptive multimodal intelligence.
Comments & Academic Discussion
Loading comments...
Leave a Comment