MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM’s ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs’ reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even state-of-the-art models struggle with real-world math tasks, lagging behind human performance, highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.
💡 Research Summary
The paper introduces MathScape, a novel benchmark designed to evaluate multimodal large language models (MLLMs) on realistic, real‑world mathematical problems. Existing multimodal math benchmarks such as MathVista, MathVerse, and Math‑V rely heavily on synthetically rendered images, which fail to capture the visual noise, lighting variations, and layout ambiguities that users encounter when photographing textbooks, worksheets, or screens. To address this gap, the authors construct a dataset of 1,369 high‑quality math problems sourced from primary, middle, and high‑school curricula in China. Each problem is converted into PDF, rendered as an image, and then photographed or screenshot‑captured to emulate real‑world acquisition conditions. Human annotators (five graduate students) performed rigorous validation, costing roughly $8,000, and ensured that every question and solution reached consensus. The dataset is finely annotated along three axes: (1) question type (multiple‑choice, fill‑in‑the‑blank/solution, proof), (2) knowledge domain (algebra, geometry, probability, statistics, functions, equations), and (3) educational stage (primary, middle, high school). Each label is cross‑checked by multiple annotators to guarantee consistency.
For evaluation, the authors propose a two‑step pipeline. First, a language model is prompted to segment long, narrative answers into discrete sub‑answers, each focusing on a specific sub‑problem. Second, another LLM evaluates each sub‑answer individually using a dedicated scoring prompt. Human judges verified the automated scores, achieving a 97 % agreement rate, which validates the reliability of the automatic scoring mechanism and enables granular error analysis.
The experimental suite includes nine closed‑source models (GPT‑4o, GPT‑4V, GPT‑4‑Turbo, GeminiPro, Claude‑3‑Opus, Baichuan‑VL, Qwen‑Max, Qwen‑Plus, GLM4V) and three open‑source models with more than 20 B parameters (Yi‑VL‑34B, Qwen2‑VL‑Instruct‑72B, LLaVA‑One‑Vision‑72B). Additionally, seven smaller open‑source models (DeepSeek‑VL‑2‑4.5B, LLaVA‑1.6‑7B, Qwen2‑VL‑Instruct‑7B, LLaVA‑One‑Vision‑7B, Llama‑3.2‑11B‑Vision) and two math‑specialized variants (Math‑LLaVA, G‑LLaVA‑7B) are evaluated. All models are tested in a zero‑shot setting with identical inference parameters (max tokens = 2048, top‑k = 5, temperature = 0.3, repetition penalty = 1.05) on NVIDIA H100 GPUs.
Results reveal that even the strongest closed‑source model, GPT‑4o, attains the highest overall accuracy but still falls short of human performance, especially on higher‑difficulty, proof‑type, and geometry questions. When the same models are fed digitally rendered PDFs (synthetic images), GPT‑4o reaches roughly 78 % accuracy, whereas on human‑captured photos the accuracy drops below 42 %. This stark contrast demonstrates that proficiency on clean, synthetic visuals does not transfer to noisy, real‑world inputs. Moreover, performance variability across multiple runs is non‑trivial (5–8 % fluctuation), indicating limited robustness to visual perturbations. Smaller models generally achieve under 20 % accuracy, and math‑specific fine‑tuning (e.g., Math‑LLaVA) yields only marginal gains over generic vision‑language models.
The authors conclude that current MLLMs are not yet ready for practical deployment in real‑world educational settings. Key limitations include sensitivity to image quality, insufficient chain‑of‑thought reasoning for complex multi‑step problems, and unstable predictions. They recommend future work focus on (1) enhancing visual robustness through noise‑aware pre‑training or augmentation, (2) integrating more sophisticated CoT prompting and hierarchical reasoning frameworks, and (3) domain‑adaptive fine‑tuning that jointly leverages textual math corpora and real‑world multimodal data. MathScape itself is positioned as a critical benchmark for the next generation of multimodal models, providing a realistic testbed that bridges the gap between laboratory performance and everyday educational use cases.
Comments & Academic Discussion
Loading comments...
Leave a Comment