ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks – where frontier models perform similarly and near saturation – our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

💡 Research Summary

The paper “ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision‑Language Models” addresses a critical gap in multimodal evaluation: most existing chart‑question‑answering benchmarks rely heavily on textual cues, allowing large vision‑language models (LVLMs) to succeed without genuine visual reasoning. To expose this weakness, the authors first conduct a synthetic experiment using charts that contain no textual annotations, varying visual complexity through overlay and subplot configurations across five chart types (histograms, density plots, line, scatter, violin). They find that as the number of overlaid figures (n) increases from 3 to 9, Claude‑3.7‑Sonnet’s accuracy drops sharply, while human participants maintain near‑perfect performance. This demonstrates that current LVLMs struggle with pure visual inference, especially under high visual load.

Motivated by these findings, the authors introduce ChartMuseum, a new benchmark comprising 1,162 (image, question, short answer) triples drawn from 928 real‑world charts sourced from 184 distinct websites (academic papers, infographics, Tableau dashboards, etc.). Thirteen computer‑science researchers curated the dataset without any assistance from large language models, ensuring that each question has an objective answer and a sufficiently large answer space (minimum four options). Questions are manually classified into four reasoning categories: (1) Textual Reasoning, (2) Visual Reasoning, (3) Text/Visual Mixed Reasoning, and (4) Synthesis Reasoning (requiring both modalities). Subjective “why/how” questions and compound queries are deliberately excluded to keep evaluation objective.

The benchmark’s difficulty is validated by a comparative experiment on a widely used dataset, ChartQA. When Claude‑3.7‑Sonnet is provided only with explicitly extracted textual information from ChartQA, it achieves 74.1% accuracy—only 13% lower than when it sees the full image (87.4%). In contrast, on ChartMuseum the same model’s accuracy collapses from 61.3% with images to 15.2% with extracted text, a 46% gap that reflects the presence of inherently visual information.

Extensive evaluation on ChartMuseum includes 10 open‑source LVLMs (e.g., Qwen2.5‑VL‑72B‑Instruct, LLaVA‑1.5‑34B) and 11 proprietary models (e.g., Gemini‑2.5‑Pro, GPT‑4V). Human performance reaches 93% overall. The best proprietary model, Gemini‑2.5‑Pro, scores 63.0%; the strongest open‑source model, Qwen2.5‑VL‑72B‑Instruct, attains only 38.5%. When broken down by reasoning type, models lose 35‑55% of their accuracy on visual‑reasoning questions compared to textual‑reasoning questions.

A qualitative error analysis reveals systematic weaknesses: (i) visual comparisons (e.g., determining which bar is taller), (ii) marker‑based object identification (e.g., locating a specific colored point), and (iii) trajectory reasoning on line charts (e.g., extrapolating trends). Models tend to over‑rely on textual extraction, even when the question explicitly demands visual inference, leading to predictable failures.

The authors conclude that while LVLMs have made rapid progress on text‑centric multimodal tasks, they remain far from human‑level visual reasoning, especially in domains where visual patterns cannot be trivially translated into text. Future work should focus on expanding visual encoder capacity, designing multi‑step visual reasoning pipelines, and integrating training objectives that jointly optimize textual and visual inference. ChartMuseum thus provides a rigorous, real‑world testbed for measuring and guiding such advancements.

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment