차트 이해를 위한 공간·텍스트 학습 프레임워크 START
📝 Abstract
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
💡 Analysis
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
📄 Content
START: Spatial and Textual Learning for Chart Understanding Zhuoming Liu1*, Xiaofeng Gao2, Feiyang Niu2, Qiaozi Gao2, Liu Liu3, Robinson Piramuthu2 1University of Wisconsin-Madison 2Amazon AGI 3MIT Abstract Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual lay- out (spatial property) with an underlying data representa- tion (textual property) — grasping both is essential for pre- cise, fine-grained chart reasoning. Motivated by this obser- vation, we propose START, the Spatial and Textual learn- ing for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code genera- tion to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that exist- ing methods cannot handle. To evaluate a model’s abil- ity to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evalu- ation. Leveraging spatial and textual learning, START de- livers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
- Introduction The rapid advancement of multimodal large language mod- els (MLLMs) has opened new frontiers in artificial intel- ligence, enabling processing and reasoning across text, im- ages, and other modalities simultaneously. As their capabil- ities continue to grow, successful deployment in real-world applications increasingly hinges on their ability to accu- rately understand and interpret complex visual information. *Work done during internship at Amazon AGI. A MLLM ➕ ➕ ➕ B Q: In plot A, which condition shows a greater dispersion of Mean RMS values for ‘car’? There are 15 dots on the subplot (b).
300,1034, 380]}```
```python\n Import
matplotlib …. ```
"The 'car' condition shows a greater
dispersion of Mean RMS values compared
to the 'lab' condition."
The car condition in the ERP plot shows a
greater dispersion of Mean RMS values
compared to the baseline condition…
There fore the final answer is ERP.
Qwen2.5-VL
Our
How many white
circles in subplot (b)
Locate the legend
Convert the chart
to python code
📊
Figure 1. A: the overview of the START, which leverages spa-
tial and textual learning for chart understanding. B: challenging
question sample from the CharXiv [59]. Answering the question
requires chart element grounding and step-by-step reasoning.
Among various types of visual content, charts represent a
particularly challenging yet essential domain for MLLM,
especially in real-world scenarios such as analyzing scien-
tific papers, technical reports, and financial documents.
However, despite significant progress in general multi-
modal understanding, current MLLMs often struggle with
understanding complicated visual structure and details in
the charts.
Even the best vision reasoning model Ope-
nAI o3 [44] still lags behind the human level understand-
ing to the charts [59]. Figure 1-B shows a sample question
from CharXiv [59], which requires a step-by-step reasoning
and chart element grounding based on the instruction in the
question. The Qwen2.5-VL [1], one of the best open-source
MLLMs, makes mistakes given that it does not ground the
”condition” to the x-axis correctly, justifying the difficulties
of the chart’s understanding.
Unlike natural images that primarily convey semantic
content through objects and scenes, charts are artificial vi-
sual input that pair a structured spatial layout with an un-
derlying textual data representation. They typically include
subplots, titles, legends, and axes, and are instantiated from
data sources—such as tables—or rendered by code (e.g.,
Python [12, 66]). Motivated by the properties of the chart,
we raise two research questions: 1. Can explicitly learning
the spatial structure of the chart and recovering the textual
details in the chart help the chart understanding? and 2. To
1
arXiv:2512.07186v1 [cs.CV] 8 Dec 2025
facilitate spatial and textual learning, how should we con-
struct the dataset?
To address these questions, we propose START, Spatial
and Textual learning chART understanding, as shown in
Figure 1-A. Specifically, we formalize the spatial and tex-
tual learning in supervised finetuning (SFT)
<div style="text-align:center; margin:30px 0;">
<a href="https://arxiv.org/pdf/2512.07186.pdf" target="_blank" style="padding:12px 25px; background:#007bff; color:white; border-radius:8px; text-decoration:none; font-weight:bold;"> View Original ArXiv</a>
</div>
<p style="font-size:0.8em; color:gray;">This content is AI-processed based on ArXiv data.</p>