차트 이해를 위한 공간·텍스트 학습 프레임워크 START

Reading time: 6 minute
...

📝 Abstract

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

💡 Analysis

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

📄 Content

START: Spatial and Textual Learning for Chart Understanding Zhuoming Liu1*, Xiaofeng Gao2, Feiyang Niu2, Qiaozi Gao2, Liu Liu3, Robinson Piramuthu2 1University of Wisconsin-Madison 2Amazon AGI 3MIT Abstract Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual lay- out (spatial property) with an underlying data representa- tion (textual property) — grasping both is essential for pre- cise, fine-grained chart reasoning. Motivated by this obser- vation, we propose START, the Spatial and Textual learn- ing for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code genera- tion to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that exist- ing methods cannot handle. To evaluate a model’s abil- ity to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evalu- ation. Leveraging spatial and textual learning, START de- livers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

  1. Introduction The rapid advancement of multimodal large language mod- els (MLLMs) has opened new frontiers in artificial intel- ligence, enabling processing and reasoning across text, im- ages, and other modalities simultaneously. As their capabil- ities continue to grow, successful deployment in real-world applications increasingly hinges on their ability to accu- rately understand and interpret complex visual information. *Work done during internship at Amazon AGI. A MLLM ➕ ➕ ➕ B Q: In plot A, which condition shows a greater dispersion of Mean RMS values for ‘car’? There are 15 dots on the subplot (b).
JSON
300,1034, 380]}```
```python\n Import 
matplotlib …. ``` 
"The 'car' condition shows a greater 
dispersion of Mean RMS values compared 
to the 'lab' condition."
The car condition in the ERP plot shows a 
greater dispersion of Mean RMS values 
compared to the baseline condition… 
There fore the final answer is ERP.
Qwen2.5-VL
Our
How many white 
circles in subplot (b)
Locate the legend
Convert the chart 
to python code 
📊
Figure 1. A: the overview of the START, which leverages spa-
tial and textual learning for chart understanding. B: challenging
question sample from the CharXiv [59]. Answering the question
requires chart element grounding and step-by-step reasoning.
Among various types of visual content, charts represent a
particularly challenging yet essential domain for MLLM,
especially in real-world scenarios such as analyzing scien-
tific papers, technical reports, and financial documents.
However, despite significant progress in general multi-
modal understanding, current MLLMs often struggle with
understanding complicated visual structure and details in
the charts.
Even the best vision reasoning model Ope-
nAI o3 [44] still lags behind the human level understand-
ing to the charts [59]. Figure 1-B shows a sample question
from CharXiv [59], which requires a step-by-step reasoning
and chart element grounding based on the instruction in the
question. The Qwen2.5-VL [1], one of the best open-source
MLLMs, makes mistakes given that it does not ground the
”condition” to the x-axis correctly, justifying the difficulties
of the chart’s understanding.
Unlike natural images that primarily convey semantic
content through objects and scenes, charts are artificial vi-
sual input that pair a structured spatial layout with an un-
derlying textual data representation. They typically include
subplots, titles, legends, and axes, and are instantiated from
data sources—such as tables—or rendered by code (e.g.,
Python [12, 66]). Motivated by the properties of the chart,
we raise two research questions: 1. Can explicitly learning
the spatial structure of the chart and recovering the textual
details in the chart help the chart understanding? and 2. To
1
arXiv:2512.07186v1  [cs.CV]  8 Dec 2025
facilitate spatial and textual learning, how should we con-
struct the dataset?
To address these questions, we propose START, Spatial
and Textual learning chART understanding, as shown in
Figure 1-A. Specifically, we formalize the spatial and tex-
tual learning in supervised finetuning (SFT)

<div style="text-align:center; margin:30px 0;">
  <a href="https://arxiv.org/pdf/2512.07186.pdf" target="_blank" style="padding:12px 25px; background:#007bff; color:white; border-radius:8px; text-decoration:none; font-weight:bold;"> View Original ArXiv</a>
</div>

<p style="font-size:0.8em; color:gray;">This content is AI-processed based on ArXiv data.</p>
Click to expand and view more

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut