Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper
💡 Research Summary
The paper tackles the challenging task of generating executable plotting code from static chart images, a problem that sits at the intersection of visual perception and program synthesis. While recent Vision‑Language Models (VLMs) have shown promise in cross‑modal tasks, the authors argue that direct supervised fine‑tuning of VLMs on chart‑to‑code data suffers from a fundamental mismatch: the supervision signal is dominated by surface‑level token imitation rather than the underlying visual structure that charts encode. This leads to three typical failure modes—structural hallucinations, misaligned token‑level loss, and sparse binary execution feedback—resulting in code that may be syntactically correct but semantically divergent from the source chart.
To address these issues, the authors introduce Chart Specification (ℱ), a compact, structured intermediate representation inspired by the Grammar of Graphics and a‑Lite visual grammar. ℱ captures four orthogonal aspects of a chart: (1) Topology (layout, axes, legends), (2) CoordDomain (coordinate transformations), (3) SeriesDomain (data series and their bindings), and (4) AuxiliaryStats (labels, tick marks, statistical annotations). By parsing both the reference plotting script and the model‑generated script into ℱ, the system obtains a modality‑agnostic, semantically grounded description of the chart’s intent.
Using ℱ as a filter, the authors curate ChartStruct, a structurally balanced training corpus. Unlike prior large‑scale datasets that are heavily skewed toward a few chart types and contain noisy syntactic variations, ChartStruct deliberately balances samples across fine‑grained structural categories (e.g., line charts defined by explicit functional mappings versus those derived from interpolated discrete data). This balanced curation enables the model to learn complex data dependencies even with a modest number of examples.
The second major contribution is the Spec‑Align Reward, a dense, verifiable reinforcement‑learning signal that directly measures structural fidelity between generated and reference specifications. The reward is organized as a hierarchical Reward Tree with three phases: (i) Integrity Gate checks the presence and format of required fields; (ii) Semantic Topology Gate evaluates the correctness of layout, axis relationships, and series‑axis bindings; (iii) Code Metric Gate computes fine‑grained execution metrics such as pixel‑level MSE or SSIM between the rendered chart from generated code and the ground‑truth image. By providing step‑wise feedback, the reward closes the “reward gap” that has limited prior RL attempts on chart‑to‑code tasks, where binary compile‑or‑run success was too coarse.
For policy optimization, the paper adopts Group Relative Policy Optimization (GRPO). Samples are grouped by their Chart Specification, and advantage estimates are computed relative to the group mean. This group‑wise normalization mitigates the long‑tail distribution of chart types and prevents over‑rewarding complex structures that are under‑represented in the data.
Experiments are conducted on three public benchmarks (including ChartCoder, Plot2Code, and DePlot). Remarkably, with only 3 K training pairs, the proposed method outperforms prior supervised baselines that use up to 160 K pairs, achieving up to 61.7 % improvement on complex structural metrics. Scaling to 4 K samples yields new state‑of‑the‑art results across all evaluated metrics, with substantial gains in structural correctness (e.g., axis scaling, series mapping) and a higher proportion of executable code. Ablation studies confirm that both the Chart Specification representation and the Spec‑Align Reward contribute significantly to the observed gains.
In summary, the paper demonstrates that structural supervision via an intermediate specification, combined with fine‑grained, verifiable reinforcement learning, enables VLMs to reason about chart geometry and data relationships far more efficiently than raw token‑level supervision. The approach not only improves data efficiency and fidelity for chart‑to‑code generation but also offers a blueprint for other visual‑to‑code domains where structural correctness is paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment