ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing

ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.


💡 Research Summary

The paper introduces ChartE³, a novel benchmark designed to evaluate end‑to‑end chart editing without relying on intermediate code or program representations. Existing chart‑editing approaches typically follow a pipeline where a natural‑language instruction is first translated into chart‑specific code (e.g., Matplotlib, Vega‑Lite) and then rendered, which limits the assessment to code correctness rather than the visual fidelity of the final edited chart. ChartE³ reframes the problem as a direct image‑to‑image transformation: given an original chart image and a textual editing instruction, the model must produce the edited chart image directly.

ChartE³ focuses on two complementary editing dimensions. Local editing covers fine‑grained appearance changes such as font size, color, label position, and axis styling. Global editing involves data‑centric transformations, including data filtering, aggregation, trend‑line addition, and axis range adjustments. These two dimensions are further divided into twelve fine‑grained task types, providing a comprehensive spectrum of realistic chart‑editing scenarios.

To construct the benchmark, the authors design a five‑stage pipeline. First, they collect roughly 10 K chart images from both real‑world sources (ChartBench, Chart2Code) and synthetic datasets (ChartX, ChartM³), covering over 40 chart types. Diversity filtering is performed by extracting CLIP image embeddings for each chart, applying k‑means clustering, and selecting cluster centroids to obtain about 100 representative images per type, ensuring visual variety and reducing redundancy.

Next, each image is paired with a rendering code representation. For datasets lacking code, Gemini‑2.5‑Pro is used to synthesize code. A reflection‑style iterative refinement process renders the generated code, checks execution success, and measures CLIP similarity (>0.7) with the original image; if either check fails, the model is prompted to revise the code. This yields a high‑fidelity chart‑code pair that serves only as an annotation aid.

Using the chart‑code pairs, editing instructions are automatically generated for the twelve task types. Human curators then verify each sample, discarding ambiguous or erroneous cases. The final benchmark comprises over 1 200 curated editing samples, each consisting of (original image, editing instruction, target edited image, and underlying code).

Evaluation metrics combine objective and subjective components. Objective metrics include Structural Similarity Index Measure (SSIM) to assess pixel‑level fidelity and CLIP‑based similarity to capture cross‑modal alignment. Subjective evaluation employs a GPT‑4‑Turbo model that scores semantic correctness, edit faithfulness, and visual distortion on a 1‑5 scale, providing a human‑like quality assessment.

Extensive experiments benchmark both closed‑source (GPT‑4o, Gemini‑Pro) and open‑source multimodal large language models (LLaVA‑1.5, InternVL‑2, etc.). Results reveal a clear performance gap: local editing tasks are relatively easy, with most models achieving >70 % success and SSIM ≥ 0.85, whereas global editing tasks see success rates drop below 30 %. Error analysis identifies three primary failure modes: (1) misinterpretation of textual instructions, leading to incorrect edit intent; (2) inability to accurately extract underlying data values from the chart image, which hampers data‑centric transformations; and (3) omission of necessary structural adjustments (e.g., axis rescaling) after applying edits, resulting in visual and numerical inconsistencies.

These findings highlight that current multimodal models lack robust integration of visual perception, language understanding, and data extraction required for faithful chart editing, especially for global, data‑driven modifications. The authors suggest future research directions: (a) joint training of chart‑specific visual encoders with data extraction modules to recover numeric information directly from images; (b) diffusion‑based image editing frameworks augmented with structural constraints to better handle global edits; and (c) reinforcement learning with human‑in‑the‑loop feedback to improve instruction grounding and edit fidelity.

In summary, ChartE³ provides the first comprehensive, image‑centric benchmark for end‑to‑end chart editing, exposing critical limitations of existing multimodal models and offering a valuable testbed for advancing visual‑language‑data integration in future AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment