What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets
Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.
💡 Research Summary
The paper “What Lies Beneath: A Call for Distribution‑based Visual Question & Answer Datasets” identifies a critical gap in current visual question answering (VQA) research: most chart‑focused VQA datasets assume a one‑to‑one correspondence between visual marks (bars, lines, points) and the underlying raw data. In real scientific practice, charts are often transformed representations—data are aggregated, binned, or otherwise altered—so the visual elements do not directly expose the original distribution. This mismatch creates a reasoning challenge that existing benchmarks do not capture.
To address this, the authors propose a “distribution‑based” VQA benchmark. They generate synthetic histogram figures using Python’s matplotlib library. Underlying data are sampled from Gaussian mixture models (GMMs) with 1–5 components, values drawn from a uniform range (‑1 to 1), and random noise of 5–10 % added. The pipeline automatically produces JPEG images, JSON metadata describing the distribution parameters, and bounding‑box annotations for every visual element (bars, axis labels, titles, etc.). Although the current release focuses on single‑panel histograms, the code can also create line, scatter, and contour plots with varied styles (colors, fonts, log/linear scales, DPI, aspect ratios).
For each figure the authors automatically generate a set of questions. They adopt a “level” taxonomy (from simple to complex) and decompose each question into four slots: persona (the role of the model), context (background information), question (core query), and format (desired answer format). In this initial study they limit themselves to two statistical questions: (1) “What is the median value of the data in this figure panel?” and (2) “How many Gaussians were used to generate the data for the plot in the figure panel?” The format slot for the second question explicitly asks the model to return an integer between 1 and 5, mirroring the known range of the synthetic data.
The evaluation involves two human annotators recruited via the Zooniverse platform and a large multimodal model, GPT‑5‑nano (the smallest variant of the newly announced GPT‑5 family). Human annotators view each histogram, draw a vertical line indicating the perceived median, and type the estimated number of Gaussians. The model receives the same image together with the structured prompt and returns textual answers. The authors test 80 histograms in total, varying two key visual factors: (a) the number of histogram bars (10, 20, 45, 60) while fixing the number of Gaussians to 2, and (b) the number of Gaussians (1, 2, 3, 5) while fixing the bar count to 50.
Statistical analysis shows that median estimates from humans and the model are not significantly different (Kruskal‑Wallis test, p > 0.05). Residuals for all three groups deviate from normality, prompting non‑parametric testing. Variance comparisons (Levene’s test) reveal that Annotator 2’s residuals have higher variance than both Annotator 1 and the model, possibly reflecting differences in statistical training. The number of bars does not affect median accuracy for any participant group. However, the number of Gaussians does impact estimation error: as the true number of components increases, all participants’ errors increase, a trend confirmed by a linear mixed‑effects model (p < 0.01). This aligns with intuition—overlapping Gaussian components become visually indistinguishable, making component counting harder.
Interestingly, GPT‑5‑nano’s performance on both tasks is comparable to the statistically trained human (Annotator 1) and often better than the non‑trained annotator. Nevertheless, the model sometimes fails to obey the prescribed answer format, producing “X” markers in the result tables when it refuses to answer or returns a value outside the expected range. Larger model variants (e.g., GPT‑5‑mini) did not improve accuracy and exhibited more hallucinations, suggesting that scaling alone does not guarantee better VQA performance on distribution‑based tasks.
The authors acknowledge several limitations. The dataset currently contains only histograms and only two question types, limiting ecological validity. Synthetic data, while controllable, may not fully capture the visual complexity of real scientific figures. Moreover, the evaluation uses a single LMM and a small number of human annotators, which restricts generalizability.
Future work outlined includes: (1) expanding the benchmark to other chart types (line, scatter, heatmaps) and more sophisticated statistical queries (confidence intervals, hypothesis tests, correlation coefficients); (2) increasing visual diversity (logarithmic axes, error bars, multi‑panel layouts); (3) scaling up human annotation to a broader pool with varied expertise levels; (4) fine‑tuning multimodal models on the released dataset and investigating prompt‑engineering strategies to enforce answer format compliance; and (5) integrating real‑world scientific charts from published articles to assess transferability from synthetic to authentic data.
In conclusion, this paper makes a compelling case for “distribution‑based” VQA datasets that break the simplistic 1‑to‑1 data‑visual correspondence assumption. By releasing the synthetic histogram dataset, code, and baseline human/LMM results, the authors provide a concrete starting point for the community to develop richer, statistically grounded VQA benchmarks that better reflect the reasoning demands of scientific data visualization.
Comments & Academic Discussion
Loading comments...
Leave a Comment