VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck is the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs’ inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a Cross-Modal verification framework that generates questions and answers purely from figure-citing paragraphs, then verifies them against the figures themselves, leveraging the inherent text-figure alignment in scientific papers to filter out erroneous QA pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,272 QA pairs spanning 20 scientific domains and 12 figure types. Difficulty assessment reveals a notable accuracy gap between the best open-source model (65%) and the best proprietary model (80.5%), demonstrating room for improvement. Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size, surpassing models trained on existing datasets. Human evaluation further validates the improved quality of VeriSciQA. These results demonstrate that continued data expansion via our scalable framework can further advance SVQA capability in the open-source community. Our dataset is publicly available at https://huggingface.co/datasets/datajuicer/VeriSciQA.

💡 Research Summary

The paper addresses a critical bottleneck in scientific visual question answering (SVQA): the scarcity of large‑scale, high‑quality training data that is both diverse and visually grounded. While recent works have leveraged large vision‑language models (LVLMs) to synthesize QA pairs at scale, systematic errors arise due to two fundamental issues: (i) model‑intrinsic hallucination, where LVLMs generate visual claims that are not present in the figure, and (ii) information asymmetry, because generation pipelines typically feed only the figure (and sometimes its caption) to the model, omitting the rich contextual information found in figure‑citing paragraphs of the paper. An audit of the widely used ArXivQA dataset revealed a 37 % error rate, with four recurring error categories (E1–E4): incorrectly visually grounded, figure‑intent misaligned, non‑visual, and outside‑knowledge questions.

To overcome these problems, the authors propose a Cross‑Modal Verification framework that decouples generation and verification across modalities. The pipeline consists of two stages:

Generation – Figure‑citing paragraphs are first extracted from the LaTeX source of arXiv papers. A text‑only large language model (LLM) parses each paragraph into atomic claims of the form “The figure shows…”. For each claim, the LLM generates a natural‑language question and its correct answer, deliberately skipping claims that lack concrete visual grounding. A vision‑enabled LVLM then creates plausible distractor options conditioned on the figure, yielding a complete multiple‑choice QA pair.
Verification – A cascade of three filters is applied. (a) A source‑consistency filter checks that the question and answer are faithful to the original claim. (b) A visual‑dependence filter ensures that the answer truly requires visual evidence, again without looking at the image. (c) Two vision‑based filters, implemented with a second LVLM, verify (i) that the answer can be inferred from the figure’s caption and (ii) that the answer is consistent with the visual content of the figure itself, using self‑consistency voting. Only QA pairs that pass all three checks are retained.

Using this framework, the authors curate VeriSciQA, a dataset of 20,272 QA pairs spanning 20 scientific domains (e.g., computer vision, biology, physics) and 12 figure types (line charts, bar charts, heatmaps, microscopy images, etc.). The dataset includes five question types and is deliberately balanced across difficulty levels.

Empirical evaluation shows a substantial performance gap between open‑source and proprietary models on VeriSciQA: the best open‑source LVLM (LLaVA‑13B) achieves 65 % accuracy, while the best proprietary model (GPT‑4V) reaches 80.5 %, a 15.9‑point difference. Fine‑tuning open‑source LVLMs on VeriSciQA yields consistent improvements on three downstream SVQA benchmarks (including CharXiv and SciQA), outperforming models trained on existing synthetic datasets. Moreover, performance scales monotonically with data size: training on 500 examples yields modest gains, while using the full 20k set delivers an average +2.05 % absolute improvement over the base model. Human evaluation corroborates these findings, reporting lower error rates and higher perceived quality compared to prior datasets.

The paper also releases the full dataset on Hugging Face and provides Data‑Juicer operators that implement each step of the pipeline, facilitating reproducibility and future extensions. Limitations are acknowledged: verification still depends on the current LVLM’s visual reasoning capabilities, and more sophisticated visual consistency checks (e.g., using specialized chart parsers or multimodal ensembles) could further reduce residual errors. Nonetheless, the cross‑modal verification paradigm demonstrates that high‑quality SVQA data can be generated automatically at scale, narrowing the gap between open‑source and commercial models and paving the way for more reliable scientific AI assistants.

VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment