SIQA: Toward Reliable Scientific Image Quality Assessment
Scientific images fundamentally differ from natural and AI-generated images in that they encode structured domain knowledge rather than merely depict visual scenes. Assessing their quality therefore requires evaluating not only perceptual fidelity but also scientific correctness and logical completeness. However, existing image quality assessment (IQA) paradigms primarily focus on perceptual distortions or image-text alignment, implicitly assuming that depicted content is factually valid. This assumption breaks down in scientific contexts, where visually plausible figures may still contain conceptual errors or incomplete reasoning. To address this gap, we introduce Scientific Image Quality Assessment (SIQA), a framework that models scientific image quality along two complementary dimensions: Knowledge (Scientific Validity and Scientific Completeness) and Perception (Cognitive Clarity and Disciplinary Conformity). To operationalize this formulation, we design two evaluation protocols: SIQA-U (Understanding), which measures semantic comprehension of scientific content through multiple-choice tasks, and SIQA-S (Scoring), which evaluates alignment with expert quality judgments. We further construct the SIQA Challenge, consisting of an expert-annotated benchmark and a large-scale training set. Experiments across representative multimodal large language models (MLLMs) reveal a consistent discrepancy between scoring alignment and scientific understanding. While models can achieve strong agreement with expert ratings under SIQA-S, their performance on SIQA-U remains substantially lower. Fine-tuning improves both metrics, yet gains in scoring consistently outpace improvements in understanding. These results suggest that rating consistency alone may not reliably reflect scientific comprehension, underscoring the necessity of multidimensional evaluation for scientific image quality assessment.
💡 Research Summary
The paper introduces Scientific Image Quality Assessment (SIQA), a novel framework designed to evaluate the quality of scientific images—a class of visual media that encodes structured domain knowledge rather than merely depicting visual scenes. Traditional image quality assessment (IQA) methods, whether full‑reference (e.g., PSNR, SSIM) or no‑reference (e.g., HyperIQA, CLIP‑Score), focus on perceptual fidelity, distortion, or image‑text alignment and implicitly assume that the content is factually correct. This assumption collapses for scientific figures, where a visually polished image can still contain factual inaccuracies, logical omissions, or violations of disciplinary conventions.
SIQA models quality along two complementary dimensions:
- Knowledge – comprising Scientific Validity (consistency with established scientific facts) and Scientific Completeness (inclusion of all necessary elements for sound inference).
- Perception – comprising Cognitive Clarity (ease of visual interpretation, layout, labeling) and Disciplinary Conformity (adherence to field‑specific standards such as IUPAC notation, geographic conventions, etc.).
To operationalize these dimensions, the authors propose two evaluation protocols:
- SIQA‑U (Understanding) – a multiple‑choice reasoning test that probes a model’s semantic comprehension of the image across the four dimensions. Each image is paired with several questions; correct answer rates are aggregated into Knowledge and Perception scores.
- SIQA‑S (Scoring) – a classic MOS (Mean Opinion Score) alignment test where a model predicts a scalar quality rating and its correlation with expert human scores is measured.
The authors construct the SIQA Challenge, consisting of:
- An expert‑annotated benchmark (≈2,000 images, 12,000 QA pairs, 5,000 MOS ratings) covering diverse scientific domains (chemistry, geology, biology, physics, etc.).
- A large‑scale training set (≈100,000 images, 600,000 QA pairs) for fine‑tuning.
Data collection involved aggregating images from existing scientific multimodal datasets, then having domain experts annotate each image for the four quality dimensions, generate dimension‑specific MCQs, and assign MOS ratings. The resulting dataset explicitly disentangles semantic understanding from rating agreement.
Experiments evaluate several state‑of‑the‑art multimodal large language models (MLLMs) such as LLaVA‑2, Kosmos‑2.5, and Qwen‑VL. Without any task‑specific adaptation, these models achieve relatively high correlations (0.78–0.84 Pearson) on SIQA‑S, indicating they can learn to mimic expert rating distributions. However, on SIQA‑U their average accuracy hovers around 45 %, revealing a substantial gap in genuine scientific comprehension. Fine‑tuning on the SIQA training set improves SIQA‑S scores modestly (≈5 % absolute gain) but yields only marginal gains on SIQA‑U (≈3 % absolute gain). Dimension‑wise analysis shows a weak correlation (≈0.42) between Perception and Knowledge scores, confirming that the two axes capture largely independent aspects of quality.
Key contributions:
- Framework – The first systematic, multidimensional definition of scientific image quality that integrates epistemic correctness with perceptual clarity.
- Evaluation Protocols – A clear separation between rating alignment (SIQA‑S) and semantic understanding (SIQA‑U), exposing the risk of conflating the two.
- Dataset – A publicly released benchmark and large‑scale fine‑tuning corpus, enabling reproducible research on scientific image assessment.
- Empirical Insight – Demonstration that current MLLMs excel at mimicking human quality scores but struggle to reason about the factual correctness of scientific visuals, highlighting a need for dedicated knowledge‑aware training.
The paper concludes with several future directions: extending the Knowledge dimension to finer, domain‑specific criteria (e.g., reaction mechanisms, biological pathways), integrating human‑in‑the‑loop feedback for continual model improvement, and applying SIQA as a quality control signal for scientific image generation systems. By providing both a conceptual foundation and concrete resources, the work opens a new research avenue toward trustworthy, knowledge‑grounded visual AI in scientific domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment