SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Reading time: 5 minute
...
📝 Original Info
Title: SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
ArXiv ID: 2512.22334
Date: 2025-12-26
Authors: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai
📝 Abstract
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
💡 Deep Analysis
📄 Full Content
SciEvalKit: An Open-source Evaluation
Toolkit for Scientific General Intelligence
Shanghai Artificial Intelligence Laboratory and Community Contributors∗
Abstract
We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science
across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation
platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific
Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding,
Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific
Knowledge Understanding.
It supports six major scientific domains, spanning from physics and
chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific
benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic
scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch
evaluation across models and datasets, supports custom model and dataset integration, and provides
transparent, reproducible, and comparable results.
By bridging capability-based evaluation and
disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the
next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and
actively maintained to foster community-driven development and progress in AI4Science.
Page https://opencompass.org.cn/Intern-Discovery-Eval/rank
Code https://github.com/InternScience/SciEvalKit
Earth
Science
Life Science
Materials
Science
Physics
Chemistry
Astronomy
SciEvalKit
Scientific
Generation
Hypothesis
Scientific
Generation
Code
Reasoning
Symbolic
Scientific
Scientific
Knowledge
Under-
standing
Reasoning
Multimodal
Scientific
Under-
standing
Multimodal
Scientific
Multimodal
Scientific
Perception
Idea Generation
Paper Writing
Result Interpretation
Experimental Design
Data Processing
Literature Review
research hypothesis
frontier trend
workflow diagram
visualization
Image understanding
web search
survey synthesis
data cleaning
Figure 1 | Overview of the SciEvalKit scientific intelligence evaluation framework.
∗SciEvalKit contributors can join the author list of the report based on their contribution to the repository. Specifically,
it requires 3 major contributions (implement a new benchmark, foundation model, or contribute a major feature). We will
update the report quarterly and an additional section that details each developer’s contribution will be appended in the
next update.
arXiv:2512.22334v3 [cs.AI] 6 Jan 2026
Contents
1
Introduction
3
2
Benchmark Suite
4
2.1
Core Competencies Taxonomy of Scientific Intelligence
4
2.2
Scientific Discipline Coverage
6
2.3
Expert-Aligned Benchmark Construction
6
2.3.1
Principles of Expert-Aligned Benchmark Design
6
2.3.2
Benchmark Overview
7
3
Evaluation Framework
7
3.1
Abstraction Layer
7
3.2
Unified interface for prompt construction and prediction
8
3.3
Capability-Oriented Evaluation
9
3.4
Evaluation Modes
10
4
Evaluation Results
11
5
Conclusion and Discussion
15
References
15
A
Appendix
21
A.1 Authors
21
A.2
Full Evaluation Results Across Core Benchmarks
21
B
Benchmark Description
23
C
Representative Task Cases
25
C.1
MaScQA
25
C.2
Chembench
26
C.3
SciCode
27
C.4
PHYSICS
28
C.5
CMPhysBench
28
C.6
ClimaQA
30
C.7
EarthSE
30
C.8
ProteinLMBench
31
C.9
TRQA
31
C.10 ResearchBench
31
C.11 MSEarth
32
C.12 AstroVisBench
33
C.13 SLAKE
36
C.14 SFE
37
2
1. Introduction
The advances in large language models (LLMs) have demonstrated remarkable general-purpose
reasoning [1, 2, 3, 4] and broad knowledge retrieval [5, 6, 7]. Recently, researchers are increasingly
interested in probing whether these models demonstrate key facets of scientific intelligence such as
conceptual understanding [8, 9, 10, 11], symbolic reasoning [12, 13, 14], and hypothesis-driven
exploration [15, 16, 17, 18]. Despite encouraging progress on individual benchmarks [19, 20, 21],
current evaluations largely focus on surface-level correctness or narrow task-specific metrics, and
therefore fail to assess whether LLMs can truly operate across the full spectrum of scientific reasoning.
Real-world scientific problem solving fundamentally differs from generic reasoning: it requires concep-
tual abstraction, symbolic manipulation, hypothesis formation, multi-step procedural thinking, and
the ability to interpret structured visual representations such as chemical diagrams [22, 23], protein
structures [24, 25]. Yet existing benchmarks neither capture this holistic view nor systematically
evaluate these capabilities across scientific disciplines, modalities, and cognitive dimensions.
From a cognitive perspective, scientific reasoning is inherently structural, relational, and multi-
representational. The famous DSRP [26] Theory which represents Distinctions, Sys