SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

December 26, 2025

Reading time: 5 minute

...

📝 Original Info

Title: SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
ArXiv ID: 2512.22334
Date: 2025-12-26
Authors: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai

📝 Abstract

We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.

💡 Deep Analysis

📄 Full Content

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence Shanghai Artificial Intelligence Laboratory and Community Contributors∗ Abstract We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science. Page https://opencompass.org.cn/Intern-Discovery-Eval/rank Code https://github.com/InternScience/SciEvalKit Earth Science Life Science Materials Science Physics Chemistry Astronomy SciEvalKit Scientific Generation Hypothesis Scientific Generation Code Reasoning Symbolic Scientific Scientific Knowledge Under- standing Reasoning Multimodal Scientific Under- standing Multimodal Scientific Multimodal Scientific Perception Idea Generation Paper Writing Result Interpretation Experimental Design Data Processing Literature Review research hypothesis frontier trend workflow diagram visualization Image understanding web search survey synthesis data cleaning Figure 1 | Overview of the SciEvalKit scientific intelligence evaluation framework. ∗SciEvalKit contributors can join the author list of the report based on their contribution to the repository. Specifically, it requires 3 major contributions (implement a new benchmark, foundation model, or contribute a major feature). We will update the report quarterly and an additional section that details each developer’s contribution will be appended in the next update. arXiv:2512.22334v3 [cs.AI] 6 Jan 2026 Contents 1 Introduction 3 2 Benchmark Suite 4 2.1 Core Competencies Taxonomy of Scientific Intelligence 4 2.2 Scientific Discipline Coverage 6 2.3 Expert-Aligned Benchmark Construction 6 2.3.1 Principles of Expert-Aligned Benchmark Design 6 2.3.2 Benchmark Overview 7 3 Evaluation Framework 7 3.1 Abstraction Layer 7 3.2 Unified interface for prompt construction and prediction 8 3.3 Capability-Oriented Evaluation 9 3.4 Evaluation Modes 10 4 Evaluation Results 11 5 Conclusion and Discussion 15 References 15 A Appendix 21 A.1 Authors 21 A.2 Full Evaluation Results Across Core Benchmarks 21 B Benchmark Description 23 C Representative Task Cases 25 C.1 MaScQA 25 C.2 Chembench 26 C.3 SciCode 27 C.4 PHYSICS 28 C.5 CMPhysBench 28 C.6 ClimaQA 30 C.7 EarthSE 30 C.8 ProteinLMBench 31 C.9 TRQA 31 C.10 ResearchBench 31 C.11 MSEarth 32 C.12 AstroVisBench 33 C.13 SLAKE 36 C.14 SFE 37 2 1. Introduction The advances in large language models (LLMs) have demonstrated remarkable general-purpose reasoning [1, 2, 3, 4] and broad knowledge retrieval [5, 6, 7]. Recently, researchers are increasingly interested in probing whether these models demonstrate key facets of scientific intelligence such as conceptual understanding [8, 9, 10, 11], symbolic reasoning [12, 13, 14], and hypothesis-driven exploration [15, 16, 17, 18]. Despite encouraging progress on individual benchmarks [19, 20, 21], current evaluations largely focus on surface-level correctness or narrow task-specific metrics, and therefore fail to assess whether LLMs can truly operate across the full spectrum of scientific reasoning. Real-world scientific problem solving fundamentally differs from generic reasoning: it requires concep- tual abstraction, symbolic manipulation, hypothesis formation, multi-step procedural thinking, and the ability to interpret structured visual representations such as chemical diagrams [22, 23], protein structures [24, 25]. Yet existing benchmarks neither capture this holistic view nor systematically evaluate these capabilities across scientific disciplines, modalities, and cognitive dimensions. From a cognitive perspective, scientific reasoning is inherently structural, relational, and multi- representational. The famous DSRP [26] Theory which represents Distinctions, Sys

📄 Read Full PDF on ArXiv