HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery
📝 Original Info
- Title: HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery
- ArXiv ID: 2512.22899
- Date: 2025-12-28
- Authors: Yaping Zhang, Qixuan Zhang, Xingquan Zhang, Zhiyuan Chen, Wenwen Zhuang, Yupu Liang, Lu Xiang, Yang Zhao, Jiajun Zhang, Yu Zhou, Chengqing Zong
📝 Abstract
The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce HiSciBench, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: Scientific Literacy (L1), Literature Parsing (L2), Literature-based Question Answering (L3), Literature Review Generation (L4), and Scientific Discovery (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.📄 Full Content
We identify two primary gaps in current evaluation methods. First, existing benchmarks are often fragmented and narrow in scope. Most focus on isolated skills, such as Fig. 1. Overview of HiSciBench, a hierarchical benchmark for evaluating scientific intelligence in large language models. It covers six scientific disciplines includingmathematics, physics, chemistry, biology, geography, and astronomy-shown on the left, and five progressively structured tasks (L1-L5) on the right, encompassing Scientific Literacy, Literature Parsing, Literature QA, Review Generation, and Scientific Discovery. The hierarchy mirrors the full scientific workflow, progressing from L1 factual understanding and L2 literature parsing, to L3 contextual reasoning, L4 integrative synthesis, and finally L5 creative discovery.
factual recall [3], visual reasoning [4], or specific problemsolving in chemistry or physics [5], [6]. Even comprehensive efforts like SciEval [7] tend to treat scientific ability as a flat set of independent tasks. In reality, scientific inquiry is hierarchical. According to Bloom’s Taxonomy [8], models often perform well on lower-order tasks like remembering facts but struggle with higher-order tasks like synthesizing information or creating new hypotheses. Currently, there is no unified benchmark that evaluates LLMs across the sequential stages of the actual research process.
Second, existing evaluations lack sufficient multimodal and multilingual coverage. Scientific research rarely relies on text alone; it integrates equations, figures, tables, and data across multiple languages. While some datasets have introduced multimodal elements [4], [9], they often focus on simple image recognition rather than the complex integration of crossmodal information found in scientific papers. Furthermore, most benchmarks are English-centric, failing to account for the global nature of modern science. As a result, they do not fully reflect the diverse inputs that researchers handle daily.
To address these limitations, we introduce HiSciBench, a Hierarchical multi-disciplinary Benchmark designed to evaluate scientific intelligence from reading to discovery. HiS-ciBench includes 8,735 curated instances across six core disciplines: mathematics, physics, chemistry, biology, geography, and astronomy.
Unlike previous “flat” evaluations, HiSciBench is organized into five levels (L1-L5) that follow the logical progression of a scientific workflow: L1: Scientific Literacy (factual knowledge and concepts).L2: Literature Parsing (multimodal document extraction and translation). L3: Literature QA (deep comprehension of specific papers). L4: Review Generation (synthesizing information from multiple sources). L5: Scientific Discovery (data-driven exploration and hypothesis generation). By structuring the benchmark this way, we can pinpoint exactly where models fail. Our testing shows that while state-of-the-art models perform well on foundational tasks, their performance drops significantly in the synthesis (L4) and discovery (L5) stages. This suggests that current AI is more capable of acting as a knowledge repository than as an active research assistant.
The main contributions of this work are as follows:
• We propose a hierarchical evaluation framework that moves beyond isolated tasks to reflect the actual multistage workflow of scientific research. • We provide a multimodal and multilingual dataset of 8,735 instances across six disciplines, using authentic scientific literature to ensure high difficulty and realism. • We conduct an extensive evaluation of leading LLMs, identifying key weaknesses in high-level reasoning and synthesis that provide a roadmap for future AI for Science development.
Early benchmarks like MMLU [1] and its STEM subset evaluate scientific knowledge through multiple-choice questions but lack deeper reasoning capabilities. ScienceQA [2] introduces multimodal science questions with visual contexts, though limited to elementary-level reasoning. SciBench [30] addresses college-level problem-solving with 692 physics, chemistry, and mathematics problems, while GPQA [11] provides 448 graduate-level questions requiring expert knowledge.
Recent large-scale efforts include MMMU [9] with 11,500 multimodal college-level questions across six disciplines, SciKnowEval [12] with 28,392 questions, and SuperG-PQA [13] with 26,529 expert-level questions. Domain-specific benchmarks such as MathVista [4], ChemBench [6], and BioASQ [14] provide deep evaluation within particular fields. However, these benchmarks primarily focus on comprehension and reasoning (L3), overlooking higher-order cognitive processes such as synthesis (L4) and innovation (L5).
Several benchmarks evaluate scientific literature understanding. SciRepEval [16] and PaperQA [17] assess paper classification and question answering. SciAssess [18] provides 10,000 instances across four domains with partial synthesis evaluation, while M3SciQA [19] introduces multi-document questions with visual reasoning. ScholarBench [20] evaluates scholarly understanding with 5,200 instances, and ArXivQA [31] offers 100,000 multimodal questions from arXiv papers.
For code generation and scientific discovery, DS-1000 [25] evaluates data science coding, SciCode [27] targets scientific programming across six disciplines, and ScienceAgent-Bench [29] assesses autonomous agents with 102 tasks. While valuable, these benchmarks evaluate isolated capabilities rather than the complete research workflow, and lack systematic integration across cognitive levels.
Overall, existing benchmarks exhibit three fundamental limitations: (1) fragmented evaluation of isolated skills without complete workflow coverage, (2) limited multimodal and multilingual support, and (3) disconnection between agentbased evaluation and cognitive hierarchy. As shown in Table II, no benchmark fully supports all five cognitive levels from perception to innovation.
In this section, we present the comprehensive design of HiSciBench, a hierarchical benchmark system that systematically evaluates multimodal large language models across the full spectrum of scientific research capabilities. Our design philosophy centers on simulating the cognitive progression inherent in human scientific inquiry, spanning from foundational knowledge acquisition to advanced research innovation.
Figure 2 and Table III provide a comprehensive overview of the HiSciBench dataset. HiSciBench spans five different levels (L1-L5) across six scientific disciplines, covering diverse modalities such as text, image-text, and structured data. This hierarchical design enables systematic evaluation of models’ scientific reasoning-from factual recall to datadriven discovery.
As shown in Figure 2(a), Level 3 tasks (Monolingual and Cross-lingual Literature QA) dominate the dataset, accounting for 63% of all instances. This reflects HiSciBench’s focus on reasoning within real-world scientific literature. In terms of disciplinary coverage (Figure 2b), Biology (26.6%) and Physics (26.4%) constitute the largest portions, followed by Mathematics and Chemistry, ensuring balanced representation across major scientific domains. The modality analysis in Figure 2(c) shows that most tasks (84.7%) involve structured or image-text inputs, underscoring the benchmark’s emphasis on multimodal scientific comprehension. Table III details the scale and data sources for each task level. Lower-level tasks (L1-L2) focus on scientific literacy and information extraction, while higher levels (L3-L5) involve reasoning, synthesis, and discovery. Data are collected from a combination of public datasets (e.g., SuperGPQA, ScienceAgentBench) and self-curated scientific materials (e.g., arXiv and Nature papers), ensuring both reliability and diversity. Overall, HiSciBench includes 8,735 instances, supporting comprehensive yet interpretable evaluation of scientific intelligence across multiple cognitive and modality dimensions.
The architecture of HiSciBench is grounded in the recognition that scientific research expertise develops through distinct cognitive stages. Rather than treating scientific capability as a monolithic skill, we decompose it into a hierarchical framework that mirrors the natural progression of researchers: from acquiring basic scientific literacy, through mastering literature comprehension and analysis, to ultimately conducting original research and making novel discoveries.
Our benchmark encompasses five progressive levels, each representing a critical stage in the research lifecycle: Level 1 (L1) assesses fundamental scientific literacy; Level 2 (L2) evaluates information extraction from scientific literature; Level 3 (L3) tests comprehension and reasoning over multimodal research content; Level 4 (L4) examines synthesis and innovation through literature review generation; and Level 5 (L5) measures practical problem-solving through data-driven scientific discovery tasks.
L1: Scientific Literacy. The foundation of scientific research capability rests upon a broad understanding of fundamental scientific concepts. Level 1 evaluates models’ grasp of core knowledge across major scientific disciplines through carefully curated questionanswering tasks.
Task Description: Models are presented with multiplechoice questions covering essential concepts in six major scientific domains: mathematics, physics, chemistry, astronomy, geography, and biology. These questions assess not merely factual recall, but conceptual understanding and the ability to apply fundamental principles to novel scenarios.
Dataset Construction: We construct the L1 dataset by systematically sampling from SuperGPQA, established benchmarks for scientific knowledge evaluation. For each of the six disciplines, we select 200 high-quality multiple-choice
Literature and Research-Oriented Benchmarks
Code Generation and Scientific Discovery
Legend: ✓= fully supported; •= partially supported; ✗= not supported. Task Description: Level 2 evaluates a model’s ability to extract, interpret, and translate information from multimodal scientific documents that contain dense combinations of text, equations, tables, and figures. It reflects the model’s competence in acquiring information from real-world research materials through two subtasks: L2.1 Scientific Document Parsing focuses on recognizing and reconstructing multimodal content from scientific pages, requiring accurate optical character recognition (OCR) and layout understanding to handle complex mathematical formulas, chemical structures, and specialized diagrams. L2.2 Cross-lingual Scientific Translation extends this by testing models’ ability to translate scientific texts across languages while preserving technical semantics, symbolic accuracy, and structural alignment, thereby supporting multilingual accessibility and global scientific communication.
Dataset Construction: For L2.1, we construct the dataset using LaTeX source files and their corresponding PDF documents collected from arXiv across multiple scientific disci-plines. Each LaTeX source is converted to HTML and then to Markdown to preserve structural fidelity. The Markdown files are segmented according to page boundaries in the corresponding PDFs, and each page is rasterized into an image. This process yields a paired dataset of 629 samples encompassing mathematics (208), physics (357), astronomy (19) and transfer across languages when queries and source materials differ linguistically. Together, these tasks assess whether models can connect heterogeneous modalities and maintain semantic consistency across linguistic contexts. Dataset Construction: For L3.1, we construct a largescale dataset by aggregating question-answer pairs from established benchmarks such as ScholarChemQA, ArXivQA, and PubMedQA, and augmenting them with additional samples derived from scientific papers collected via DOIs and official publication sources. Expert annotators define sub-domains for each discipline and extract the ten most frequent keywords from abstracts to map QA pairs to the most relevant scientific domain, ensuring precise annotation. The resulting dataset includes 5,514 QA pairs covering mathematics (821), physics (1,025), and other disciplines, each grounded in specific sections or figures of the source papers. For L3.2, we extend L3.1 to multilingual settings by leveraging the translated documents from Level 2. Using Qwen2.5-VL-72B, we automatically generate cross-lingual QA pairs, followed by manual verification of 20% of the data to ensure factual and linguistic accuracy. This setup enables evaluation of models’ robustness in scientific reasoning under multilingual variation and cross-lingual information transfer.
L4: Literature Review Generation.
The synthesis of existing research into coherent literature reviews represents a cornerstone of scientific writing, requiring both comprehensive understanding of a research area and the ability to identify patterns, gaps, and future directions. Level 4 evaluates models’ capacity for producing high-quality scientific reviews.
Task Description: Given a research topic specified through keywords and a set of core papers, models must generate comprehensive literature reviews that synthesize current knowledge, identify research trends, critically analyze methodologies, and highlight open questions. This task demands not only comprehension of individual papers but sophisticated integration of ideas across multiple sources.
Dataset Construction: We compile review topics spanning six disciplines: mathematics, physics, chemistry, astronomy, geography and biology. For each discipline, we identify 10 recently published review paper titles from Nature and arXiv, ensuring coverage of diverse subfields and current research frontiers.
L5: Scientific Discovery.
The ultimate manifestation of scientific research capability lies in conducting original investigations and solving novel problems through data analysis and computational methods. Level 5 evaluates models’ ability to translate scientific questions into executable solutions.
Task Description: Models are presented with scientific research problems requiring data-driven analysis, each accompanied by relevant datasets and domain-specific background knowledge. The task requires formulating appropriate computational approaches, implementing them in Python code, executing analyses, and interpreting results-mirroring the complete workflow of computational scientific research.
Dataset Construction: We adopt the ScienceAgentBench dataset, which provides carefully curated problems from four computation-intensive scientific disciplines. The dataset encompasses 67 distinct tasks, each comprising a problem description, associated data files, relevant domain knowledge, and expert-annotated reference solutions. Tasks span diverse analytical paradigms including statistical inference, numerical simulation, data visualization, and machine learning applications in scientific contexts.
A. Experimental Setup 1) Evaluated Models: We evaluate a comprehensive collection of 18 state-of-the-art large language models spanning multiple categories: closed-source frontier models, opensource foundation models, vision-language architectures, and specialized research-oriented systems. This diverse model selection enables systematic assessment across the scientific literacy hierarchy, from basic knowledge understanding (L1) to advanced research capabilities (L5). Closed-source Multimodal Models. We include GPT-5 [32], a frontier multimodal foundation model supporting both text and visual inputs. GPT-5 serves as the primary reference for state-of-the-art performance across all levels of HiSciBench, participating in both text-input and vision-language evaluation tracks where applicable. Open-source Text-only Models. Our open-source text-only cohort includes nine models covering different parameter scales: i.Large-scale models (>60B): Llama-3.1-Instruct-70B [33], DeepSeek-v3 [34], DeepSeek-R1 [35]; ii.Mid-scale models (32B): Qwen3-32B [36], QWQ-32B [37], DeepSeek-R1-Distill-Qwen-32B (DeepSeek-R1-Distill-32B) [35] These models are evaluated on text-centric tasks, including L1 (Scientific Literacy), L2.2 (Cross-lingual Translation with text input), and text-based L3 (Literature QA with extracted content). Open-source Vision-Language Models. We further evaluate five multimodal architectures for document-centric scientific tasks requiring visual understanding: i.Qwen-VL family: Qwen3-VL-8B, Qwen2.5-VL-7B [38];ii.InternVL family: Intern-VL3-8B, Intern-VL3.5-8B [39]; iii.GLM-4.5V [40] These models handle L2.1 (OCR and Document Parsing), L2.2 (Translation from visual inputs), and L3 (Literature QA with PDF-based inputs), enabling evaluation of visual-textual grounding and multimodal reasoning in scientific contexts. Research-oriented Specialized Models. Finally, we include four systems explicitly designed for scientific research workflows: i.Tongyi-DeepResearch [41]: a deep reasoning system optimized for multi-step research problem solving. ii.S1 family (Orion Star): research-oriented large language models, including S1-Base-Pro-32B1 (general scientific reasoning) and S1-Literature (literature comprehension and synthesis).
iii.SurveyX: a specialized system for automated literature review generation. These systems primarily target higher-level reasoning tasks-L4 (Literature Review Generation) and L5 (Scientific Discovery)-representing the current frontier of AIassisted scientific research and synthesis.
- Evaluation Metrics: To ensure comprehensive and fair assessment across our hierarchical task structure, we employ evaluation metrics specifically tailored to each level’s cognitive demands and output characteristics.
L1: Scientific Literacy. For multiple-choice questions testing fundamental knowledge, we employ classification accuracy as the primary metric. This straightforward measure effectively captures whether models possess the requisite scientific foundations.
L2: Literature Parsing. Document parsing and translation tasks necessitate evaluating the fidelity of generated structured representations against the ground truth. We use word-level accuracy to assess OCR performance (L2.1) [42] and the BLEU score to evaluate translation quality (L2.2) [43].
L3: Literature Question Answering. Similar to L1, we evaluate QA tasks using accuracy, as questions are designed with definitive answers. This metric directly measures whether models correctly comprehend and reason over scientific literature.
L4: Literature Review. Evaluating literature review quality requires multidimensional assessment, as reviews must satisfy both content quality and citation integrity requirements. We decompose evaluation into two primary dimensions: Content Quality. Following the methodology established by SurveyX [44], we employ LLM-as-Judge evaluation across five dimensions: (1) Coverage:the comprehensiveness with which the review addresses the topic; (2) Structure:the logical organization and coherence of presentation; (3) Relevance:the degree to which content directly pertains to the topic; (4) Synthesis:the effectiveness of integrating ideas across sources; and (5) Critical Analysis:the depth of methodological critique and identification of research gaps. Each dimension is scored autonomously by advanced LLMs.
Citation Quality. Beyond content accuracy, scientific reviews must exhibit rigorous citation practices to ensure credibility and scholarly integrity. We evaluate citation quality from four complementary perspectives: (1) Verifiability, which measures whether the cited references truly exist and whether their bibliographic information is accurate and properly formatted. This dimension includes metrics such as Verifiability Rate and Metadata Accuracy. (2) Coverage and Representativeness, which captures the breadth and diversity of citations the total Citation Count, the number of Unique Sources, and the Source Distribution Entropy reflecting balance across publication venues. (3) Recency, which quantifies the proportion of recently published papers among all citations, indicating the review’s awareness of the latest research progress. (4) Faithfulness, which assesses whether each citation in the text accurately reflects the claims and findings of the original referenced work, ensuring that cited evidence is used in a truthful and contextually appropriate manner.
L5: Scientific Discovery. Evaluating code-based scientific problem solving requires metrics that reflect both functional correctness and overall code quality. We adopt the Success Rate (SR) metric, which assesses whether the generated code successfully fulfills the task-specific objectives. Beyond simple execution, SR measures whether the program’s outputs meet defined success criteria-such as achieving target performance on test data, producing correct predictions, or generating visualizations that accurately represent the underlying data. If the code fails to execute or produces invalid results, the SR is set to zero. This metric provides a direct and interpretable measure of task completion.
Figure 3 present the overall performance of representative models across HiSciBench levels (L1-L5). Each level corresponds to an increasing cognitive complexity, ranging from factual recall (L1) to scientific discovery (L5). All results are averaged over six scientific disciplines, with a score of 0 indicating that a model does not support the corresponding task.
For the S1 family, two task-specific variants are used: S1-Literature for L4.1 (Scientific Literature Review Generation), and S1-Base-Pro (32B) for all other levels (L1-L3, L5). This series is designed for fundamental scientific research, emphasizing structured reasoning, domain-specific knowledge grounding, and controllable text generation.
Across all levels, GPT-5 exhibits the strongest and most consistent performance, achieving the best or near-best scores Overall, a notable performance gap remains between current systems and the ideal benchmark target (≥ 60%), highlighting the challenge of achieving generalized scientific intelligence across modalities and reasoning levels.
L1: Scientific Literacy Assessment.
Table IV presents performance on L1, evaluating factual recall and conceptual understanding across six scientific disciplines as the foundation of HiSciBench. GPT-5 achieves the highest average accuracy (69.17%), excelling in mathematics (84.50%) and physics (70.50%), demonstrating that largescale multimodal pretraining effectively captures symbolic and conceptual scientific patterns. Deepseek-r1 ranks second (67.17%) with competitive chemistry and astronomy results but weaker biology and geography performance. Among 32Bscale open models, S1-Base-Pro-32B leads (59.17%), outperforming Qwen3-32B (57.85%) and substantially exceeding DeepSeek-R1-Distill-32B (38.58%), whose 28.6-point drop from the full model (67.17%) reveals that aggressive com- pression severely erodes factual completeness and reasoning fidelity. The retrieval-oriented Tongyi-DeepResearch (49.83%) shows limited generalization beyond its specialized focus. Performance varies considerably across disciplines: mathematics achieves the highest accuracy (65.56%), while biology remains most challenging (50.22%) due to greater linguistic and conceptual variability. General-purpose models consistently outperform specialized agents by 10.4 percentage points, emphasizing the importance of large-scale factual priors. However, even GPT-5’s 70% accuracy indicates current LLMs still fall short of human-level scientific reasoning.
These findings reveal that scientific comprehension scales with model size and multimodal pretraining, yet achieving human-comparable literacy remains an open challenge.
L2: Scientific Document Parsing. Table V and Figure 4 report results on L2: Scientific Literature Parsing and Translation, which evaluate a model’s ability to read, understand, and translate scientific documents across modalities. L2.1 examines structured document parsing from PDFs (vision-language input), while L2.2 measures cross-lingual translation accuracy under text and visual inputs.
Across both tasks, GPT-5 consistently achieves the best performance, with BLEU scores of 67.61 on document parsing and 43.29 on translation. In L2.1, GPT-5 and Qwen3-VL-8B perform comparably in physics and astronomy, suggesting strong visual parsing capabilities. By contrast, smaller models such as Intern-VL3 and VL3.5 degrade sharply (BLEU < 10), underscoring the importance of large-scale multimodal pretraining for robust text-vision alignment and formula extraction. In L2.2, GPT-5 again leads, followed by S1-base-Pro (41.28) and Deepseek-v3 (38.98). S1-base-Pro benefits from scientific text fine-tuning, achieving strong terminology preservation but lower fluency than GPT-5. Text-only models consistently outperform their vision-based counterparts, indicating that visual inputs introduce noise into linguistic reasoning.
Overall, these findings reveal two key insights: (1) Large multimodal models like GPT-5 excel in both document understanding and cross-lingual reasoning; (2) Cross-modal consistency remains a major bottleneck-current vision-language systems fail to match the semantic precision of text-only models. Bridging this gap will require more effective visual-text fusion and pretraining strategies explicitly optimized for document-level semantic preservation. L3: Scientific Literature QA.
Table VI summarizes model performance on L3: Scientific Literature QA, which evaluates scientific reasoning and comprehension at the document level. L3.1 measures monolingual reasoning within English scientific papers, while L3.2 tests cross-lingual understanding where questions are posed in Chinese but source documents remain in English. These tasks require models to perform contextual inference, information integration, and domain-aware grounding beyond simple factual recall.
L3.1 Monolingual Literature QA. Across both text-only and vision-language settings, large-scale foundation models exhibit strong comprehension capabilities. Among text-based systems, Deepseek-v3 achieves the highest accuracy (96.20%), closely followed by Deepseek-r1 (93.43%) and S1-Base-Pro-32B (91.00%). These results indicate that monolingual reasoning benefits substantially from large language model pretraining and scientific corpus alignment. By contrast, smaller specialized agents such as S1-Base-8B (42.71%) or distilled variants show limited transfer ability, suggesting that scale and retained factual priors are key for document-level consistency.
In the vision-language configuration, GLM-4.5V and GPT-5 achieve the best performance (average 80.45% and 76.75%, respectively), demonstrating robust multimodal understanding across text-figure integration. However, a noticeable accuracy gap remains between fragment-based and full-text reasoning, implying that document segmentation can simplify inference by focusing attention on semantically dense regions. Performance drops sharply in smaller visual models (e.g., Intern-VL3/3.5), revealing their weak cross-modal grounding and limited ability to parse scientific figures and equations. L3.2 Cross-lingual Literature QA. This task evaluates whether models can align multilingual semantics with scientific context. GPT-5 again achieves the best results in both text (63.00%) and vision-language (86.28%) settings, confirming strong cross-lingual reasoning and terminology transfer. Deepseek-r1 (67.53%) also perform competitively, highlighting that bilingual fine-tuning and instruction-level alignment can narrow the gap with general-purpose multilingual models. Smaller multimodal models (Intern-VL3/3.5) struggle significantly, often failing to map cross-language entity correspondences.
In summary, L3 tasks demonstrate that scientific comprehension requires deep integration of factual memory, contextual inference, and multilingual grounding. While frontier models like GPT-5 and Deepseek-v3 approach humanlevel monolingual understanding, sustaining comparable performance in cross-lingual and multimodal contexts remains a key frontier for next-generation research models.
L4: Literature Review Generation. Table VII summarizes performance on L4, covering content and citation quality across multiple dimensions. Content quality is exceptionally high: GPT-5 achieves a nearperfect Overall Score (4.99/5.0), with Tongyi-DeepResearch closely following (4.94). However, Critical Analysis remains the weakest dimension across all models (3.97-4.93), with Deepseek models particularly struggling in both Synthesis (4.03-4.32) and Critical Analysis, indicating that generating sophisticated analytical perspectives remains challenging for current LLMs.
A striking gap emerges in citation reliability. Specialized systems significantly outperform general models: SurveyX leads with 71.4% Verifiability Rate and 45.6% Metadata Accuracy, vastly exceeding GPT-5’s 19.3% and 2.6%, respectively. General models exhibit severe citation hallucination, with verifiability rates below 20% (GPT-5: 19.3%, Deepseek-v3: 17.9%), generating non-existent or unverifiable references despite producing fluent content. These findings highlight that Retrieval-Augmented Generation architectures are crucial for grounding citations to authentic sources, whereas current general-purpose LLMs prioritize surface coherence over factual traceability.
L5: Scientific Discovery and Problem Solving. VIII, all models achieve relatively low success rates, indicating the difficulty of data-driven scientific reasoning. GPT-5 achieves the best overall performance (24.75%), followed by Deepseek-r1 (21.05%), while other models re-main below 15%. Across disciplines, Geography yields the highest accuracy (33.33%), and Chemistry remains the most challenging (15.00%). Although Deepseek-r1 shows stronger performance in Biology and Chemistry, GPT-5 demonstrates more balanced generalization. Smaller or distilled models (e.g., Deepseek-v3, QWQ-32B) suffer severe degradation, highlighting the dependence on model scale. Overall, results suggest that current models can execute structured reasoning but still lack the inductive and hypothesis-forming abilities required for genuine scientific discovery.
Q1.Does modality affect performance? Figure 4 compares GPT-5’s cross-lingual translation performance across four scientific disciplines (Mathematics, Physics, Astronomy, and Biology) under text-only and vision-language input modalities. A consistent degradation is observed when visual information is introduced alongside text, indicating that multimodal inputs can interfere with semantic translation fidelity rather than enhance it. Q2. Can LLMs judge their own scientific quality? Figure 5 reveals a striking disparity between self-assessed content quality and citation reliability in L4. LLM-as-Judge evaluations show models achieve exceptionally high content scores (88.8-99.8%), consistently surpassing the 80% benchmark. However, external verification exposes a severe 80% factuality gap: verifiability rates plummet to 17-22%. While models generate fluent, well-structured reviews, their references lack In the first case (Incomplete Solution), the model correctly processes demographic data and generates visualizations but produces three separate subplots instead of a composite map, revealing deficient conceptual synthesis in integrating spatial and demographic layers. In the second case (Code Logic Error), the model attempts to count five-year warm spells but reduces the array along the latitude axis rather than the temporal dimension, yielding a zero-valued map. Though syntactically correct, the code lacks spatiotemporal reasoning consistency. These failures demonstrate that L5 errors stem not from coding mistakes but from misalignment between scientific intent and computational reasoning, exposing the gap between code generation and genuine scientific understanding.
We introduced HiSciBench, a hierarchical benchmark for evaluating scientific intelligence in large language models (LLMs) across six disciplines and five cognitive levels (L1-L5). Built from authentic scientific literature, HiSciBench captures the full research workflow-from factual recall to hypothesis generation-enabling systematic, interpretable assessment of scientific reasoning. Experiments show that current LLMs excel at factual and linguistic understanding (L1-L2) but struggle with higher-order cognition. Performance declines markedly in multimodal and cross-lingual tasks (L2-L3), citation faithfulness in literature synthesis (L4), and scientific code reasoning (L5). These results reveal persistent weaknesses in epistemic grounding, multimodal integration, and procedural reasoning. HiSciBench thus provides a unified framework for diagnosing such limitations and guiding the development of next-generation scientific AI-models capable of reliable understanding, faithful synthesis, and verifiable discovery.
† Sampled from SuperGPQA benchmarks. ‡ ♠ Collected from arXiv and Nature publications.¶ Adapted from ScienceAgentBench benchmark.
† Sampled from SuperGPQA benchmarks. ‡ ♠ Collected from arXiv and Nature publications.
grounding in authentic sources, indicating systematic overestimation of scientific coherence. Notably, even specialized systems (e.g., S1-Literature: 96.4% content quality, 22.4% citation validity) exhibit this tendency, demonstrating that domain specialization alone cannot guarantee factual reliability. These findings highlight a fundamental limitation: surface-