FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we introduce a granular rubric-based evaluation framework to assess model capabilities throughout the process of solving a research task, rather than judging only a standalone final answer.

💡 Research Summary

FrontierScience is introduced as a novel benchmark designed to evaluate large language models (LLMs) on expert‑level scientific reasoning, addressing the saturation of existing science benchmarks that rely heavily on multiple‑choice or fact‑recall tasks. The benchmark consists of two complementary tracks. The Olympiad track contains problems modeled after International Physics, Chemistry, and Biology Olympiads. These problems are authored from scratch by 42 former international medalists and national team coaches, ensuring high difficulty, originality, and factual correctness. Over 500 candidate questions were created, filtered through multiple rounds of peer review and internal model testing, and a curated gold set of 100 Olympiad items is publicly released. Each problem provides all necessary variables, units, and a single numeric, algebraic, or fuzzy‑string answer, enabling fully automated evaluation via equivalence checking.

The Research track targets PhD‑level, open‑ended sub‑tasks that a researcher might encounter during a real project. Forty‑five qualified scientists (post‑docs, professors, or doctoral candidates) authored more than 200 such tasks, each expected to require at least three to five hours of work. A granular 10‑point rubric accompanies every task, breaking down evaluation into concrete, objectively assessable items such as “derives the correct intermediate equation,” “states the hypothesis clearly,” and “interprets the result appropriately.” A gold set of 60 research items (20 per discipline) is released. Because open‑ended answers cannot be judged by simple string matching, the authors employ a model‑based judge: GPT‑5 run at “high reasoning effort” receives the model’s answer and the rubric and returns a numeric score. Prompt templates for the judge are provided in the appendix.

The authors evaluate nine frontier models—including GPT‑4o, GPT‑5, GPT‑5.1, GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro, Grok 4, and OpenAI‑o3—on both tracks. All models are run with a “high” reasoning effort setting, except GPT‑5 at “x‑high” and GPT‑5.2 at “high.” On the Olympiad set, the best model (GPT‑5.2) achieves 77 % accuracy, with chemistry problems being the easiest (≈73 %) and biology the hardest. On the Research set, performance is markedly lower; GPT‑5.2 scores 25 % (using a threshold of ≥7 rubric points), while most other models hover around 15‑22 %. Notably, GPT‑5 outperforms its successor GPT‑5.1 on research tasks, and Gemini 3 Pro matches GPT‑5.2 on Olympiad but lags on research. Error analysis reveals four dominant failure modes: logical/step‑wise reasoning errors, misunderstanding of niche domain concepts, arithmetic or algebraic miscalculations, and factual inaccuracies.

The discussion acknowledges several limitations. First, the Research track still presents a constrained problem statement, limiting assessment of genuine hypothesis generation or novel research direction formulation. Second, rubric‑based grading, while more nuanced than single‑answer scoring, depends on the reliability of the model judge and may diverge from human expert grading. Third, the benchmark is text‑only; real scientific work often involves images, spectra, code execution, or wet‑lab interactions that are not captured. Fourth, no human baseline is provided, making it difficult to contextualize model scores relative to expert performance.

Despite these caveats, FrontierScience represents a significant step forward: it offers a large, expert‑verified, original‑question dataset spanning physics, chemistry, and biology; it introduces a systematic rubric‑based evaluation for open‑ended scientific tasks; and it provides a clear performance gap between state‑of‑the‑art LLMs and true expert capability, especially on research‑style problems. The authors suggest future work on expanding multimodal inputs, refining rubric reliability, and establishing human baselines to further solidify the benchmark’s role in guiding AI development toward genuine scientific discovery.

FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment