FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

💡 Research Summary

**
FIRE‑Bench (Full‑cycle Insight Rediscovery Evaluation) is introduced as a rigorous benchmark for assessing autonomous research agents powered by large language models (LLMs). Unlike existing benchmarks that either require agents to generate full papers judged by other LLMs (subjective and hard to scale) or focus on a single performance metric (e.g., leaderboard score), FIRE‑Bench asks agents to “rediscover” a verified empirical finding from a recent high‑impact machine‑learning paper.

Benchmark construction
The authors curated 30 empirical analysis papers on LLM behavior from top‑tier venues (ICLR, ICML, NeurIPS) published in 2024‑2025. Each paper is parsed into a hierarchical research‑problem tree: a root node (high‑level research question), intermediate nodes (sub‑questions), and leaf nodes (concrete experiments with dataset, method, and evaluation criteria). Tree extraction is automated with a gpt‑5‑based extractor and validated by human experts. For each paper, the leaf node representing the central figure/table is selected as the ground‑truth result, and its parent intermediate node becomes the task prompt. Agents receive only this high‑level question plus the scope (datasets, evaluation metrics) but no details of the original experimental design or conclusions.

Evaluation protocol
Agents must autonomously plan hypotheses, design experiments, write and run code, and finally produce a conclusion. The generated conclusion and the original paper’s conclusion are both decomposed into atomic, verifiable claims using a fixed‑prompt LLM extractor (gpt‑5.2). Claims are matched via an LLM‑based entailment classifier; precision, recall, and F1 are computed at the claim level. This claim‑centric approach mirrors RAG‑Checker and avoids reliance on subjective LLM judges.

Experiments
State‑of‑the‑art agents—including OpenHands, OpenAI Codex, and Anthropic Claude‑Code—were evaluated with frontier backbones such as gpt‑5 and Claude‑4‑Sonnet. Each agent was run 3–5 times per task. Across the 30 tasks, average F1 scores ranged from 38 to 45, with the best single run barely crossing 49. Performance exhibited high variance (standard deviation ≈ 12), indicating low reproducibility.

Error analysis
A four‑stage error taxonomy (Research Planning, Implementation, Execution, Conclusion Formation) was applied. Failures clustered in Research Planning (poor hypothesis generation, inappropriate experimental design) and Conclusion Formation (misinterpretation of results, weak scientific phrasing). Implementation and execution were comparatively stable. The authors also examined potential data contamination by correlating performance with paper publication year and task difficulty; no systematic contamination was detected, suggesting the benchmark’s design successfully isolates genuine rediscovery ability.

Significance and limitations
FIRE‑Bench demonstrates that while current LLM agents can write and run code, they still lack robust meta‑cognitive abilities required to decide what to test and how to articulate findings scientifically. The benchmark’s constraints—public datasets, modest compute (≤ 24 h on an 80 GB A100), and omission of proprietary resources—make it scalable but also limit its applicability to more complex, large‑scale scientific investigations that demand extensive model training or specialized equipment.

Future directions

Strengthening planning – integrate meta‑prompt engineering, Bayesian experiment selection, and systematic hypothesis exploration to improve the Planning stage.
Evidence‑based reasoning – link generated claims to statistical validation (e.g., confidence intervals, hypothesis tests) for more trustworthy conclusions.
Multimodal evidence – incorporate graphs, visualizations, and execution logs alongside text to create richer proof chains.
Human‑agent collaboration – explore hybrid workflows where limited expert feedback (e.g., design review) guides the autonomous agent, potentially boosting reliability.

In sum, FIRE‑Bench provides a concrete, process‑level yardstick for full‑cycle scientific discovery by autonomous agents. Its results reveal substantial gaps in current systems, especially in research planning and conclusion synthesis, and it offers a clear roadmap for the next generation of AI‑driven scientific assistants.

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

💡 Research Summary

Comments & Academic Discussion

Leave a Comment