DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The ability to research and synthesize knowledge is central to human expertise and progress. A new class of AI systems–designed for generative research synthesis–aims to automate this process by retrieving information from the live web and producing long-form, cited reports. Yet, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short, factual answers, while expert-curated datasets risk staleness and data contamination. Neither captures the complexity and evolving nature of real research synthesis tasks. We introduce DeepScholar-bench, a live benchmark and automated evaluation framework for generative research synthesis. DeepScholar-bench draws queries and human-written exemplars from recent, high-quality ArXiv papers and evaluates a real synthesis task: generating a related work section by retrieving, synthesizing, and citing prior work. Our automated framework holistically measures performance across three key dimensions–knowledge synthesis, retrieval quality, and verifiability. To further future work, we also contribute DeepScholar-ref, a simple, open-source reference pipeline, which is implemented on the LOTUS framework and provides a strong baseline. Using DeepScholar-bench, we systematically evaluate prior open-source systems, search agents with strong models, OpenAI’s DeepResearch, and DeepScholar-ref. We find DeepScholar-bench is far from saturated: no system surpasses a geometric mean of $31%$ across all metrics. These results highlight both the difficulty and importance of DeepScholar-bench as a foundation for advancing AI systems capable of generative research synthesis. We make our benchmark code and data available at https://github.com/guestrin-lab/deepscholar-bench.


💡 Research Summary

DeepScholar‑Bench introduces a live, continuously updating benchmark for evaluating generative research synthesis systems—AI agents that retrieve information from the live web and produce long‑form, cited reports. Existing benchmarks focus on short factual QA or static expert‑curated datasets, which are either too narrow or become stale and risk data contamination. To address this, the authors automatically harvest recent, high‑quality arXiv papers accepted at conferences, extract their “Related Works” sections as human‑written exemplars, and use the paper abstract as the query description. A monthly data pipeline ensures that the benchmark reflects the latest research landscape while avoiding versioning issues and low‑quality papers.

The benchmark task is to generate a “Related Works” section for a given paper: the system must retrieve relevant prior work, synthesize key facts, and cite sources appropriately. Evaluation is fully automated and spans three core dimensions—knowledge synthesis, retrieval quality, and verifiability—through seven fine‑grained metrics:

  • Knowledge synthesis: (i) Organization & Coherency, measured by pairwise LLM‑as‑judge comparisons against the human exemplar, and (ii) Nugget Coverage, the proportion of essential factual “nuggets” from the exemplar that appear in the generated text.
  • Retrieval quality: (i) Relevance Rate, graded 0‑2 per document by an LLM judge following the Cranfield model; (ii) Reference Coverage, the fraction of “important” references (identified from the exemplar) that are retrieved; and (iii) Document Importance, a citation‑count‑based weight reflecting the impact of retrieved documents.
  • Verifiability: (i) Citation Precision, the percentage of cited sources that actually support the accompanying claim, and (ii) Claim Coverage, the proportion of claims fully backed by citations.

All metrics rely on LLM‑based judgments, but the authors validate their correlation with human assessments to ensure reliability.

The authors evaluate 14 existing systems—including open‑source tools (STORM, OpenScholar), search agents powered by strong proprietary models, and OpenAI’s DeepResearch—alongside a newly released open‑source reference pipeline, DeepScholar‑ref, built on the LOTUS framework. Results show that no system exceeds a geometric mean of 31 % across all metrics. DeepResearch achieves the highest Nugget Coverage (≈39 %) and modest Reference Coverage (≈19 %), yet lags in verifiability. DeepScholar‑ref, while competitive on most dimensions, attains a markedly higher verifiability score (over six times that of DeepResearch). Overall, performance on Nugget Coverage, Reference Coverage, and Document Importance remains below 40 %, highlighting the substantial gap between current capabilities and the demands of authentic research synthesis.

Key contributions are: (1) a live benchmark that automatically curates up‑to‑date research queries and human exemplars; (2) a holistic, automated evaluation framework with validated LLM‑based metrics across synthesis, retrieval, and verification; and (3) an open‑source baseline pipeline (DeepScholar‑ref) that sets a strong reference point for future work. Limitations include reliance on LLM judges (potential bias), focus solely on the “Related Works” section (excluding methods, experiments, etc.), and limited external validation of citation correctness. Future directions suggested are expanding to multi‑modal retrieval (code, datasets), incorporating human‑LLM collaborative evaluation, and building community‑driven verification mechanisms to sustain the benchmark over time.

In sum, DeepScholar‑Bench provides a rigorous, scalable platform for measuring progress toward AI systems capable of genuine, trustworthy research synthesis, and it reveals ample room for improvement across all evaluated dimensions.


Comments & Academic Discussion

Loading comments...

Leave a Comment