Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
As an embodiment of intelligence evolution toward interconnected architectures, Deep Research Agents (DRAs) systematically exhibit the capabilities in task decomposition, cross-source retrieval, multi-stage reasoning, information integration, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response format, and scoring mechanisms, limiting their effectiveness in assessing such agents. This paper introduces Dr. Bench, a multidimensional evaluation framework tailored to DRAs and long-form report-style responses. The benchmark comprises 214 expert-curated challenging tasks across 10 broad domains, each accompanied by manually constructed reference bundles to support composite evaluation. This framework incorporates metrics for semantic quality, topical focus, and retrieval trustworthiness, enabling a comprehensive evaluation of long reports generated by DRAs. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement of DRAs.
💡 Research Summary
The paper introduces Dr.Bench, a comprehensive, multidimensional benchmark designed specifically for evaluating Deep Research Agents (DRAs) that generate long‑form, report‑style outputs. Existing benchmarks focus on short, discrete answers and fail to capture the full pipeline of DRAs—task decomposition, cross‑source retrieval, multi‑step reasoning, information synthesis, and structured reporting. Dr.Bench addresses this gap by curating 214 high‑difficulty tasks across ten broad domains (Academia & Research, News & Current Affairs, Sports & Competitions, Common Sense & Education, Law & Politics, Business & Finance, Technology & AI, Environment & Sustainability, History & Social Sciences, Health & Medicine).
Each task is paired with a manually constructed reference bundle consisting of five modules: (1) Query‑Specific Rubrics (QSR) that evaluate factual accuracy, logical validity, mechanism explanation, structured expression, source verification, and other task‑specific criteria using binary or ternary scoring; (2) General‑Report Rubrics (GRR) that assess universal report qualities such as organization, logical clarity, depth, citation quality, originality, data analysis rigor, and formatting consistency, providing a total of 48 rubrics and 73 points; (3) Trustworthy‑Source Links (TSL) that supply vetted external URLs for the agent to cite; (4) Focus‑Anchor Keywords (FAK) that capture the core thematic anchors a high‑quality report should emphasize; and (5) Focus‑Deviation Keywords (FDK) that flag off‑topic or extraneous content.
The evaluation framework combines these components into a composite score: IntegratedScore = Quality × (1 − SemanticDrift) × TrustBoost × FAK × FDK. “Quality” aggregates QSR and GRR scores; “SemanticDrift” measures meaning deviation using an LLM‑based judge; “TrustBoost” incorporates TSL, FAK, and FDK to reward trustworthy sourcing and penalize topic drift. This formulation simultaneously captures semantic adequacy, topical focus, and source credibility—dimensions that traditional exact‑match or BLEU‑style metrics overlook.
Experimental evaluation involves five mainstream DRAs (including Alibaba’s TonYgi DeepResearch, xAI’s Grok Deep Search, Perplexity’s Sonar Deep Research, OpenAI’s o3 Deep Research, and HKUST DeepSearch), one advanced agent model, and seven reasoning models augmented with web‑search tools. Results show that DRAs consistently outperform tool‑augmented baselines, achieving 12‑18 % higher IntegratedScore, particularly excelling in GRR dimensions of structure and logical coherence. However, QSR analyses reveal persistent factual inaccuracies and incomplete mechanism explanations in roughly 20 % of cases, and some TSLs fail to reflect the most recent information, reducing trust scores.
The authors argue that Dr.Bench provides a rigorous, reproducible, and extensible platform for assessing the full capabilities of DRAs, filling a critical evaluation gap. They also acknowledge limitations: the need for more reliable automated scoring models, continual updating of reference bundles to keep pace with evolving knowledge, and the addition of domain‑specific rubrics for specialized tasks. Future work is suggested in meta‑learning based automatic judges, real‑time source verification pipelines, and feedback‑driven rubric refinement.
In summary, Dr.Bench establishes a new standard for DRA evaluation, offering detailed, multidimensional metrics that reflect real‑world research workflows and enabling more targeted development of next‑generation autonomous research agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment