DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countries, originate from anonymized real-world usage patterns within a large-scale deep research system. Tasks are sampled from a de-identified dataset of Perplexity Deep Research requests, then filtered and augmented to ensure that the tasks are anonymized, open-ended and complex, objectively evaluable, and representative of the broad scope of real-world deep research use cases. Outputs are graded against task-specific rubrics along four dimensions: factual accuracy (accuracy), breadth and depth of analysis (including completeness), presentation quality (including objectivity), and citation quality. DRACO is publicly available at https://hf.co/datasets/perplexity-ai/draco.


💡 Research Summary

The paper introduces DRACO (Deep Research Accuracy, Completeness, and Objectivity), a new benchmark designed to evaluate the performance of deep‑research language models across a wide variety of real‑world tasks. Unlike prior benchmarks that rely on synthetic or narrowly scoped queries, DRACO draws its 100 test items directly from anonymized user requests made to Perplexity’s Deep Research service during a two‑month period in late 2025. The authors first sampled 1,000 high‑difficulty queries—identified by negative user feedback or thumbs‑down ratings—and then applied a multi‑stage pipeline to transform these raw requests into well‑specified, privacy‑preserving research tasks.

The pipeline consists of: (1) automated de‑identification and ambiguity reduction using a large language model, ensuring no personally identifiable information ever reaches a human reviewer; (2) systematic augmentation along two axes—Context (persona, desired deliverable format, explicit source instructions) and Scope (temporal extension, comparative elements, geographic expansion)—to increase task complexity and better reflect the multi‑step planning, retrieval, and synthesis required of deep‑research agents; (3) LLM‑driven filtering that retains only tasks that are objective (clear success criteria), tractable (bounded scope), and genuinely challenging (requiring non‑trivial information gathering and reasoning); and (4) final curation by in‑house domain experts, resulting in a balanced set covering ten domains (Finance, Academic, Technology, General Knowledge, UX Design, Law, Medicine, “Needle in a Haystack”, Personalized Assistant, Shopping/Product Comparison) and drawing information from 40 countries across four continents.

Each task is paired with a domain‑specific rubric created through a rigorous, four‑stage expert workflow. Twenty‑six domain specialists (including clinicians, attorneys, financial analysts, engineers, and designers) collaborated with The LLM Data Company. An initial rubric is drafted by Expert 1 with LLM assistance, reviewed and revised by Expert 2, stress‑tested by running Perplexity Deep Research on the task (if the model scores >90 % the rubric is deemed too lenient and sent back for revision), and finally approved by an in‑house domain expert and an AI specialist. The resulting rubrics contain on average 39.3 criteria per task, split into four evaluation axes: factual accuracy (≈20.5 criteria), breadth & depth of analysis (≈8.6), presentation quality (≈5.6), and citation quality (≈4.8). Criteria are weighted; for example, harmful medical misinformation can incur penalties up to –500, while non‑medical errors range from –10 to –25. Both positive (desired properties) and negative (error‑type) criteria are included, with 415 negative items across the whole suite.

Scoring is performed with an open‑source LLM‑as‑a‑judge protocol. For each criterion the judge outputs a binary verdict (MET/UNMET) plus a brief justification. The final task score is the weighted sum of all MET criteria, with UNMET contributing zero and negative‑weight criteria subtracting from the total. This approach dramatically reduces human evaluation cost while preserving the nuanced judgments encoded in the expert rubrics.

Using DRACO, the authors evaluated four publicly available deep‑research systems: OpenAI Deep Research, Gemini Deep Research, Claude Opus, and Perplexity Deep Research (the system that supplied the original queries). Across all domains and all rubric axes, Perplexity Deep Research achieved the highest overall scores and pass rates, indicating a consistent advantage in factual correctness, analytical completeness, clear presentation, and proper citation. The other systems showed particular weaknesses in high‑stakes domains such as Law and Medicine, underscoring current limitations of LLMs in handling risk‑sensitive content.

The paper concludes with a candid discussion of limitations. DRACO currently contains only English‑language tasks, limiting its applicability to non‑English contexts. The reliance on LLM‑as‑a‑judge, while efficient, still requires further validation against human judges to ensure reliability, especially for nuanced criteria. Future work will focus on expanding multilingual coverage, strengthening the robustness of automated judging, and automating the entire task‑generation pipeline to enable continuous benchmark refreshes.

In sum, DRACO represents a comprehensive, production‑grounded benchmark that combines realistic, privacy‑preserving task sourcing, systematic difficulty augmentation, expert‑crafted multi‑axis rubrics, and scalable LLM‑based evaluation. It fills a gap in the evaluation landscape for deep‑research agents and provides a solid foundation for tracking progress and guiding improvements in the next generation of AI‑driven research assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment