Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics – Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench
💡 Research Summary
The paper introduces TaxoBench, a benchmark designed to evaluate whether Deep Research Agents (DRAs) can match human experts in two essential capabilities for survey generation: retrieving the core set of papers that experts would cite, and organizing those papers into expert‑like hierarchical taxonomies. Existing benchmarks focus on surface aspects such as writing quality or citation correctness and lack ground‑truth paper sets and hierarchical structures needed to assess these capabilities. To fill this gap, the authors collect 72 highly‑cited LLM‑focused survey papers across eight research domains (e.g., multimodal learning, reinforcement learning, alignment, agents). From each survey they manually extract the expert‑authored taxonomy tree, resulting in a dataset of 3,815 cited papers mapped to 14,000+ category nodes (average 14 categories per taxonomy, depth ≈5). This provides a realistic gold‑standard of both paper‑to‑category assignments and the internal parent‑child relations of the taxonomy.
TaxoBench supports two evaluation modes. In Deep Research mode, an agent receives only a topic string and must perform end‑to‑end web search, paper collection, and taxonomy construction. Retrieval performance is measured with standard Recall, Precision, and F1 against the expert paper set. In Bottom‑Up mode the expert paper set is supplied, isolating the organization ability of the model. Organization is evaluated at two levels. The leaf‑level uses Adjusted Rand Index (ARI) and V‑Measure to assess how well papers are assigned to the correct leaf categories. Because these flat clustering metrics ignore hierarchical structure, the authors propose three hierarchy‑aware metrics: Unordered Semantic Tree Edit Distance (US‑TED), its normalized version US‑NTED, and Semantic Path Similarity (Sem‑Path). US‑TED extends classic tree edit distance by ignoring sibling order and assigning a semantic rename cost based on cosine similarity of label embeddings (MiniLM‑L6‑v2). US‑NTED normalizes the distance by the total number of nodes, enabling cross‑instance comparison. Sem‑Path compares the root‑to‑leaf label sequences of matched papers, measuring path‑level semantic consistency. Together these metrics capture both global structural mismatches and local path errors that flat metrics miss.
The authors evaluate seven state‑of‑the‑art Deep Research Agents and twelve frontier LLMs on TaxoBench. In Deep Research mode, the best agent retrieves only 20.92 % of the expert‑cited papers, indicating a severe retrieval bottleneck. In Bottom‑Up mode, even when given the exact expert paper set, the best model achieves only 31.24 % ARI, showing that organization remains a major challenge. US‑NTED scores cluster around 0.28–0.29 for all models, and Sem‑Path scores are similarly narrow, suggesting a shared limitation in hierarchical reasoning beyond simple clustering. Error analysis reveals three dominant failure modes: (1) missing critical papers, (2) assigning papers to incorrect sibling categories, and (3) mis‑placing entire sub‑trees under wrong parent nodes.
The paper’s contributions are threefold: (1) the creation and public release of TaxoBench, a large‑scale benchmark with expert‑authored taxonomies and paper mappings; (2) the introduction of hierarchy‑aware evaluation metrics (US‑TED/US‑NTED, Sem‑Path) that capture structural and semantic alignment of taxonomies; and (3) a comprehensive empirical study exposing a dual bottleneck in current DRAs—limited retrieval recall and weak hierarchical organization—even for the most advanced LLMs. The findings highlight the need for future research to improve both web‑scale retrieval of seminal works and the ability of models to reason about and construct coherent hierarchical knowledge structures. TaxoBench and its metrics are likely to become standard tools for assessing and guiding progress in LLM‑driven literature synthesis and survey automation.
Comments & Academic Discussion
Loading comments...
Leave a Comment