SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation
The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
💡 Research Summary
The paper addresses a critical gap in the evaluation of Automatic Survey Generation (ASG) systems: existing benchmarks and metrics are heavily biased toward Computer Science and rely on generic n‑gram or semantic similarity scores that do not capture the structural and stylistic nuances of different academic disciplines. To remedy this, the authors introduce SurveyLens, the first discipline‑aware benchmark for ASG, together with a curated dataset called SurveyLens‑1k comprising 1,000 high‑quality human‑written survey papers across ten distinct research fields (Physics, Engineering, Computer Science, Medicine, Environment, Business, Sociology, Education, Psychology, and Business).
Dataset Construction
The authors collected roughly 3,000 candidate surveys using keyword filters (“a survey of”, “a review”, etc.) and citation‑based ranking, then applied a multi‑stage cleaning pipeline: (1) rule‑based extraction of outlines, figures, tables, equations, and references using PDF‑to‑Markdown tool MinerU; (2) LLM‑assisted refinement to correct hierarchical nesting and reference parsing; (3) manual verification by annotators. Each paper is represented as a Structured Survey Representation (SSR) triplet (Outline, Content, References), enabling fine‑grained analysis of section hierarchy, textual content, and bibliographic metadata. Statistical analysis shows stark cross‑disciplinary differences—for example, Physics papers average 411 equations and 23 k words, while Sociology papers are much shorter and contain few equations.
Evaluation Framework
SurveyLens proposes a dual‑lens evaluation:
-
Discipline‑Aware Rubric Evaluation – Field‑specific rubrics are automatically distilled from the SurveyLens‑1k corpus. An LLM (Gemini‑3‑Pro) acts as a judge, scoring generated surveys against these rubrics. To align the LLM’s scoring with expert preferences, the authors collect pairwise preference data from human evaluators and fit a Bradley‑Terry model, producing weight vectors that bias the rubric scores toward what domain experts value (e.g., evidence hierarchy in Medicine, benchmark tables in Computer Science).
-
Canonical Alignment Evaluation – This measures factual grounding and content coverage relative to the human reference surveys. The authors introduce two complementary metrics:
- RAMS (Redundancy‑Aware Matching Score) – Uses Hungarian matching to pair sections of the generated survey with those of the reference, explicitly penalizing repeated or duplicated content.
- TAMS (Thresholded Average Maximum Similarity) – Computes the average of the maximum semantic similarity between generated and reference sections, capped by a threshold to avoid inflating scores through superficial similarity. Together, RAMS and TAMS capture both the richness of synthesis and the avoidance of redundancy, a problem often hidden from traditional ROUGE‑type scores.
Experimental Setup
Eleven state‑of‑the‑art ASG methods are evaluated, spanning three paradigms: (a) Vanilla LLMs (e.g., GPT‑4, Gemini‑Pro), (b) specialized ASG pipelines (AutoSurvey, SurveyForge, etc.), and (c) multi‑agent Deep Research Agents (Gemini Deep Research, Qwen Deep Research). For each method, 100 surveys are generated (10 per discipline) using topics extracted from SurveyLens‑1k. The generated outputs are then scored using the discipline‑aware rubric (with the LLM judge) and the RAMS/TAMS metrics, and compared against the corresponding human‑written reference surveys.
Key Findings
- Structural Rigor vs. Narrative Richness – Specialized ASG pipelines excel in fields that demand strict organization, such as Physics, Engineering, and Computer Science, achieving high rubric scores for structural coherence, citation accuracy, and tabular/figure integration. Conversely, Vanilla LLMs outperform these pipelines in humanities and social sciences (Sociology, Education, Psychology) by delivering more fluid narratives, better thematic synthesis, and higher novelty scores.
- Deep Research Agents – While agents produce content with higher semantic depth and broader cross‑disciplinary connections, they often miss discipline‑specific formatting conventions (e.g., equation numbering, reference style), leading to lower rubric scores for formal precision.
- Metric Correlation with Human Judgment – Both the discipline‑aware rubric scores and the RAMS/TAMS composite correlate strongly (Pearson > 0.78) with independent human expert ratings, substantially surpassing the correlation achieved by ROUGE or BERTScore alone. This validates the benchmark’s ability to reflect real scholarly quality.
- Trade‑offs Identified – The study reveals a clear trade‑off: methods optimized for exhaustive coverage (high TAMS) may suffer from redundancy (low RAMS), while those tuned for concise structure may miss nuanced synthesis.
Contributions and Impact
- SurveyLens‑1k – A publicly released, multi‑disciplinary dataset that captures field‑specific structural and stylistic variance, filling a long‑standing resource gap.
- Discipline‑Aware Evaluation – A novel, LLM‑as‑judge framework that incorporates human‑derived preference weights, enabling interpretable, field‑specific scoring.
- Redundancy‑Sensitive Metrics – RAMS and TAMS together provide a more nuanced assessment of content quality than traditional n‑gram metrics.
- Comprehensive Empirical Insights – The benchmark reveals where current ASG technologies succeed or fail across disciplines, offering concrete guidance for practitioners choosing tools for specific scholarly domains.
Future Directions
The authors plan to expand SurveyLens to additional fields such as Law and Arts, integrate dynamic rubric updates as new citation practices emerge, and explore hybrid human‑agent evaluation loops that could further close the gap between automated scores and expert judgment.
In summary, SurveyLens establishes a rigorous, discipline‑sensitive benchmark that not only standardizes ASG evaluation across the scholarly spectrum but also drives the next generation of survey‑generation systems toward truly interdisciplinary competence.
Comments & Academic Discussion
Loading comments...
Leave a Comment