Agent Benchmarks Fail Public Sector Requirements

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

💡 Research Summary

The paper addresses a pressing gap in the responsible deployment of large‑language‑model (LLM) based agents within public‑sector organisations (PSOs). While benchmarks are the primary technical tool for measuring model capabilities, it is unclear whether existing benchmarks capture the legal, procedural, and structural constraints that public institutions must satisfy—such as procedural legitimacy, political neutrality, transparency, and impersonality.

To answer this, the authors first conduct a theory‑driven survey of public‑administration literature (automation theory, digital‑era governance, artificial discretion, diffusion of innovations, etc.) and psychometric principles of benchmark design. From this synthesis they derive six concrete criteria, grouped into two categories: (1) structural requirements for the benchmark itself—Task‑Based Model (process‑oriented), Realistic Tasks, and Public‑Sector‑Specific Tasks; and (2) requirements for the reported metrics—Top‑Level Performance (including cost), Fairness, Rigorous Statistical Methodology, and Construct Validity.

Using these criteria, they examine 1,304 papers that propose benchmarks for LLM agents. An LLM‑assisted extraction pipeline gathers information about task definitions, data sources, and evaluation metrics; a panel of five domain experts validates a sample, achieving Cohen’s κ = 0.82, indicating high annotation reliability. The analysis reveals that no benchmark satisfies all six criteria. The most commonly met criteria are Task‑Based Model (≈68 % of papers) and Realistic Tasks (≈55 %). In stark contrast, only a handful of benchmarks address public‑sector specificity or report cost and fairness metrics (each < 3 %). Consequently, the current benchmark ecosystem fails to provide the multidimensional evidence needed for public‑sector risk assessment.

The authors discuss why this mismatch exists: benchmark design is driven by technical convenience and data availability rather than policy relevance, leading to synthetic or overly simplified tasks, single‑dimensional accuracy scores, and weak statistical rigor. This creates a “benchmark‑deployment validity gap” that can cause LLM agents to underperform or produce biased outcomes when confronted with real bureaucratic workflows.

To close the gap, the paper proposes a practical roadmap: (1) co‑design benchmarks with public‑sector stakeholders to obtain authentic case documents, citizen queries, and procedural logs; (2) structure benchmarks around modular tasks that reflect end‑to‑end processes, preserving inter‑task dependencies; (3) adopt a standardized set of multidimensional metrics that include monetary or token cost and fairness across protected attributes; (4) embed rigorous statistical testing (confidence intervals, significance tests, contamination checks) into benchmark pipelines; and (5) create open repositories and continuous evaluation frameworks that link benchmark outcomes directly to policy‑making decisions.

In conclusion, the study makes three key contributions: it formalises a theory‑grounded set of public‑sector benchmark criteria; it empirically demonstrates that existing LLM‑agent benchmarks fall short of these standards; and it offers concrete design and governance recommendations for future benchmarks. By aligning benchmark construction with the unique demands of public administration, researchers and policymakers can obtain reliable, actionable evidence for the safe and accountable integration of LLM agents into the public sector.

Agent Benchmarks Fail Public Sector Requirements

💡 Research Summary

Comments & Academic Discussion

Leave a Comment