Scalable Delphi: Large Language Models for Structured Risk Estimation

Scalable Delphi: Large Language Models for Structured Risk Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.


💡 Research Summary

The paper tackles the long‑standing bottleneck of structured expert elicitation in high‑stakes risk assessment by replacing human panels with large language models (LLMs). Traditional Delphi studies, while providing calibrated and auditable probability estimates, require months of coordination, costly expert time, and are infeasible for many organizations or rapidly evolving domains. The authors introduce “Scalable Delphi,” a fully automated protocol that mirrors the classic Delphi process but uses multiple LLM agents instantiated with distinct expert personas.

In each round, every persona receives the same evidence set and a description of the quantity to be estimated. The LLM produces an independent probability estimate together with a written rationale. A mediator LLM aggregates these responses into anonymized feedback—summary statistics, key arguments for higher and lower values, and points of disagreement—without attributing statements to any specific persona. This feedback is fed back to all agents for the next round, allowing iterative refinement. After a predetermined number of rounds, the final estimate is obtained by averaging the panel’s last‑round estimates (a linear opinion pool), and the spread across agents is reported as a 95 % confidence interval.

Because the target quantities are inherently unobservable, the authors devise an evaluation framework based on necessary conditions rather than direct ground‑truth comparison. The two core conditions are (1) calibration—correlation with observable proxies and appropriate coverage of confidence intervals, and (2) evidence sensitivity—demonstrating that adding or removing relevant evidence moves the estimates in the expected direction. As corroborating evidence, they compare LLM outputs to independent human expert panels and qualitatively inspect the generated rationales.

The experimental domain is AI‑augmented cybersecurity risk, which offers three publicly available benchmarks with known success rates for autonomous agents: BountyBench, Cybench, and CyberGym. Two state‑of‑the‑art models are evaluated: OpenAI’s GPT‑5.1 (knowledge cut‑off September 2024) and Anthropic’s Claude Opus 4.1 (cut‑off January 2025). Benchmarks are chosen to minimize contamination; two of them post‑date the models’ cut‑offs, while Cybench predates them, allowing a clean test of reasoning from supplied evidence.

Results show strong calibration: Pearson correlations between LLM estimates and benchmark ground truth range from 0.87 to 0.95, with Spearman coefficients similarly high and mean absolute errors well below simple baseline heuristics (global or task‑wise means). Evidence‑sensitivity experiments reveal systematic shifts in estimates as evidence is added or removed, confirming that the models are responsive to the supplied information rather than relying on memorized outcomes. In a head‑to‑head comparison with two human expert panels, the LLM panel’s mean absolute difference from a human panel is only 5.0 %, whereas the two human panels differ from each other by 16.6 %, indicating that LLMs can approximate expert consensus closely. Convergence behavior mirrors traditional Delphi: variance across agents shrinks and confidence intervals narrow over successive rounds.

The paper’s contributions are threefold: (1) a concrete Scalable Delphi protocol that defines persona creation, prompt structure, round‑wise feedback, and aggregation; (2) an evaluation methodology for latent‑quantity estimation that relies on calibration, evidence sensitivity, and expert alignment; (3) empirical evidence across multiple benchmarks and models that LLM‑based elicitation can achieve accuracy and consistency comparable to human experts while reducing elicitation time from months to minutes.

Beyond the core findings, the authors discuss practical advantages of LLM‑based elicitation: unlimited repeatability for stress‑testing, controllable independence (fresh context vs. sequential history), and the ability to conduct value‑of‑information analyses automatically. They also acknowledge limitations such as over‑confidence, potential mode collapse when personas are insufficiently diverse, and the risk of data contamination for benchmarks released after model training. Future work is suggested on post‑hoc calibration techniques, automated persona diversification, and integration of Bayesian updating to further improve reliability.

In summary, Scalable Delphi demonstrates that large language models can serve as scalable, fast, and cost‑effective proxies for structured expert judgment, opening the door to rigorous risk assessment in settings where traditional Delphi is prohibitively expensive or too slow.


Comments & Academic Discussion

Loading comments...

Leave a Comment