Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current

Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highlighting whether reported results reflect true capability or measurement noise. We evaluated on GAIA (Levels 1-3, measuring agentic capabilities across varying reasoning complexity) and FRAMES (measuring retrieval and factuality across multiple documents). We found that ICC varies dramatically with task structure, with reasoning and retrieval tasks (FRAMES) exhibit ICC=0.4955-0.7118 across models, and agentic tasks (GAIA) exhibiting ICC=0.304-0.774 across models. For sub-agent replacement decisions in agentic systems, accuracy improvements are only trustworthy if ICC also improves. We demonstrate that ICC converges by n=8-16 trials for structured tasks and n>=32 for complex reasoning, enabling practitioners to set evidence-based resampling budgets. We recommend reporting accuracy alongside ICC and within-query variance as standard practice, and propose updated Evaluation Cards capturing these metrics. By making evaluation stability visible, we aim to transform agentic benchmarking from opaque leaderboard competition to trustworthy experimental science. Our code is open-sourced at https://github.com/youdotcom-oss/stochastic-agent-evals.


💡 Research Summary

The paper addresses a critical gap in the evaluation of large language models (LLMs) when they are deployed as autonomous agents within larger systems. Traditional benchmarking practices report a single performance metric—typically accuracy, F1, or BLEU—derived from one run. This approach masks the stochastic nature of LLMs, which can produce different outputs for the same input depending on sampling temperature, top‑p, prompt variations, and internal randomness. Consequently, a seemingly higher score may be the result of lucky sampling rather than a genuine capability improvement, leading to brittle downstream behavior when such agents are integrated into complex pipelines.

To make evaluation reliability visible, the authors borrow the Intraclass Correlation Coefficient (ICC) from measurement science. ICC decomposes total observed variance into two components: (1) between‑query variance, reflecting the intrinsic difficulty or diversity of the test items, and (2) within‑query variance, capturing the inconsistency of the model when the same query is presented multiple times. Mathematically, ICC = σ²_between / (σ²_between + σ²_within). An ICC close to 1 indicates that most variability stems from the queries themselves and that the model behaves consistently; an ICC near 0 signals that the model’s stochasticity dominates, rendering any single accuracy figure unreliable.

The authors evaluate ICC on two benchmark suites that target different aspects of agentic capability. GAIA (Generalized Agentic Intelligence Assessment) spans three levels of difficulty, from simple command execution (Level 1) to complex multi‑step reasoning and planning (Level 3). FRAMES (Factual Retrieval and Multi‑document Evaluation Suite) focuses on information‑retrieval and factual verification across multiple documents. For each benchmark, they run a set of contemporary models—including GPT‑3.5‑Turbo, Claude‑2, and Llama‑2‑70B—across multiple repetitions (8, 16, 32, 64 trials per query). They compute standard accuracy metrics as well as the variance components needed for ICC.

Key findings reveal that ICC is highly task‑dependent. In FRAMES, most models achieve ICC values between 0.4955 and 0.7118, suggesting a balanced contribution of query difficulty and model noise. Retrieval tasks show higher within‑query variance, while factual verification exhibits more stable outputs, raising the ICC. GAIA displays a broader spread: Levels 1 and 2 often reach ICCs of 0.70–0.77, indicating reliable performance on straightforward tasks. However, Level 3 reasoning drops ICC dramatically to the 0.30–0.55 range for many models, highlighting pronounced stochasticity when complex planning is required. This divergence underscores that a single accuracy number cannot capture the nuanced reliability profile of an agentic system.

The paper also investigates how many repetitions are needed for ICC to converge. Structured, low‑entropy tasks (e.g., multiple‑choice or simple retrieval) stabilize after 8–16 trials, whereas high‑entropy reasoning tasks require at least 32 repetitions, with diminishing returns beyond 64 runs. This empirical guideline enables practitioners to allocate evaluation budgets efficiently: they can set a resampling budget that guarantees a stable ICC estimate without excessive computational cost.

Based on these insights, the authors propose two concrete recommendations for the community. First, every benchmark report should include, alongside accuracy, the within‑query variance and the ICC value. This three‑part reporting format makes it explicit whether observed gains are due to genuine capability improvements or merely reduced measurement noise. Second, they extend the existing “Evaluation Card” format to incorporate these reliability metrics, encouraging reproducible and transparent reporting across papers and leaderboards. By doing so, the field moves from opaque leaderboard competition toward a more scientific, evidence‑based evaluation paradigm.

In conclusion, the study demonstrates that ICC is a practical, interpretable statistic for quantifying the stochastic consistency of agentic LLMs. It reveals that many high‑performing models can still be unreliable on complex tasks, and that improvements should be judged on both accuracy and ICC trends. The open‑source codebase (https://github.com/youdotcom-oss/stochastic-agent-evals) provides the community with tools to compute ICC, plan resampling strategies, and generate enriched evaluation cards. Future work may explore complementary reliability measures (e.g., Cronbach’s α, test‑retest reliability) and extend the methodology to domains such as code generation, robotic control, and multimodal agents, further solidifying trustworthy benchmarking practices for the next generation of autonomous AI systems.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...