When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0–4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior ($\leq$2 unique paths) achieve 80–92% accuracy, while highly inconsistent tasks ($\geq$6 unique paths) achieve only 25–60%, a 32–55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.

💡 Research Summary

This paper investigates a fundamental reliability question for large‑language‑model (LLM) agents: given exactly the same input, does an agent follow the same reasoning and take the same actions each time it is run? To answer this, the authors build a ReAct‑style agent that interleaves “thought” generation with tool use (Search, Retrieve, Finish) and evaluate it on 100 hard‑difficulty HotpotQA questions. Each question‑model pair is executed ten times with temperature 0.7, yielding a total of 3,000 runs across three contemporary models—Llama 3.1 70B (open‑source), GPT‑4o, and Claude Sonnet 4.5.

Key empirical findings

Behavioral diversity – The number of unique action sequences per ten runs ranges from 2.0 (Claude) to 4.2 (Llama). Even the best‑performing closed‑source model exhibits multiple distinct trajectories, indicating that stochastic sampling produces non‑trivial variance in tool‑calling decisions.
Consistency predicts correctness – Tasks whose runs produce ≤ 2 distinct sequences achieve 80–92 % accuracy, whereas tasks with ≥ 6 distinct sequences drop to 25–60 % accuracy. The gap (32–55 percentage points) is consistent across all three models and is statistically significant (e.g., t = 2.70, p = 0.011 for Llama). This strong correlation suggests that early‑stage agreement can serve as a runtime reliability signal.
Where divergence occurs – For the most variable model (Llama 3.1 70B), 69 % of the divergence happens at step 2, i.e., the first search query after the initial reasoning step. The authors show that tasks maintaining consistency through step 2 achieve 71.7 % accuracy, while those diverging at step 2 achieve 85.8 %—highlighting the pivotal role of the initial query in shaping the entire trajectory.
Path length as a proxy – Short, consistent trajectories (average ≈ 3.4 steps) attain ≈ 86 % correctness, whereas long, inconsistent trajectories (average ≈ 7.8 steps) fall to ≈ 43 % correctness (correlation r = ‑0.34). Longer sequences reflect more back‑tracking and uncertainty, providing a simple heuristic for flagging potentially erroneous runs.
Temperature effects – Lowering the sampling temperature for Llama 3.1 70B from 0.7 to 0.0 reduces unique sequences from 4.2 to 2.2 and raises accuracy from 77.4 % to 82.8 %. While temperature mitigates stochastic noise, it does not eliminate divergence, indicating that architectural or prompting factors also drive inconsistency.
Question‑type differences – Bridge (multi‑hop) questions show higher consistency but slightly lower accuracy than Comparison (yes/no) questions. The constrained answer space of yes/no questions boosts raw accuracy, yet explanations vary more, lowering measured consistency.

Implications and recommendations

Runtime monitoring: Running multiple parallel instances and checking for early agreement can trigger automatic retries, human review, or confidence scoring.
Focus on the first search: Since most variance originates at step 2, improving query formulation (better prompts, query expansion, learned retrievers) is likely to yield the biggest gains in both consistency and overall performance.
Model selection: Claude Sonnet 4.5 simultaneously attains the highest accuracy (≈ 82 %) and the lowest behavioral variance, making it a strong candidate for reliability‑critical deployments. For open‑source settings, Llama 3.1 70B with a reduced temperature (≈ 0.0–0.3) offers a reasonable trade‑off.
Scalability concerns: The study uses a minimal toolset (three tools) and relatively short trajectories (≤ 8 steps). Real‑world agents often involve dozens of tools and longer reasoning chains, suggesting that the combinatorial explosion of decision points could dramatically amplify inconsistency. Monitoring consistency becomes even more crucial as agents scale in capability.
Future directions: Extending the analysis to more complex benchmarks (SWE‑bench, WebArena), multimodal inputs, and larger action spaces; exploring reinforcement‑learning‑based policy refinement to stabilize the first‑step query; and systematically studying how model size, training data, and architecture affect consistency.

Limitations – The work is confined to a single QA benchmark and a lexical search implementation; the temperature ablation uses a small subset of questions; and the consistency metrics focus on discrete action sequences rather than probabilistic uncertainty measures.

Conclusion – Behavioral consistency is a measurable, predictive, and actionable property of LLM‑based agents. Early divergence, especially at the first search step, signals higher error risk, while shorter, stable trajectories correlate with higher accuracy. Simple engineering levers such as temperature tuning and improved query generation can substantially improve both consistency and performance, offering practical pathways toward more reliable agent deployments. The authors release code and data at https://github.com/amanmehta‑maniac/agent‑consistency.

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment