Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis
Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines – conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. We present a focused empirical evaluation that isolates an LLM’s reasoning behavior. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totaling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.
💡 Research Summary
This paper presents a systematic, controlled study of large language models (LLMs) applied to root‑cause analysis (RCA) in cloud‑native microservice environments, focusing specifically on multi‑hop fault propagation. The authors argue that prior work obscures the true reasoning capabilities of LLMs by embedding them in complex multi‑agent pipelines or by processing raw, high‑volume telemetry that exceeds model context windows. To isolate reasoning, they build a simplified experimental framework that (1) extracts alerts from logs, metrics, and traces using deterministic parsers (Drain for logs, Isolation Forest for traces, 3‑sigma rule for metrics), (2) unifies these alerts either chronologically or by reporting element, and (3) constructs an explicit knowledge graph (KG) encoding services, data‑flow, control‑flow, and deployment topology.
Three agentic workflows are evaluated: a non‑agentic “Straight‑Shot” baseline, ReAct (observe‑act‑observe loop), and Plan‑and‑Execute (explicit planning followed by execution and replanning). Six open‑source LLMs (including Llama‑2‑70B, Mistral‑7B, Gemma‑2B) are run under each workflow on two real‑world case studies (GAIA and OpenRCA). In total, 48 000 simulated failure scenarios are generated, consuming 228 days of compute time.
The authors introduce a “LLM‑as‑a‑Judge” evaluator that automatically scores intermediate reasoning traces. This judge is trained on 3 073 manually labeled traces and achieves a Cohen’s κ of 0.78 against human raters, providing a reproducible metric for trace quality. Using this evaluator, they derive a taxonomy of 16 distinct reasoning failures (e.g., Stalled – no progress, Biased – incorrect assumptions, Confused – mixed propagation paths).
Key findings: (i) Plan‑and‑Execute outperforms ReAct by ~7 percentage points, indicating that explicit planning helps LLMs manage multi‑step diagnosis. (ii) Multi‑modal alert inputs improve overall accuracy by ~9 percentage points compared with single‑modality inputs, highlighting the importance of fused telemetry. (iii) The average root‑cause accuracy across models is 62 %, revealing substantial room for improvement. (iv) Specific failure types strongly predict final outcomes; “Stalled” errors drop accuracy to 35 %, while the co‑occurrence of “Stalled” and “Confused” raises failure probability to 92 %. (v) Model‑specific biases emerge, with Llama‑2 showing a higher incidence of “Biased” errors.
The study concludes that current open‑source LLMs are not yet reliable for autonomous multi‑hop RCA, especially when reasoning traces contain stalls or confusion. The provided taxonomy and automated judging pipeline constitute reusable resources for future research, enabling systematic benchmarking, targeted model fine‑tuning, and the design of more transparent, reasoning‑centric RCA systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment