Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic AI systems combine large language model (LLM) reasoning with external tool invocation and long-horizon task execution. Although these systems are increasingly deployed in practice, their architectural composition introduces reliability challenges that differ from those in traditional software systems and standalone LLM applications. However, there is limited empirical understanding of how faults originate, manifest, and propagate in real-world agentic AI systems. To address this gap, we conduct a large-scale empirical study of faults in agentic AI systems. We collect 13,602 issues and pull requests from 40 open-source agentic AI repositories and apply stratified sampling to select 385 faults for in-depth qualitative analysis. Using grounded theory, we derive taxonomies of fault types, observable symptoms, and root causes. We further apply Apriori-based association rule mining to identify statistically significant relationships among faults, symptoms, and root causes, revealing common fault propagation patterns. Finally, we validate the taxonomy through a developer study with 145 practitioners. Our analysis identifies 37 distinct fault types grouped into 13 higher-level fault categories, along with 13 classes of observable symptoms and 12 categories of root causes. The results show that many failures originate from mismatches between probabilistically generated artifacts and deterministic interface constraints, frequently involving dependency integration, data validation, and runtime environment handling. Association rule mining further reveals recurring propagation pathways across system components, such as token management faults leading to authentication failures and datetime handling defects causing scheduling anomalies. Practitioners rated the taxonomy as representative of real-world failures (mean = 3.97/5), and 83.8% reported that it covered faults they had encountered.


💡 Research Summary

This paper presents a large‑scale empirical investigation of faults in agentic AI systems—software that couples large language model (LLM) reasoning with tool invocation, state management, and long‑horizon task execution. Recognizing that such systems differ fundamentally from both traditional deterministic software and pure conversational LLM applications, the authors set out to characterize how faults arise, how they manifest, and how they propagate across system components.

Data collection began with a systematic GitHub search for repositories tagged “AI agents” that had at least 1,000 stars and 30 issues as of June 2025. After language filtering (Python only) and manual annotation (Cohen’s κ = 0.83), 40 high‑quality open‑source agentic AI projects were retained. Closed issues and merged pull requests were harvested, yielding 13,602 candidate fault reports. An initial keyword filter reduced the set to 19,947 items; a manual audit of 68 samples showed 68.6 % relevance. To further prune noise, the authors employed GPT‑4.1 as a secondary filter, achieving 83 % precision and 97 % recall on a ground‑truth subset.

From the cleaned corpus, stratified sampling was used to select 385 fault instances that were representative across repository types, sizes, and domains. The authors then applied grounded theory: open coding generated an initial pool of concepts, axial coding clustered them into higher‑level groups, and selective coding produced the final taxonomy. The result is a hierarchical classification comprising five architectural fault dimensions (Cognitive Control, Runtime Execution, Environment Grounding, Data/Storage, Interface/Integration) and 37 concrete fault types (e.g., Prompt Design Error, Tool‑Parameter Mismatch, Token Expiration Mishandling, Date‑Time Conversion Error, State Synchronization Failure).

Observable symptoms were grouped into 13 classes, including Authentication Failure, Date/Time Inconsistency, Memory Leak, Tool‑Invocation Failure, and Insufficient Logging. Root causes were organized into 12 categories; the most prevalent were Dependency & Integration Failures (19.5 %) and Data & Type Handling Failures (17.6 %). These findings highlight a systematic mismatch between probabilistically generated artifacts (LLM outputs) and the deterministic contracts enforced by external tools, APIs, and runtime environments.

To uncover fault propagation patterns, the authors performed Apriori‑based association‑rule mining on the 385‑case dataset, using a minimum support of 5 % and confidence of 60 %. High‑lift rules revealed recurring pathways such as:

  • Token‑Management Fault → Authentication Failure (lift = 181.5)
  • Incorrect Date‑Time Conversion → Scheduling Anomaly (lift = 121.0)
  • State‑Management Defect → Memory‑Related Symptom (lift = 97.3)
  • Tool‑Response Parsing Error → Data‑Type Mismatch (lift = 85.4)

These rules demonstrate that many faults are not isolated; weak error handling and limited observability cause errors to cascade across component boundaries, turning simple implementation mistakes into hard‑to‑diagnose failures.

The taxonomy’s external validity was assessed through a developer survey involving 145 practitioners experienced with agentic AI. Participants rated the relevance of each fault category on a 5‑point Likert scale, yielding an average rating of 3.97 and a Cronbach’s α of 0.904, indicating strong internal consistency. Moreover, 83.8 % of respondents confirmed that the taxonomy covered faults they had encountered. Qualitative feedback suggested extensions for multi‑agent coordination issues and enhanced observability, but overall endorsed the taxonomy’s completeness.

In the discussion, the authors argue that the taxonomy can guide the design of debugging pipelines, automated test generation, and runtime monitoring for agentic systems. They recommend embedding contract verification between LLM‑generated prompts and tool interfaces, auto‑generating type schemas, and wrapping external tool calls with robust error‑handling layers to mitigate the dominant Dependency & Integration and Data/Type handling failures. Visualizing high‑lift propagation pathways can inform proactive monitoring of high‑risk components (e.g., token refresh logic, datetime handling modules).

Threats to validity include potential sampling bias (the study focuses on popular Python projects), subjectivity in manual labeling, and reliance on GPT‑4.1 for noise reduction, which may miss nuanced fault reports. Future work is proposed to broaden language and platform coverage, incorporate real‑time operational logs, and evaluate automated mitigation strategies derived from the taxonomy.

Overall, the paper demonstrates that faults in agentic AI are structured, hybrid phenomena that blend traditional software bugs with probabilistic LLM errors. By empirically grounding a detailed taxonomy, revealing statistically significant fault propagation patterns, and validating the findings with a sizable practitioner cohort, the study provides a solid foundation for reliability engineering, observability, and systematic debugging of the next generation of AI‑driven autonomous agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment