Root Cause Analysis of Outliers with Missing Structural Knowledge
The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause, i.e., as a soft intervention. RCA is then the task of identifying which causal mechanism changed. In real-world applications, one often has either few or only a single sample from the post-intervention distribution: a severe limitation for most methods, which assume one knows or can estimate the distribution. However, even those that do not are statistically ill-posed due to the need to probe regression models in regions of low probability density. In this paper, we propose simple, efficient methods to overcome both difficulties in the case where there is a single root cause and the causal graph is a polytree. When one knows the causal graph, we give guarantees for a traversal algorithm that requires only marginal anomaly scores and does not depend on specifying an arbitrary anomaly score cut-off. When one does not know the causal graph, we show that the heuristic of identifying root causes as the variables with the highest marginal anomaly scores is causally justified. To this end, we prove that anomalies with small scores are unlikely to cause those with larger scores in polytrees and give upper bounds for the likelihood of causal pathways with non-monotonic anomaly scores.
💡 Research Summary
The paper tackles the problem of Root Cause Analysis (RCA) under three realistic constraints: only a single anomalous observation is available, there is exactly one root cause, and the underlying causal graph is a polytree (a directed acyclic graph where the underlying undirected graph is a tree). Traditional RCA methods either require many post‑intervention samples to estimate the anomalous distribution or assume full knowledge of the structural causal model, both of which are often infeasible in practice.
The authors introduce two simple, computationally cheap methods that work under these constraints. Central to both methods is an “information‑theoretic” (IT) anomaly score. Given any scalar anomaly detector τ (e.g., a z‑score), the marginal score for a value x is defined as S(x)=−log P(τ(X)≥τ(x)), i.e., the negative log‑p‑value of observing a value at least as extreme as x under the normal distribution. A conditional version S(x|pa)=−log P(τ(X)≥τ(x) | PA=pa) is also defined, but the key insight is that, for polytrees, marginal scores alone contain enough information to infer causal directionality.
The theoretical development begins with a two‑variable cause‑effect pair X→Y. The authors define “score typicality” – a condition relating the conditional score S(y|x) to the absolute difference |S(y)−S(x)| – and prove several lemmas: (1) H_X⁰ (the hypothesis that X’s mechanism is unchanged) can be rejected at level e^{−S(x)}; (2) H_Y⁰ (the hypothesis that Y’s conditional mechanism is unchanged) can be rejected at level e^{−|S(y)−S(x)|⁺} under score typicality; (3) the probability that an anomaly at X causes a much larger anomaly at Y decays exponentially with the score gap. Lemmas also show that score typicality holds with high probability for absolutely continuous X, or exactly when the mapping x↦S(x) is injective and monotonic.
Extending to arbitrary DAGs, the authors note that without structural constraints the problem becomes ill‑posed because estimating conditional distributions in low‑density regions is statistically unstable. However, when the DAG is a polytree, the “score jump” property holds along any directed path: scores tend to increase (or decrease) monotonically from the root cause outward. This leads to the first algorithm, SMOOTH TRAVERSAL, which assumes the causal graph is known. Starting from the root(s) of the polytree, the algorithm traverses downstream edges only when the child’s marginal score exceeds the parent’s score. The traversal stops at the first node where the score does not increase; that node is returned as the root cause. The algorithm requires only the set of marginal scores, avoids any conditional density estimation, and does not need an arbitrary anomaly‑score threshold. The authors prove that, under the score‑typicality assumption, SMOOTH TRAVERSAL identifies the true root cause with probability at least 1−δ, where δ depends on the exponential bounds derived earlier.
When the causal graph is unknown, the authors propose SCORE ORDERING. All variables are ranked by their marginal IT scores, and the top‑k variables are taken as candidate root causes. The theoretical justification rests on the same exponential bound: in a polytree, the probability that a node with a lower score causes a node with a higher score is exponentially small in the score gap. Consequently, the highest‑scoring variables are the most plausible root causes. The authors acknowledge that this heuristic can fail in the presence of multiple simultaneous interventions or in graphs with cycles, but they argue that for many engineering and biomedical applications the single‑root‑cause, polytree assumption is reasonable.
Empirical evaluation includes synthetic data generated from linear and nonlinear structural equations, a real‑world cloud‑computing microservice dataset where performance degradations are known, and a personalized‑medicine dataset where each patient’s disease is treated as a unique anomaly. The proposed methods are compared against state‑of‑the‑art RCA techniques such as counterfactual contribution analysis, Bayesian network inference, and regression‑based residual analysis. Results show that SMOOTH TRAVERSAL achieves near‑perfect root‑cause identification when the graph is supplied, while SCORE ORDERING attains competitive precision and recall without any graph information. Both methods are orders of magnitude faster than baselines because they avoid iterative likelihood estimation or graph learning.
In summary, the paper makes three main contributions: (1) it demonstrates that marginal IT anomaly scores are sufficient for causal inference in polytrees; (2) it provides a provably correct, linear‑time traversal algorithm for known graphs; (3) it offers a theoretically grounded heuristic for unknown graphs. The work broadens the applicability of RCA to settings where data are scarce—only a single anomalous observation is available—while retaining rigorous causal guarantees. Future directions mentioned include extending the theory to multiple simultaneous interventions, handling graphs with loops, and adapting the framework to high‑dimensional, non‑Gaussian mechanisms.
Comments & Academic Discussion
Loading comments...
Leave a Comment