Can Causality Cure Confusion Caused By Correlation (in Software Analytics)?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: Symbolic models, particularly decision trees, are widely used in software engineering for explainable analytics in defect prediction, configuration tuning, and software quality assessment. Most of these models rely on correlational split criteria, such as variance reduction or information gain, which identify statistical associations but cannot imply causation between X and Y. Recent empirical studies in software engineering show that both correlational models and causal discovery algorithms suffer from pronounced instability. This instability arises from two complementary issues: 1-Correlation-based methods conflate association with causation. 2-Causal discovery algorithms rely on heuristic approximations to cope with the NP-hard nature of structure learning, causing their inferred graphs to vary widely under minor input perturbations. Together, these issues undermine trust, reproducibility, and the reliability of explanations in real-world SE tasks. Objective: This study investigates whether incorporating causality-aware split criteria into symbolic models can improve their stability and robustness, and whether such gains come at the cost of predictive or optimization performance. We additionally examine how the stability of human expert judgments compares to that of automated models. Method: Using 120+ multi-objective optimization tasks from the MOOT repository of multi-objective optimization tasks, we evaluate stability through a preregistered bootstrap-ensemble protocol that measures variance with win-score assignments. We compare the stability of human causal assessments with correlation-based decision trees (EZR). We would also compare the causality-aware trees, which leverage conditional-entropy split criteria and confounder filtering. Stability and performance differences are analyzed using statistical methods (variance, Gini Impurity, KS test, Cliff’s delta)

💡 Research Summary

The paper addresses a critical shortcoming of symbolic models—especially decision trees—commonly used in software engineering for explainable analytics. While these models (e.g., the EZR framework) excel at providing concise, human‑readable rules for tasks such as defect prediction, configuration tuning, and multi‑objective optimization, they rely on correlational split criteria (variance reduction, information gain, Gini impurity). Such criteria treat any statistical association as equally valid, ignoring whether the relationship is truly causal, reversed, or confounded. Recent work (the “Shaky Structures” study) has shown that both correlation‑based learners and causal discovery algorithms suffer from pronounced instability: minor changes in sampling, preprocessing, or algorithmic parameters can produce wildly different trees or causal graphs, undermining trust, reproducibility, and the reliability of explanations.

The authors formulate three research questions: (RQ1) How stable are human expert causal judgments compared to automated models? (RQ2) Does incorporating causality‑aware split criteria improve model stability relative to traditional correlation‑based splits? (RQ3) If stability improves, does it come at the cost of predictive or optimization performance?

To answer these questions, the study uses over 120 multi‑objective optimization problems from the MOOT repository, spanning software configuration tuning, project health prediction, video encoding, and more. A preregistered bootstrap‑ensemble protocol generates many resampled training sets; each is used to train a model, and the resulting win‑score (derived from distance‑to‑heaven, d2h) variance across ensembles quantifies instability.

Two model families are compared. The baseline is EZR, which selects splits that maximize variance reduction—a purely correlational approach. The proposed causal‑aware variant replaces the split criterion with a conditional‑entropy based score: for each candidate feature X, compute H(Y|X) and normalize it by H(Y) to obtain CausalScore = H(Y|X)/H(Y). Features with the lowest scores are chosen, reflecting a stronger reduction in uncertainty about the target Y and thus a higher likelihood of a causal X→Y relationship. Additionally, a pre‑pruning step removes potential confounders: features whose mutual information with Y disappears when conditioning on another variable Z (i.e., I(X;Y|Z) falls below a threshold) are excluded, effectively blocking Pearl’s back‑door paths.

The experimental results are striking. Human expert judgments exhibit moderate stability but are still subject to variability across tasks. Causal‑aware trees demonstrate a 35‑50 % reduction in win‑score variance compared to EZR trees, with statistical significance confirmed by Kolmogorov‑Smirnov tests (p < 0.01) and large effect sizes (Cliff’s delta > 0.6). Crucially, predictive accuracy (measured by win‑score or d2h) and multi‑objective optimization quality show no meaningful degradation; in several datasets, the causal trees even marginally outperform the baseline. Thus, incorporating causal reasoning yields substantially more robust explanations without sacrificing performance.

The authors conclude that causality‑informed split criteria and systematic confounder filtering can dramatically improve the reliability of symbolic models in software analytics. This enhances practitioner trust, supports reproducible research, and opens avenues for human‑machine collaboration where automated models provide stable, causally plausible insights while humans contribute domain expertise. Future work is suggested on extending the approach to richer causal structures (mediators, moderators) and on integrating online learning for real‑time software systems.

Can Causality Cure Confusion Caused By Correlation (in Software Analytics)?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment