A Framework for Causal Concept-based Model Explanations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work presents a conceptual framework for causal concept-based post-hoc Explainable Artificial Intelligence (XAI), based on the requirements that explanations for non-interpretable models should be understandable as well as faithful to the model being explained. Local and global explanations are generated by calculating the probability of sufficiency of concept interventions. Example explanations are presented, generated with a proof-of-concept model made to explain classifiers trained on the CelebA dataset. Understandability is demonstrated through a clear concept-based vocabulary, subject to an implicit causal interpretation. Fidelity is addressed by highlighting important framework assumptions, stressing that the context of explanation interpretation must align with the context of explanation generation.

💡 Research Summary

The paper introduces a novel post‑hoc explainable AI (XAI) framework that generates causal, concept‑based explanations for black‑box models. The authors argue that a satisfactory explanation must satisfy two criteria: understandability for human users and fidelity to the underlying model. To meet these goals, they propose to abstract the model’s input space into a set of high‑level, human‑interpretable concepts (e.g., “wearing glasses”, “has beard”, “smiling”) and then evaluate the effect of intervening on each concept using the probability of sufficiency (PoS). PoS is defined as the probability that, when a particular concept is present (or forced to be present), the model will output a target class. By estimating PoS through Monte‑Carlo simulations of concept interventions, the framework yields both local (instance‑specific) and global (dataset‑wide) explanations.

The methodology proceeds in four stages. First, a domain‑expert curated concept vocabulary is mapped onto raw data. For images, this involves labeling each sample with binary concept indicators; for text, analogous semantic tags could be used. Second, the concepts are aligned with internal representations of the target model, typically by learning a linear probe or a small neural head that predicts concept presence from intermediate activations. This alignment step ensures that the concepts are grounded in the model’s learned features. Third, concept interventions are simulated: the selected concept is “turned on” or “turned off” while all other inputs remain unchanged, and the model’s predictions are recomputed across many perturbed instances. The proportion of times the target class is predicted under the intervention provides an empirical estimate of PoS. Fourth, the PoS values are visualized. Local explanations highlight which concepts most strongly drive a single prediction, while global explanations rank concepts by their average PoS across the whole dataset, often displayed as bar charts or causal diagrams.

The authors demonstrate the approach on a ResNet‑50 classifier trained on the CelebA facial attribute dataset. They define a set of 40 visual concepts and compute PoS for each with respect to gender classification. For instance, activating the “glasses” concept raises the probability of predicting “male” from 0.68 to 0.85, indicating that the model has learned an implicit causal link between glasses and gender in the training distribution. Global analysis reveals that “beard” and “thick eyebrows” have the highest sufficiency for the male class, whereas “smile” and “makeup” are most sufficient for the female class. These findings align with human intuition, supporting the claim that the explanations are understandable.

Importantly, the paper is transparent about its assumptions. The sufficiency calculation treats concepts as independent interventions, ignoring potential interactions (e.g., “glasses” and “beard” may co‑occur). The fidelity of the explanations therefore depends on the validity of this independence assumption and on the quality of the concept vocabulary. Moreover, estimating PoS requires many forward passes, which can be computationally expensive for large models or datasets. The authors suggest future work on learning explicit causal graphs among concepts and on using Bayesian optimization to reduce the number of required simulations.

In summary, the proposed framework bridges the gap between low‑level feature importance methods and high‑level, human‑centric explanations by grounding explanations in causal concept interventions. It offers a systematic way to quantify how much each concept contributes to a model’s decision, delivering both local interpretability and global insight while maintaining a clear link to the model’s actual behavior. The work advances the state of XAI by providing a principled, theoretically motivated, and empirically validated approach that could be extended to other domains beyond facial attribute classification.