Beyond single-model XAI: aggregating multi-model explanations for enhanced trustworthiness
The use of Artificial Intelligence (AI) models in real-world and high-risk applications has intensified the discussion about their trustworthiness and ethical usage, from both a technical and a legislative perspective. The field of eXplainable Artificial Intelligence (XAI) addresses this challenge by proposing explanations that bring to light the decision-making processes of complex black-box models. Despite being an essential property, the robustness of explanations is often an overlooked aspect during development: only robust explanation methods can increase the trust in the system as a whole. This paper investigates the role of robustness through the usage of a feature importance aggregation derived from multiple models ($k$-nearest neighbours, random forest and neural networks). Preliminary results showcase the potential in increasing the trustworthiness of the application, while leveraging multiple model’s predictive power.
💡 Research Summary
The paper addresses a pressing issue in the deployment of AI systems in high‑risk domains: the need for trustworthy and robust explanations of black‑box models. While the field of eXplainable AI (XAI) has produced many methods for generating local or global explanations, the robustness of those explanations—i.e., their stability under small, realistic input perturbations—has received comparatively little attention. The authors formulate two research questions: (1) can the disagreement problem, where multiple explanation methods give conflicting attributions for the same instance, be mitigated by aggregating explanations from several heterogeneous models; and (2) can a quantitative robustness score be computed to decide whether a particular explanation should be trusted?
To explore these questions, the authors select three fundamentally different classifiers: k‑Nearest Neighbours (k‑NN), Random Forests (RF), and Neural Networks (NN). For each model they devise a dedicated feature‑importance extraction algorithm that yields a vector of attributions compatible with the other two.
-
For k‑NN, the algorithm identifies the k nearest neighbours belonging to the predicted class (N_c) and to the opposite class (N_¬c). It computes the average feature‑wise distance from the query point to each set (D_c and D_¬c) and defines the attribution as e = D_¬c – D_c, followed by L2 normalisation. This captures which features most separate the two classes in the local neighbourhood.
-
For Random Forests, the method traverses each tree’s decision path for the query point, accumulates the impurity reduction (Gini or cross‑entropy) of every split node, and stores the sums in two vectors: e_c for trees that agree with the ensemble’s majority vote and e_¬c for those that disagree. The final attribution is a weighted combination (p_¬c + ε)·e_c – p_c·e_¬c, where p_c and p_¬c are the class probabilities, and ε = 0.01 prevents division by zero. The result is also L2‑normalised.
-
For Neural Networks, the authors employ DeepLIFT, a back‑propagation‑like technique that distributes the difference between the output and a reference activation to the input features. The resulting vector is normalised to unit length to match the other two.
Having obtained three unit‑norm attribution vectors (a_1, a_2, a_3), the authors aggregate them by a simple arithmetic mean: a_agg = (1/L) Σ_l a_l, with L = 3. This averaging has two desirable properties: it penalises strong sign disagreements (since opposite signs cancel out) and it limits the influence of any single model that is uncertain about a feature (i.e., has a near‑zero coefficient). When the three models predict different classes, the authors flip the sign of the attributions belonging to the disagreeing model(s) before averaging, ensuring that all vectors refer to the same output class.
Robustness is measured using the estimator introduced in a prior work (reference
Comments & Academic Discussion
Loading comments...
Leave a Comment