Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration
📝 Abstract
Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.
💡 Analysis
Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.
📄 Content
The ability to explain model decisions and ensure accurate confidence estimates are fundamental requirements for deploying machine learning systems responsibly [34]. Perturbation-based techniques [54] have been established as a popular way to enhance model transparency in practice [6,14]. Such methods systematically modify input features to quantify their importance by evaluating and aggregating subsequent changes in model outputs [11]. This intuitive principle and the flexibility to explain any prediction in a model-agnostic way has led to widespread adoption across various domains [31].
Nevertheless, the application of perturbation-based techniques faces a fundamental challenge: These methods operate by generating inputs that differ substantially from the training distribution [26,20], and models often produce invalid outputs for such perturbed samples [18,8,32]. Consequently, forming explanations by aggregating misleading predictions under perturbations can significantly distort the outcome and compromise its fidelity. On top of that, this can also contribute to frequently observed instabilities of perturbation-based explanations [1,4,38,13], which reduces their effectiveness and could even be exploited for malicious manipulations [58,3]. These issues naturally raise the question of how to attain reliable model outputs under the specific perturbations used when deriving explanations. A classical approach for this purpose is uncertainty calibration [46,45,25]. It aims to ensure that a model’s confidence aligns with its actual accuracy, which is crucial to obtaining meaningful probabilistic predictions. Consider, for instance, the situation in Figure 1 where a classifier detects a bee in the image with 99% confidence. Then this value is only reliably interpretable if for all predictions with corresponding confidence, precisely 99% are indeed correct. While calibration has been extensively studied in the machine learning literature [63], its Figure 1: Perturbation-based explanation methods typically query the model on modified inputs and aggregate the resulting prediction changes to identify relevant features. However, we show that models typically produce significantly miscalibrated output probabilities under commonly used perturbations. This means that the underlying predictions used to derive explanations do not reflect actual changes in class likelihoods, obscuring true feature importance. To mitigate this, we propose ReCalX as a simple recalibration technique that enables reliable outputs under explainability-specific perturbations, leading to more informative explanation results. role in explanation methods remains largely unexplored. First empirical evidence suggests that basic calibration might indeed benefit explainability [55,39,40], but a rigorous theoretical understanding is still missing. In this work, we provide the first comprehensive analysis of the relationship between uncertainty calibration and perturbation-based explanations. Our findings establish calibration as a fundamental prerequisite for reliable model explanations and propose a practical solution for enhancing the quality of perturbation-based explanation methods (see Figure 1).
More precisely, we make the following contributions:
• We provide a rigorous theoretical analysis revealing how poor calibration can bias the results of common perturbation-based explainability techniques.
• We show that common neural classifiers for tabular and image data exhibit high levels of miscalibration under explainability-related perturbations.
• We propose ReCalX, a novel approach that increases the reliability of model outputs under the particular perturbations used to derive explanations and validate its effectiveness.
• We demonstrate that after applying ReCalX, explanation results are significantly more robust and better identify input features that are relevant for high performance.
Basic Notation Consider a probability space (Ω, F, P ), where Ω is the sample space, F is a σalgebra of events, and P is a probability measure. Let X : Ω → X be a random variable representing the feature space X , and let Y : Ω → Y be a random variable representing the target space Y. The joint data distribution of (X, Y ) is denoted by P X,Y , the conditional distribution of Y given X by P Y |X , and the marginal distributions by P X , P Y . During our theoretical analysis, we will also make use of the following quantities. The mutual information between two random variables X and Y , measures the reduction in uncertainty of one variable given the other and is defined as:
A related measure is the Kullback-Leibler (KL) divergence, which expresses the difference between two probability distributions P and Q over the same sample space:
Perturbation-based Explanations Perturbation-based explanations quantify the importance of individual input features by evaluating how a model output changes under specific input corruptions. Let f : X → [0, 1] K be a classi
This content is AI-processed based on ArXiv data.