"Faithful to What?" On the Limits of Fidelity-Based Explanations
In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network’s predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $λ(f)$, a diagnostic that quantifies the extent to which a regression network’s input–output behavior is linearly decodable. $λ(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model’s behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.
💡 Research Summary
The paper tackles a fundamental flaw in many explainable‑AI (XAI) approaches that rely on surrogate models: the common practice of evaluating a surrogate by its fidelity to a neural network’s predictions does not guarantee that the surrogate captures the underlying data‑generating signal that drives predictive performance. Fidelity merely measures agreement between the surrogate and the learned model, not alignment with the true task‑relevant structure. To expose and quantify this gap, the authors introduce a diagnostic called the linearity score λ(f). For a regression network f : ℝⁿ → ℝ, λ(f) is defined as the coefficient of determination (R²) between f and its optimal linear surrogate g, where g is the affine function that minimizes the mean‑squared error with respect to f over the input domain. Formally, g = arg min_{g∈L} E
Comments & Academic Discussion
Loading comments...
Leave a Comment