Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model's decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.
Deep neural networks (DNNs) have achieved remarkable success across numerous applications (Redmon et al. 2016;Vinyals et al. 2015;Antol et al. 2015). Despite their impressive capabilities, a significant challenge persists: the inherent lack of interpretability in their decision-making processes. This limitation raises critical concerns about the reliability and safety of DNNs, particularly in high-stakes applications where model interpretability is crucial for ensuring reliability and accountability (Molnar 2020;Borys et al. 2023).
Current approaches for explaining DNN predictions can be broadly categorized into two paradigms: post-hoc explaining and self-explaining methods. Post-hoc explanation methods, including gradient-based (Sundararajan, Taly, and Yan 2017;Selvaraju et al. 2017;Yang, Wang, and Bilgic 2023;Li et al. 2023a), perturbation-based techniques (Petsiuk, Das, and Saenko 2018;Fong, Patrick, and Vedaldi 2019;Jethani et al. 2022;Covert, Kim, and Lee 2023), counterfactual-generation-based (Bass et al. 2022;Xie et al. 2024) and attention-based (Abnar and Zuidema 2020;Chefer, Gur, and Wolf 2021a;Qiang et al. 2022;Wu et al. 2024a), are typically applied independently of model training. While widely adopted, these methods often produce unfaithful explanations that inadequately represent model behaviors (Adebayo et al. 2018;Yang and Kim 2019;Kindermans et al. 2019;Hesse, Schaub-Meyer, and Roth 2024). In contrast, self-explaining neural networks (SENNs) (Chen et al. 2019;Brendel and Bethge 2019;Wang, Wang, and Inouye 2021;Hesse, Schaub-Meyer, and Roth 2021;Böhle, Fritz, and Schiele 2022;Chen et al. 2023;Nauta et al. 2023;De Santi et al. 2024;Arya et al. 2024) integrate interpretability directly into their models through specialized architecture designs. By construction, SENNs generate intrinsic explanations aligned with the model’s decision logic, offering greater faithfulness compared to posthoc methods. However, SENNs face three major limitations: 1) they often require training from scratch, limiting their compatibility with pre-trained models (Arya et al. 2024); 2) their specialized architectural designs (Chen et al. 2019;Hesse, Schaub-Meyer, and Roth 2021;Chen et al. 2023) often lead to degraded performance compared to standard DNNs; 3) the inclusion of interpretability modules (Brendel and Bethge 2019;Wang, Wang, and Inouye 2021;Chen et al. 2023) introduces memory and computational overhead, hindering scalability.
Recent research (Lundberg and Lee 2017;Jethani et al. 2022;Covert, Kim, and Lee 2023) establishes the Shapley value (Shapley 1953) as a principled approach for model interpretation, as it quantifies the marginal contribution of individual input components (e.g., image patches) to predictions. While many approaches have made progress in incorporating Shapley value for model interpretation, they still face significant limitations in computational efficiency and attribution accuracy. Conventional approaches (Castro, Gómez, and Tejada 2009;Strumbelj and Kononenko 2010;Lundberg and Lee 2017;Covert and Lee 2021;Mitchell et al. 2022) require extensive model inferences to approximate Shapley values, making them computationally costly and impractical for many applications. Moreover, the discrepancy between masked data used for Shapley value estimation during testing and the unmasked data during training may introduce additional attribution errors. Alternative methods such as FastSHAP (Jethani et al. 2022) and ViT-Shapley (Covert, Kim, and Lee 2023) introduce an auxiliary surrogate model to process masked images, followed by Tench English Springer Chain Saw Churh French Horn Gas Pump Cassette Player Golf Ball Parachute training an explainer to fastly estimate Shapley values. However, the explainer ultimately interprets the surrogate model rather than the model to be explained.
To address these challenges, we propose a multitask learning framework that intrinsically integrates Shapley value estimation directly into the model’s optimization process. By simultaneously optimizing the model for the primary task (e.g., image classification) and Shapley value estimation with an appropriate trade-off parameter, our framework (see Figure 2) achieves self-interpretable predictions without compromising the performance of the primary task. Unlike post-hoc explanation methods that risk misrepresenting a model’s decision logic, our approach learns explanations as part of the model’s reasoning process, ensuring alignment between explanations and decision logic (see Figure 1). Compared to existing SENNs, our framework requires minor architectural modifications and does not rely on external interpretation modules, preserving computational efficiency and model compatibility. Furthermore, by naturally incorporating masked images during training, our approach circumvents the discrepancy problem associated with post-hoc Shapley estimation without the need for a surrogate model.
In summary, our key contributions are
This content is AI-processed based on open access ArXiv data.