📝 Original Info
- Title: EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs
- ArXiv ID: 2512.22240
- Date: 2025-12-23
- Authors: Researchers from original ArXiv paper
📝 Abstract
Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Logistic Regression. Although all models achieve high predictive accuracy, explanation stability differs across pipelines. Deep neural networks on Breast Cancer converge to a single explanatory basin, while the same architecture on Adult Income separates into distinct explanatory basins despite identical training conditions. Logistic Regression on Breast Cancer exhibits conditional multiplicity, where basin accessibility is controlled by regularisation configuration. EvoXplain does not attempt to select a correct explanation. Instead, it makes explanatory structure visible and quantifiable, revealing when single instance explanations obscure the existence of multiple admissible predictive mechanisms. More broadly, EvoXplain reframes interpretability as a property of the training pipeline under repeated instantiation, rather than of any single trained model.
💡 Deep Analysis
Deep Dive into EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why -- Measuring Mechanistic Multiplicity Across Training Runs.
Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Lo
📄 Full Content
EvoXplain: When Machine Learning Models Agree on
Predictions but Disagree on Why
Measuring Mechanistic Multiplicity Across Training Runs
Chama Bensmail
University of Hertfordshire
Omics Data Solutions LTD
bensmail.chama@gmail.com
Jan 2026
Abstract
Machine learning models are primarily judged by their predictive performance, especially
in applied settings. Once a model reaches a high accuracy, its explanation is often assumed
to be both correct and trustworthy. However, this assumption raises an often overlooked
question: when two models achieve high accuracy, do they depend on the same internal logic,
or do they reach the same outcome via different—and potentially competing—mechanisms?
We introduce EvoXplain, a simple diagnostic framework that measures the stability of
model explanations across repeated training. Rather than analysing the explanation of a sin-
gle trained model, EvoXplain treats explanations as samples drawn from the training and
model selection pipeline itself—without aggregating predictions or construct-
ing ensembles—and examines whether these samples form a single coherent explanation
or instead separate into structured regions of explanation space corresponding to distinct
explanatory modes.
We evaluate EvoXplain on the Breast Cancer and COMPAS datasets across two
widely used model classes: Logistic Regression and tree-based models (Random Forests).
Although all models achieve high predictive accuracy, their explanations frequently exhibit
clear multimodality. Even models commonly assumed to be stable—such as Logistic Regres-
sion—can give rise to structured, low-dimensional explanation manifolds that separate into
recurrent regions under repeated training on the same data split. Crucially, these distinct
explanation modes coexist even at near-identical hyperparameter configurations, indicat-
ing genuine explanation non-identifiability rather than smooth sensitivity to regularisation
strength.
EvoXplain does not attempt to select a “correct” explanation. Instead, it makes ex-
planatory instability visible and quantifiable, revealing when averaged or single-instance
explanations obscure the existence of multiple underlying mechanisms. This provides a prac-
tical check on interpretability claims and highlights a failure mode of common explanation-
averaging practices used in ensemble and AutoML pipelines.
More broadly, EvoXplain
reframes interpretability as a property of a model class under repeated instantiation, rather
than of any single trained model.
Code: https://github.com/bensmailchama-boop/EvoXplain
1
arXiv:2512.22240v3 [cs.LG] 2 Feb 2026
1
Introduction
Machine learning models are increasingly deployed in domains where explanations matter, includ-
ing healthcare, public policy, and scientific discovery. Despite the fact that practitioners routinely
retrain models during development and performance tuning, interpretability is typically assessed
using a single trained instance—or by averaging explanations across runs—rather than by explic-
itly examining whether explanations themselves remain consistent under retraining. This practice
implicitly assumes that high predictive accuracy corresponds to a single, stable way of reasoning
[Lipton, 2018, Rudin, 2019].
There have been sustained challenges to this assumption in the literature. Perhaps the most
well-known is the Rashomon effect, which observes that distinct models may achieve similar
predictive performance while relying on different internal logics [Breiman, 2001, Marx et al., 2020].
Related work on underspecification highlights that modern learning pipelines often admit many
solutions that are indistinguishable in terms of accuracy, leaving training and model selection
procedures to select arbitrarily among them [D’Amour et al., 2022]. Other studies have shown that
explanations can vary substantially when models are retrained, or when features are correlated
or perturbed, including documented failures of attribution methods under retraining, feature
overlap, or input transformations [Kindermans et al., 2017, Nogueira et al., 2018, Kumar et al.,
2020, Frye et al., 2020]. Collectively, these results suggest that explanation stability cannot be
assumed a priori. However, existing work does not provide a practical way to quantify how much
explanatory variability exists in a given setting, nor whether such variability is structured or
merely diffuse.
To address this gap, we propose EvoXplain, a simple and easy-to-apply framework for assess-
ing the stability of model explanations across repeated training. Rather than evaluating a single
attribution vector, EvoXplain treats explanations as outcomes of repeated training and model
selection and analyses them as a distribution induced by the training pipeline itself. By cluster-
ing explanation samples obtained from repeated instantiations of the same model class—without
aggregating predictions or constructing ensembles—EvoXplain determines whether training con-
sistently yields a s
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.