EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why -- Measuring Mechanistic Multiplicity Across Training Runs

Reading time: 6 minute
...

📝 Original Info

  • Title: EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs
  • ArXiv ID: 2512.22240
  • Date: 2025-12-23
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Logistic Regression. Although all models achieve high predictive accuracy, explanation stability differs across pipelines. Deep neural networks on Breast Cancer converge to a single explanatory basin, while the same architecture on Adult Income separates into distinct explanatory basins despite identical training conditions. Logistic Regression on Breast Cancer exhibits conditional multiplicity, where basin accessibility is controlled by regularisation configuration. EvoXplain does not attempt to select a correct explanation. Instead, it makes explanatory structure visible and quantifiable, revealing when single instance explanations obscure the existence of multiple admissible predictive mechanisms. More broadly, EvoXplain reframes interpretability as a property of the training pipeline under repeated instantiation, rather than of any single trained model.

💡 Deep Analysis

Deep Dive into EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why -- Measuring Mechanistic Multiplicity Across Training Runs.

Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Lo

📄 Full Content

EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why Measuring Mechanistic Multiplicity Across Training Runs Chama Bensmail University of Hertfordshire Omics Data Solutions LTD bensmail.chama@gmail.com Jan 2026 Abstract Machine learning models are primarily judged by their predictive performance, especially in applied settings. Once a model reaches a high accuracy, its explanation is often assumed to be both correct and trustworthy. However, this assumption raises an often overlooked question: when two models achieve high accuracy, do they depend on the same internal logic, or do they reach the same outcome via different—and potentially competing—mechanisms? We introduce EvoXplain, a simple diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a sin- gle trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself—without aggregating predictions or construct- ing ensembles—and examines whether these samples form a single coherent explanation or instead separate into structured regions of explanation space corresponding to distinct explanatory modes. We evaluate EvoXplain on the Breast Cancer and COMPAS datasets across two widely used model classes: Logistic Regression and tree-based models (Random Forests). Although all models achieve high predictive accuracy, their explanations frequently exhibit clear multimodality. Even models commonly assumed to be stable—such as Logistic Regres- sion—can give rise to structured, low-dimensional explanation manifolds that separate into recurrent regions under repeated training on the same data split. Crucially, these distinct explanation modes coexist even at near-identical hyperparameter configurations, indicat- ing genuine explanation non-identifiability rather than smooth sensitivity to regularisation strength. EvoXplain does not attempt to select a “correct” explanation. Instead, it makes ex- planatory instability visible and quantifiable, revealing when averaged or single-instance explanations obscure the existence of multiple underlying mechanisms. This provides a prac- tical check on interpretability claims and highlights a failure mode of common explanation- averaging practices used in ensemble and AutoML pipelines. More broadly, EvoXplain reframes interpretability as a property of a model class under repeated instantiation, rather than of any single trained model. Code: https://github.com/bensmailchama-boop/EvoXplain 1 arXiv:2512.22240v3 [cs.LG] 2 Feb 2026 1 Introduction Machine learning models are increasingly deployed in domains where explanations matter, includ- ing healthcare, public policy, and scientific discovery. Despite the fact that practitioners routinely retrain models during development and performance tuning, interpretability is typically assessed using a single trained instance—or by averaging explanations across runs—rather than by explic- itly examining whether explanations themselves remain consistent under retraining. This practice implicitly assumes that high predictive accuracy corresponds to a single, stable way of reasoning [Lipton, 2018, Rudin, 2019]. There have been sustained challenges to this assumption in the literature. Perhaps the most well-known is the Rashomon effect, which observes that distinct models may achieve similar predictive performance while relying on different internal logics [Breiman, 2001, Marx et al., 2020]. Related work on underspecification highlights that modern learning pipelines often admit many solutions that are indistinguishable in terms of accuracy, leaving training and model selection procedures to select arbitrarily among them [D’Amour et al., 2022]. Other studies have shown that explanations can vary substantially when models are retrained, or when features are correlated or perturbed, including documented failures of attribution methods under retraining, feature overlap, or input transformations [Kindermans et al., 2017, Nogueira et al., 2018, Kumar et al., 2020, Frye et al., 2020]. Collectively, these results suggest that explanation stability cannot be assumed a priori. However, existing work does not provide a practical way to quantify how much explanatory variability exists in a given setting, nor whether such variability is structured or merely diffuse. To address this gap, we propose EvoXplain, a simple and easy-to-apply framework for assess- ing the stability of model explanations across repeated training. Rather than evaluating a single attribution vector, EvoXplain treats explanations as outcomes of repeated training and model selection and analyses them as a distribution induced by the training pipeline itself. By cluster- ing explanation samples obtained from repeated instantiations of the same model class—without aggregating predictions or constructing ensembles—EvoXplain determines whether training con- sistently yields a s

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut