The Dead Salmons of AI Interpretability

Reading time: 5 minute
...

📝 Original Info

  • Title: The Dead Salmons of AI Interpretability
  • ArXiv ID: 2512.18792
  • Date: 2025-12-21
  • Authors: Maxime Méloux, Giada Dirupo, François Portet, Maxime Peyrard

📝 Abstract

In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar ''dead salmon'' artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field's susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.

💡 Deep Analysis

Figure 1

📄 Full Content

In 2009, researchers placed a dead salmon in an MRI scanner, showed it photographs of humans in social situations, and ostensibly asked it to judge their emotions (Bennett et al., 2009). Standard analysis pipelines commonly used at the time surprisingly returned brain voxels as significantly predictive of emotional situations. The error arose from a failure to correct for multiple comparisons within the statistical analysis pipeline. The "dead salmon" demonstration of false positives contributed to a larger reckoning in the field of neuroscience. For instance, an influential study showed that different research groups obtained different results even when analyzing the same dataset and the same research question (Botvinik-Nezer et al., 2020). Subsequent work identified several sources of statistical fragility. Widely used statistical procedures embedded in standard analysis pipelines were shown to inflate false-positive rates (Eklund et al., 2016), an effect worsened by non-independent analyses producing spuriously large brain-behavior correlations (Vul et al., 2009). Also, early neuroimaging research was constrained by small samples and limited data availability (Button et al., 2013;Marek et al., 2022), exacerbating overfitting and spurious associations. Moreover, fMRI had been criticized for offering predictive explanations, rather than functional ones, resulting in little clinical relevance (Lyon, 2017). Finally, reverse inference emerged as a central interpretative problem, given that individual neural systems are not uniquely associated with specific cognitive functions (Poldrack, 2006;Duncan & Owen, 2000).

AI interpretability now faces its own dead salmon issues, similarly begging for a larger reevaluation of its statistical foundations. A growing body of work has shown that many influential methods, including feature attribution (Adebayo et al., 2018), probing classifiers (Ravichander et al., 2021), sparse autoencoders (Heap et al., 2025), circuit discoveries (Méloux et al., 2025), and causal abstractions (Sutter et al., 2025), can yield plausible-looking explanations even when applied to random neural networks. In Figure 1, we report a minimum dead salmon artifact from analyzing activations of a fully randomized BERT model in a sentiment analysis task where both correlation analysis and probing find highly significant explanations. Such striking failure modes are particularly troubling as modern AI systems are increasingly deployed in high-stakes domains where AI interpretability should be essential for transparency, accountability, and error diagnosis (Mehrabi et al., 2021;Barnes & Hutson, 2024;Ramachandram et al., 2025). Interpretability methods have the potential surface critical failure modes (Kim & Canny, 2017;Zech et al., 2018;Caruana et al., 2015;Meng et al., 2022;Monea et al., 2024;Nguyen et al., 2025) and offer levers for mitigating bias and systematic errors (Arrieta et al., 2019;Kristofik, 2025;Lepori et al., 2025).

Yet, despite frequent analogies to mature sciences like neuroscience (Barrett et al., 2019), biology (Lindsey et al., 2025), or physics (Allen-Zhu & Li, 2023;Allen-Zhu, 2024) of neural networks, the practice of AI interpretability remains in its early foundational stages. Striking dead-salmon artifacts are accompanied by a general statistical fragility: small perturbations to inputs (Ghorbani et al., 2019;Kindermans et al., 2019;Zhang et al., 2025) or changes in random initialization (Adebayo et al., 2018;Zafar et al., 2021) can radically change explanations. Explanations often fail to generalize to new settings and input distributions (Hoelscher-Obermaier et al., 2023). Also, multiple incompatible explanations can be discovered for the same behavior (Méloux et al., 2025;Dombrowski et al., 2019). While the dead salmon study demonstrated a simple statistical oversight correctable through multiple comparison adjustments, AI interpretability’s difficulties stem from more fundamental issues. In particular, we argue that, for common interpretability queries, computational traces do not uniquely determine explanations. Beyond neuroscience and AI, such challenges are not unprecedented. Psychology and the social sciences faced a similar reckoning during the replication crisis, when questionable research practices produced widespread false positives (Collaboration, 2015;Simmons et al., 2011;Ioannidis, 2005;Schimmack, 2020). These fields responded with methodological reforms: pre-registration, registered reports, increased statistical power, and explicit multiplecomparison corrections (Munafò et al., 2017;Korbmacher et al., 2023). Likewise, econometrics used causal inference (Pearl, 2009) to formalize the distinction between correlation and causation, developing identification criteria, sensitivity analyses, and robustness tests (Imbens & Rubin, 2015;Angrist & Pischke, 2009;Heckman, 2007). Now, AI interpretability can also begin to build its own methodological guardrails. As argued bef

📸 Image Gallery

dead_salmom_illus.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut