Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal

Reading time: 4 minute
...

📝 Original Info

  • Title: Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal
  • ArXiv ID: 2602.17532
  • Date: 2026-02-19
  • Authors: 논문에 명시된 저자 정보가 제공되지 않았습니다. (필요 시 원문에서 확인 후 추가)

📝 Abstract

We present a systematic evaluation framework - thirty-seven analyses, 153 statistical tests, four cell types, two perturbation modalities - for assessing mechanistic interpretability in single-cell foundation models. Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer-specific organisation - protein-protein interactions in early layers, transcriptional regulation in late layers - but this structure provides no incremental value for perturbation prediction: trivial gene-level baselines outperform both attention and correlation edges (AUROC 0.81-0.88 versus 0.70), pairwise edge scores add zero predictive contribution, and causal ablation of regulatory heads produces no degradation. These findings generalise from K562 to RPE1 cells; the attention-correlation relationship is context-dependent, but gene-level dominance is universal. Cell-State Stratified Interpretability (CSSI) addresses an attention-specific scaling failure, improving GRN recovery up to 1.85x. The framework establishes reusable quality-control standards for the field.

💡 Deep Analysis

📄 Full Content

The emergence of transformer-based foundation models for single-cell transcriptomics represents a paradigm shift in computational biology [Cui et al., 2024, Theodoris et al., 2023, Yang et al., 2022, Hao et al., 2024]. These models-trained on millions of cells across diverse tissues-learn contextual representations that have shown promise for cell type annotation, perturbation response prediction, and gene regulatory network (GRN) inference [Chen et al., 2024, Rosen et al., 2024]. A particularly compelling promise is mechanistic interpretability: extracting biologically meaningful regulatory circuits directly from attention-derived edge scores. Both scGPT [Cui et al., 2024] and Geneformer [Theodoris et al., 2023] highlight attention-derived gene network inference as a key application, and downstream studies have adopted attention-derived edge scores as regulatory proxies without rigorous validation [Zheng et al., 2024].

This promise draws on parallel advances in large language model interpretability, where techniques such as activation patching [Meng et al., 2022, Goldowsky-Dill et al., 2023] and automated circuit discovery [Conmy et al., 2023] have identified computational circuits for well-defined behaviours [Elhage et al., 2021, Olsson et al., 2022, Wang et al., 2022]. However, translating these approaches to biology faces unique challenges. Gene regulatory relationships are context-dependent, combinatorial, and only partially captured in reference databases such as TRRUST [Han et al., 2018] and DoRothEA [Garcia-Alonso et al., 2019], which contain a fraction of true regulatory interactions [Pratapa et al., 2020].

Current practices in single-cell foundation model interpretability rest on several critical and largely untested assumptions. First, that attention patterns directly reflect causal regulatory relationships-an assumption already challenged in the NLP literature [Jain and Wallace, 2019, Serrano and Smith, 2019, Bibal et al., 2022]. Second, that larger datasets consistently improve the reliability of mechanistic interpretations. Third, that attention-derived predictions align with experimental perturbation outcomes from CRISPR screens [Dixit et al., 2016]. Fourth, that mechanistic insights transfer reliably across biological contexts. We address these through a two-tier evaluation framework. The core tier directly evaluates foundation model internal representations-attention weights, intervention effects, and perturbation-outcome prediction-and proposes Cell-State Stratified Interpretability (CSSI) as a constructive method. The boundary condition tier uses correlation-based edge scores to establish limits that any edge-scoring method must contend with, including cross-species transfer, pseudotime directionality, technical leakage, and uncertainty calibration (Supplementary Notes 7-10).

Prior benchmarks have evaluated GRN inference methods [Pratapa et al., 2020] and individual foundation model capabilities [Zheng et al., 2024], but none has systematically assessed whether attentionderived edge scores add mechanistic information beyond expression statistics, nor tested this with causal interventions. We address this gap with a reusable evaluation framework integrating thirty-seven complementary analyses-including trivial-baseline comparison, conditional incremental-value testing, expression residualisation, propensity-matched benchmarks, and causal ablation with intervention-fidelity diagnosticsacross two foundation model architectures (scGPT, Geneformer V2-316M), four cell types (K562, primary T cells, RPE1 retinal epithelial, iPSC neurons), and two perturbation modalities (CRISPRi, CRISPRa), with 153 statistical tests under Benjamini-Hochberg FDR correction (Supplementary Table 1; Supplementary Note 16). The framework yields three principal findings. First, attention patterns encode layer-specific biological structure-protein-protein interactions in early layers, transcriptional regulation in late layers (Supplementary Note 17)-but this information provides no incremental value for perturbation prediction: trivial gene-level baselines outperform both attention and correlation edges, pairwise edge scores add zero predictive contribution, and causal ablation of “regulatory” heads produces no behavioural effect. Second, the attention-correlation relationship is context-dependent across cell types, but the underlying confound pattern-gene-level features dominate-generalises. Third, CSSI addresses an attention-specific scaling failure through cell-state stratification, providing an immediately deployable constructive tool. The framework itself-its battery of tests, controls, and diagnostic checks-constitutes a reusable quality-control standard for evaluating mechanistic interpretability claims in single-cell foundation models.

We designed an evaluation framework comprising five interlocking test families (Figure 1-6; Supplementary Notes 1-16). (i) Trivial-baseline comparison tests whether pairwise e

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut