Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.
Advances in single-cell transcriptomics have transformed our ability to discover and experimentally characterize cell types across tissues and organisms [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. A major achievement of this research program has been the creation of "cell atlases" that catalogue cellular gene expression with single cell resolution [9,16]. Cell atlases often contain data from hundreds of million cells, with the amount of data expected to grow as sequencing costs continue to drop and new experimental techniques and modalities emerge [17]. For this reason, there is immense interest in leveraging single cell RNA-sequencing (scRNA-seq) data to extract biological insights about the molecular basis of cellular identity.
Although every cell in an organism is genomically identical, different cell types express different subsets of genes, giving rise to observed phenotypic and functional differences. One seductive idea 1 is that the essential aspects of cellular identity are encoded in the statistical properties of cellular gene expression profiles. If true, this suggests that learning good representations of single-cell data offers a powerful route for understanding cellular identity across cell types, tissues, and species. This perspective has motivated a broad range of statistical approaches for analyzing scRNA-seq datasets.
The hope that complex statistical models can be used to learn new biology is buttressed by the tremendous success of protein language models at tasks such as sequence-structure prediction (see [18,19] for recent reviews). Motivated in part by this work, several works have trained large selfsupervised “foundation models” on cell atlas data [20][21][22][23][24][25][26][27]. Models such as TranscriptFormer use transformer-based architectures to learn a generative model for gene expression (i.e. mRNA counts) by embedding genes into a latent vector space. Gene embeddings from foundation models have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classifica-tion, disease-state prediction, and cross-species learning. This is often cited as evidence that models like TranscriptFormer learn general-purpose biological representations of gene expression that can serve as inputs for more ambitious “virtual cell” models [28].
Still, it remains unclear if the representations learned by foundation models capture biological structure beyond that which is already present in appropriately processed scRNA-seq data [29,30]. A growing body of work has shown that simple, interpretable methods can perform remarkably well across diverse single-cell analysis tasks. For example, linear and physics-inspired approaches such as Single-cell Type Order Parameters (scTOP) enable accurate cell-type classification, interpretable visualization of developmental dynamics, and principled analyses of cell fate transitions without requiring large-scale model training [31][32][33][34][35][36]. These observations raise two basic but under-explored questions: how complex is the structure of scRNA-seq data itself? What level of representational sophistication is required to extract the biologically relevant variation captured by current benchmarks?
There are several good reasons to believe that embeddings from single cell foundation models may be less powerful than those from protein language models [37,38]. In contrast to protein sequences, which are discrete, high-quality, and tightly constrained by biophysics, scRNA-seq data are sparse, noisy, and characterized by substantial technical variability, including dropout effects and batch-specific artifacts [39]. Moreover, cellular identity reflects a complex interplay between gene expression, signaling, environmental context, and post-transcriptional regulation, rather than being fully specified by transcript counts alone. This is in stark contrast with proteins where the information needed to determine its three-dimensional structure is contained almost entirely in its sequence (Anfinsen’s principle) [40][41][42][43].
Inspired by these considerations, we systematically compared the performance of simple, interpretable pipelines to the reported performance of large scale single-cell foundation models on common downstream tasks (see Fig. 1) [29]. We find that by carefully choosing pre-processing steps and normalization procedures, it is possible to achieve SOTA or near-SOTA performance using simple representations where cells are viewed as vectors in gene expression space. The performance of these pipelines often exceeds that of foundation models, despite the fact that they require ordersof-magnitude less computational resources and have almost no free parameters (See SI.B 3 for an extended comparison).
These results motivate a unifying interpretation: much of the biologically relevant structure present in current scRNA-seq benchmarks is already accessible through low-complexity linear representations. Consequent
This content is AI-processed based on open access ArXiv data.