Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Reading time: 5 minute
...

📝 Original Info

  • Title: Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics
  • ArXiv ID: 2602.15253
  • Date: 2026-02-16
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (필요 시 원문에서 확인 후 추가) **

📝 Abstract

Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

💡 Deep Analysis

📄 Full Content

A central empirical finding in deep learning is that, under appropriate conditions, the loss of a neural network follows a power law as model size, dataset size, or training compute increases (Kaplan et al., 2020;Hestness et al., 2017). These scaling laws have proven remarkably consistent across modalities: they hold for autoregressive language models (Kaplan et al., 2020;Hoffmann et al., 2022), vision transformers (Zhai et al., 2022), and multimodal systems (Cherti et al., 2023). Beyond their theoretical interest, scaling laws have immediate practical value: they enable researchers to predict the performance of expensive large-scale training runs from cheap small-scale experiments, thereby guiding resource allocation and architectural choices.

Single-cell RNA sequencing (scRNA-seq) has become a cornerstone of modern biology, generating expression profiles for millions of individual cells across diverse tissues, species, and disease states (Regev et al., 2017). In response, the community has developed increasingly large foundation models for single-cell data, including scVI (Lopez et al., 2018), scGPT (Cui et al., 2024), Geneformer (Theodoris et al., 2023), scBERT (Yang et al., 2022), and scFoundation (Hao et al., 2024). These models are typically evaluated on downstream tasks such as cell-type annotation, batch correction, or perturbation prediction, but the fundamental question of how pretraining loss scales with model size and data volume has received almost no attention. This gap matters for several reasons. First, without scaling laws, practitioners cannot determine whether a larger model will yield meaningful improvements or whether the current bottleneck is data, compute, or intrinsic noise. Second, the asymptotic loss floor c in a scaling law of the form L = aP -α + c provides an empirical upper bound on the reducible error, which can be related to the intrinsic entropy of the datagenerating process (Kaplan et al., 2020;Hoffmann et al., 2022). For transcriptomics, such an estimate would quantify the fundamental information content of gene expression measurements, a quantity of independent biological interest.

In this work, we take a first step towards establishing scaling laws for single-cell transcriptomics. We train masked-reconstruction transformers-a pretraining paradigm analogous to masked language modelling (Devlin et al., 2019)-on scRNA-seq data drawn from the CELLxGENE Census (CZI Single-Cell Biology, 2023), varying the model size across seven presets that span from 533 to 3.4 × 10 8 trainable parameters. We study two regimes that differ in both gene vocabulary size and dataset size, fit power-law models to the resulting loss curves, and analyse the conditions under which scaling behaviour emerges or breaks down.

  1. We present, to our knowledge, the first systematic investigation of neural scaling laws for maskedreconstruction pretraining on scRNA-seq data.

  2. We demonstrate that power-law scaling (α ≈ 0.23, R 2 ≈ 0.82) emerges in a data-rich regime (200,000 cells, 512 genes), with an identifiable irreducible loss floor.

  3. We show that scaling breaks down (α ≈ 0.009, R 2 ≈ 0.02) in a data-limited regime (10,000 cells, 1,024 genes), providing direct evidence that data scarcity-not model capacity-is the binding constraint.

  4. We identify the key experimental ingredients-information-theoretic loss functions, matched datasets, and comprehensive token accounting-needed to progress from loss-vs-parameters curves to transcriptomic entropy estimates.

Neural scaling laws. Hestness et al. (2017) first documented power-law scaling of test error with dataset size across several domains. Kaplan et al. (2020) established that language model cross-entropy loss follows L(P ) = (P c /P ) α P over several orders of magnitude in parameter count P , with α P ≈ 0.076 for autoregressive transformers. Importantly, they found that scaling with model size, data size, and compute each follows its own power law, and that training can be either model-limited or data-limited depending on the regime. Hoffmann et al. (2022) refined these findings, demonstrating that Kaplan et al.’s protocol over-allocates parameters relative to data and proposing compute-optimal (Chinchilla) scaling rules. Subsequent work has extended scaling laws to vision transformers (Zhai et al., 2022), contrastive multimodal models (Cherti et al., 2023), and various other architectures and tasks (Clark et al., 2022;Muennighoff et al., 2024).

Single-cell foundation models. The single-cell community has developed several large pretrained models. scVI (Lopez et al., 2018) introduced variational autoencoders for scRNA-seq, and subsequent work extended this framework to multi-omic data (Gayoso et al., 2022). More recently, transformer-based architectures have gained prominence: scGPT (Cui et al., 2024) adopts a generative pretraining approach with gene tokens; Geneformer (Theodoris et al., 2023) pretrains on rank-ordered gene expression using a B

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut