Cell-JEPA: Latent Representation Learning for Single-Cell Transcriptomics

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Single-cell foundation models learn by reconstructing masked gene expression, implicitly treating technical noise as signal. With dropout rates exceeding 90%, reconstruction objectives encourage models to encode measurement artifacts rather than stable cellular programs. We introduce Cell-JEPA, a joint-embedding predictive architecture that shifts learning from reconstructing sparse counts to predicting in latent space. The key insight is that cell identity is redundantly encoded across genes. We show predicting cell-level embeddings from partial observations forces the model to learn dropout-robust features. On cell-type clustering, Cell-JEPA achieves 0.72 AvgBIO in zero-shot transfer versus 0.53 for scGPT, a 36% relative improvement. On perturbation prediction within a single cell line, Cell-JEPA improves absolute-state reconstruction but not effect-size estimation, suggesting that representation learning and perturbation modeling address complementary aspects of cellular prediction.

💡 Research Summary

Cell‑JEPA introduces a joint‑embedding predictive architecture for single‑cell transcriptomics that moves away from the conventional masked‑gene reconstruction objective used by foundation models such as scGPT. The core idea is to predict a stable cell‑level embedding in latent space rather than reconstruct raw count values, thereby reducing the model’s reliance on noisy, dropout‑prone measurements.

The model builds directly on scGPT’s tokenization, embedding, and bidirectional transformer backbone, but adds a student‑teacher pair of encoders. The teacher encoder receives the full, unmasked token sequence and produces a cell‑level representation via the token. The student encoder processes a version of the same cell where only a subset of gene tokens are masked (expression values replaced by a sentinel). The teacher’s parameters are updated as an exponential moving average (EMA) of the student’s parameters, creating a slowly evolving target that stabilizes the representation‑prediction task.

Training optimizes two complementary losses. First, a gene‑level reconstruction loss (L_rec) identical to scGPT’s masked‑gene prediction, computed as mean‑squared error over the discretized expression bins of masked genes. Second, a JEPA loss (L_JEPA) that aligns the student’s predicted embedding (after a small predictor head) with the teacher’s embedding using a cosine‑distance objective. The teacher embedding is detached (stop‑gradient) so that only the student is updated toward the target. The total pre‑training objective is a weighted sum of L_rec and L_JEPA, allowing the model to retain fine‑grained expression information while enforcing robustness at the representation level.

Data preprocessing follows scGPT: each cell is represented as a sparse list of (gene_id, expression) pairs, expression values are quantile‑binned into 50 discrete levels per cell, and the number of expressed genes per cell is capped at 600 by uniform random subsampling. This stochastic subsampling mimics biological dropout across epochs, forcing the model to learn from ever‑changing partial views of the same cell.

Evaluation is performed on three downstream tasks. (1) Supervised fine‑tuning for cell‑type clustering shows that Cell‑JEPA consistently outperforms scGPT in Adjusted Rand Index and Normalized Mutual Information, indicating more biologically meaningful embeddings. (2) Zero‑shot transfer, where the frozen pre‑trained embeddings are used directly with a K‑Nearest‑Neighbor classifier, yields an AvgBIO score of 0.72 versus 0.53 for scGPT—a 36 % relative improvement—demonstrating that the latent space captures cell identity without any task‑specific adaptation. (3) Perturbation‑response prediction on a single cell line reveals that Cell‑JEPA improves absolute state reconstruction (lower RMSE) but does not surpass scGPT in estimating perturbation effect sizes, suggesting that representation learning and perturbation modeling address complementary aspects of the problem.

In summary, Cell‑JEPA shows that shifting the self‑supervised signal from raw count reconstruction to latent‑space prediction yields embeddings that are more robust to extreme sparsity and technical noise, while preserving the ability to recover detailed expression patterns. The approach provides a solid foundation for transfer learning across tissues, species, and experimental conditions, and points to future work where perturbation‑specific heads could be combined with the robust latent representations learned by Cell‑JEPA.

Cell-JEPA: Latent Representation Learning for Single-Cell Transcriptomics

💡 Research Summary

Comments & Academic Discussion

Leave a Comment