OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Graph Language Foundation Modeling
With the rapid growth of large-scale single-cell omic datasets, omic foundation models (FMs) have emerged as powerful tools for advancing research in life sciences and precision medicine. However, most existing omic FMs rely primarily on numerical transcriptomic data by sorting genes as sequences, while lacking explicit integration of biomedical prior knowledge and signaling interactions that are critical for scientific discovery. Here, we introduce the Text-Omic Signaling Graph (TOSG), a novel data structure that unifies human-interpretable biomedical textual knowledge, quantitative omic data, and signaling network information. Using this framework, we construct OmniCellTOSG, a large-scale resource comprising approximately half million meta-cell TOSGs derived from around 80 million single-cell and single-nucleus RNA-seq profiles across organs and diseases. We further develop CellTOSG-FM, a multimodal graph language FM, to jointly analyze textual, omic and signaling network context. Across diverse downstream tasks, CellTOSG-FM outperforms existing omic FMs, and provides interpretable insights into disease-associated targets and signaling pathways.
💡 Research Summary
OmniCellTOSG introduces a novel data paradigm that unifies textual biomedical knowledge, quantitative omic measurements, and signaling network topology into a single graph structure called the Text‑Omic Signaling Graph (TOSG). The authors first aggregate roughly 80 million single‑cell and single‑nucleus RNA‑seq profiles from public repositories such as CellxGene, the Brain Cell Atlas, GEO, the Single Cell Portal, and the Human Cell Atlas. After rigorous quality control, normalization, and harmonization of tissue, disease, age, and sex annotations, they employ SEACells to compress the raw cells into about 0.5 million meta‑cells, preserving biological diversity while reducing computational burden.
Each meta‑cell’s transcriptome is then aligned to the BioMedGraphica knowledge base. Transcripts become “transcript nodes” carrying measured expression values, while their downstream protein products become “protein nodes” initialized with zero vectors (due to lack of proteomic data). Edges encode intra‑cellular relationships (transcript‑to‑protein) and inter‑protein protein‑protein interactions (PPIs). The resulting graph contains 533 458 entities and 16 637 405 relations, including 152 585 intra‑cell edges and over 16 M PPIs, thereby embedding both matched and virtual entities and capturing nucleus‑level and intra‑cellular signaling.
To make the resource usable, the authors release CellTOSG_Loader, a Python package that accepts user‑specified filters (cell type, tissue, disease, age, sex, etc.), performs stratified cohort balancing, and applies ComBat‑seq batch correction across platforms. This enables reproducible, bias‑controlled experiments; for example, an Alzheimer’s disease (AD) versus control study can be automatically matched on sex, age, and cell‑type composition.
Building on this dataset, the authors develop CellTOSG‑FM, a multimodal graph‑language foundation model. The architecture comprises three encoders: (1) a Transformer‑based text encoder that embeds entity names, descriptions, and sequences; (2) an MLP‑based omic encoder that processes expression vectors; and (3) a graph neural network that performs message passing both within a meta‑cell (internal propagation) and across the global TOSG (global propagation). Cross‑modal attention fuses the three modalities into a unified representation.
Self‑supervised pre‑training employs three complementary objectives: (i) masked‑edge reconstruction, where random transcript‑protein or protein‑protein edges are hidden and the model must predict them; (ii) node‑degree regression, encouraging the model to infer topological importance; and (iii) global message‑propagation consistency, preserving contextual information across the entire graph. The authors demonstrate that masking edges—rather than nodes—yields superior performance on structure‑sensitive tasks such as link prediction and topology recovery, which is crucial for signaling analysis.
CellTOSG‑FM is evaluated on seven downstream tasks: cell‑type annotation across 30 cell types, disease classification (including cancer and neurodegeneration), core‑signaling pathway inference (e.g., PI3K‑AKT, MAPK), drug‑response prediction, cell‑cell interaction prediction, protein embedding quality assessment, and interpretability via subgraph rationales. Across all benchmarks, CellTOSG‑FM outperforms existing single‑cell foundation models such as scGPT, GeneFormer, and scFoundation, achieving 4–12 % higher accuracy or F1 scores. Notably, the model’s attention maps highlight biologically plausible subgraphs; for instance, in the AD case study, the model emphasizes the PI3K‑AKT pathway and its downstream effectors, aligning with known disease mechanisms and suggesting novel hypotheses.
All data and code are openly released (dataset on HuggingFace, code on GitHub), ensuring reproducibility and facilitating community extensions. The authors outline future directions: integrating true proteomic measurements, adding other omics layers (ATAC‑seq, methylation), and scaling to larger clinical cohorts for precision‑medicine applications.
In summary, OmniCellTOSG provides the first large‑scale, knowledge‑grounded, graph‑structured representation of single‑cell data, and CellTOSG‑FM demonstrates that a multimodal graph‑language foundation model can simultaneously achieve state‑of‑the‑art predictive performance and mechanistic interpretability, marking a significant step forward for AI‑driven discovery in life sciences.
Comments & Academic Discussion
Loading comments...
Leave a Comment