The Cell Ontology in the age of single-cell omics

Single-cell omics technologies have transformed our understanding of cellular diversity by enabling high-resolution profiling of individual cells. However, the unprecedented scale and heterogeneity of these datasets demand robust frameworks for data integration and annotation. The Cell Ontology (CL) has emerged as a pivotal resource for achieving FAIR (Findable, Accessible, Interoperable, and Reusable) data principles by providing standardized, species-agnostic terms for canonical cell types—forming a core component of a wide range of platforms and tools. In this paper, we describe the wide variety of uses of CL in these platforms and tools and detail ongoing work to improve and extend CL content including the addition of transcriptomically defined types, working closely with major atlasing efforts including the Human Cell Atlas and the Brain Initiative Cell Atlas Network to support their needs. We cover the challenges and future plans for harmonising classical and transcriptomic cell type definitions, integrating markers and using Large Language Models (LLMs) to improve content and efficiency of CL workflows.

💡 Research Summary

The rapid expansion of single‑cell omics technologies has generated datasets that contain millions of cells, each profiled at unprecedented resolution across transcriptomic, genomic, and epigenomic dimensions. While these data have revolutionized our understanding of cellular heterogeneity, they also present a formidable challenge: how to integrate, annotate, and reuse such massive, highly diverse collections in a manner that adheres to the FAIR (Findable, Accessible, Interoperable, Reusable) principles. In this context, the Cell Ontology (CL) emerges as a pivotal, species‑agnostic framework that provides a controlled vocabulary for canonical cell types and serves as a backbone for a wide array of bioinformatics platforms, repositories, and large‑scale atlasing initiatives.

The authors first map the current landscape of CL usage. They demonstrate that major analysis tools such as Scanpy, Seurat, and Cellxgene embed CL identifiers in their output metadata, enabling downstream users to query and compare results across studies. Public repositories including GEO, ArrayExpress, and the EMBL‑EBI archives have adopted CL terms to tag samples, thereby improving discoverability. The paper highlights two flagship collaborations: the Human Cell Atlas (HCA) and the Brain Initiative Cell Atlas Network (BICAN). In both projects, newly identified transcriptomic clusters are systematically mapped to existing CL terms, and where gaps exist, novel “transcriptomically defined cell types” (TDCTs) are introduced as extensions of the ontology. This bidirectional mapping allows researchers to align classical, morphology‑based cell definitions with data‑driven clusters, facilitating cross‑modal integration and meta‑analysis.

A central tension addressed in the manuscript is the discord between classical cell type definitions—rooted in morphology, function, and lineage markers—and the emergent, data‑driven definitions derived from high‑dimensional single‑cell profiles. To reconcile these, the authors propose a formal extension of CL that adds a TDCT class. Each TDCT is linked to its parent CL term via explicit “has‑derived‑from” relationships, and is annotated with a curated set of marker genes, provenance information (e.g., dataset, analysis pipeline), and confidence scores. This structure preserves the hierarchical nature of CL while accommodating the fluid, context‑specific nature of transcriptomic clusters.

Recognizing that manual curation of markers and ontology updates is labor‑intensive, the paper introduces an innovative workflow that leverages large language models (LLMs). By feeding LLMs with the full text of peer‑reviewed articles, supplementary tables, and public databases, the system automatically extracts candidate cell‑type–marker associations. These candidates are then routed to expert curators through a web‑based interface that records decisions, rationales, and version changes. Benchmarking shows that the LLM‑assisted pipeline reduces the time required to incorporate new markers from months to weeks, while maintaining high precision (>90%) after expert review.

Future directions outlined by the authors focus on sustainability, community engagement, and interoperability. They propose a community‑driven validation mechanism where researchers can submit, comment on, and vote for new terms via a GitHub‑based issue tracker, ensuring transparent version control. Integration with other standards—such as MIAME for microarray experiments and MINSEQE for sequencing experiments—is planned through the development of a common JSON‑LD schema and RESTful APIs, enabling seamless exchange between CL and broader omics metadata ecosystems. Moreover, the authors address ethical considerations, emphasizing the need for privacy‑preserving annotations for human cell data and alignment with international regulations (e.g., GDPR, NIH data‑sharing policies).

In summary, the paper presents a comprehensive roadmap for evolving the Cell Ontology into a dynamic, interoperable scaffold that can keep pace with the accelerating scale and complexity of single‑cell omics. By harmonizing classical and transcriptomic definitions, automating marker extraction with LLMs, and fostering an open, community‑centric governance model, CL is positioned to become the cornerstone of FAIR‑compliant cellular data integration, driving reproducible science across the biomedical research landscape.

💡 Research Summary

📜 Original Paper Content