STProtein: predicting spatial protein expression from multi-omics data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The integration of spatial multi-omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi-task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi-omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi-omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological “Dark Matter”.

💡 Research Summary

The paper addresses a pressing bottleneck in spatial multi‑omics: while spatial transcriptomics datasets are proliferating, spatial proteomics remains scarce due to high cost and technical challenges. To bridge this gap, the authors introduce STProtein, a graph‑neural‑network‑based framework that predicts spatial protein expression from readily available spatial transcriptomics data using a multi‑task learning (MTL) strategy.

Data preprocessing: RNA counts are log‑transformed, library‑size normalized, and the top 4,000 highly variable genes are selected. Principal component analysis (PCA) reduces dimensionality, and the first k components (where k equals the number of measured proteins) are retained as model inputs. Protein counts undergo centered‑log‑ratio (CLR) normalization, and the same PCA components are used for the protein side.

Feature graph construction: Instead of a simple spatial‑neighbor graph, the authors adopt a K‑nearest‑neighbor (K‑NN) graph built on the PCA embeddings. Each spot is linked to its three nearest spots (default k = 3), capturing similarity that may be spatially distant—a biologically motivated choice supported by prior work (STAGATE, SpatialGlue).

Graph attention autoencoder: The core model consists of a graph attention layer (GATv2) followed by an encoder‑decoder pair. The encoder stacks two GATv2 layers with ReLU activations, then a linear projection to produce a latent embedding interpreted as reconstructed protein expression. The decoder mirrors this architecture, converting the protein embedding back into reconstructed RNA expression. Parameter sharing (weights and attention scores) between encoder and decoder mitigates over‑fitting.

Loss function and multi‑task learning: Two L2 reconstruction losses are defined—one for RNA (L_rna) and one for protein (L_protein). A weighted sum L_total = β₁·L_rna + β₂·L_protein is minimized, allowing the network to learn a shared latent space that simultaneously explains both modalities.

Upstream and downstream tasks: After training on a dataset containing both RNA and protein measurements, the model can (i) predict protein levels for new samples that only have RNA data (upstream task) and (ii) use the learned protein embeddings for clustering or spatial domain discovery (downstream task), thereby revealing hidden protein spatial patterns—what the authors term “biological Dark Matter.”

Experimental validation: The authors compare STProtein against existing multi‑omics predictors such as totalVI and scArches, demonstrating that explicit modeling of spatial relationships via the K‑NN graph and GATv2 yields superior protein prediction accuracy and more coherent spatial domain segmentation. Ablation studies explore the impact of the K‑NN neighbor count, number of attention heads, and the β weighting scheme, providing practical guidance for hyper‑parameter selection.

Key contributions:

A unified graph‑based representation that integrates spatial transcriptomics and proteomics.
A multi‑task learning framework that jointly reconstructs RNA and protein, encouraging a biologically meaningful shared latent space.
A practical solution to the data‑imbalance problem, enabling cost‑effective generation of spatial proteomics‑like data from abundant transcriptomics.

Implications: By turning abundant spatial transcriptomics into a proxy for spatial proteomics, STProtein can accelerate tissue‑level studies, facilitate discovery of novel protein biomarkers, and support the exploration of previously uncharacterized spatial protein domains. The framework is extensible to other modalities (e.g., epigenomics) and can be integrated with emerging large‑scale spatial multi‑omics atlases, promising broad impact across biomedical research.

STProtein: predicting spatial protein expression from multi-omics data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment