LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation

LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Motivation: Pathway enrichment analysis is widely used to interpret gene expression data. Standard approaches, such as GSEA, rely on predefined phenotypic labels and pairwise comparisons, which limits their applicability in unsupervised settings. Existing unsupervised extensions, including single-sample methods, provide pathway-level summaries but primarily capture linear relationships and do not explicitly model gene-pathway associations. More recently, deep learning models have been explored to capture non-linear transcriptomic structure. However, their interpretation has typically relied on generic explainable AI (XAI) techniques designed for feature-level attribution. As these methods are not designed for pathway-level interpretation in unsupervised transcriptomic analyses, their effectiveness in this setting remains limited. Results: To bridge this gap, we introduce LaCoGSEA (Latent Correlation GSEA), an unsupervised framework that integrates deep representation learning with robust pathway statistics. LaCoGSEA employs an autoencoder to capture non-linear manifolds and proposes a global gene-latent correlation metric as a proxy for differential expression, generating dense gene rankings without prior labels. We demonstrate that LaCoGSEA offers three key advantages: (i) it achieves improved clustering performance in distinguishing cancer subtypes compared to existing unsupervised baselines; (ii) it recovers a broader range of biologically meaningful pathways at higher ranks compared with linear dimensionality reduction and gradient-based XAI methods; and (iii) it maintains high robustness and consistency across varying experimental protocols and dataset sizes. Overall, LaCoGSEA provides state-of-the-art performance in unsupervised pathway enrichment analysis. Availability and implementation: https://github.com/willyzzz/LaCoGSEA


💡 Research Summary

LaCoGSEA (Latent Correlation Gene Set Enrichment Analysis) is introduced as an unsupervised framework that combines deep representation learning with classical pathway enrichment statistics to enable pathway‑level interpretation of transcriptomic data without requiring phenotype labels. The core pipeline consists of four steps. First, a deep autoencoder compresses the high‑dimensional gene expression matrix (G genes × N samples) into a low‑dimensional latent space (D ≪ G). The encoder‑decoder pair is trained with a reconstruction loss augmented by Elastic Net regularization (L1 + L2) to prevent over‑fitting. Second, for each latent dimension k, the Pearson correlation between the expression vector of every gene j and the activation vector of dimension k across all samples is computed, yielding a G × D correlation matrix P. These correlations serve as a label‑free proxy for differential expression: genes with high positive (or negative) ρ_{j,k} are considered up‑ (or down‑) regulated with respect to the latent biological signal captured by dimension k. Third, each column of P is sorted to produce a pre‑ranked gene list L_k, which is fed directly into standard GSEA (KEGG, GO, C6, etc.). Because GSEA only requires a ranked list, the method bypasses the need for supervised statistics. Fourth, pathway activity scores for each sample are derived by multiplying the sample’s latent vector z_i (size D) with the normalized enrichment scores (NES) of each pathway across dimensions, producing an activity matrix A (N × M). This matrix can be used for downstream tasks such as unsupervised clustering, visualization, or differential analysis.

The authors evaluated LaCoGSEA on nine public datasets spanning cancer (breast, lung, lymphoma), neuro‑degeneration, trauma, liver, and heart, covering both RNA‑seq and microarray platforms. A saturation analysis varying D from 4 to 128 showed that the autoencoder consistently identified more significant pathways than PCA, especially after D ≥ 4, with a stable plateau around D = 64. In contrast, PCA’s performance degraded at higher dimensions due to its orthogonal constraint, which fragments overlapping biological programs. Negative control experiments with synthetic Gaussian noise produced zero significant pathways for both methods, confirming that the autoencoder’s sensitivity reflects genuine biological signal rather than noise.

For pathway recovery, LaCoGSEA’s global correlation ranking outperformed gradient‑based explainability methods (SHAP, DeepLIFT) and linear attribution (absolute PCA loadings). The authors defined a model‑level rank for each pathway as the best (lowest) rank across all dimensions where the pathway achieved FDR < 0.05. LaCoGSEA achieved the highest ranks across KEGG, GO, and C6 collections, indicating superior prioritization of biologically relevant pathways.

At the sample level, pathway activity scores derived from LaCoGSEA enabled accurate unsupervised sub‑type discovery. In the SCAN‑B breast cancer cohort, K‑means clustering on the activity matrix yielded an Adjusted Rand Index (ARI) of 0.372 against PAM50 clinical labels, surpassing PCA (0.240), ssGSEA (0.185), and GSVA (0.126). t‑SNE visualizations displayed clear separation of known subtypes, confirming that the latent‑derived pathways capture meaningful heterogeneity. Similar improvements were observed in the TCGA‑NSCLC cohort.

Robustness analyses demonstrated that LaCoGSEA maintains performance across varying sample sizes (including as few as 30 samples) and different data types, highlighting its applicability to both large consortia datasets and smaller experimental studies. The method also proved resistant to technical noise, as demonstrated by consistent pathway detection despite variations in sequencing depth and dropout rates.

In summary, LaCoGSEA offers three major advantages: (1) it enables fully unsupervised pathway enrichment by converting latent‑gene correlations into dense gene rankings; (2) it captures non‑linear biological structure that linear methods miss, leading to higher pathway recovery and more accurate subtype stratification; and (3) it provides a robust, label‑free alternative to gradient‑based XAI approaches, which are ill‑suited for the dense, redundant nature of biological pathways. The authors provide an open‑source implementation (GitHub) and suggest future extensions such as variational autoencoders or graph‑based embeddings to further improve performance on limited‑sample scenarios. LaCoGSEA thus represents a significant step toward integrating deep learning’s representation power with the statistical rigor of classical gene set enrichment in an unsupervised setting.


Comments & Academic Discussion

Loading comments...

Leave a Comment