TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Advances in single-cell and spatial transcriptomic technologies have transformed tumor ecosystem profiling at cellular resolution. However, large scale studies on patient cohorts continue to rely on bulk transcriptomic data, where variation in tumor purity obscures tumor-intrinsic transcriptional signals and constrains downstream discovery. Many deconvolution methods report strong performance on synthetic bulk mixtures but fail to generalize to real patient cohorts because of unmodeled biological and technical variation. Here, we introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective, representing a fundamental departure from the deconvolution paradigm. Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings by leveraging adjacent-normal profiles within the same cohort as “background” guidance, enabling the disentanglement of tumor-specific signals without relying on any external reference. Benchmarked against multiple large cancer cohorts across RNA-seq and microarray platforms, TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities compared to raw bulk profiles. By providing a transferable framework for decontaminating bulk transcriptomics, TwinPurify extends the utility of existing clinical datasets for molecular discovery.

💡 Research Summary

TwinPurify introduces a fundamentally new approach to correcting tumor purity effects in bulk transcriptomic data, moving away from traditional deconvolution methods toward a self‑supervised representation learning framework. The method adapts the Barlow Twins objective, using adjacent normal tissue samples from the same cohort as structured perturbations. During training, each tumor expression vector is mixed with a synthetic normal reference created by uniformly combining five randomly selected normal profiles. The tumor‑to‑normal mixing ratio (α) is optimized (≈0.27 tumor, 0.73 normal) and two distinct “views” are generated by sampling different normal sets. Both views pass through a shared encoder and projection head, producing embeddings z₁ and z₂ that are batch‑normalized. The cross‑correlation matrix C between the two embeddings is computed, and the loss L_TP = Σ_i(1‑C_ii)² + λ Σ_{i≠j}C_ij² forces diagonal entries toward 1 (preserving tumor signal) while pushing off‑diagonal entries toward 0 (decorrelating latent dimensions). λ is tuned (final value 54.9) to balance these forces.

TwinPurify is benchmarked against three baselines: a standard auto‑encoder (AE), a variational auto‑encoder (VAE), and Principal Component Analysis (PCA). All models receive the same input (tumor + adjacent normal concatenation), but only TwinPurify incorporates the structured normal perturbations and the Barlow Twins loss; AE and VAE rely solely on reconstruction objectives.

Three large breast‑cancer cohorts are used: SCAN‑B and TCGA‑BRCA (RNA‑seq) and METABRIC (microarray). To assess robustness to purity variation, synthetic dilution series are generated for each cohort, ranging from 0 % to 100 % tumor content in 10 % steps. At each dilution level, two downstream tasks are evaluated: (i) concordance of PAM50 intrinsic subtypes with reference labels, and (ii) histological grade prediction (TCGA lacks grade). TwinPurify consistently outperforms the baselines, especially at low purity (<30 % tumor), where AE, VAE, and PCA suffer steep drops in accuracy.

Biological relevance of the learned embeddings is examined by correlating each latent dimension with all genes, ranking genes per dimension, and running GSEA on GO‑Biological Process and Immune Signature (C7) collections. TwinPurify yields a larger number of unique, statistically significant pathways across dimensions, and a higher “uniqueness” score indicating less overlap of top genes between dimensions. In contrast, AE and VAE produce more redundant pathway enrichment, reflecting less orthogonal latent spaces.

Survival analysis uses the embeddings as covariates in Cox proportional hazards models. TwinPurify embeddings achieve higher concordance indices (C‑index) than raw expression and the baseline embeddings, demonstrating that purity‑corrected representations improve prognostic modeling.

Key contributions of TwinPurify are: (1) leveraging cohort‑specific normal samples as realistic, structured noise; (2) employing a Barlow‑Twins cross‑correlation loss to enforce tumor‑specific, decorrelated latent features; (3) demonstrating cross‑platform (RNA‑seq, microarray) transferability; and (4) showing consistent gains in molecular subtyping, grade prediction, pathway discovery, and survival prediction. Limitations include dependence on sufficient normal samples, the need to retune hyper‑parameters (α, λ) for different cohorts, and validation only on breast cancer data. Future work should test the framework on other tumor types, explore synthetic normal generation for datasets lacking normals, and integrate the embeddings into downstream biomarker discovery pipelines.

TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment