A Contrastive Variational AutoEncoder for NSCLC Survival Prediction with Missing Modalities
Predicting survival outcomes for non-small cell lung cancer (NSCLC) patients is challenging due to the different individual prognostic features. This task can benefit from the integration of whole-slide images, bulk transcriptomics, and DNA methylation, which offer complementary views of the patient’s condition at diagnosis. However, real-world clinical datasets are often incomplete, with entire modalities missing for a significant fraction of patients. State-of-the-art models rely on available data to create patient-level representations or use generative models to infer missing modalities, but they lack robustness in cases of severe missingness. We propose a Multimodal Contrastive Variational AutoEncoder (MCVAE) to address this issue: modality-specific variational encoders capture the uncertainty in each data source, and a fusion bottleneck with learned gating mechanisms is introduced to normalize the contributions from present modalities. We propose a multi-task objective that combines survival loss and reconstruction loss to regularize patient representations, along with a cross-modal contrastive loss that enforces cross-modal alignment in the latent space. During training, we apply stochastic modality masking to improve the robustness to arbitrary missingness patterns. Extensive evaluations on the TCGA-LUAD (n=475) and TCGA-LUSC (n=446) datasets demonstrate the efficacy of our approach in predicting disease-specific survival (DSS) and its robustness to severe missingness scenarios compared to two state-of-the-art models. Finally, we bring some clarifications on multimodal integration by testing our model on all subsets of modalities, finding that integration is not always beneficial to the task.
💡 Research Summary
This paper tackles the clinically important problem of predicting disease‑specific survival (DSS) for non‑small cell lung cancer (NSCLC) patients when multimodal data are partially missing. The authors argue that whole‑slide images (WSI), bulk transcriptomics, and DNA methylation each provide complementary information, but real‑world cohorts often lack one or more of these modalities for a substantial fraction of patients. Existing multimodal approaches either ignore missing modalities (by training only on complete cases) or try to impute them, leading to biased representations and poor robustness under severe missingness.
To address these limitations, the authors propose a Multimodal Contrastive Variational AutoEncoder (MCVAE). The architecture consists of four modality‑specific variational encoders that map each available data source into a Gaussian latent distribution (mean µₖ and diagonal covariance σₖ²). The variance captures the uncertainty of the modality: larger σ indicates lower confidence. For missing modalities the latent vector is set to zero. A learned gating mechanism, parameterized by γₖ and passed through a sigmoid, assigns a continuous importance weight to each modality. The gated latent vectors are summed, normalized by the number of present modalities, and passed through a small fusion network h(·) to obtain a shared representation z_fused. This design ensures that the contribution of each modality is scaled both by its learned importance and by its actual availability, avoiding the equal‑weight assumption of simple concatenation.
Training is performed with a multi‑task loss that combines:
- Survival loss (Cox partial likelihood) – directly optimizes the log‑hazard prediction on the fused representation.
- Reconstruction loss – modality‑specific decoders reconstruct each available input from z_fused, encouraging the shared space to retain modality‑specific information. Missing modalities are not reconstructed, preventing spurious gradients.
- KL divergence loss – regularizes each variational posterior toward a standard normal prior, with learnable modality‑specific weights wₖ (softmax) that let the model down‑weight unreliable modalities. An annealing factor β(t) mitigates posterior collapse.
- Cross‑modal contrastive loss (InfoNCE) – pulls together latent vectors from different modalities of the same patient (positives) while pushing apart vectors from different patients (negatives), fostering a modality‑agnostic patient embedding.
Dynamic uncertainty‑based weighting of the four components replaces fixed λ‑coefficients, allowing the network to balance tasks automatically. Crucially, during training the authors apply stochastic modality dropout: for each patient and each modality a Bernoulli mask is sampled, temporarily hiding the modality even if it is present. This “modality masking” forces the model to learn to operate under arbitrary missingness patterns, effectively blending the strengths of dropout‑based robustness and imputation‑based reconstruction without actually synthesizing missing data.
The method is evaluated on two TCGA NSCLC cohorts: LUAD (n = 475) and LUSC (n = 446). Experiments systematically vary the proportion of missing modalities from 0 % to 80 % and compare against two state‑of‑the‑art baselines: MUSE (graph‑based multimodal aggregation) and SMIL (probabilistic imputation). Results show that MCVAE consistently achieves higher concordance indices (C‑index ≈ 0.71 for LUAD, 0.69 for LUSC) than the baselines (≈ 0.64–0.66). Under severe missingness (≥ 80 % of modalities hidden), the performance drop of MCVAE is less than 0.03 C‑index points, whereas MUSE and SMIL degrade by 0.12–0.18 points. Ablation studies reveal that the contrastive loss contributes ≈ 0.04 C‑index improvement, and the gating mechanism learns sensible importance scores (transcriptomics > methylation > clinical > WSI).
A comprehensive modality‑combination analysis uncovers that integrating all modalities does not always improve prediction; for some subsets (e.g., transcriptomics + methylation) the model performs worse than using transcriptomics alone, highlighting that multimodal fusion must be selective and data‑quality aware.
In discussion, the authors emphasize that their approach provides a principled way to quantify modality uncertainty, adaptively weight available sources, and maintain robustness without explicit imputation. Limitations include the relatively simple MLP encoders for high‑dimensional image data and the sensitivity to hyper‑parameters such as the temperature τ and KL annealing schedule. Future work could explore more expressive convolutional variational encoders, automated hyper‑parameter tuning, and extension to other cancer types or prospective clinical settings.
Overall, the paper makes a solid contribution to multimodal survival analysis by introducing a variational, contrastive, and masking‑aware framework that remains accurate even when large portions of the data are missing—a scenario that closely mirrors real‑world oncology practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment