SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.

💡 Research Summary

SurvDiff introduces the first end‑to‑end diffusion‑based generative model specifically designed for synthetic survival data. The authors identify three core challenges that distinguish survival datasets from standard tabular data: (1) the presence of mixed‑type covariates (continuous and categorical), (2) the need to generate realistic event times, and (3) the necessity to preserve the right‑censoring mechanism that masks many events. Existing generative approaches—SurvivalGAN, the Ashhad framework, and generic tabular diffusion models such as TabDiff—either rely on separate stages for covariates and event‑time generation or ignore censoring altogether, leading to mode collapse, error propagation, and poor fidelity of the survival distribution.

SurvDiff addresses these issues by jointly modeling all four components—continuous covariates, discrete covariates, event indicator E, and observed time T—within a single diffusion process. The forward diffusion combines a Gaussian noise schedule for continuous variables with a masked diffusion scheme for categorical variables, ensuring that the discrete nature of the data is respected throughout the noising steps. The reverse diffusion is driven by a neural score network that incorporates cross‑attention between variable types and a time‑conditional embedding, allowing the model to capture complex inter‑variable dependencies.

The key technical contribution is a survival‑tailored loss function. This loss augments the standard score‑matching objective with two additional terms: (i) a ranking loss that enforces the correct temporal ordering of uncensored events, thereby preserving the shape of the event‑time distribution, and (ii) a censored‑log‑likelihood term that explicitly models the probability of right‑censoring given covariates. To further stabilize training, the authors introduce a sparsity‑aware weighting scheme that assigns higher weight to early event times—where data density is higher—and lower weight to later times, mitigating the instability caused by the scarcity of long‑term observations.

Experiments are conducted on three real‑world medical datasets: a SEER cancer cohort (≈10 k patients, 35 % censoring), the MIMIC‑III ICU dataset (≈8.5 k patients, 42 % censoring), and a multi‑center clinical trial dataset (≈5.2 k patients, 48 % censoring). Four evaluation dimensions are used: (1) covariate fidelity (Jensen–Shannon and Wasserstein distances), (2) survival‑specific fidelity (Event‑Time Divergence, ETD), (3) overall fidelity (Shape metric), and (4) downstream survival analysis performance in a Train‑on‑Synthetic‑Test‑on‑Real (TSTR) scenario, measured by concordance index (C‑index) and Brier score.

Across all metrics SurvDiff consistently outperforms the baselines. Covariate distributions are matched with JSD < 0.03, a 30‑45 % improvement over competing methods. ETD scores drop to ≤ 0.12, indicating a close match of the temporal structure, while the Shape metric exceeds 0.95. In the TSTR evaluation, Cox proportional hazards and DeepSurv models trained on SurvDiff‑generated data achieve C‑indices of 0.78‑0.82 (real data baseline 0.80‑0.84) and Brier scores of 0.12‑0.15 (baseline 0.10‑0.13), surpassing SurvivalGAN and Ashhad by 5‑10 percentage points.

The paper also discusses limitations. Currently SurvDiff only handles right‑censoring; extending the framework to left‑censoring, interval censoring, or competing risks would require additional loss terms and possibly alternative diffusion dynamics. Moreover, the model is evaluated on tabular data; applying it to multimodal settings (e.g., imaging plus time‑to‑event) remains an open research direction. Training time is roughly 1.5× that of standard TabDiff, but the authors argue that the gain in fidelity justifies the extra cost.

In conclusion, SurvDiff demonstrates that diffusion models, when equipped with survival‑specific objectives and appropriate handling of mixed data types, can generate high‑quality synthetic survival datasets that faithfully reproduce both covariate structures and censoring mechanisms. This opens the door to privacy‑preserving data sharing, augmentation for low‑sample clinical studies, and robust benchmarking of survival analysis methods without exposing patient‑level data. Future work will explore more complex censoring schemes, multimodal extensions, and integration with differential privacy guarantees.

SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment