Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.


💡 Research Summary

The paper tackles a pressing problem in clinical artificial intelligence: how to share useful patient data while respecting privacy regulations and the practical constraints of many healthcare institutions. Existing dataset condensation (DC) techniques create a small synthetic dataset that enables neural‑network models to achieve performance comparable to training on the full data. However, most clinical work relies on classical, non‑differentiable models such as decision trees, gradient‑boosted ensembles, and Cox proportional‑hazards regression. Because traditional DC methods require gradients, they cannot be directly applied to these models, creating a gap between the promise of DC and real‑world clinical practice.

To bridge this gap, the authors propose a differentially private, zero‑order optimisation framework that treats a reference model as a black box. First, a model (e.g., XGBoost for classification or a Cox model for survival) is trained on the real dataset. The synthetic dataset is then initialized randomly and iteratively refined. At each iteration, small perturbations are applied to synthetic inputs, the black‑box model’s predictions on the perturbed inputs are observed, and a finite‑difference estimate of the loss gradient with respect to the synthetic inputs is computed. This estimate guides the update of the synthetic samples. Crucially, Gaussian noise calibrated to a target privacy budget (ε, δ) is added to each update, providing formal (ε, δ)‑differential privacy guarantees for the original patient records. The loss combines a cross‑entropy term (matching synthetic labels to the model’s predictions) and a KL‑divergence term (aligning the predictive distribution of the synthetic set with that of the real data).

The method is evaluated on six real‑world clinical tasks: three COVID‑19 diagnosis datasets from NHS trusts (Portsmouth, Oxford, and Birmingham), a plasma‑proteomics dataset from the UK Biobank for predicting incident multiple myeloma, a breast‑cancer survival cohort from SEER, and a diabetes‑onset survival cohort from the UK Biobank. For each task the authors train a reference XGBoost (or Cox) model, generate synthetic datasets with varying “instances‑per‑class” (IPC) values of 50, 100, 500, and 1000, and then train fresh models on the synthetic data. Performance is measured by AUROC for classification and concordance index (C‑index) for survival.

Results show that even with IPC = 100 (roughly 1–2 % of the original sample size) the synthetic data enable models that achieve AUROC scores within 1–3 % of the full‑data baseline across all three COVID‑19 sites (e.g., 0.894 vs. 0.901 for Portsmouth). For the proteomics task, IPC = 500 yields an AUROC of 0.913, surpassing the full‑data score of 0.898. Survival models trained on synthetic data attain C‑indices of 0.78–0.80, essentially matching the full‑data results. Sensitivity and negative predictive value (NPV) remain high, indicating that the condensed data are especially suitable for screening scenarios where ruling out disease safely is critical.

Privacy analysis demonstrates that with a moderate privacy budget (ε ≈ 100, δ = 10⁻⁵) the synthetic datasets retain most of their utility. As ε decreases (stronger privacy), performance degrades gradually, confirming the expected trade‑off. The authors also provide t‑SNE visualisations and nearest‑neighbour distance histograms, showing that synthetic points do not collapse onto individual real records but instead occupy a broader region of the data manifold, reducing memorisation risk.

The paper’s contributions are threefold: (1) a model‑agnostic DC algorithm that works with non‑differentiable, widely‑used clinical models; (2) integration of differential privacy into the condensation process, delivering formal privacy guarantees; and (3) extensive empirical validation across heterogeneous clinical domains, demonstrating that a tiny synthetic dataset can replace the original data for downstream model development.

Limitations are acknowledged. The experiments focus on XGBoost and Cox models; extending the approach to other classical learners (e.g., random forests, support vector machines) remains an open question. Zero‑order optimisation can be sample‑inefficient in very high‑dimensional spaces (e.g., the 2932‑dimensional proteomics data), suggesting that future work could explore dimensionality reduction or smarter query strategies. Finally, the method’s utility under very strict privacy budgets (ε < 50) is limited, indicating that practical deployments will need careful calibration of ε based on the regulatory context and acceptable utility loss.

In summary, this work demonstrates that dataset condensation, when combined with zero‑order optimisation and differential privacy, can democratise access to clinical data for a broad range of traditional predictive models. By producing compact, privacy‑preserving synthetic datasets, the approach lowers barriers for institutions with limited data‑sharing capabilities, supports collaborative research across geographic and economic boundaries, and paves the way for more inclusive, equitable development of clinical AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment