End to End Collaborative Synthetic Data Generation
The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.
💡 Research Summary
The paper addresses a critical gap in the current literature on federated synthetic data generation (SDG): while many recent works focus on training a synthesizer under privacy constraints, they assume that the data have already been pre‑processed, that a single set of hyper‑parameters will suffice, and that the synthetic output can be released in one shot. In practice, generating high‑quality synthetic data is a multi‑stage pipeline that includes privacy‑preserving preprocessing (e.g., quantile binning of continuous attributes), rigorous evaluation against the original data, and iterative hyper‑parameter tuning. Each of these stages consumes part of the privacy budget, especially when differential privacy (DP) guarantees are required for both input privacy (no raw data leakage among collaborators) and output privacy (the released synthetic data must not reveal individual records).
To close this gap, the authors propose an end‑to‑end collaborative framework that integrates all stages of the SDG pipeline while preserving both input and output privacy. The framework is built on Secure Multiparty Computation (MPC) as a service: each data custodian secret‑shares its raw records among a set of non‑colluding MPC servers. All subsequent computations—DP‑protected preprocessing, model training, cross‑validation evaluation, and synthetic data generation—are performed on secret‑shared data, ensuring that no party ever sees the underlying raw values.
A key technical contribution is the “privacy‑budget reset” mechanism. In each iteration of hyper‑parameter search, the same privacy budget ε is allocated for preprocessing, training, and evaluation, but because all intermediate artifacts (synthetic samples, model parameters, evaluation metrics) remain secret and are never published, the budget is not actually depleted. Consequently, the total privacy cost of the entire search process remains bounded by a single ε, the same cost as a naïve one‑shot pipeline. The framework uses k‑fold cross‑validation inside MPC to obtain robust utility estimates; the average metric is compared against a pre‑defined quality threshold. When the threshold is met, the selected hyper‑parameters are fixed, the model is retrained on the full combined dataset (still within the same ε budget), and the final synthetic dataset is released.
The authors instantiate the framework with concrete DP‑in‑MPC protocols: (a) a quantile‑binning protocol that computes DP quantiles over the combined data without revealing raw values, (b) an evaluation protocol that runs logistic‑regression classifiers and computes workload‑error metrics across k folds, and (c) a DP‑protected Private‑PGM synthesizer that learns a graphical model from noisy low‑dimensional marginals. They evaluate the system on a realistic use case—synthetic genomic data for five leukemia sub‑types (ALL, AML, CLL, CML, “Other”) collected from multiple hospitals. Using naïve MPC implementations, they report runtime overheads and demonstrate that the synthetic data achieve utility comparable to a centralized DP baseline (e.g., similar classification accuracy and marginal fidelity) while guaranteeing input privacy.
The paper’s contributions can be summarized as follows:
- First full‑pipeline framework that jointly handles privacy‑preserving preprocessing, hyper‑parameter tuning, evaluation, and synthetic data release across multiple data silos, with both input and output DP guarantees.
- Budget‑efficient design that keeps the total privacy loss equal to a single ε despite multiple iterative runs, by never exposing intermediate results.
- Modular MPC‑based architecture allowing plug‑and‑play of different DP‑in‑MPC primitives for various data types and SDG algorithms.
- Empirical validation on a challenging biomedical dataset, showing that the approach attains utility close to centralized methods while preserving strict privacy.
Overall, the work provides a practical, cryptographically sound solution for collaborative synthetic data generation in regulated domains such as healthcare and finance, where data cannot be pooled centrally and legal constraints prohibit reliance on trusted third parties. The modular nature of the framework suggests that it can be extended to other data modalities (e.g., time‑series, images) and more sophisticated hyper‑parameter optimization strategies (e.g., Bayesian optimization) while retaining the same strong privacy guarantees.
Comments & Academic Discussion
Loading comments...
Leave a Comment