Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence

Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks – particularly membership inference – succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record’s “outlierness”; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record’s privacy loss is proportional to its influence on the output – so deliberately shrinking outliers’ contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting – not random downweighting – drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline.


💡 Research Summary

The paper addresses a critical shortcoming of standard differential privacy (DP) when applied to synthetic data generation: the uniform privacy budget (ε, δ) treats all records alike, yet empirical attacks such as membership inference and linkage attacks disproportionately succeed against rare or atypical records (outliers). In domains like healthcare or finance, these outliers often correspond to vulnerable individuals whose privacy must be protected more rigorously.

To remedy this, the authors propose Risk‑Equalized Private Synthesis (REPS), a two‑stage framework that explicitly targets high‑risk records and provides tighter per‑instance privacy guarantees for them while preserving overall utility.

Stage 1 – Private outlier scoring.
A small privacy budget (ε_s, δ_s) is allocated to compute an “outlierness” score for each record. The paper demonstrates a concrete DP histogram‑based scorer: each feature is discretized, noisy bin counts are released using Gaussian noise calibrated to the ℓ₂‑sensitivity √d (d = number of features), and a log‑density score is derived from the noisy marginal probabilities. Alternative DP‑compatible scorers (DP‑k‑nearest‑neighbors, DP clustering) are mentioned, but the theory treats the scorer abstractly.

Stage 2 – Risk‑weighted DP learning.
The outlier scores b_s_i are transformed into weights w_i ∈ (0, 1] via a monotone decreasing mapping g(·). Two families are suggested: a capped linear form w_i = min{1, τ/(b_s_i+τ)} and a hinge‑exponential form w_i = exp(−γ(b_s_i−t)). These weights are then incorporated into the learning algorithm for the synthetic data generator. The authors instantiate two variants:

Variant A (statistical synthesis): Each record is mapped to a sufficient‑statistics vector ϕ(z_i). The weighted, clipped aggregate f_w(D) = (1/n) Σ_i w_i·clip(ϕ(z_i), C) is computed, Gaussian noise N(0, σ_t²I) is added, and a parametric model (e.g., an exponential family) is fitted to the noisy statistics.

Variant B (deep generative synthesis): DP‑SGD is modified so that per‑example gradients are scaled by w_i before clipping at norm C and adding Gaussian noise. This yields a differentially private generator whose parameters inherit the per‑record influence reduction.

Theoretical contribution.
The paper proves that under Gaussian mechanisms the privacy loss for record i satisfies ε_i ≤ w_i·α_i, where α_i is the bound on the record’s contribution after clipping. Consequently, by choosing w_i such that w_i·α_i ≤ τ_out for all high‑risk records (those with scores above a threshold τ), the mechanism guarantees a tighter per‑instance bound ε_out < ε for that subset. The two stages compose additively, giving overall (ε_s+ε_t, δ_s+δ_t)‑DP. Closed‑form expressions for ε_i are derived, and a constructive schedule for selecting τ, γ, and other hyper‑parameters is provided.

Empirical evaluation.
Experiments are conducted on (1) synthetic data with controlled outlier injection and (2) three real‑world tabular benchmarks: Wisconsin Breast Cancer, Adult Income, and German Credit. Membership inference attacks are implemented, including recent attacks tailored to marginal‑based generators (DOMIAS) and deep generators (MAMA‑MIA). Results show that REPS dramatically reduces attack success on high‑outlierness records (30–50 percentage‑point drops) while maintaining comparable utility (statistical distance, downstream classifier performance) to baseline DP synthesizers that do not weight records. Ablation studies confirm that the improvement stems from risk‑based weighting rather than random down‑weighting.

Comparison to related work.
The authors position REPS relative to (i) empirical privacy vulnerability literature, (ii) per‑instance DP analyses that are post‑hoc rather than mechanism‑driven, (iii) standard DP synthetic data methods that treat all records uniformly, and (iv) targeted high‑risk approaches such as ε‑PrivSMOTE, which replace outliers rather than shrink their influence. REPS uniquely integrates a DP‑compliant risk estimator with a weighted learning procedure, delivering formal per‑record privacy bounds and preserving the contribution of outliers to the data distribution.

Limitations and future directions.
The main constraints are the accuracy of the private outlier scores (limited by ε_s), the need to tune weight‑mapping hyper‑parameters for each domain, and the focus on tabular data. Extending REPS to high‑dimensional continuous data (images, time series) and automating the selection of τ, γ, and the clipping norm C are identified as promising research avenues.

Conclusion.
Risk‑Equalized Private Synthesis provides a principled, theoretically grounded method to equalize privacy risk across records in synthetic data releases. By explicitly down‑weighting high‑risk outliers during the learning phase, it achieves tighter per‑instance privacy guarantees where they are most needed, without sacrificing overall data utility. This work bridges the gap between empirical observations of outlier vulnerability and practical DP mechanism design, offering a valuable tool for privacy‑sensitive data sharing in healthcare, finance, and other high‑stakes domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment