Privacy Amplification Persists under Unlimited Synthetic Data Release

Privacy Amplification Persists under Unlimited Synthetic Data Release
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study privacy amplification by synthetic data release, a phenomenon in which differential privacy guarantees are improved by releasing only synthetic data rather than the private generative model itself. Recent work by Pierquin et al. (2025) established the first formal amplification guarantees for a linear generator, but they apply only in asymptotic regimes where the model dimension far exceeds the number of released synthetic records, limiting their practical relevance. In this work, we show a surprising result: under a bounded-parameter assumption, privacy amplification persists even when releasing an unbounded number of synthetic records, thereby improving upon the bounds of Pierquin et al. (2025). Our analysis provides structural insights that may guide the development of tighter privacy guarantees for more complex release mechanisms.


💡 Research Summary

The paper investigates privacy amplification that occurs when only synthetic data, rather than the private generative model itself, is released. Building on the recent work of Pierquin et al. (2025), which proved amplification for a linear generator only in the asymptotic regime where the model dimension d vastly exceeds the number of released synthetic records n_syn, the authors ask whether amplification can persist when n_syn is arbitrarily large. Their main contribution is a positive answer under a bounded‑parameter assumption: if the Frobenius norm of the generator’s weight matrix is bounded by a constant C, then releasing an unbounded number of synthetic records still yields a strictly tighter privacy guarantee than releasing the model parameters directly.

The analysis is carried out in the Rényi Differential Privacy (RDP) framework. The mechanism consists of two stages: (1) a differentially private algorithm outputs a noisy parameter matrix V = v + σN (and similarly W for an adjacent dataset), where N has i.i.d. standard Gaussian entries; (2) an independent Gaussian latent matrix Z ∈ ℝ^{n_syn×d} is drawn and the synthetic dataset ZV is released. By post‑processing, the privacy loss of the synthetic data is bounded by the Rényi divergence D_α(ZV, ZW). The authors seek a factor η < 1 such that D_α(ZV, ZW) ≤ η·D_α(V, W).

Two technical tools are introduced. First, a local second‑order expansion (Proposition 3.1) links Rényi divergence to Fisher information: for a smooth parametric family {P_θ}, D_α(P_{θ+Δ}, P_θ) = (α/2)·I(θ)·Δ² + o(Δ²). Second, a global bound (Proposition 3.2) integrates this local relationship along a path in parameter space, yielding a non‑asymptotic upper bound on D_α in terms of an envelope U for the (2α‑1)‑order Rényi divergence and the supremum of Fisher information along the path.

The key insight is that, as n_syn → ∞, the divergence D_α(ZV, ZW) can be reduced to the divergence between the Gram matrices VᵀV and WᵀW (Propositions 4.1 and 4.2). Consequently, the privacy loss of releasing infinitely many synthetic records is equivalent to releasing the Gram matrix of the noisy parameters. Without additional assumptions, this yields no amplification (Proposition 4.3). However, imposing the bounded‑parameter condition (‖v‖_F, ‖w‖_F ≤ C) limits the spectral norm of VᵀV – WᵀW, which in turn bounds the Fisher information by C²·d/k. Plugging this bound into the global inequality gives the unified upper bound

 D_α(ZV, ZW) ≤ C²·(d/k) + C²·D_α(V, W),

which is independent of n_syn and strictly smaller than the naïve post‑processing bound D_α(V, W). The factor η = C²·(d/k) / (C²·(d/k) + C²·D_α(V, W)) is always less than one, establishing privacy amplification even when an unlimited amount of synthetic data is released.

The authors complement the theory with numerical experiments. They compute empirical Rényi divergences for various dimensions (d = 50–200), output sizes (k = 5–20), norm bounds C, and sensitivity Δ, showing rapid convergence of D_α(ZV, ZW) to its infinite‑sample limit already for modest n_syn (≈10³–10⁴). The empirical values consistently lie below the theoretical bound, confirming its tightness in practice.

In summary, the paper makes three substantive advances: (1) it removes the restrictive d ≫ n_syn condition of prior work, proving amplification under a realistic bounded‑parameter assumption; (2) it introduces a novel Fisher‑information‑based technique for globally bounding Rényi divergences, which may be of independent interest to statisticians; (3) it demonstrates that, from a privacy perspective, releasing arbitrarily many synthetic records is essentially equivalent to releasing a single Gram matrix, a much smaller summary of the model. The results suggest that many practical synthetic‑data pipelines—especially those employing ℓ₂‑regularized linear models—can safely publish large synthetic datasets without sacrificing differential‑privacy guarantees. Future work could extend these ideas to non‑linear deep generative models, explore tighter bounds for specific architectures, and integrate the bounded‑parameter requirement into training algorithms via regularization or clipping.


Comments & Academic Discussion

Loading comments...

Leave a Comment