To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling
The Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm supports the training of machine learning (ML) models with formal Differential Privacy (DP) guarantees. Traditionally, DP-SGD processes training data in batches using Poisson subsampling to select each batch at every iteration. More recently, shuffling has become a common alternative due to its better compatibility and lower computational overhead. However, computing tight theoretical DP guarantees under shuffling remains an open problem. As a result, models trained with shuffling are often evaluated as if Poisson subsampling were used, which might result in incorrect privacy guarantees. This raises a compelling research question: can we verify whether there are gaps between the theoretical DP guarantees reported by state-of-the-art models using shuffling and their actual leakage? To do so, we define novel DP-auditing procedures to analyze DP-SGD with shuffling and measure their ability to tightly estimate privacy leakage vis-à-vis batch sizes, privacy budgets, and threat models. Overall, we demonstrate that DP models trained using this approach have considerably overestimated their privacy guarantees (by up to 4 times). However, we also find that the gap between the theoretical Poisson DP guarantees and the actual privacy leakage from shuffling is not uniform across all parameter settings and threat models. Finally, we study two common variations of the shuffling procedure that result in even further privacy leakage (up to 10 times). Overall, our work highlights the risk of using shuffling instead of Poisson subsampling in the absence of rigorous analysis methods.
💡 Research Summary
The paper investigates a critical gap between the theoretical differential privacy (DP) guarantees that are commonly reported for models trained with Differentially Private Stochastic Gradient Descent (DP‑SGD) using shuffling, and the actual privacy leakage observed in practice. Traditionally, DP‑SGD selects each mini‑batch via Poisson subsampling, a scheme for which tight privacy amplification theorems exist. Because Poisson subsampling requires random access and variable batch sizes, many modern implementations replace it with a simple shuffle: the dataset is randomly permuted once and then split into fixed‑size batches. While this greatly simplifies engineering and improves runtime, the privacy analysis for shuffling remains largely unresolved. Consequently, practitioners often report the Poisson‑based ε as if it applied to shuffled training, potentially overstating privacy protection.
To address this, the authors develop the first systematic auditing methodology for DP‑SGD with shuffling (DP‑SGD(Shuffle)). Their approach builds on the concept of DP auditing, which treats privacy assessment as a distinguishing game between an adversary and a challenger. The adversary attempts to infer whether a target record was included in the training set based on the mechanism’s output (or intermediate states). By repeatedly playing this game and measuring the adversary’s success, one obtains an empirical privacy leakage estimate ε_emp that can be compared to the claimed theoretical ε.
A key technical contribution is the introduction of the Batched Gaussian Mechanism (BGM), a simplified, non‑adaptive variant of DP‑SGD that isolates the effect of shuffling on privacy. Using likelihood‑ratio based distinguishing functions, the authors can tightly audit BGM and then extend the methodology to full‑scale DP‑SGD(Shuffle). They evaluate a range of threat models: from weak adversaries that only see the final model output, to strong adversaries that also observe per‑iteration gradients or internal states.
Empirical results span several benchmark datasets (MNLI, QNLI, SST‑2, Persona‑Chat, Places‑365, CIFAR‑10) and state‑of‑the‑art models (BERT‑based NLP classifiers, Vision Transformers). The findings are striking: the empirical ε_emp can be up to four times larger than the theoretical ε derived under Poisson subsampling. For example, a model trained on MNLI reports a theoretical ε = 3, yet the audit yields ε_emp ≈ 12. The gap is not uniform; it widens with larger batch sizes and stronger adversaries, and narrows when batches are small or the adversary’s view is limited.
Beyond standard shuffling, the paper audits two common variants that appear in public code repositories: partial shuffling (only a subset of the data is permuted) and batch‑then‑shuffle (shuffle within each batch, then shuffle batches). Both variants exacerbate privacy loss, with partial shuffling inflating ε by roughly 2.6× and batch‑then‑shuffle by up to 10× for very low privacy budgets (theoretical ε = 0.1 leads to ε_emp = 0.29 and ε_emp = 1.00 respectively).
The authors discuss the practical implications. Many released “private” models claim DP guarantees based on Poisson analysis while actually training with shuffling; such claims may be overly optimistic, potentially misleading users and regulators. Hyper‑parameter choices tuned for DP‑SGD(Shuffle) may not be optimal for the Poisson setting, affecting the privacy‑utility trade‑off. The work calls for either rigorous shuffling‑specific privacy amplification theorems or a return to Poisson subsampling with more efficient implementations.
In summary, this study provides (1) a novel, likelihood‑ratio based auditing framework for DP‑SGD with shuffling, (2) extensive empirical evidence that shuffling can dramatically weaken privacy guarantees compared to Poisson subsampling, (3) an analysis of how batch size, privacy budget, and adversary strength modulate the gap, and (4) a warning that common shuffling variants can cause even larger leaks. The findings underscore the necessity of accurate privacy auditing and the development of tighter theoretical analyses before shuffling can be safely adopted in privacy‑sensitive machine‑learning pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment