When Gradient Clipping Becomes a Control Mechanism for Differential Privacy in Deep Learning
Privacy-preserving training on sensitive data commonly relies on differentially private stochastic optimization with gradient clipping and Gaussian noise. The clipping threshold is a critical control knob: if set too small, systematic over-clipping induces optimization bias; if too large, injected noise dominates updates and degrades accuracy. Existing adaptive clipping methods often depend on per-example gradient norm statistics, adding computational overhead and introducing sensitivity to datasets and architectures. We propose a control-driven clipping strategy that adapts the threshold using a lightweight, weight-only spectral diagnostic computed from model parameters. At periodic probe steps, the method analyzes a designated weight matrix via spectral decomposition and estimates a heavy-tailed spectral indicator associated with training stability. This indicator is smoothed over time and fed into a bounded feedback controller that updates the clipping threshold multiplicatively in the log domain. Because the controller uses only parameters produced during privacy-preserving training, the resulting threshold updates are post-processing and do not increase privacy loss beyond that of the underlying DP optimizer under standard composition accounting.
💡 Research Summary
The paper tackles a central challenge in differentially private deep learning: the choice of the gradient‑clipping norm C in DP‑SGD. A too‑small C truncates useful gradient information (clipping bias), while a too‑large C inflates the Gaussian noise scale (σ C) and harms utility. Existing adaptive clipping schemes rely on per‑example gradient norm statistics, which add considerable computational overhead and are sensitive to dataset and architecture variations.
The authors propose a fundamentally different approach: treat the clipping norm as a control variable and adjust it using a lightweight, weight‑only diagnostic derived from the model’s parameters. Specifically, at periodic probe steps they select a fixed weight matrix (e.g., the final fully‑connected layer), compute its singular value decomposition, and fit a power‑law tail to the largest eigenvalues. The fitted tail exponent ζₜ serves as a scalar proxy for the “spectral health” of the network; heavy‑tailed spectra (ζ around 4) are empirically associated with well‑trained, stable models, whereas deviations indicate either excessive clipping (flattened spectrum) or excessive noise (over‑steepened tail).
The tail exponent is smoothed with an exponential moving average (EMA) and compared to a predefined “health zone” centered at ζ★ (default 4) with radius r (default 2). The centered error eₜ = ζ̂ₜ − ζ★ is normalized and passed through a saturation function to obtain a bounded signal ϕₜ ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment