The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples
Machine unlearning offers a practical alternative to avoid full model re-training by approximately removing the influence of specific user data. While existing methods certify unlearning via statistical indistinguishability from re-trained models, these guarantees do not naturally extend to model outputs when inputs are adversarially perturbed. In particular, slight perturbations of forget samples may still be correctly recognized by the unlearned model - even when a re-trained model fails to do so - revealing a novel privacy risk: information about the forget samples may persist in their local neighborhood. In this work, we formalize this vulnerability as residual knowledge and show that it is inevitable in high-dimensional settings. To mitigate this risk, we propose a fine-tuning strategy, named RURK, that penalizes the model’s ability to re-recognize perturbed forget samples. Experiments on vision benchmarks with deep neural networks demonstrate that residual knowledge is prevalent across existing unlearning methods and that our approach effectively prevents residual knowledge.
💡 Research Summary
Machine unlearning has emerged as a practical alternative to costly full‑model retraining when a user requests that their data be forgotten. Existing approaches typically certify that the “unlearned” model is statistically indistinguishable from a model retrained without the forget set, using notions such as (ε, δ)‑indistinguishability or Rényi‑unlearning. These guarantees, however, are limited to the original data points and do not address how the models behave on inputs that are close but not identical to the forget samples.
The paper uncovers a previously unstudied privacy risk: Residual Knowledge. Even when a model satisfies standard unlearning certificates, a slight adversarial perturbation of a forget sample may still be correctly classified by the unlearned model while a freshly retrained model fails. This indicates that traces of the forgotten data persist in the local neighborhood of the input space. Empirically, on CIFAR‑10 with an ℓ₂ perturbation of magnitude ≈0.03, more than 7 % of forget samples exhibit this behavior across a range of state‑of‑the‑art unlearning methods.
The authors formalize the phenomenon in two steps. First, they prove Proposition 1, showing that if two models are (ε, δ)‑indistinguishable, the distribution of adversarial examples generated against them can become up to a factor of 2δ/(1‑e^{‑ε}) less indistinguishable. In other words, post‑processing with an adversarial attack degrades the original statistical guarantee. Second, using geometric probability on the unit sphere S^{d‑1}, they demonstrate that disagreement between the unlearned and retrained models—measured by k(x)=𝟙
Comments & Academic Discussion
Loading comments...
Leave a Comment