Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the ‘grade’ of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.
💡 Research Summary
The paper investigates a previously overlooked factor influencing the robustness of large language model (LLM) unlearning: the choice of optimizer. While prior work has focused on redesigning the unlearning objective (e.g., min‑max formulations, SAM, meta‑learning) to defend against post‑unlearning perturbations such as weight quantization or relearning attacks, this study asks how the optimizer’s “grade” – the amount of information it exploits – affects the stability of the forgetting process. The authors define three grades: zeroth‑order (gradient‑free), first‑order (gradient‑based), and second‑order (Hessian‑based). They also consider compressed first‑order variants such as sign‑based optimizers, which effectively lower the grade within the same order.
Through systematic experiments on two fine‑tuned LLMs (ICLM‑7B and LLaMA2‑7B) and the MUSE benchmark (which measures verbatim memorization and knowledge memorization on both forget and retain sets), the authors compare several optimizers: Sophia (second‑order), Adam (first‑order), 8‑bit Adam, and 1‑bit Adam (gradient‑sign). When the unlearned models are subsequently quantized to 4‑bit or subjected to a short relearning fine‑tune on a subset of the forgotten data, the lower‑grade optimizers (zeroth‑order methods and 1‑bit Adam) retain far stronger forgetting performance. In contrast, higher‑grade optimizers, while converging faster, lead to models that are more vulnerable: quantization quickly restores memorized content, and relearning attacks can undo a large portion of the forgetting.
The authors attribute this phenomenon to the geometry of the loss landscape. Low‑grade optimizers produce noisier, less precise updates, causing the optimization trajectory to settle in broader, flatter basins that are “hard‑to‑disturb.” Small perturbations to the weights therefore cause only minor changes in loss, making the model’s forgetting more resilient. Moreover, zeroth‑order optimization can be interpreted as a form of randomized smoothing: by estimating gradients through random finite‑difference queries, the method implicitly smooths the objective over a neighborhood in parameter space, endowing the model with inherent robustness to L2‑bounded weight changes.
Building on these insights, the paper proposes a hybrid optimizer that interleaves first‑order Adam steps with occasional zeroth‑order updates (implemented with 1‑bit sign gradients to keep computational cost low). This FO‑ZO hybrid retains the fast convergence of first‑order methods while injecting the smoothing effect of zeroth‑order steps. Empirically, the hybrid optimizer matches or slightly improves unlearning effectiveness (as measured by VerbMem and KnowMem) and dramatically improves robustness: after 4‑bit quantization the degradation in forgetting is minimal, and relearning attacks recover far less information compared to pure Adam or Sophia.
Key contributions are: (1) the first systematic study linking optimizer grade to unlearning robustness; (2) a theoretical rationale connecting low‑grade optimizers to flat‑basin convergence and randomized smoothing; (3) a practical FO‑ZO hybrid algorithm that achieves both high unlearning quality and strong resistance to weight perturbations; and (4) extensive validation across multiple benchmarks (MUSE, WMDP, TOFU) and unlearning methods (GradDiff, NPO, DPO). The work suggests that, contrary to intuition, “downgrading” the optimizer can be an effective strategy for building more robust unlearning pipelines, and it opens avenues for further research on adaptive grade scheduling, broader model families, and real‑world deployment scenarios where quantization and continual fine‑tuning are commonplace.
Comments & Academic Discussion
Loading comments...
Leave a Comment