Selective Fine-Tuning for Targeted and Robust Concept Unlearning
Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models’ likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
💡 Research Summary
The paper addresses the pressing safety issue of text‑to‑image diffusion models that can be exploited to generate harmful content. Existing concept‑unlearning approaches either fine‑tune the entire model or rely on static neuron masks, which are computationally expensive and often degrade the quality of benign generations, especially when dealing with compositional or conditional concepts. To overcome these limitations, the authors propose TRUST (Targeted Robust Selective fine‑Tuning), a novel framework that dynamically identifies and edits the neurons responsible for encoding a target concept while preserving the rest of the model’s capabilities.
TRUST consists of three main components. First, a dynamic concept‑neuron identification step computes the gradient of an alignment loss between the input prompt and the generated image. Neurons in the cross‑attention layers with the largest gradient magnitudes are selected as the current “concept neurons.” This selection is performed at every fine‑tuning batch, reflecting the fact that neuron importance shifts as the model adapts. Second, the authors introduce a Concept Influence Penalty (CIP), a Hessian‑based regularization term that penalizes changes to the identified neurons in proportion to their second‑order influence on the overall loss. By encouraging sparsity in the set of active concept neurons, CIP allows aggressive removal of the target concept while limiting collateral damage to unrelated concepts. Third, a mask‑guided fine‑tuning procedure updates only the dynamically selected neurons, leaving the rest of the network untouched, which dramatically reduces compute and memory requirements.
The method is evaluated on Stable Diffusion 1.5 across three scenarios: (i) single harmful concepts (e.g., “gun”), (ii) harmful concept combinations (e.g., “child drinking beer”), and (iii) conditional harmful concepts (e.g., “person wearing mask at night”). Baselines include SalUn, CoGFD, and a stronger variant SalUn++ that recomputes masks each step. Metrics cover Attack Success Rate (ASR), Unlearning Accuracy (UA), CLIP‑Score, FID, and Retaining Accuracy (RA) for benign prompts.
Results show that TRUST achieves state‑of‑the‑art effectiveness: ASR drops below 8 % and UA exceeds 95 % across all scenarios, outperforming baselines by a large margin, particularly on compositional concepts where prior methods struggle. Utility preservation is excellent, with ΔFID ≈ 0.02 and ΔCLIP‑Score < 0.01, indicating negligible loss in image fidelity. TRUST also converges in roughly half the number of fine‑tuning steps (≈ 12 vs. 25‑30 for baselines) and reduces memory consumption by about 40 % thanks to the selective mask.
Ablation studies confirm the importance of each component. Using a static mask leads to rapid performance degradation after a few steps, while removing the CIP term causes noticeable drops in RA, demonstrating that the Hessian‑based regularizer is crucial for preventing catastrophic interference. The authors acknowledge that exact Hessian computation can be memory‑intensive for very large models and suggest future work on cheaper second‑order approximations and extending the approach beyond cross‑attention layers.
In summary, TRUST introduces a dynamic, Hessian‑regularized selective fine‑tuning paradigm that efficiently erases targeted harmful concepts—including complex combinations and conditional variants—while maintaining high generation quality and computational efficiency. This work represents a significant step toward practical, real‑time safety mechanisms for large‑scale generative AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment