REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mit…

Authors: Yong Zou, Haoran Li, Fanxiao Li

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
REFORGE: Multi-modal Attacks Re v eal V ulnerable Concept Unlearning in Image Generation Models Y ong Zou 1 , Haoran Li 2 , Fanxiao Li 1 , Shenyang W ei 1 , Y unyun Dong 1 , Li T ang 1 , W ei Zhou 1 , and Renyang Liu ∗ , 3 1 Y unnan Uni versity 2 Northeastern Uni versity 3 National Uni versity of Singapore Abstract —Recent progr ess in image generation models (IGMs) enables high-fidelity content creation, b ut amplifies risks includ- ing repr oducing copyrighted or generating offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by remo ving harmful concepts without full retraining. De- spite gr owing attention, the rob ustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplor ed. T o bridge this gap, we present R E F O R G E , a black-box red-teaming framework that evaluates IGMU robust- ness via adversarial image prompts. R E F O R G E initializes stroke- based images and optimizes perturbations with a cross-attention- guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity . Extensi ve experiments across r epresentative unlearning tasks and defenses demonstrate that R E F O R G E significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than in volved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-awar e unlearning against multi-modal adversarial attacks. Our code at: https://github.com/Imfatnoily/REFORGE . Index T erms —Red-teaming, Image generation model unlearn- ing, AI safety , Stable Diffusion model, AIGC I . I N T R O D U C T I O N Image generation models (IGMs) ha ve witnessed remark- able progress, rev olutionizing applications in artistic creation, virtual reality , and medical imaging. Prominent models, such as D ALL·E [ 1 ], Imagen [ 2 ], and Stable Dif fusion [ 3 ], have facilitated the widespread adoption of text-to-image synthesis. Howe ver , these capabilities have introduced significant safety and compliance concerns [ 4 ], including harmful, misleading, or copyright-infringing generations that can cause tangible societal threats. A ke y source of these risks is the reliance of modern IGMs on large-scale internet-scraped datasets [ 5 ], which inevitably contain copyrighted works, NSFW imagery , etc. Such undesir- able information can be internalized during training and later re-emerge at deplo yment time, enabling misuse even when the service interface appears benign. Although dataset filtering followed by retraining can miti- gate these issues, it is often computationally prohibitive for large-scale diffusion models [ 6 ]. Consequently , prior work mainly follo ws two directions: (1) external filters that screen prompts or generated images [ 1 ], [ 7 ], [ 8 ], and (2) Machine *Corresponding author . This work is supported by the Y unnan Research Project (Grant Nos. 202503A G380006, 202401A T070474, 202501A U070059, 202403AP140021), National Natural Science Foundation of China (Grant Nos. 62562061, 62502422 and 62462067), and Y unnan Provincial Department of Education Science Research Project (Grant Nos. 2025J0006, 2024J0010 and 2025J0007). (Email: ryliu@nus.edu.sg) I* P 𝑎 𝑑 𝑣 I′ I P 𝑡 𝑒 𝑥 𝑡 IGM Un l ea r ne d IG M Un l ea r ne d IG M The v il l ag e un d er the star r y n ight b y Vin c en t v an Go gh Fig. 1. Giv en that an unlearned IGM has undergone a concept-unlearning procedure (e.g., remov al of the V an Gogh style), our adversarial image prompt P adv combined with the prompt P text can still bypass the unlearning mechanism, causing the erased style to re-emerge in the generated image I ∗ . Unlearning (MU) that remov es specific concepts by directly modifying model parameters [ 9 ]–[ 15 ]. Filtering-based de- fenses suffer from inherent trade-offs: pre-filtering can over - lock benign prompts, whereas post-filtering increases inference latency and wastes computations on discarded generations. In contrast, Image Generation Model Unlearning (IGMU) integrates the remo v al objectiv e into the model itself, offering a more direct and potentially efficient mitigation. Researchers hav e developed diverse IGMU techniques, in- cluding inference-time constraints [ 16 ], [ 17 ], weight edit- ing [ 9 ], [ 10 ], adversarial training [ 13 ], and structural prun- ing [ 15 ]. Ho wever , the rob ustness of unlearned models against adversarial inputs remains insufficiently understood. Recent studies sho w that erased concepts can be recovered via care- fully optimized prompts. White-box attacks, such as P4D [ 18 ] and UnlearnDiffAtk [ 19 ], exploit access to model internals to construct ef fecti ve adversarial prompts. In the black-box setting, existing red-teaming methods largely focus on ma- nipulating text prompts [ 20 ]–[ 24 ], while the vulnerabilities introduced by image inputs are less explored. Although recent work [ 25 ] in vestigates image-modality red-teaming, it relies on white-box access. T o our knowledge, black-box red-teaming for IGMU image inputs remains unstudied. T o bridge this gap, we study the robustness of unlearned IGMs under realistic text-to-image generation interfaces where attackers can provide both text and image inputs. W e propose R E F O R G E , a novel black-box red-teaming framew ork that generates adversarial image prompts to bypass IGMU mecha- nisms. As illustrated in Fig. 1 , R E F O R G E combines adversarial stroke-based image prompts with the original text prompt to induce the re-emergence of erased concepts while preserving ov erall semantic consistency . Crucially , R E F O R G E does not require access to target-model parameters or gradients, making it applicable to real-world, closed-source services. W e validate R E F O R G E through extensiv e experiments across three representativ e unlearning categories and mul- tiple concept erasure techniques. The experimental results demonstrate that R E F O R G E achie ves superior performance in terms of attack success rate, semantic similarity , and attack efficienc y , compared to representati ve baselines. Our key con- tributions are as follo ws: • W e propose R E F O R G E , a black-box red-teaming frame- work that targets the image modality for IGMU and rev eals the fragility of current unlearning mechanisms under realistic multi-modal attacks. • W e introduce a masking strategy that leverages cross- attention maps to allocate perturbations, balancing attack effecti veness with visual imperceptibility . • W e conduct extensi ve ev aluations across unlearning tasks and methods, showing that R E F O R G E consistently out- performs prior baselines in ef fectiveness, semantic preser - vation, and efficiency . I I . R E L A T E D W O R K A. Image Generation Model Unlearning As IGMs impro ve in fidelity , they also amplify safety and compliance risks by enabling the synthesis of undesirable content. Image generation model unlearning (IGMU) aims to remov e specific concepts from a pretrained generator while preserving general generation quality . Existing IGMU methods span inference-time suppression, parameter editing, adv ersarial optimization, and structural pruning. Specifically , SLD [ 16 ] imposes suppression constraints at inference time, whereas ESD [ 9 ] performs selecti ve fine-tuning ov er model layers. UCE [ 10 ], MACE [ 12 ], and RECE [ 26 ] use closed-form updates for ef ficient weight modification: UCE tar gets cross- attention parameters, MA CE integrates LoRA modules for multi-concept erasure, and RECE iterati vely eliminates deri ved embeddings with regularization to preserv e generation quality . FMN [ 11 ] achiev es unlearning through attention redirection. AdvUnlearn [ 13 ] lev erages adversarial examples to enhance forgetting robustness, and DoCo [ 14 ] adopts a GAN-like framew ork with adversarial optimization. ConceptPrune [ 15 ] remov es concepts by pruning critical neurons in the FFN layers of denoiser . B. Red T eaming for Imag e Generation Model Unlearning Despite progress in IGMU, recent studies ha ve shown that erased concepts can be recovered under adversarial inputs. In white-box settings, P4D [ 18 ] uses an auxiliary diffusion model to optimize adversarial prompts, and UnlearnDiffAtk [ 19 ] improv es ef ficiency by an additional reference image. Be- yond gradient-based methods, SneakyPrompt [ 20 ] adopts rein- forcement learning for prompt optimization, Ring-A-Bell [ 21 ] applies genetic algorithms to align prompts with concept vectors, and JP A [ 22 ] relaxes discrete tokens into continuous variables for ef ficient optimization. F or black-box red-teaming, DiffZOO [ 23 ] performs zeroth-order optimization, and Jail- Fuzzer [ 24 ] employs large language models as fuzzing agents. While these efforts have substantially advanced red-teaming for unlearning, most existing frameworks operate primarily in the text modality and do not explicitly account for the image-input channel supported by many IGMs. Although RE- CALL [ 25 ] e xtends red-teaming to the image modality , it relies on white-box assumptions, leaving black-box ev aluation via image inputs lar gely unexplored. T o fill this gap, we propose R E F O R G E , a black-box rob ustness assessment frame work for multi-modal scenarios, demonstrating that erased concepts can be recov ered by combining unmodified textual prompts and adversarial stroke-based image prompts. R E F O R G E does not require access to the target model’ s parameters or gradients, making it applicable to real world scenarios. I I I . M E T H O D O L O G Y A. Threat Model W e consider a black-box setting in which the adversary has no access to the target model’ s parameters or gradients. The adversary can query the unlearned model M u through its standard text-image interface by providing an input image and a text prompt and observing the generated output. For optimization, the adversary uses a public IGM as a proxy to compute cross-attention maps and optimization gradients. B. Overview W e propose R E F O R G E , a nov el black-box multi-modal red- teaming framew ork for ev aluating the robustness of image generation model unlearning (IGMU). R E F O R G E constructs an adversarial example P adv by combining (i) a stroke-based initialization derived from a concept reference image P ref and (ii) a text prompt P text that specifies the erased concept. As shown in Fig. 2 , R E F O R G E consists of four stages: Stage I (Initialization). Con vert P ref into a stroke-based image P ∗ adv that preserv es global composition while remo ving fine details. Stage II (Mask Construction). Aggregate cross-attention maps from the proxy model conditioned on ( P ∗ adv , P text ) to obtain a spatial mask M ∈ [0 , 1] that emphasizes concept- relev ant regions. Stage III (Latent-Alignment Optimiza- tion). Optimize the adversarial latent z adv in the proxy V AE space by aligning it to the reference latent z ref , while apply- ing the mask M to constrain the update. Stage IV (Red- T eaming Evaluation). Query M u with ( P adv , P text ) and assess whether the erased concept re-emerges in the output. The pseudo-code of the R E F O R G E pipeline is sho wn in Alg. 1 . C. Initialization of Adversarial Sample Giv en a reference image P ref , R E F O R G E initializes the adversarial image prompt by conv erting P ref into a stroke- based image P ∗ adv . This initialization preserv es global layout 2 𝓜 𝑢 M a s k A u xi l i a r y S D P 𝑟 𝑒 𝑓 𝜺 𝑧 𝑟 𝑒 𝑓 𝜺 𝑧 𝑎 𝑑 𝑣 𝑧 𝑎 𝑑 𝑣 𝜺 : I mag e Enc ode r : F orw a rd : Mul ti - modal Attac k : Mask Mult ipl y Stag e III St ag e II St ag e I St ag e I V I* P 𝑡 𝑒 𝑥 𝑡 s t r o ke - s i m u l a t i o n P 𝑎 𝑑 𝑣 P 𝑎 𝑑 𝑣 * 𝓛 𝑎 𝑙 𝑖 𝑔 𝑛 ( 𝑧 𝑟 𝑒 𝑓 , 𝑧 𝑎 𝑑 𝑣 ) Fig. 2. Overvie w of the R E F O R G E frame work. Sensiti ve parts are covered by . and coarse color cues, which helps maintain consistency with the te xtual prompt P text while suppressing fine-grained details. Concretely , for a 512 × 512 input, we apply a large-kernel median filter (kernel size 47 ) to remo ve high-frequency details, perform color quantization with k =6 , and render re gion-based strokes to obtain P ∗ adv . D. Mask Construction via Cross-Attention Uniformly allocating perturbations ov er the entire spatial domain leads to an inherent trade-off between perceptibility and attack effecti veness. T o focus the optimization on concept- relev ant re gions, R E F O R G E deri ves a spatial mask from cross- attention maps of the proxy diffusion model conditioned on ( P ∗ adv , P text ) . Cross-attention highlights spatial locations that are strongly associated with the concept tokens, and we use this signal to weight update magnitude during optimization. W e aggregate cross-attention activ ations at denoising timestep t : e A = Aggregate( A t ) , (1) where Aggregate( · ) selects and aggregates attention layers. W e then normalize the e A to obtain a mask M ∈ [0 , 1] : M = e A ∥ e A ∥ ∞ . (2) When M is deriv ed as a spatial map, it is broadcast along the channel dimension to match the shape of latent representation. E. Latent-Alignment Optimization W e construct the adversarial example by iteratively optimiz- ing in the latent space of the proxy diffusion model. Giv en a reference image P ref that exhibits the erased concept, and an initialized stroke-based image P ∗ adv , we align their latent representations so that the optimized adversarial latent is closer to the concept-related features from P ref . W e obtain the latent value of both images via the V AE encoder E I of the auxiliary diffusion model: z ref = E I ( P ref ) , (3) z adv = E I ( P ∗ adv ) , (4) where z ref and z adv are the reference latent and the initialized adversarial latent, respecti vely . W e iteratively optimize the adversarial latent z adv so that it approaches the reference latent z ref , thereby transferring concept-related features from P ref to the adv ersarial example. W e define an alignment objective as the mean-squared error (MSE) between the two latents and optimize it via gradient descent over K iterations: L alig n ( z adv , z ref ) = 1 n ∥ z ref − z adv ∥ 2 2 , (5) P ( k ) adv = P ( k − 1) adv − η ·  ∇ P adv L alig n ( z ( k − 1) adv , z ref ) ⊙ M  , (6) where k inde xes the optimization iteration, η is the step size, and M is the cross-attention mask defined in Eq. ( 2 ). This masked update concentrates the perturbation budget on concept-relev ant regions indicated by M , while limiting un- necessary modifications to other regions. After K iterations, we obtain adversarial example P adv = P ( K ) adv . F . Red-T eaming Evaluation W ith the adv ersarial e xample fully constructed, we e valuate the robustness of the unlearned dif fusion model M u by querying it with the multi-modal input ( P adv , P text ) through its standard generation process. The generated output is then examined to determine whether the erased concept re-emerges under the adversarial image prompt. 3 Algorithm 1: R E F O R G E Input: Reference image P ref , T extual prompt P text , Auxiliary model S D , Iterations K , Step size η , IGMU M u ; Output: Red-teaming generated image I ∗ ; 1 Initialize P ∗ adv ← Stroke-simulation ( P ref ) ; 2 Attention map A t ← S D ( P ∗ adv , P text , t ) ; 3 Mask M ← Ψ( A t ) ; // aggregate and normalize mask 4 P adv ← P ∗ adv , z ref ← E I ( P ref ) ; 5 for k = 1 to K do 6 z adv ← E I ( P adv ) ; 7 L alig n ← 1 n ∥ z ref − z adv ∥ 2 2 ; // alignment loss 8 g ← ∇ P adv L alig n ; 9 P adv ← P adv − η · ( g ⊙ M ) 10 I ∗ ← M u ( P adv , P text ) ; // IGMU generation 11 return I ∗ I V . E X P E R I M E N T S T o comprehensiv ely ev aluate the ef fecti veness and gen- eralizability of R E F O R G E , we conduct experiments across three representative unlearning tasks, spanning local abstract concepts ( Nudity ), local object concepts ( Parachute ), and global abstract concepts ( Van Gogh-style ). A. Settings Datasets. W e adopt the prompt sets used in UnlearnDif- fAtk [ 19 ] for the Object-Parachute and V an Gogh-Style con- cepts, and Sneak yPrompt [ 20 ] for the Nudity concept. For each prompt, we generate a reference image using a third-party model (e.g., Flux-Uncensored-v2 [ 27 ] and Stable Diffusion v2.1 [ 28 ]) and automatically v erify whether the target concept is present; prompts whose reference images do not exhibit the target concept are discarded. After filtering, we retain 150 , 45 , and 48 prompt-reference pairs for Nudity , Object-Parachute, and V an Gogh-Style, respecti vely . IGMU Methods. W e e valuate representativ e unlearning meth- ods cov ering weight editing, adversarial optimization, and structural pruning 1 : ESD [ 9 ], UCE [ 10 ], MA CE [ 12 ], AdvUn- learn [ 13 ], DoCo [ 14 ], and ConceptPrune [ 15 ]. Baselines. T o align with the black-box threat model, we compare against several representative red-teaming meth- ods that operate without access to target unlearned models: SneakyPrompt 2 [ 20 ], Ring-A-Bell [ 21 ] and MMA 3 [ 29 ]. Evaluation Metrics. W e ev aluate the effecti veness of red- teaming attacks using the following metrics. Attac k Success 1 The unlearned weights are primarily obtained from AdvUnlearn [ 13 ] and the official implementations of the respectiv e methods, or reproduced using the authors’ open-source code with default settings. 2 W e modify the original reinforcement learning objectiv e to treat an attack as successful once the generated content contains the target concept, rather than using a negati ve reward. 3 W e only include the text-based v ariants that remain applicable in the black- box setting. Rate (ASR) : the fraction of adversarial queries that elicit the erased concept. For Nudity , we detect the target concept using NudeNet [ 30 ] with a confidence threshold of 0 . 45 . For Object-Parachute, we use a ResNet-50 [ 31 ] classifier and adopt its T op-1 prediction. For V an Gogh-Style, we use the style classifier provided by EvalIGMU [ 32 ] and adopt its top-1 prediction. CLIP Scor e : The cosine similarity between image and text embeddings from CLIP [ 33 ]. Attack T ime : The av erage runtime per adversarial example. Implementation Details. All experiments use 100 sampling steps based on Stable Dif fusion v1.4 [ 34 ] with a fixed seed to ensure reproducibility . T o reflect realistic attacker constraints, we limit each method to a query b udget of 10 generation calls to the unlearned model. All experiments are conducted on a single NVIDIA R TX 4090 GPU using standard PyT orch. B. Attack P erformance T able I summarizes the attack success rate (ASR) achiev ed by dif ferent methods across three concepts. Overall, R E F O R G E attains the best av erage performance. W e further highlight three observ ations: (1) Several IGMU methods remain vulner- able even to unoptimized text prompts. In particular , for V an Gogh-Style, the unlearned model exhibits high sensitivity to the ra w prompt, yielding the second-highest ASR without any optimization. (2) R E F O R G E consistently outperforms strong baselines, including MMA [ 29 ] and Ring-A-Bell [ 21 ], support- ing the effecti veness of focusing perturbations on semantically relev ant image regions. (3) Adversarially enhanced unlearning methods (e.g., AdvUnlearn) reduce the absolute ASR of all at- tack strate gies. Ne vertheless, R E F O R G E retains a clear margin ov er competing methods under this stronger defense. Overall, these results suggest that current IGMU techniques remain vulnerable to multi-modal adversarial inputs. C. Semantic Alignment T able I reports semantic alignment between the generated images and their corresponding textual prompts across three representativ e unlearning tasks, measured by CLIP Score. R E - F O R G E achieves the highest CLIP Score, indicating impro ved text-image consistency . W e attribute this to the stroke-based initialization, which helps preserve global composition and coarse tonal structure during optimization. Although Ring- A-Bell [ 21 ] attains relatively high ASR, its CLIP Score is lower , suggesting de graded semantic alignment under te xt-only optimization. These results suggest that text-based attacks tend to compromise text-image consistency , whereas our image- modality-driv en R E F O R G E better preserves semantic fidelity . D. Attack Efficiency W e measure the average runtime required to generate a single complete adversarial example for each task. The e x- perimental results show that existing black-box attacks incur substantial computational cost, with SneakyPrompt ∼ 290s, MMA ∼ 1000s, and Ring-A-Bell ∼ 320s. In comparison, R E - F O R G E requires only ∼ 35s, while achieving comparable or better attack performance. W e attribute the efficienc y gains 4 T ABLE I C O M PA R I S O N O F A S R ( % ) A N D C L I P S C O R E AC RO S S V A R I O U S U N L E A R N I N G TA S K S . F O R E AC H M E T H O D , T H E L E F T C O L U M N I N D I C A T E S A S R ( ↑ ) A N D T H E R I G H T I N D I C A T E S C L I P S C O R E ( ↑ ) . T H E B E S T R E S U LTS A R E H I G H L I G H T E D I N B O L D , A N D T H E S E C O N D - B E S T A R E U N D E R L I N E D . T ask Method ESD UCE AdvUnlearn DoCo MA CE ConceptPrune A verage ASR CLIP ASR CLIP ASR CLIP ASR CLIP ASR CLIP ASR CLIP ASR CLIP Nudity T ext 32.00 24.62 54.66 25.22 4.66 19.89 76.00 26.06 24.00 19.34 94.66 26.16 47.66 23.55 SneakyPrompt 21.33 21.90 32.66 22.68 1.33 21.30 52.66 23.86 13.33 18.82 76.00 23.59 32.88 22.02 MMA 32.66 24.39 65.33 23.49 1.33 19.40 77.33 24.67 20.00 17.90 96.66 25.11 48.88 22.49 Ring-A-Bell 78.66 18.86 65.33 18.96 2.33 10.50 93.33 19.12 11.33 12.14 100.00 19.58 58.55 16.52 R E F O RG E 65.33 25.83 69.33 26.15 62.66 22.33 89.33 24.46 14.66 17.95 98.00 26.46 66.55 24.19 Object-Parachute T ext 15.55 24.12 6.66 24.71 4.44 26.66 46.66 26.27 6.66 22.19 95.55 27.67 29.25 25.27 SneakyPrompt 0.00 0.00 4.44 22.41 0.00 0.00 24.44 23.68 6.66 19.89 68.88 24.73 17.40 15.12 MMA 44.44 24.28 13.33 24.20 6.66 21.89 64.44 26.00 6.66 23.96 100.00 27.27 39.25 24.60 Ring-A-Bell 26.66 21.08 20.00 21.53 2.22 17.84 64.44 25.60 17.77 18.87 100.00 24.34 38.51 21.54 R E F O RG E 93.33 26.93 71.11 25.93 57.77 24.16 91.11 27.75 11.11 20.45 97.77 27.33 70.36 25.43 V an Gogh-Style T ext 58.33 26.91 100.00 30.35 14.58 19.66 70.83 28.08 83.33 28.12 100.00 28.84 71.17 26.99 SneakyPrompt 14.58 18.12 62.50 25.61 8.33 20.72 27.08 24.54 37.50 23.17 64.58 24.42 35.76 22.76 MMA 62.50 26.18 100.00 29.34 12.50 20.61 66.66 27.12 81.25 28.45 100.00 27.50 70.17 26.53 Ring-A-Bell 56.25 22.34 100.00 25.39 10.41 19.73 27.08 24.17 81.25 24.60 100.00 24.65 62.49 23.48 R E F O RG E 64.58 27.21 97.91 28.67 20.83 23.44 83.33 26.86 83.33 28.04 100.00 28.29 74.99 27.08 T ABLE II G E N E R ATE D I M AG E S U N D E R D I FF E R E N T A T TAC K S . T ask Nudity Object-Parachute Prompt ... perfect f ace, perfect and proportional body , black skin, sensual, in the style ... Parachute opening in mid-air . Models ESD UCE ESD UCE SneakyPr ompt MMA Ring-A-Bell R E F O R G E to the stroke-based initialization and the spatially weighted optimization, which reduce the optimization complexity and thus accelerate the adversarial example generation. E. Ablation Study W e conduct ablation studies to assess the generalizability of R E F O R G E and to analyze the impact of reference image se- lection, cross-attention–guided masking across dif ferent layers and timesteps, as well as the choice of alignment loss. 1) Selection of r eference images: T o assess the sensitivity of R E F O R G E to the choice of P ref , we use four randomly 55 60 65 70 75 80 N udity Par ach ute V an Gog h Cos i ne L2 Norm MSE 55 60 65 70 75 80 N udi t y Par ach ute V an Gog h Shal l ow Lo we r-Mid U pper-M i d D eep O ptimal 55 60 65 70 75 80 N udity Par ach ute V an Gog h 100 400 800 55 60 65 70 75 80 N udity Par ach ute V an Gog h R1 R2 R3 R4 RP Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) (a) 55 60 65 70 75 80 N udity Par ach ute V an Gog h Cos i ne L2 Norm MSE 55 60 65 70 75 80 N udi t y Par ach ute V an Gog h Shal l ow Lo we r-Mid U pper-M i d D eep O ptimal 55 60 65 70 75 80 N udity Par ach ute V an Gog h 100 400 800 55 60 65 70 75 80 N udity Par ach ute V an Gog h R1 R2 R3 R4 RP Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) (b) 55 60 65 70 75 80 N udity Par ach ute V an Gog h Cos i ne L2 Norm MSE 55 60 65 70 75 80 N udi t y Par ach ute V an Gog h Shal l ow Lo we r-Mid U pper-M i d D eep O ptimal 55 60 65 70 75 80 N udity Par ach ute V an Gog h 100 400 800 55 60 65 70 75 80 N udity Par ach ute V an Gog h R1 R2 R3 R4 RP Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) (c) 55 60 65 70 75 80 N udity Par ach ute V an Gog h Cos i ne L2 Norm MSE 55 60 65 70 75 80 N udi t y Par ach ute V an Gog h Shal l ow Lo we r-Mid U pper-M i d D eep O ptimal 55 60 65 70 75 80 N udity Par ach ute V an Gog h 100 400 800 55 60 65 70 75 80 N udity Par ach ute V an Gog h R1 R2 R3 R4 RP Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) Atta ck Su ccess Ra te ( %) (d) Fig. 3. Ablation of key parameters: ASR (%) vs. (a) reference images, (b) timesteps, (c) layers, and (d) losses. chosen reference images (R1–R4) and one prompt-aligned reference image (RP) for each task. As shown in Fig. 3a , the attack success rate remains high across different choices of P ref , demonstrating that R E F O R G E can e xtract concept- relev ant information from any reference image that contains the target content, without requiring strict one-to-one corre- spondence between P ref and P text . 2) Layer Selection for Cross-Attention: W e study ho w the depth of cross-attention layers affects perturbation allocation and attack performance by e valuating fi ve selection strategies. Stable Dif fusion v1.4 [ 34 ] contains 16 cross-attention layers, which we ha ve grouped into four depth ranges: Shallow (0–3), Lower -Mid (4–7), Upper -Mid (8–11), and Deep (12–15), along with an “Optimal” selection identified through preliminary analysis. As shown in Fig. 3c , different depth ranges yield different attack success rates, indicating that cross-attention at different depths provides distinct semantic and spatial cues. Overall, the “Optimal” selection consistently outperforms the fixed-depth configurations. 5 3) T imestep Selection for Cr oss-Attention Mask: W e ex- amine how the sampling used to extract cross-attention affects mask quality and attack performance by e valuating timesteps of T ∈ { 800 , 400 , 100 } . As sho wn in Fig. 3b , the optimal timestep is task-dependent.For Nudity , late-stage attention at T = 100 achiev es the highest ASR, consistent with captur- ing fine details. For Object-Parachute, early-stage attention at T = 800 yields the best performance. For V an Gogh- Style, mid-stage attention at T = 400 provides the strongest trade-off between semantic rele vance and spatial specificity . These findings indicate that dif ferent semantic concepts are synthesized at distinct stages of the rev erse diffusion process. 4) Loss Function Selection for P erturbation Optimization: W e compare Cosine Loss, L2 Loss, and MSE Loss as objec- tiv es for perturbation optimization. As shown in Fig. 3d , MSE consistently yields the highest ASR across tasks, suggesting more stable optimization in our setting. In contrast, Cosine and L2 losses consistently underperform, indicating that MSE is the most effecti ve objective among those considered. V . C O N C L U S I O N In this paper , we propose R E F O R G E , a novel black-box red-teaming framework that ev aluates the robustness of IGMU methods via the image modality . By combining stroke-based initialization with cross-attention-guided masking, R E F O R G E constructs adversarial image prompt that elicits erased con- cepts while preserving text-image semantic alignment. Ex- tensiv e experiments on representative unlearning tasks and defenses demonstrate that R E F O R G E consistently outperforms existing baselines in recovering erased styles, objects, and sensitiv e concepts. These results reveal that current IGMU methods remain vulnerable to multi-modal adversarial inputs, indicating the urgent need for de veloping robustness-a ware un- learning and safety alignment under black-box threat models. R E F E R E N C E S [1] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, “Hierarchical text-conditional image generation with CLIP latents, ” arXiv , 2022. [2] Chitwan Saharia, W illiam Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, et al., “Photorealistic text-to-image dif fusion models with deep language understanding, ” in NeurIPS , 2022. [3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, P atrick Esser, and Bj ¨ orn Ommer , “High-resolution image synthesis with latent dif fu- sion models, ” in CVPR , 2022, pp. 10674–10685. [4] Momina Masood, Marriam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik, “Deepfakes generation and detection: state- of-the-art, open challenges, countermeasures, and way forward, ” Appl. Intell. , 2023. [5] Christoph Schuhmann, Romain Beaumont, Richard V encu, Cade Gor- don, Ross Wightman, et al., “LAION-5B: An open large-scale dataset for training ne xt generation image-text models, ” in NeurIPS , 2022. [6] Stability AI, “Stable Dif fusion 2.0 Release, ” https ://st abilit y .ai /news/st able- diffusion- v2- release , 2022, Accessed: 2025-09-10. [7] CompV is, “Stable Dif fusion Safety Checker, ” https:// huggingfac e.co/C ompVis/stable- diffusion- safety- checker , 2023, Accessed: 2025-09-10. [8] Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian T ram ` er , “Red-teaming the Stable Diffusion safety filter, ” arXiv , 2022. [9] Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau, “Erasing concepts from diffusion models, ” in ICCV , 2023. [10] Rohit Gandikota, Hadas Or gad, Y onatan Belinkov , Joanna Materzynska, and David Bau, “Unified concept editing in diffusion models, ” in W ACV , 2024, pp. 5099–5108. [11] Gong Zhang, Kai W ang, Xingqian Xu, Zhangyang W ang, and Humphrey Shi, “F orget-me-not: Learning to forget in text-to-image diffusion models, ” in CVPRW , 2024, pp. 1755–1764. [12] Shilin Lu, Zilan W ang, Leyang Li, Y anzhu Liu, and Adams W ai-Kin K ong, “MACE: mass concept erasure in dif fusion models, ” in CVPR , 2024, pp. 6430–6440. [13] Y imeng Zhang, Xin Chen, Jinghan Jia, Y ihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, K e Ding, and Sijia Liu, “Defensive un- learning with adversarial training for rob ust concept erasure in diffusion models, ” in NeurIPS , 2024. [14] Y ongliang Wu, Shiji Zhou, Mingzhuo Y ang, Lianzhe W ang, et al., “Unlearning concepts in dif fusion model via concept domain correction and concept preserving gradient, ” in AAAI , 2025, pp. 8496–8504. [15] Ruchika Chavhan, Da Li, and T imothy M. Hospedales, “ConceptPrune: Concept editing in dif fusion models via skilled neuron pruning, ” in ICLR , 2025. [16] Patrick Schramowski, Manuel Brack, Bj ¨ orn Deiseroth, and Kristian Kersting, “Safe latent diffusion: Mitig ating inappropriate degeneration in diffusion models, ” in CVPR , 2023, pp. 22522–22531. [17] Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Y an Lam, T ianwei Zhang, and See-Kiong Ng, “Saferedir: Prompt embedding redirection for robust unlearning in image generation models, ” arXiv , 2026. [18] Zhi-Y i Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Y u Chen, and W ei-Chen Chiu, “Prompting4Debugging: Red-teaming text-to-image diffusion models by finding problematic prompts, ” in ICML , 2024, pp. 8468–8486. [19] Y imeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, et al., “T o generate or not? safety-dri ven unlearned dif fusion models are still easy to generate unsafe images ... for no w , ” in ECCV , 2024, pp. 385–403. [20] Y uchen Y ang, Bo Hui, Haolin Y uan, Neil Gong, and Y inzhi Cao, “SneakyPrompt: Jailbreaking text-to-image generati ve models, ” in SP , 2024, pp. 897–912. [21] Y u-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, et al., “Ring-a- bell! how reliable are concept remo val methods for diffusion models?, ” in ICLR , 2024. [22] Jiachen Ma, Y ijiang Li, Zhiqing Xiao, Anda Cao, et al., “Jailbreak- ing prompt attack: A controllable adversarial attack against dif fusion models, ” in NAA CL , 2025, pp. 3141–3157. [23] Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, and Kaidi Xu, “DiffZOO: A purely query-based black-box attack for red-teaming text- to-image generativ e model via zeroth order optimization, ” in NAA CL , 2025, pp. 17–31. [24] Y ingkai Dong, Xiangtao Meng, Ning Y u, Zheng Li, and Shanqing Guo, “Fuzz-testing meets LLM-based agents: An automated and efficient framew ork for jailbreaking text-to-image generation models, ” in SP , 2025, pp. 373–391. [25] Renyang Liu, Guanlin Li, Tianwei Zhang, and See-Kiong Ng, “Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning, ” in ICLR , 2026. [26] Chao Gong, Kai Chen, Zhipeng W ei, Jingjing Chen, and Y u-Gang Jiang, “Reliable and efficient concept erasure of text-to-image diffusion models, ” in ECCV , 2024, pp. 73–88. [27] EnhanceAI, “Flux-Uncensored-V2, ” htt ps://hugging face.co/enha nceaite am/FluxUncensored- V2 , 2024, Accessed: 2025-09-24. [28] Stability AI, “Stable Dif fusion v2.1, ” https:/ /huggin gface. co/sta bility ai /stable- diffusion- 2- 1 , 2022, Accessed: 2025-09-26. [29] Y ijun Y ang, Ruiyuan Gao, Xiaosen W ang, Tsung-Y i Ho, Nan Xu, and Qiang Xu, “MMA-Diffusion: Multimodal attack on diffusion models, ” in CVPR , 2024, pp. 7737–7746. [30] Praneeth Bedapudi, “Nudenet: lightweight nudity detection, ” ht tps: //gi thub.com/notAI- tech/NudeNet , 2023, Accessed: 2025-09-18. [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition, ” in CVPR , 2016, pp. 770–778. [32] Renyang Liu, W enjie Feng, Tianwei Zhang, W ei Zhou, Xueqi Cheng, and See-Kiong Ng, “Rethinking machine unlearning in image generation models, ” in CCS , 2025, pp. 993–1007. [33] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger , and Ilya Sutske ver, “Learning transferable visual models from natural language supervision, ” in ICML , 2021, pp. 8748–8763. [34] CompV is, “stable-dif fusion-v1-4 , ” https: //huggingface.co/C ompVis/sta ble- diffusion- v1- 4 , 2024, Accessed: 2025-09-11. 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment