IdentityGuard: Context-Aware Restriction and Provenance for Personalized Synthesis
The nature of personalized text-to-image models poses a unique safety challenge that generic context-blind methods are ill-equipped to handle. Such global filters create a dilemma: to prevent misuse, they are forced to damage the model's broader util…
Authors: Lingyun Zhang, Yu Xie, Ping Chen
IDENTITYGU ARD: CONTEXT -A W ARE RESTRICTION AND PRO VEN ANCE FOR PERSONALIZED SYNTHESIS Lingyun Zhang 1 , 2 , ∗ , Y u Xie 3 , ∗ , Ping Chen 2 , 3 , † 1 School of Computer Science and T echnology , Fudan Univ ersity , Shanghai, China 2 Institute of Big Data, Fudan Uni versity , Shanghai, China, 3 Purple Mountain Laboratories, Nanjing, China ABSTRA CT The nature of personalized text-to-image models poses a unique safety challenge that generic context-blind meth- ods are ill-equipped to handle. Such global filters create a dilemma: to pre vent misuse, they are forced to damage the model’ s broader utility by erasing concepts entirely , causing unacceptable collateral damage.Our work presents a more precisely targeted approach, built on the principle that security should be as context-a ware as the threat itself, intrinsically bound to the personalized concept. W e present IDENTITYGU ARD, which realizes this principle through a conditional restriction that blocks harmful content only when combined with the personalized identity , and a concept- specific watermark for precise traceability . Experiments show our approach prevents misuse while preserving the model’ s utility and enabling robust traceability . By moving beyond blunt, global filters, our work demonstrates a more ef fecti ve and responsible path tow ard AI safety . Index T erms — text-to-image models, personalization, generation restriction, image prov enance 1. INTR ODUCTION The very power of personalized text-to-image models[1, 2, 3, 4, 5, 6] —their ability to generate content featuring a specific person or object—is also their primary vulnerability . This capability creates a highly tar geted threat, enabling the mis- use of models to generate harmful or decepti ve images tied to real-world identities. This ”personalized threat” exposes a fundamental flaw in the prev ailing paradigm of generic, context-blind safety measures, which are ill-equipped to ha n- dle such a nuanced challenge. Current safety approaches [7, 8, 9, 10, 11] force de vel- opers into an impossible dilemma. Global filters like Safe Latent Diffusion[7] act as blunt instruments, lacking the con- text to distinguish between malicious and benign prompts. On the other hand, global erasure methods [12, 13, 9, 14] are a scorched-earth tactic. T o prev ent the misuse of a personalized concept, they are forced to completely remove any associated ∗ Equal contribution. † Corresponding author . Malicious Perso nal ized Prom pt “A p h o to o f sks person in jail” Benign Generic Prompt “A p h o to o f p e rs o n in jail” Perso nal ized Subject ( sks ) Unprotected Perso nal ized Model Fail s : Creates Harmful Content ❌ Pr e s e r v e s Ut i l i t y ✅ Gen er i c S ecu r i ty Me th o d Fail s: Destroys Identity ❌ Fa i l s : De s t r o y s U t i l i t y ❌ Id e n ti ty Gu a r d ( Ou r s ) Succeeds: Safely Preserves Identity ✅ Succeeds : Preserves Utility ✅ Fig. 1 . The motivation for IDENTITYGUARD. Generic, context-blind security methods (middle column) force an un- acceptable trade-off: they either destroy the user’ s identity when blocking a threat, or destroy the model’ s utility on be- nign prompts. Our method, by binding safeguards directly to the personalized concept, is the only one to succeed in both scenarios. It defends against misuse while preserving the model’ s performance on general prompts. general concepts from the model. This approach inevitably causes unacceptable collateral damage, e.g., pre venting the generation of a simple campfire and all prison scenes to block misuse of the concepts ”fire” and ”jail. ” This forces a false choice between security and the model’ s fundamental utility . ” A similar dilemma exists for pro venance. The aggressive fine-tuning process of personalization is notoriously brit- tle, often destroying post-hoc watermarks[15, 16, 17, 18]. While some integrated methods exist[19, 17, 20], they ap- ply watermarks indiscriminately , f ailing to provide a precise signature that links an image specifically to the personal- ized concept that was used to create it. W e argue that the solution to this dilemma is not a slightly better global filter , but a fundamentally dif ferent principle: security should be as context-a ware as the threat itself, intrinsically bound to the personalized concepts it aims to protect. W e introduces Input Images "A photo of sks person" Malicious Personalized Prompt "A photo of sks person {in jai l}" Personalized Te s t - to -image Personalized Te s t - to -image Reconst ruct ion L oss W Concept - Bound Provenance Semantic Redire ctio n Loss Shared W eights T ext Encoder T ext Encoder 🔥 🔥 Frozen modules Trainable modules 🔥 Frozen W atermark Decoder Activated ONL Y when sks is in prompt. Fig. 2 . The IDENTITYGU ARD fine-tuning frame work. Our method trains a single Denoising U-Net using two condi- tional paths. (T op Path) For benign personalized prompts, our Concept-Bound Provenance is activ ated, embedding a water- mark. (Bottom Path) For malicious prompts, our novel Se- mantic Redirection Loss is acti vated, redirecting the output tow ards a safe, identity-preserving result by aligning the noise predictions of the malicious and benign prompts. Here, c ∗ is the embedding for the personalized concept, c p is for the pro- hibited concept, and sg ( · ) is the stop-gradient operator . IDENTITYGUARD , the first framew ork to realize this prin- ciple. By binding safeguards directly to the user’ s identity , it implement a conditional restriction that blocks harmful con- tent only when combined with the personalized identity , and embedding a concept-specific w atermark for precise, robust traceability . Our contributions are: 1. W e define the ”personalized threat” and argue for a paradigm shift from context-blind global filters to context-a ware, concept-bound security . 2. W e propose a conditional restriction mechanism that prev ents misuse without the collateral damage inherent in previous methods. 3. W e design a robust, concept-specific watermarking scheme integrated to survi ve personalization and pro- vide precise traceability where prior work fails. 2. PR OPOSED METHOD IDENTITYGUARD realizes our principle of Concept-Bound Security by inte grating two conditional, context-aw are mech- anisms directly into the personalization fine-tuning loop, as illustrated in Figure 2. The model learns from two parallel training paths acting on Denoising U-Net. The total objectiv e augments the standard DreamBooth [1] reconstruction loss, L DB , with our two nov el security losses: L total = L DB + λ r L CIP + λ w L WM (1) where λ r and λ w are scalar weights. Unlike global penalties, our security losses activ ated intelligently based on the mean- ing of the prompt. 2.1. Conditional Restriction via Semantic Redirection Generic erasure methods are a hea vy-handed approach that causes collateral damage. W e instead teach the model a more nuanced, conditional beha vior: “When you see personalized concept c ∗ combined with a prohibited term, ignore the pro- hibited term and generate only c ∗ . ” W e implement this behavior , which we call Semantic Redirection , through a nov el training objecti ve we term the Conditional Identity-Preserving (CIP) Loss . Let c ∗ be the te xt embedding for the personalized concept and c p be the embedding for a prohibited concept. The CIP loss is activ ated only for malicious prompts and is formulated as: L CIP = E x t ,c p h ∥ ϵ θ ( x t , ( c ∗ , c p )) − sg ( ϵ θ ( x t , c ∗ )) ∥ 2 2 i (2) where x t is the noised image, ϵ θ is the model’ s noise predic- tion, and sg ( · ) denotes the stop gradient operator . The novelty of the CIP loss is defined by two ke y proper- ties: it is both asymmetric and conditional. The asymmetric redirection—steering the prediction for the harmful combi- nation to wards the benign concept alone—is what f aithfully preserves the user’ s identity . Meanwhile, its conditional ap- plication is precisely what prevents the collateral damage that plagues context-blind methods. 2.2. Concept-Bound Pro venance T o ensure meaningful provenance in personalized contexts, the signal must be rob ust and specific, serving as a distincti ve signature that verifies the use of the personalized concept c ∗ . T o achieve this, we bind the watermark embedding pro- cess directly and exclusiv ely to the personalized concept. As shown in Figure 2, we use a pre-trained, fr ozen watermark decoder , D w , to define a watermarking loss, L WM . This loss is computed only when the training prompt contains the per- sonalized concept c ∗ : L WM = ( BCE ( D w ( x gen ) , m ) if c ∗ ∈ prompt 0 otherwise (3) where x gen is the image and m is the target k -bit message. This design represents a fundamental shift. The water- mark is no longer a generic fingerprint on all model outputs. Instead, the personalized concept itself becomes the trigger for the watermark. This tight binding ensures both rob ustness and specificity , pro viding an unambiguous cryptographic link between an image and the identity used to create it. 3. EXPERIMENTS W e conducted experiments to validate our main hypothesis that a context-aw are Concept-Bound security paradigm is demonstrably superior to generic context-blind filters. W e present a comprehensive comparison on core metrics, and then provide a deep di ve into a real-w orld safety scenario. T able 1 . Unified analysis of security paradigms. The data reveals the failure of context-blind approaches: generic methods either of fer no provenance or degrade model fidelity . Our context-aw are frame work, IDENTITYGUARD , is the only one to provide both state-of-the-art restriction and rob ust prov enance without this collateral damage. Core Security Paradigm Method Fidelity (Benign Prompts) Restriction (Malicious Prompts) Prov enance FID ↓ CLIP ↑ FID-Censored ↓ CLIP-Censored ↓ Bit Accuracy ↑ Baseline DreamBooth 55.81 0.3150 465.80 0.2378 N/A + Post-hoc WM (HiDDeN) 56.01 - - - 60% Global Guidance SLD 60.97 0.3077 412.53 0.2357 N/A Global Erasure ESD(Erasing Concept) 57.18 0.2986 372.38 0.2093 N/A Concept-Bound (Ours) Untarget 54.72 0.3045 393.15 0.2140 97.1% Conditioning 57.71 0.3026 401.59 0.2132 97.1% T arget 57.04 0.3147 402.92 0.1919 97.1% T able 2 . Case Study on a critical safety threat (nudity). De- tections by NudeNet per 100 images generated with a ma- licious ”naked” prompt. Even the strongest generic method (ESD) exhibits significant safety failures. Our method provides a near-total solution, demonstrating an order-of- magnitude improv ement in a high-stakes scenario. Method NudeNet Detections Results T otal Detections ↓ Explicit Suggestive Other DreamBooth 135 59 148 342 + SLD 67 52 127 246 + ESD 1 1 44 46 IDENTITYGUARD 1 0 1 2 3.1. Experimental Setup Implementation: Our method is integrated into the Dream- Booth fine-tuning process of a Stable Diffusion v2.1 model, with loss weights set to λ r = 0 . 2 and λ w = 0 . 1 . Baselines: W e ev aluate against methods representing the context-blind paradigm. For restriction, we use Safe Latent Diffusion (SLD)[7] and Erasing Concepts (ESD)[13]. For prov enance, we test HiDDeN[18]. Metrics: W e measure image fidelity with Fr ´ echet Inception Distance (FID) and CLIP Score on benign prompts. Restric- tion effecti veness is measured by FID-Censored and CLIP- Censored scores on malicious prompts. W atermark robust- ness is measured by Bit Accuracy . The term ’Censored’ refers to prompts that contain prohibited semantics. 3.2. Main Results: A Unified Analysis Our main results, consolidated in T able 1, rev eal the funda- mental flaw of the generic security paradigm: it is forced to inflict collateral damage to provide security . The unprotected DreamBooth model (FID 55.81) offers high fidelity but no protection. The generic safeguards at- tempt to fix this, but at a great cost. Global Erasure (ESD), while providing strong restriction (CLIP-Censored 0.2093), suffers from precisely the collateral damage we hypothe- sized. As powerfully illustrated in our qualitati ve analysis (Figure 3), ESD is forced to globally erase concepts, de- stroying the model’ s ability to generate benign images lik e a simple campfire. This forced trade-of f mak es it an impractical solution. Similarly , post-hoc watermarking is fundamen- tally incompatible with personalization, with a reco very rate barely better than random chance. In stark contrast, our Concept-Bound frame work, IDEN- TITYGU ARD, is the only approach that av oids this dilemma. The table also serves as an ablation of our core mechanism, confirming that our final Target strategy is superior . Specif- ically , Target redirects the malicious prompt to wards the benign personalized concept, whereas Untarget guides tow ards a null-text prompt, and Conditioning adds a preservation loss to ensure the blacklist concept remains us- able when not combined with the personalized identity . Our final Target method achie ves a state-of-the-art restriction score (CLIP-Censored 0.1919 ) and near -perfect provenance (Bit Accuracy 97.1% ), all while preserving the model’ s gen- eral utility and original fidelity . 3.3. Case Study: Efficacy on a Critical Safety Threat T o demonstrate the real-world impact of our method, we per- formed a case study on prev enting the generation of nudity . As shown in T able 2, we used a dedicated nudity classifier (NudeNet) [21] to analyze 100 images generated from a ma- licious “naked” prompt. The results are striking. The unprotected model produced 342 instances of exposed content. Even the strongest generic competitor , ESD, still had 46 detections, representing a sig- nificant safety f ailure. Our method, IDENTITYGU ARD, pro- vides a near -total solution, reducing the number of detections to just 2. This order-of-magnitude improv ement in a high- stakes scenario underscores the practical necessity of context- aware paradigms for b uilding genuinely safe AI. Limitations and Futur e W ork. Our experimental validation, while demonstrating the core principles of our paradigm, fo- cuses on a curated set of concepts and baselines. W e view this work as a strong proof-of-concept for context-a ware se- curity . Future w ork could explore scaling this approach to a broader range of generati ve architectures and more complex, Benign Personalized Prompt ”A photo of sks person taking a selfie in a field of sunflowers” Malicious Personalized Prompt “A photo of sks person on fire” Benign Generic Prompt “A photo of person in jail” “A pile of wood caught fire” Instance Images Malicious Personalized Prompt “A photo of sks person on fire” Benign Personalized Prompt “A photo of sks person standing in a flower garden” Prompt DreamBooth (Unprotected) Erasing Concept Safe Latent Diffusion IDENTITYGUARD (Ours) Fig. 3 . Qualitative analysis of our context-a ware security . The ke y failure of generic methods like Erasing Concept(ESD) is rev ealed in the bottom ro ws: to provide protection in the personalized context (top rows), they are forced to inflict catastr ophic collateral damage , globally erasing the concept of ”fire” and failing to generate a simple campfire. In contrast, IDENTITY - GU ARD’ s safeguard is intelligently bound only to the personalized identity , allowing it to preserv e the concept for general use. open-ended blacklist definitions. 4. CONCLUSION In this work, we ar gued that the targeted risks of personal- ized generati ve models demand a targeted security response. W e demonstrated that the prev ailing paradigm of generic, context-blind filters is fundamentally flawed, forcing a false choice where security comes at the cost of unacceptable col- lateral damage to a model’ s broader utility . As a solution, we proposed and v alidated a new paradigm: context-aw are security , where safeguards are intrinsically bound to the con- cepts they protect. Our framew ork is a practical realization of this principle, successfully a voiding the collateral damage of older methods while providing both content restriction and robust, concept-specific traceability , demonstrates a more effecti ve path tow ard building truly safe personalized AI. 5. A CKNO WLEDGMENT This work was funded in part by the National K ey R&D Pro- gram of China under Grant 2022YFB3104300, the Jiangsu Provincial Natural Science Foundation of China under Grant BK20240291 and the K ey Research and Dev elopment Pro- gramme of Ningbo’ s “Science and T echnology Innovation Y ongjiang 2035” Plan under Grant 2025Z054. 6. REFERENCES [1] Nataniel Ruiz, Y uanzhen Li, V arun Jampani, Y ael Pritch, Michael Rubinstein, and Kfir Aberman, “Dream- booth: Fine tuning text-to-image diffusion models for subject-driv en generation, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2023, pp. 22500–22510. [2] Jie Zhang, Florian Kerschbaum, Tianwei Zhang, et al., “Backdooring te xtual in version for concept censorship, ” arXiv pr eprint arXiv:2308.10718 , 2023. [3] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Y an Zhu, “Multi-concept cus- tomization of te xt-to-image dif fusion, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2023, pp. 1931–1941. [4] Xi Chen, Lianghua Huang, Y u Liu, Y ujun Shen, Deli Zhao, and Hengshuang Zhao, “ Anydoor: Zero-shot object-lev el image customization, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pat- tern r ecognition , 2024, pp. 6593–6602. [5] Rinon Gal, Moab Arar , Y uval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or , “Encoder-based domain tuning for fast personalization of text-to-image models, ” A CM T ransactions on Graphics (TOG) , vol. 42, no. 4, pp. 1–13, 2023. [6] Omri A vrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or , and Dani Lischinski, “Break-a-scene: Ex- tracting multiple concepts from a single image, ” in SIG- GRAPH Asia 2023 Confer ence P apers , 2023, pp. 1–12. [7] Patrick Schramo wski, Manuel Brack, Bj ¨ orn Deiseroth, and Kristian K ersting, “Safe latent dif fusion: Mitigating inappropriate degeneration in dif fusion models, ” in Pr o- ceedings of the IEEE/CVF Confer ence on Computer V i- sion and P attern Recognition , 2023, pp. 22522–22531. [8] Y ijun Y ang, Ruiyuan Gao, Xiao Y ang, Jianyuan Zhong, and Qiang Xu, “Guardt2i: Defending text-to-image models from adversarial prompts, ” Advances in neu- ral information pr ocessing systems , vol. 37, pp. 76380– 76403, 2024. [9] Alvin Heng and Harold Soh, “Selective amnesia: A con- tinual learning approach to forgetting in deep generative models, ” Advances in Neural Information Pr ocessing Systems , vol. 36, 2024. [10] Zongyu Wu, Hongcheng Gao, Y ueze W ang, Xiang Zhang, and Suhang W ang, “Uni versal prompt opti- mizer for safe te xt-to-image generation, ” arXiv pr eprint arXiv:2402.10882 , 2024. [11] Lingyun Zhang, Y u Xie, Y anwei Fu, and Ping Chen, “Concept replacer: Replacing sensiti ve concepts in dif- fusion models via precision localization, ” in Pr oceed- ings of the Computer V ision and P attern Recognition Confer ence , 2025, pp. 8172–8181. [12] Nupur Kumari, Bingliang Zhang, Sheng-Y u W ang, Eli Shechtman, Richard Zhang, and Jun-Y an Zhu, “ Ablat- ing concepts in te xt-to-image dif fusion models, ” in Pr o- ceedings of the IEEE/CVF International Confer ence on Computer V ision , 2023, pp. 22691–22702. [13] Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and Da vid Bau, “Erasing concepts from dif- fusion models, ” in Pr oceedings of the IEEE/CVF In- ternational Confer ence on Computer V ision , 2023, pp. 2426–2436. [14] Rohit Gandikota, Hadas Orgad, Y onatan Belinkov , Joanna Materzy ´ nska, and David Bau, “Unified con- cept editing in dif fusion models, ” in Pr oceedings of the IEEE/CVF W inter Conference on Applications of Com- puter V ision , 2024, pp. 5111–5120. [15] Mahbuba Begum and Mohammad Shorif Uddin, “Digi- tal image watermarking techniques: a revie w , ” Informa- tion , vol. 11, no. 2, pp. 110, 2020. [16] W enbo W an, Jun W ang, Y unming Zhang, Jing Li, Hui Y u, and Jiande Sun, “ A comprehensi ve surve y on robust image watermarking, ” Neur ocomputing , vol. 488, pp. 226–247, 2022. [17] Pierre Fernandez, Guillaume Couairon, Herv ´ e J ´ egou, Matthijs Douze, and T eddy Furon, “The stable signa- ture: Rooting watermarks in latent dif fusion models, ” in Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , 2023, pp. 22466–22477. [18] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei- Fei, “Hidden: Hiding data with deep networks, ” in Pr oceedings of the European conference on computer vision (ECCV) , 2018, pp. 657–672. [19] Y ingqian Cui, Jie Ren, Y uping Lin, Han Xu, Pengfei He, Y ue Xing, W enqi Fan, Hui Liu, and Jiliang T ang, “Ft- shield: A watermark against unauthorized fine-tuning in text-to-image diffusion models, ” arXiv preprint arXiv:2310.02401 , 2023. [20] W eitao Feng, Jiyan He, Jie Zhang, Tianwei Zhang, W enbo Zhou, W eiming Zhang, and Nenghai Y u, “Catch you everything ev erywhere: Guarding textual in ver - sion via concept watermarking, ” arXiv preprint arXiv:2309.05940 , 2023. [21] P Bedapudi, “Nudenet: Neural nets for nudity classifi- cation, detection and selectiv e censoring, ” 2019.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment