Personalized Safety Alignment for Text-to-Image Diffusion Models
Text-to-image diffusion models have revolutionized visual content generation, yet their deployment is hindered by a fundamental limitation: safety mechanisms enforce rigid, uniform standards that fail to reflect diverse user preferences shaped by age, culture, or personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that transitions generative safety from static filtration to user-conditioned adaptation. We introduce Sage, a large-scale dataset capturing diverse safety boundaries across 1,000 simulated user profiles, covering complex risks often missed by traditional datasets. By integrating these profiles via a parameter-efficient cross-attention adapter, PSA dynamically modulates generation to align with individual sensitivities. Extensive experiments demonstrate that PSA achieves a calibrated safety-quality trade-off: under permissive profiles, it relaxes over-cautious constraints to enhance visual fidelity, while under restrictive profiles, it enforces state-of-the-art suppression, significantly outperforming static baselines. Furthermore, PSA exhibits superior instruction adherence compared to prompt-engineering methods, establishing personalization as a vital direction for creating adaptive, user-centered, and responsible generative AI. Our code, data, and models are publicly available at https://github.com/M-E-AGI-Lab/PSAlign.
💡 Research Summary
The paper tackles a fundamental limitation of current text‑to‑image diffusion models: safety mechanisms are implemented as a single, global filter that does not account for the wide variety of user‑specific safety expectations shaped by age, culture, religion, mental health, or personal beliefs. To move from a “one‑size‑fits‑all” approach to a truly user‑centric one, the authors introduce Personalized Safety Alignment (PSA), a framework that conditions diffusion generation on an explicit user profile.
The first contribution is the Sage dataset, the first large‑scale benchmark for personalized safety. The authors define ten safety‑critical categories, focusing on seven subjective ones (hate, harassment, violence, self‑harm, sexuality, shocking content, propaganda). They generate fine‑grained concept instances using the Qwen2.5‑7B language model and then simulate 1,000 distinct virtual users by sampling controlled attributes (age, gender, religion, mental and physical health status). For each virtual user, a GPT‑4.1‑mini model infers a set of banned concepts C_ban(u) and allowed concepts C_allow(u). Using an adversarial pipeline, each concept is paired with an unsafe prompt and a safe rewrite; the resulting image pairs are labeled as preferred or dispreferred according to the user’s stance, yielding 44,100 (x⁺, x⁻, prompt, user) preference tuples. Human validation shows high agreement (κ = 0.83).
The second contribution is a parameter‑efficient adapter that injects user information into the diffusion process. The authors freeze the pretrained U‑Net of a large diffusion model (e.g., Stable Diffusion v1.5 or SDXL) and add a lightweight “User‑Cross‑Attention” module to every transformer block. The module re‑uses the query projection from the original text‑attention but learns separate key and value projections for the user embedding. The resulting attention is a simple additive combination A_t + A_u, allowing the model to modulate its behavior early in the denoising steps based on the user profile while preserving the rich semantic knowledge of the original model. The adapter requires only 16 KB per user, leading to a total storage footprint of about 16 MB for 1,000 users and adds less than 6 % inference latency.
Training leverages a personalized extension of Diffusion‑DPO. The authors define a user‑conditioned instance loss ℓ_u and compute a difference Δ_u between the policy model ϵ_θ(u) and a frozen reference model ϵ_ref for each preference pair. The PSA loss L_PSA follows the same sigmoid‑based formulation as standard DPO but operates on Δ_u, directly optimizing the model to prefer the user‑designated safe images while disfavoring the unsafe ones.
Empirical results demonstrate that PSA can dynamically trade off safety and visual fidelity. For restrictive user profiles, PSA matches or exceeds state‑of‑the‑art safety filters, achieving a 12 % higher suppression rate on dangerous concepts while maintaining comparable FID scores. For permissive profiles, PSA relaxes over‑cautious filtering, improving FID by an average of 0.8 points and raising CLIP‑Score, indicating higher image quality. Compared to external prompt‑rewriting baselines (e.g., LLM‑based prompt editing), PSA improves adherence to user‑specific boundaries by roughly 18 % on average, avoiding semantic drift and indiscriminate censorship.
Ablation studies confirm that the cross‑attention adapter is the critical component for user‑conditional control, and that freezing the backbone prevents catastrophic forgetting of the model’s general generation capabilities. The authors also present case studies in domains such as medical education (where explicit anatomical depictions may be needed) and parental control (requiring stricter shielding), illustrating how PSA can be tuned to very different safety expectations without retraining the entire model.
In summary, the paper proposes a novel, scalable solution for personalized safety in text‑to‑image diffusion models. By combining a richly annotated, user‑centric dataset with a lightweight, plug‑and‑play adapter and a DPO‑style training objective, PSA achieves state‑of‑the‑art safety performance where needed while preserving or even enhancing image quality where users desire more freedom. The public release of code, data, and pretrained adapters invites further research into user‑conditioned alignment for generative AI across modalities.
Comments & Academic Discussion
Loading comments...
Leave a Comment