DICE: Disentangling Artist Style from Content via Contrastive Subspace Decomposition in Diffusion Models
The recent proliferation of diffusion models has made style mimicry effortless, enabling users to imitate unique artistic styles without authorization. In deployed platforms, this raises copyright and intellectual-property risks and calls for reliable protection. However, existing countermeasures either require costly weight editing as new styles emerge or rely on an explicitly specified editing style, limiting their practicality for deployment-side safety. To address this challenge, we propose DICE (Disentanglement of artist Style from Content via Contrastive Subspace Decomposition), a training-free framework for on-the-fly artist style erasure. Unlike style editing that require an explicitly specified replacement style, DICE performs style purification, removing the artist’s characteristics while preserving the user-intended content. Our core insight is that a model cannot truly comprehend the artist style from a single text or image alone. Consequently, we abandon the traditional paradigm of identifying style from isolated samples. Instead, we construct contrastive triplets to compel the model to distinguish between style and non-style features in the latent space. By formalizing this disentanglement process as a solvable generalized eigenvalue problem, we achieve precise identification of the style subspace. Furthermore, we introduce an Adaptive Attention Decoupling Editing strategy dynamically assesses the style concentration of each token and performs differential suppression and content enhancement on the QKV vectors. Extensive experiments demonstrate that DICE achieves a superior balance between the thoroughness of style erasure and the preservation of content integrity. DICE introduces an additional overhead of only 3 seconds to disentangle style, providing a practical and efficient technique for curbing style mimicry.
💡 Research Summary
**
The paper addresses the pressing problem of style mimicry in diffusion‑based image generation, where malicious users can replicate a specific artist’s visual language simply by appending “in the style of X” to a prompt. Existing defenses fall into three categories: fine‑tuning, weight‑editing, and inference‑time interventions. Fine‑tuning requires costly retraining for each new style; weight‑editing methods still treat style as a concrete set of parameters and often damage non‑target content; inference‑time approaches typically need an explicit replacement style or a neutral text, which leads to distortion or reduced diversity. All of these share a fundamental flaw: they assume the model can identify “style” from a single prompt or image, whereas style is an abstract, distributed concept spread across the model’s latent space.
DICE (Disentanglement of artist Style from Content via Contrastive Subspace Decomposition) proposes a training‑free, on‑the‑fly solution that erases a target artist’s style while preserving the user‑intended content. The core idea is to abandon single‑sample identification and instead construct contrastive triplets: (Anchor) an image containing both the target style and the target content, (Positive) an image with the same style but different content, and (Negative) an image with the same content but a different style. By aligning patches across these three samples, the method obtains two sets of latent vectors—one capturing style variation, the other capturing content variation.
These two sets are fed into a generalized eigenvalue problem derived from Canonical Correlation Analysis (CCA). Solving Σ_XY Σ_YY⁻¹ Σ_YX u = ρ² Σ_XX u yields eigenvectors that span the style subspace (directions that maximize similarity between Anchor and Positive while minimizing similarity between Anchor and Negative). The orthogonal complement forms the content subspace. Because the problem is closed‑form, the subspace can be computed in a few seconds without any model retraining.
During generation, DICE manipulates the self‑attention mechanism of the U‑Net backbone. Query (Q) vectors primarily encode structural information, whereas Key (K) and Value (V) vectors carry texture and color cues associated with style. The method applies orthogonal suppression to K and V along the style subspace, while simultaneously enhancing Q along the content subspace. This “Attention Decoupling Editing” reduces stylistic influence without harming spatial layout.
Style intensity is not uniform across an image. DICE therefore introduces an Adaptive Erasure Controller that measures, for each token (image patch), the projection magnitude onto the style subspace—i.e., a style concentration score. A soft‑thresholding function maps this score to a token‑wise erasure strength, allowing stronger suppression where the style is dominant and milder editing elsewhere, thus preserving fine‑grained details.
Extensive experiments on Stable Diffusion models cover multiple well‑known artists (e.g., Van Gogh, Monet, Picasso). Baselines include concept‑erasure methods (EraseDiff, SuMA), style‑transfer approaches, and recent closed‑form weight‑editing techniques. Evaluation metrics comprise Style Cosine Similarity (to quantify style removal), PSNR/SSIM (content fidelity), and CLIP‑Score (overall semantic alignment). DICE consistently achieves the highest style reduction (≈70 % drop) while retaining the best content scores (≈90 % of the original). Qualitative inspection shows that structural composition, composition, and subject identity remain intact, with only stylistic brushwork and palette being neutralized. The entire pipeline adds roughly 3 seconds of overhead per image, making it viable for real‑time deployment.
Limitations are acknowledged: constructing triplets requires access to reference images for the target style and content, and extremely subtle style cues may not be fully eliminated. Future work aims to automate triplet generation, support simultaneous erasure of multiple styles, and extend the method to multimodal (text‑image) settings.
In summary, DICE introduces a novel paradigm for style protection in generative AI: by leveraging contrastive triplets and a mathematically grounded subspace decomposition, it isolates the abstract notion of “style” and applies targeted attention‑level edits. This enables efficient, training‑free, and content‑preserving style erasure, offering a practical tool for safeguarding artists’ intellectual property in diffusion‑model services.
Comments & Academic Discussion
Loading comments...
Leave a Comment