Creative Image Generation with Diffusion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image’s existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.

💡 Research Summary

The paper introduces a principled framework for fostering creativity in text‑to‑image diffusion models by explicitly targeting low‑probability regions of the CLIP embedding space. The authors argue that conventional generative models are biased toward high‑probability, “typical” outputs because both training objectives and evaluation metrics (e.g., FID, Inception Score) reward fidelity to the training distribution. Drawing on Berlyne’s arousal theory, they define creativity as the inverse probability (or surprise) of an image given a user’s exposure model, which they approximate with a large sample of embeddings generated by the diffusion prior.

The methodology proceeds in four stages. First, a baseline distribution of image embeddings is obtained by sampling thousands of embeddings from the diffusion prior (Kandinsky 2.1) conditioned on a positive prompt (P₍pos₎). These high‑dimensional embeddings are reduced via PCA, and a multivariate Gaussian ˆG is fitted to the reduced data. Second, a “creative loss” L₍creative₎ = log ˆG(ẽ) is minimized, pushing the embedding ẽ toward the tails of ˆG, i.e., low‑probability regions. Third, to prevent the model from drifting into semantically meaningless space, two pull‑back mechanisms are introduced: (a) an anchor loss that maximizes cosine similarity between the generated embedding and the text embedding of P₍pos₎, and (b) a multimodal large language model (MLLM) checker that periodically asks “Is this still a {subject}? Yes/No.” If the answer is negative, optimization stops. These safeguards maintain high semantic fidelity while allowing exploration of novel visual concepts.

Finally, the authors address the risk of converging on unappealing “negative” clusters by modeling such clusters with a separate Gaussian ˆG₍neg₎ derived from embeddings that consistently produce undesirable outputs. A penalty term L₍neg₎ = −α log ˆG₍neg₎(ẽ) is added to the total loss, effectively repelling the optimization from known bad regions and providing directional control over the creative trajectory.

Experiments on Kandinsky 2.1 demonstrate that the proposed approach yields images with substantially higher “Arousal Potential” (−log P) and human‑rated creativity scores, while only modestly degrading traditional quality metrics (FID increases by ~5‑10%). The anchor loss alone preserves semantic alignment in ~95 % of cases, and the MLLM verifier further reduces out‑of‑domain failures. Incorporating negative‑cluster avoidance improves both aesthetic appeal and creativity scores, cutting the proportion of undesirable outputs from 40 % to about 12 % in a vehicle‑generation scenario.

Key contributions are: (1) a probabilistic definition of creativity and a loss that directly optimizes for low‑probability embeddings; (2) a dual pull‑back system (anchor loss + MLLM) that safeguards semantic fidelity; (3) a negative‑cluster based directionality mechanism that steers the search away from known bad regions. The work opens avenues for applying low‑probability exploration to other generative domains such as video, 3D assets, or multimodal storytelling, and suggests future research on personalized arousal models, more sophisticated multimodal feedback loops, and scaling the approach to larger diffusion backbones.

Creative Image Generation with Diffusion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment