Color Bind: Exploring Color Perception in Text-to-Image Models

Color Bind: Exploring Color Perception in Text-to-Image Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors – a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.


💡 Research Summary

**
The paper investigates a specific failure mode of modern text‑to‑image diffusion models: the inability to faithfully render multiple distinct colors when prompts contain several objects, each with its own color attribute. While single‑object, single‑color prompts are handled well, multi‑object, multi‑color prompts often suffer from “color leakage” and incorrect attribute‑object binding. Existing evaluation metrics such as CLIP similarity or VQA scores are too coarse to capture these fine‑grained errors, prompting the authors to design a dedicated benchmark called CompColor.

CompColor is built from 35 well‑recognized HTML color names and a curated set of objects that do not have intrinsic dominant colors. For each color, three perceptually close and three distant partner colors are selected, forming paired prompts of the form “a {color1} {object1} and a {color2} {object2}”. The benchmark includes multiple random seeds and prompt variations to ensure statistical robustness.

To measure color fidelity, the authors employ the Segment Anything Model (SAM) to isolate each target object, then apply k‑means clustering on the pixels within the mask to extract the dominant color. The extracted color is compared to the ground‑truth color in CIELAB space. A binary accuracy is defined (distance ≤ 10 CIE ΔE) together with average LAB L2 distance and RGB L2 distance for finer analysis.

Baseline experiments cover several pretrained diffusion models (Stable Diffusion 1.4/1.5/2.1, FLUX‑dev) and a suite of inference‑time correction methods (Attend‑and‑Excite, Structured‑Diffusion, SynGen, RichText). Results show that all models achieve high accuracy on single‑object prompts (> 90 %) but drop dramatically to 45‑55 % on multi‑object prompts. The degradation is more severe for color pairs that are perceptually distant, indicating that the models tend to blend colors rather than keep them separate. Existing correction techniques improve performance modestly but often sacrifice single‑object quality or fail to resolve the binding problem completely.

To address the issue, the authors propose a training‑free editing framework that optimizes two complementary losses: (1) an attention loss that encourages the cross‑attention maps of the full prompt to resemble those of a simplified, color‑less reference prompt (“a {object}”). This leverages the observation that mis‑binding of attributes manifests as irregularities in attention maps; aligning them restores proper object‑color association. (2) a color loss that directly minimizes the CIELAB distance between the target color and the dominant color of the edited region. The optimization is performed on the latent representation and attention parameters of the already‑generated image, requiring only a few gradient steps and no retraining of the diffusion model.

Quantitative evaluation demonstrates that the proposed method consistently raises accuracy across all color pair types, with especially large gains for distant color pairs (up to 20 % absolute improvement). Average LAB distance is halved, and visual inspection confirms that colors are correctly bound to their intended objects while preserving overall composition and detail. Compared to other editing baselines, the method achieves superior color fidelity without degrading non‑target attributes.

The contribution is twofold: (1) a rigorous, perceptually grounded benchmark for assessing compositional color understanding in text‑to‑image models, and (2) a lightweight, model‑agnostic editing technique that fixes multi‑object color mis‑bindings post‑generation. The work highlights that current diffusion models still lack robust compositional reasoning for fine‑grained attributes, and it offers a practical pathway to improve user‑controlled generation without costly model retraining. Future directions include extending the framework to other attributes such as texture, material, or lighting, and refining attention‑based losses to handle more complex, multi‑attribute prompts.


Comments & Academic Discussion

Loading comments...

Leave a Comment