CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

💡 Research Summary

CoDi tackles the long‑standing problem of Subject‑Consistent Generation (SCG) in text‑to‑image (T2I) synthesis: producing a set of images that share the same subject identity while each image follows a distinct pose, layout, and prompt. Existing training‑based approaches (e.g., DreamBooth, Textual Inversion) achieve high consistency but require costly fine‑tuning or test‑time optimization, limiting scalability. Training‑free methods such as ConsiStory and StoryDiffusion avoid parameter updates by sharing self‑attention keys and values across images, yet they sacrifice pose and layout diversity because the shared attention entangles spatial configurations.

CoDi introduces a training‑free, two‑stage framework that leverages the progressive nature of diffusion models. In diffusion, low‑frequency information (coarse structure, pose) emerges in the early denoising steps, while high‑frequency details (facial features, textures) are refined later. CoDi’s first stage, Identity Transport (IT), operates during the early steps (e.g., the first 10 of 50 timesteps). It extracts subject masks from the reference image and each target image using cross‑attention maps for subject‑related tokens (e.g., “fairy”) followed by Otsu thresholding. The masked features S_id (reference) and S_n (target) are then aligned via Optimal Transport (OT). The cost matrix C is defined as 1 − cosine similarity between feature vectors, and the probability masses a and b are derived from the attention‑based importance of each masked token (softmax‑normalized). Solving the OT problem with a network‑simplex algorithm yields a transport plan T_n that re‑maps reference features onto the target’s spatial layout, effectively “mosaicking” the subject while preserving the target pose. The transported subject features S_OT_n are recombined with the non‑subject background features and fed back into the diffusion pipeline, ensuring that the early latent already contains a pose‑aware, identity‑consistent subject.

The second stage, Identity Refinement (IR), runs in the later denoising steps (the remaining 40 timesteps). Here, high‑frequency details are refined. IR adopts a cross‑image attention mechanism similar to prior training‑free methods, but it restricts keys and values to the subject features already aligned by IT. This selective attention amplifies salient identity attributes (eyes, mouth, hair) without re‑mixing pose information, thereby strengthening consistency while leaving pose diversity untouched.

Experiments are conducted on the ConsiStory+ benchmark, measuring three metrics: Subject Consistency (CLIP‑based image similarity), Pose Diversity (SMPL‑based pose embedding distance), and Prompt Fidelity (text‑image alignment score). CoDi outperforms ConsiStory and StoryDiffusion on all three fronts: it raises consistency scores by roughly 8‑12 %, improves pose diversity by 15‑20 %, and maintains prompt fidelity comparable to the vanilla SDXL baseline. Qualitative examples show the same fairy character rendered in a café, a magical grove, and a sun‑lit window, each with distinct body language yet unmistakably the same face and color palette.

The paper acknowledges limitations. Computing the OT plan scales quadratically with the number of masked tokens, which can become a bottleneck for high‑resolution images or large batches. Mask extraction relies on cross‑attention and may be noisy for thin or heavily occluded subjects. Moreover, the current implementation is tightly coupled with the SDXL diffusion architecture; extending to other models would require adaptation of the mask‑extraction and attention‑injection pipelines.

In summary, CoDi demonstrates that a principled two‑stage approach—early‑stage optimal‑transport‑based identity transport followed by late‑stage selective attention refinement—can achieve strong subject consistency without compromising pose and layout diversity, all without any model fine‑tuning. This work opens a promising direction for scalable, expressive visual storytelling and character‑centric content creation in diffusion‑based generative AI.

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment