Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.


💡 Research Summary

Group Diffusion introduces a novel paradigm for diffusion‑based image synthesis: instead of generating each sample independently at inference time, multiple samples sharing the same conditioning are denoised jointly, allowing them to exchange information through a cross‑sample attention mechanism. The method builds on the Diffusion Transformer (DiT) architecture and requires only a simple reshaping of token tensors. Specifically, for a group of N images, the patch embeddings of shape (N, L, C) are reshaped to (1, N·L, C) before the multi‑head self‑attention (MHSA) block and reshaped back afterwards. This enables every patch to attend not only to other patches within its own image but also to patches from the other images in the group. To keep the model aware of image boundaries, a learnable “sample embedding” is added to all patches belonging to the same image.

During training, groups are constructed by querying a large image database for samples that are semantically or visually similar to the target image. Similarity is measured with cosine similarity on pre‑trained CLIP or DINO embeddings, and a threshold τ_img (≈0.7 in the experiments) determines membership. Each image in the group receives its own diffusion timestep t, but the variance of timesteps within a group is constrained (σ_tv) to avoid excessive temporal mismatch. The loss is the standard diffusion denoising loss averaged over all group members (L_group), encouraging the network to predict the correct noise for each image while benefitting from the additional context provided by its peers.

Two inference variants are explored. GroupDiff‑f applies group attention to both the conditional and unconditional denoisers and then combines them with the usual classifier‑free guidance (CFG) formula. GroupDiff‑l, in contrast, applies group attention only to the unconditional denoiser; the conditional denoiser remains unchanged and operates on each image separately. GroupDiff‑l is computationally cheaper because only a small fraction (10 %) of the training steps involve the large‑group unconditional model, while the remaining 90 % follow the standard single‑image pipeline. Empirically, GroupDiff‑l achieves almost the same quality boost as GroupDiff‑f with far less overhead, making it the preferred practical choice.

Experiments are conducted on ImageNet‑256 using DiT‑XL/2 and SiT as baselines. The global batch size is kept at 256, and group sizes N = 1, 2, 4, 8 are evaluated. Across all metrics—FID, Inception Score, Precision‑Recall—the larger the group, the better the performance. For example, with N = 8 the FID drops from ~13.5 (baseline) to ~2.1, a relative improvement of up to 32.2 % when fine‑tuning a pre‑trained checkpoint. The authors also introduce a quantitative “Cross‑Sample Attention Strength” (CSA) metric, computed as the average attention weight assigned to tokens from other images within a group. CSA correlates strongly (Pearson r ≈ ‑0.87) with FID, confirming that the quality gains stem directly from effective inter‑image information exchange.

A thorough ablation study examines (1) which models (conditional, unconditional, or both) should receive group attention, (2) the impact of different query methods (class label, CLIP‑L, CLIP‑B, DINOv2, SigLIP, etc.), and (3) the effect of varying the CFG scale. Results show that applying group attention only to the unconditional model (GroupDiff‑l) consistently yields the best trade‑off between quality and compute. Moreover, using CLIP‑L embeddings for group construction provides the highest CSA and lowest FID among the tested similarity measures.

Limitations are acknowledged. Cross‑sample attention increases memory consumption proportionally to the product of batch size and group size, which may become a bottleneck for very large models. Additionally, if the queried group contains dissimilar images, attention noise can degrade performance; thus, robust group selection or dynamic grouping strategies are needed for broader applicability. The paper focuses on class‑conditional generation, leaving extensions to more complex conditioning (e.g., textual prompts with style modifiers, multimodal inputs) for future work.

In summary, Group Diffusion demonstrates that a modest architectural tweak—sharing the attention matrix across samples—can unlock a powerful collaborative generation mechanism. By allowing images to help each other during denoising, the method achieves substantial improvements in sample quality without requiring additional training data or major changes to the diffusion backbone. This work opens a new research direction: exploiting batch‑level collaboration at inference time for diffusion models, with potential extensions to video synthesis, multi‑modal generation, and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment