Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features in the latent data space from multiple diffusion models within the same ecosystem into a specified model, thereby activating particular features and enabling fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.

💡 Research Summary

The paper tackles a fundamental limitation of conditional diffusion models: while many recent works excel at controlling a single aspect such as style, object position, or interaction, they struggle to simultaneously manage several fine‑grained attributes. This difficulty stems from dataset constraints, the combinatorial explosion of conditioning variables, and the architectural complexity required to encode all desired controls in a single model. Existing compositional approaches—linear weighting in score space or parameter space—either assume orthogonal conditioning (which is rarely true for fine‑grained tasks) or require identical network architectures, leading to distribution shifts, degraded image quality, or limited applicability.

To overcome these hurdles, the authors propose Aggregation of Multiple Diffusion Models (AMDM), a training‑free algorithm that operates directly in the latent diffusion space. The key insight is that diffusion models built on the same ecosystem (e.g., Stable Diffusion) share a common latent space and identical stochastic differential equations (SDEs). Consequently, at any timestep t the set of plausible latent codes forms a high‑dimensional manifold that is well‑approximated by a hypersphere. Leveraging this geometric property, AMDM performs spherical interpolation (slerp) between the latent samples generated by two (or more) conditional models. The interpolation weight w balances the contribution of each model while preserving the norm of the latent vector, thereby keeping the combined sample on (or near) the original hypersphere.

Because the interpolated point may drift slightly outside the conditional domains of the individual models, AMDM introduces a Deviation Optimization step. This step projects the interpolated latent back onto the Gaussian shell defined by each model’s mean prediction μθ(zₜ, t, y) using a small radial correction η. The authors prove (Proposition 3.2) that an appropriate η exists to guarantee inclusion in the first model’s domain, and they derive a high‑probability lower bound for inclusion in the second model’s domain as well. In the deterministic limit (σₜ → 0) the correction coincides with a manifold projection onto the ODE flow, providing a clean geometric interpretation.

The algorithm proceeds as follows:

Initial step (t = T) – Sample latent codes from each model and apply spherical interpolation.
Iterative steps (t < T) – For each diffusion timestep, repeat spherical interpolation followed by deviation optimization.
Final steps – Switch to direct sampling without further correction to reduce computational overhead.

The authors evaluate AMDM on a set of three specialized conditional diffusion models derived from Stable Diffusion:

Model A focuses on precise positioning and interaction,
Model B emphasizes attribute control (color, shape), and
Model C specializes in style transfer (e.g., oil‑painting).

Using a shared caption (“A red‑haired girl is drinking from a blue bottle of water, oil painting”) and overlapping bounding boxes, they compare single‑model outputs with the AMDM‑fused result. Quantitative metrics show substantial improvements: FID drops by ~22 %, CLIPScore rises by ~3 %, and attribute‑consistency scores increase from 87 % to 96 %. Qualitatively, the fused images exhibit coherent object placement, faithful color attributes, and consistent artistic style, while avoiding the attribute leakage and style‑attribute conflicts observed in the individual models.

Beyond empirical gains, the paper contributes several theoretical insights:

It formalizes the conditions under which latent‑space aggregation is valid—namely shared latent encoders and identical diffusion SDEs.
It demonstrates that diffusion processes naturally evolve from a coarse‑grained focus (position, attributes, style) in early timesteps to fine‑grained quality and consistency in later timesteps.
It provides rigorous bounds on norm and angular deviations between samples from different models (Proposition 3.1), justifying the spherical interpolation approach.

In summary, AMDM offers a simple yet powerful framework for combining the strengths of multiple conditional diffusion models without any additional training. By operating in the latent space, it sidesteps the need for massive multi‑aspect datasets or bespoke network designs, opening the door for researchers to develop highly specialized diffusion experts and then seamlessly merge them for complex, fine‑grained generation tasks. Future directions include extending the method to heterogeneous architectures (e.g., mixing Stable Diffusion with Imagen), automating the selection of interpolation weights for more than two models, and learning adaptive deviation‑optimization steps to further improve fidelity.

Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment