DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation
Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
💡 Research Summary
This paper introduces DynaIP (Dynamic Image Prompt Adapter), a novel plugin adapter designed to address fundamental limitations in zero-shot Personalized Text-to-Image (PT2I) generation. PT2I aims to create customized images based on user-provided reference images and a text prompt, without requiring test-time fine-tuning. Existing adapter-based methods struggle with three core challenges: 1) the elusive trade-off between preserving the concept from the reference image (Concept Preservation - CP) and faithfully following the text prompt (Prompt Following - PF), 2) the loss of fine-grained details from the reference image, and 3) poor scalability from single-subject to multi-subject personalization.
DynaIP tackles these issues through two key technical innovations built upon state-of-the-art Multimodal Diffusion Transformer (MM-DiT) architectures. The first innovation stems from a crucial observation about the MM-DiT’s inherent learning behavior. The authors discovered that when reference image features are injected into an MM-DiT block via cross-attention, a “decoupling learning behavior” emerges: the noisy image branch selectively learns concept-specific information (e.g., object identity, unique appearance, textures), while the text branch learns concept-agnostic information (e.g., pose, perspective, lighting). Leveraging this insight, they designed a Dynamic Decoupling Strategy (DDS). During training, reference features interact with both branches, allowing each to specialize. During inference, cross-attention is performed exclusively with the noisy image branch, dynamically stripping away concept-agnostic information. This strategy significantly improves the CP-PF balance and, by eliminating conflicting agnostic information (like incompatible poses), greatly enhances scalability for composing multiple subjects.
The second innovation addresses fine-grained detail preservation. The paper identifies the visual encoder as a critical bottleneck and reveals that the hierarchical features of the commonly used CLIP encoder capture visual information at different granularity levels (e.g., fine textures in shallow layers, shapes in middle layers, semantics in deep layers). To fully exploit this, the authors propose a Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM). This module processes shallow, middle, and deep CLIP features through separate expert networks and dynamically integrates them using a routing mechanism that adapts to each input reference image. This approach remarkably elevates the fidelity of fine-grained concept details. Furthermore, it allows flexible user control over the visual granularity of preservation by manually adjusting the fusion coefficients of the expert outputs.
Extensive experiments demonstrate that DynaIP outperforms current state-of-the-art adapter methods in both single-subject and multi-subject PT2I tasks. It achieves superior fine-grained concept fidelity and a better optimal point between CP and PF. Notably, DynaIP exhibits exceptional scalability: trained solely on single-subject datasets, it can be directly extended to multi-subject generation via mask-guided feature injection without any additional training on multi-subject data. The adapter also shows native compatibility with popular base model extensions like ControlNet, LoRA, and Region Attention, unlocking diverse application scenarios. This work represents a significant advancement towards efficient, high-fidelity, and scalable zero-shot personalized image generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment