Borrowing from anything: A generalizable framework for reference-guided instance editing

Borrowing from anything: A generalizable framework for reference-guided instance editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference’s intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.


💡 Research Summary

Reference‑guided instance editing aims to transplant the intrinsic appearance (texture, pattern, material) of a reference object onto a target image while preserving the target’s geometry and context. Existing diffusion‑based approaches struggle because the reference’s intrinsic appearance is entangled with extrinsic attributes such as pose, scale, illumination, and background. This entanglement leads to artifacts like identity leakage, pose residue, or unnatural blending.
The paper introduces GENIE (Generalizable Instance Editing), a three‑module architecture that explicitly disentangles “what” to borrow from the reference and “how” to apply it to the target. The overall system is built on a dual‑U‑Net latent diffusion backbone: a reference branch extracts and purifies appearance features, while a target branch performs iterative denoising conditioned on the purified features.

  1. Spatial Alignment Module (SAM) – A lightweight localization network predicts a 2‑D affine transformation for the reference feature map. The transformation is applied via a differentiable warping operation, normalizing pose, scale, and translation before any semantic processing. By aligning the reference in latent space, SAM reduces the burden on later modules and prevents geometric mis‑matches from contaminating the appearance signal.
  2. Adaptive Residual Scaling Module (ARSM) – The aligned reference feature F_r and the current target feature F_t are concatenated channel‑wise and fed into a small convolutional network f_scale, which outputs a spatial scaling map α ∈ (‑1, 1) after a tanh activation. The final purified reference feature is computed as F′_r = (1 + α) ⊙ F_r. Positive α values amplify intrinsic cues (texture, color), while negative values suppress extrinsic cues (pose, lighting). This bidirectional scaling provides a continuous control knob that can enhance or attenuate any spatial region of the reference feature, effectively “cleaning” the appearance before it reaches the target branch.
  3. Progressive Attention Fusion (PAF) – Fusion proceeds in three stages:
    • Structural Attention focuses solely on the target’s intermediate feature F_t, using self‑attention to reinforce spatial layout and provide a stable geometric foundation.
    • Synergistic Attention concatenates F_t and the purified reference F′_r, then applies self‑attention on the combined tensor to discover latent relationships between structure and appearance.
    • Appearance Attention performs cross‑attention where queries come from the structural stream and keys/values from the hybrid tensor, allowing the geometry to selectively retrieve the most relevant appearance textures.
      The three attention outputs are combined with learnable scalar weights (β, γ, λ) to produce the final fused feature F_out that is fed into the target U‑Net. This progressive strategy ensures that structure is first stabilized, then associations are explored, and finally appearance is rendered, yielding high‑fidelity, semantically consistent edits.
      Training follows the standard diffusion noise‑prediction loss, with the target input composed of the noisy latent z_t, a mask embedding ϕ(M), and the unedited region’s latent. An IP‑Adapter injects pre‑trained CLIP embeddings, further aligning the generation with semantic cues.
      Experimental Results – The authors evaluate on the AnyInsertion benchmark, covering three categories: Object, Garment, and Person. Compared to strong baselines (AnyDoor, Paint‑by‑Example, MimicBrush, InsertAnything, OOTDiffusion), GENIE achieves the highest scores across PSNR, SSIM, LPIPS, CLIP similarity, DINO similarity, DreamSim, and FID. Notably, PSNR improves by roughly 2 dB on Objects and Garments, while FID drops by 7–16 points, indicating both better reconstruction accuracy and perceptual realism. Qualitative examples show precise texture transfer, faithful clothing folds, and seamless integration into complex backgrounds without obvious seams.
      Ablation Studies – Removing SAM dramatically worsens FID on the Person set (124 → 93), confirming the importance of spatial normalization. Excluding PAF reduces PSNR on Garments (22.21 → 23.80), highlighting the role of progressive attention in preserving detail. Omitting ARSM leads to modest PSNR change but a noticeable FID increase (70.12 → 68.93) and CLIP improvement, demonstrating its effect on semantic fidelity. Training strategy experiments reveal that freezing the reference U‑Net and the IP‑Adapter while fine‑tuning only the target U‑Net yields the best performance, preserving the strong feature extraction of pre‑trained modules while allowing the target branch to specialize in fusion.
      Conclusion – GENIE provides a principled solution to the semantic entanglement problem in reference‑guided editing. By decomposing the task into spatial alignment, adaptive residual scaling, and progressive attention fusion, it achieves explicit “what” and “how” disentanglement, leading to state‑of‑the‑art results on a challenging benchmark. The code and pretrained models are publicly released, offering a solid baseline for future research in generalized, high‑quality instance editing.

Comments & Academic Discussion

Loading comments...

Leave a Comment