FusionEdit: Semantic Fusion and Attention Modulation for Training-Free Image Editing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-guided image editing aims to modify specific regions according to the target prompt while preserving the identity of the source image. Recent methods exploit explicit binary masks to constrain editing, but hard mask boundaries introduce artifacts and reduce editability. To address these issues, we propose FusionEdit, a training-free image editing framework that achieves precise and controllable edits. First, editing and preserved regions are automatically identified by measuring semantic discrepancies between source and target prompts. To mitigate boundary artifacts, FusionEdit performs distance-aware latent fusion along region boundaries to yield the soft and accurate mask, and employs a total variation loss to enforce smooth transitions, obtaining natural editing results. Second, FusionEdit leverages AdaIN-based modulation within DiT attention layers to perform a statistical attention fusion in the editing region, enhancing editability while preserving global consistency with the source image. Extensive experiments demonstrate that our FusionEdit significantly outperforms state-of-the-art methods. Code is available at \href{https://github.com/Yvan1001/FusionEdit}{https://github.com/Yvan1001/FusionEdit}.

💡 Research Summary

FusionEdit tackles two fundamental challenges in text‑guided image editing: (1) determining where to edit and (2) how to edit while preserving the rest of the image. Existing approaches typically rely on external binary masks (e.g., from blend words or segmentation models) that create hard boundaries, leading to visual artifacts and limited editability. FusionEdit eliminates the need for any external mask by automatically locating editing regions through a semantic‑discrepancy analysis of the source and target prompts.

The method first computes a semantic discrepancy map S by feeding the same noisy latent (at an early‑to‑mid denoising step T′≈0.89) into a pretrained rectified‑flow model (Flux) conditioned on the source and target texts separately. The L2 distance between the two resulting velocity fields yields a pixel‑wise map that highlights areas where the semantics differ. To obtain a spatially coherent region, the map is averaged over several runs, partitioned into non‑overlapping patches, and patches are merged iteratively based on similarity, producing a binary region mask M_R.

Because a binary mask still creates abrupt transitions, FusionEdit converts M_R into a soft mask M_S using a distance‑aware sigmoid function. For each pixel, the Euclidean distance D to the nearest binary boundary is computed; pixels within a predefined band (d_max) receive a smooth transition value, while pixels outside retain the original binary value. This soft mask enables a gradual blend between edited and preserved latents.

At each diffusion timestep t, the intermediate latent X_mid^t (the result of moving from source toward target in latent space) is fused with the original source latent X_src using M_S: X_M^t = M_S ⊙ X_mid^t + (1 – M_S) ⊙ X_src. To further suppress boundary artifacts, a total variation (TV) loss is applied over the boundary region Ω_b, encouraging spatial smoothness while keeping the fused latent close to its initial value. The loss is L_TV = Σ_{Ω_b}‖∇X_M^t‖² + λ Σ_{Ω_b}‖X_M^t – ĤX_M^t‖², with λ controlling the smoothness‑fidelity trade‑off.

While the soft mask restricts editing to the intended area, it also blocks global style cues that are useful for fine‑grained alignment. To restore this information, FusionEdit introduces Disparity‑Aware Attention Modulation (DAM). In the DiT (Denoising Diffusion Transformer) architecture, two parallel streams are run: a masked editing stream producing value tensor V_l and an unmasked reference stream producing V_r^l. An AdaIN operation transfers the channel‑wise mean and variance of V_r^l into V_l, yielding AdaIN(V_l, V_r^l). The final value tensor is a weighted combination: V’_l = α·AdaIN(V_l, V_r^l) + (1 – α)·V_l. The weight α is dynamically computed as α = β·(1 – t)·

FusionEdit: Semantic Fusion and Attention Modulation for Training-Free Image Editing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment