SpotEdit: Selective Region Editing in Diffusion Transformers
📝 Abstract
Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
💡 Analysis
Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
📄 Content
Diffusion models have shown outstanding performance in image generation tasks [10,16,31]. By leveraging Diffusion Transformer (DiTs) architectures [28], the generation quality and flexibility have been further enhanced. Building on these advances, encoding condition images and integrating them directly into transformers [36,37] has become a mainstream technique for image editing [15,39]. This strategy allows simple yet effective editing without relying on manually provided masks, greatly improving usability in practical applications.
However, in most image editing tasks, only a small region of the image requires modification, while the majority of areas remain unchanged. Yet, existing approaches uniformly follow a full-image regeneration paradigm, indiscriminately denoising every region from random noise, including those that do not need editing. Such uniform processing introduces two prominent drawbacks: First, redundant computations in non-edited regions may inadvertently produce subtle artifacts; second, significant computational resources are wasted by processing unmodified areas. These issues naturally lead us to reconsider the current editing paradigm and pose a critical question: Is it truly necessary to regenerate every region of the image during an editing task? To address the above problems, we begin by analyzing the temporal convergence patterns of latent representations during diffusion. Figure 2 reveal that, in partial editing tasks, non-edited regions stabilize quickly, converging at early diffusion timesteps. This observation naturally motivates a more efficient editing strategy: Edit only what needs to be edited.
Guided by this principle, we propose SpotEdit , a mechanism designed to automatically detect stable, non-edited regions and reuse their corresponding condition image latent features without computing them in DiTs, thereby avoiding redundant regeneration computation. Implementing this idea, however, raises two critical challenges: 1) How to efficiently and accurately identify non-edited regions? 2) How to enable models to dynamically focus computation only on the regions requiring modification?
For challenge 1, we propose SpotSelector, an adaptive mechanism that dynamically identifies stable regions during diffusion iterations. Specifically, SpotSelector computes a perceptual similarity score for each latent token by measuring the perceptual distance between the reconstructed fully denoised latent and the corresponding condition image latent via VAE decoder layers. Regions whose perceptual distance is below a threshold are automatically classified as non-edited regions. This approach eliminates manual masking and directly leverages the diffusion dynamics observed in our analysis, ensuring that the identified regions align with the model’s generative process.
For challenge 2, we introduce SpotFusion, a context fusion mechanism that restores missing contextual information by adaptively blending features from the condition image. Leveraging the high feature relationship between nonedited regions and corresponding condition image regions across diffusion steps, SpotFusion dynamically modulates the contribution of the reference based on the current denoising timestep, relying more on the reference early in the process and gradually shifting to the current estimate as generation progresses. This design preserves time coherence in features while avoiding potential boundary arti-facts, without requiring additional computation for unedited regions.
Experimental results demonstrate that our SpotEdit achieves a speedup of 1.7× for imgEdit-Benchmark [43]and 1.9× for PIE-Bench++ [12] on the base model FLUX.1-Kontext [15], while maintaining quality comparable to the original model. Qualitative results (see Figure 1) further indicate that SpotEdit perfectly preserves non-edited regions and produces clean, localized edits.
Our primary contributions are summarized as follows:
(i) We propose SpotSelector, a perceptual-similaritybased method for dynamically distinguishing nonedited regions, removing the need for manual masks. (ii) We introduce SpotFusion, an adaptive fusion mechanism ensuring temporal coherence and contextual consistency in partially-edited diffusion processes. (iii) We demonstrate that our combined framework, SpotEdit, enables selective diffusion-based editing, significantly accelerating inference while preserving the fidelity and quality of edits.
Image editing has long been a central need in workflows. [29]Following the remarkable breakthroughs of diffusion models [11,42] in image generation, a growing body of research has focused on adapting these models to serve image editing tasks. Early approaches, such as ControlNet [46], injected external control signals into U-Net [32] to enable robust and controllable editing outcomes. With the advancement of diffusion models, inversionbased methods [8,9,22,25,38,41,48] have become the mainstream paradigm. These approaches operate by
This content is AI-processed based on ArXiv data.