CoreEditor: Correspondence-constrained Diffusion for Consistent 3D Editing

CoreEditor: Correspondence-constrained Diffusion for Consistent 3D Editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.


💡 Research Summary

CoreEditor tackles the persistent problem of multi‑view inconsistency in text‑driven 3D editing by introducing two novel attention mechanisms that operate directly within a pre‑trained latent diffusion model. The pipeline begins by rendering a set of N views and their depth maps from a Gaussian‑Splatting (GS) representation of the scene. Each view is first edited independently using a standard DDIM‑inversion approach, during which intermediate diffusion features are saved. Users can then select a preferred edited view (Iᵣ); its feature map (Fᵣ) is injected into all views via Reference Attention (RA). In RA, the selected edit acts as an additional key‑value pair in the self‑attention computation, weighted by a coefficient λ, thereby aligning the global editing style across all views before any fine‑grained consistency enforcement.

The second, core contribution is Correspondence‑constrained Attention (CCA). Unlike conventional self‑attention that allows every token to attend to every other token, CCA restricts attention to tokens that correspond to the same 3D point across different camera viewpoints. Correspondences are built from two complementary sources: (1) Geometric correspondence derived from depth maps and camera intrinsics/extrinsics, with a reprojection error mask to discard occluded matches; (2) Semantic correspondence obtained from the diffusion model’s final‑layer feature maps. For pixels lacking reliable geometric matches, the algorithm searches for the pixel in another view that maximizes cosine similarity of diffusion features, accepting only matches whose similarity exceeds a high threshold (β≈0.9). The union of geometric and semantic matches forms a Geometric‑Semantic Co‑supported Correspondence set that is fed into the attention module. Consequently, each token can only attend to its true multi‑view counterparts, ensuring that the denoising trajectory for a given 3D point is identical across all views. This dramatically improves texture sharpness, edge fidelity, and overall 3D consistency, even under large viewpoint changes or heavy occlusions.

CoreEditor operates in a zero‑shot fashion: the underlying diffusion network remains frozen, and only the attention flow is re‑wired. No additional trainable parameters or costly fine‑tuning are required, preserving the efficiency of existing diffusion‑based editors while delivering far superior multi‑view results. Extensive experiments on a variety of indoor and outdoor scenes demonstrate that CoreEditor outperforms recent baselines such as GaussCtrl, DGE, and InterGSEdit. Quantitatively, it achieves average gains of +1.2 dB in PSNR, +0.03 in SSIM, and –0.07 in LPIPS. Qualitatively, the rendered scenes exhibit markedly sharper textures and consistent details across viewpoints, and the selective editing interface gives users direct control over the final aesthetic. Limitations include reliance on Gaussian‑Splatting as the 3D representation and the additional memory overhead of extracting diffusion features for semantic correspondence, which may become significant for very high‑resolution inputs.

In summary, CoreEditor introduces a powerful combination of Reference Attention for global style alignment and Correspondence‑constrained Attention for precise local consistency, all within a frozen diffusion model. This design resolves the two main challenges of text‑driven 3D editing—maintaining multi‑view coherence and preserving fine details—without sacrificing speed or requiring extensive retraining. Future work will explore extending the approach to alternative 3D representations (e.g., meshes, voxels) and developing more lightweight semantic correspondence mechanisms to broaden applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment