Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.


💡 Research Summary

GeoEdit tackles two persistent shortcomings of diffusion‑based image editing: (1) the inability to perform precise geometric transformations (translation, rotation, scaling, and their combinations) on objects within complex scenes, and (2) inadequate modeling of lighting and shadow effects that leads to unrealistic results after transformation. The proposed framework consists of three tightly coupled components: a Geometric Transformation module, an Effects‑Sensitive Attention (ESA) mechanism, and a large‑scale training dataset called RS‑Objects.

The Geometric Transformation module lifts the target object into a textured 3D mesh using Hunyuan3D‑2.1. Translation is handled by copying the source mask to a new location, while rotation is performed by rotating the mesh to an arbitrary angle, orthographically projecting it onto a larger canvas, and then cropping and rescaling the result. Scaling is simulated by uniformly resizing the rendered object and mask, which implicitly encodes depth cues. By operating in this elevated space, the module can apply parametric transformations with pixel‑perfect control, preserve texture fidelity, and generate accurate target masks that serve as in‑context guidance for the downstream diffusion model.

The core novelty of the attention design is ESA. Standard scaled‑dot‑product attention distributes focus uniformly across the whole image, which often dilutes the model’s ability to concentrate on the edited region. A hard modulation that forces queries in the edit region to attend only to a subset of keys eliminates this dilution but also blocks the flow of lighting and shadow information from surrounding pixels, resulting in flat or missing shadows. ESA introduces a soft bias: for queries belonging to the edit region, a constant δ = α·std(S) is added to the raw attention logits before the softmax. α is a positive scalar that controls the strength of the bias. This adjustment raises the probability mass on keys that belong to the edited object while still allowing interaction with keys representing the surrounding environment, thereby preserving illumination cues. The authors provide a theoretical analysis (Theorem 3.1) showing that ESA reduces the KL‑divergence between the actual attention distribution and an ideal attention map A★, whereas hard modulation leads to divergence that tends to infinity.

Training such a system requires a dataset that simultaneously offers precise geometric control and realistic lighting. RS‑Objects is built through a two‑stage rendering‑synthesis pipeline. In the rendering stage, Blender is used to create 24 diverse, object‑rich scenes containing 30 distinct objects. Multiple camera rings generate 20 000 image‑mask pairs with controlled translation, rotation, and scaling. In the synthesis stage, meshes from AnyInsertion‑V1 and Hunyuan3D‑2.1 are used to generate additional samples, which are then fed to a LoRA‑fine‑tuned diffusion model trained on the rendered data. This model produces over 800 000 synthetic samples; a human‑in‑the‑loop quality filter (20 annotators, three weeks) discards samples with spatial, feature, or illumination inconsistencies, leaving more than 100 000 high‑quality pairs for final training.

Extensive experiments compare GeoEdit against state‑of‑the‑art diffusion editing methods such as DreamBooth, SDEdit, Prompt‑guided Diffusion, and recent geometry‑aware approaches that manipulate latent features or attention maps. Quantitative metrics (FID, LPIPS, Geometric IoU, and a custom lighting consistency score) demonstrate that GeoEdit consistently outperforms baselines, especially on compound transformations where it achieves 15‑20 % lower FID and 10 % higher IoU. User studies corroborate these findings, with participants rating GeoEdit’s realism and geometric fidelity highest. Qualitative examples illustrate that GeoEdit preserves sharp object boundaries, generates shadows with correct direction and intensity, and maintains coherent illumination across the scene.

In summary, GeoEdit advances geometric image editing by (1) introducing a 3D‑based transformation pipeline that yields precise control over object pose and scale, (2) designing a soft, effects‑sensitive attention mechanism that balances focus on edited regions with preservation of surrounding lighting cues, and (3) curating a large, high‑quality dataset that enables supervised learning of both geometry and photorealistic effects. The work opens avenues for extending the approach to more diverse lighting conditions, video editing, and applying ESA to other diffusion‑based generative tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment