Shifting the Breaking Point of Flow Matching for Multi-Instance Editing
Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.
💡 Research Summary
The paper addresses a critical limitation of recent flow‑matching based image editing models: their inability to edit multiple objects or regions independently when given several textual instructions at once. Existing flow‑matching generators, such as those built on Multimodal Diffusion Transformers (MMDiT), concatenate text, latent, and auxiliary context tokens into a single sequence and apply joint self‑attention. While this design yields high global visual quality, it also allows every token to attend to every other token, causing “attribute leakage” – the semantic influence of one instruction bleeding into unrelated image regions. This problem becomes especially severe in multi‑instance scenarios, where N different edits must be applied simultaneously, and in domains like text‑dense infographics where small, precise changes are required.
To solve this, the authors propose Instance‑Disentangled Attention (IDA), an architectural modification that partitions the joint token stream into six logical groups: global prompt tokens (Tg), per‑instance local prompt tokens (Tₙ), background latent tokens (Lᵤ), per‑instance latent tokens (Lₙ), background context tokens (Cᵤ), and per‑instance context tokens (Cₙ). Two attention masks are defined:
-
Disentanglement mask M₍dis₎ – allows only tokens belonging to the same instance (Tₙ, Lₙ, Cₙ) to attend to each other, while global prompts and background tokens may attend to everything. This enforces strict isolation of each edit’s influence on its own spatial region.
-
Harmonization mask M₍har₎ – relaxes the restriction for later transformer layers, permitting instance‑specific latents and contexts to also attend to other instances’ latents and contexts. This restores global coherence after the per‑instance bindings have been established.
The masks are applied in a layer‑wise schedule: early layers use M₍har₎ to capture coarse, global features; middle layers use M₍dis₎ to bind each textual instruction to its designated region; and final layers revert to M₍har₎ to blend the edited parts back into a harmonious whole. This schedule aligns with recent findings that transformer depth correlates with a progression from low‑level feature extraction to semantic binding and finally to global refinement.
In addition to the architectural changes, the paper tackles the inefficiency of multi‑prompt encoding. The naïve approach concatenates all instructions into a single long prompt, which lets the text encoder’s self‑attention mix unrelated concepts before they even reach the generator. Existing alternatives either mask after encoding (still prone to leakage) or encode each sub‑prompt separately (computationally linear in the number of instances). The authors’ solution encodes a global prompt (often a null prompt) and each instance’s sub‑prompt independently, then concatenates the resulting embeddings while keeping the total token length proportional to the overall semantic content rather than the number of instances. This yields near‑constant overhead even when dozens of edits are required, enabling true single‑pass multi‑instance editing.
Experiments are conducted on two fronts. First, standard natural‑image editing benchmarks are used to demonstrate that IDA preserves the high visual fidelity of flow‑matching generators while eliminating cross‑instance artifacts. Second, the authors introduce a new benchmark of text‑dense infographics. Each infographic image is paired with bounding‑box annotations for every text region and a “Change ‘SRC’ to ‘TGT’” instruction, where SRC is the original English text and TGT is a translation (e.g., Korean). This benchmark stresses the model because edited regions are tiny, densely packed, and any stray change can corrupt the overall meaning.
Quantitative results show that IDA‑augmented models achieve higher edit accuracy (percentage of correctly replaced text), better locality (lower L2 change in untouched regions), and improved image quality metrics (PSNR, SSIM) compared to baseline flow‑matching editors and recent diffusion‑based multi‑edit methods. Qualitative examples illustrate clean, isolated edits without the color bleeding or shape distortion seen in baselines. Importantly, the entire multi‑instance edit is performed in a single forward pass, eliminating the iterative mask‑refinement loops required by prior work and reducing inference time by up to 40 %.
The paper’s contributions are fourfold: (1) identification and formal analysis of attribute leakage in flow‑matching generators; (2) the design of Instance‑Disentangled Attention with dual masks and a depth‑aware application schedule; (3) an efficient multi‑prompt encoding scheme that scales sub‑linearly with the number of edits; and (4) the creation of a challenging infographic editing benchmark that will serve the community. Limitations include reliance on bounding‑box based token partitioning (overlapping boxes can cause tokens to belong to multiple partitions, leading to minor residual leakage) and the current focus on static images; extending the approach to video or to non‑rectangular masks would require additional temporal or shape‑aware mechanisms.
Overall, the work demonstrates that flow‑matching models, when equipped with carefully structured attention, can overcome their previous inability to handle multi‑instance editing, achieving a balance of edit disentanglement, locality, and global coherence while retaining the speed advantages of continuous‑time ODE generation. This opens the door to practical applications such as batch editing of design assets, rapid localization of UI mockups, and multilingual infographic adaptation.
Comments & Academic Discussion
Loading comments...
Leave a Comment