픽셀 동등 잠재 합성으로 구현하는 고품질 이미지 인페인팅
📝 Abstract
Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixelequivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.
💡 Analysis
Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixelequivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.
📄 Content
Latent diffusion models (LDMs) [20,21] dominate modern image generation, yet a brittle operation is widespread * Equal contribution.
for mixing image latents. In masked-conditioned generation tasks such as inpainting or editing, latents are interpolated via a mask the same way as pixels. However, this heuristic is a source of error which limit mask fidelity. VAE decoders are nonlinear and spatially entangled, so mixing latents does not mix pixels. The result is off-manifold seams, color shifts, and halos that diffusion then amplifies across denoising steps.
We propose a simple principle: latent compositing should be pixel-equivalent (PE). For a frozen encoder E, decoder D, and any pixel-space operator F , a latent operator C F should satisfy
That is, applying F after decoding should match applying C F before decoding, and the same for encoding. We call these two equalities decoder equivalence (DE) and encoder equivalence (EE). As a concrete case, inpainting uses
To satisfy DE and EE, it must uphold
and
Linear latent blending would satisfy decoder-consistency only if D were locally linear and channel-separable, assumptions that we empirically show fail in modern VAEs (Table 2, Fig. 1).
Modern VAE latents couple wide spatial context and heterogeneous channels; broadcasting a single, downsampled mask and linearly mixing latents introduces boundary leakage and global color drift. Figure 1 shows the effect on a latent mixing task: heuristic blending yields visible halos and boundary mismatch, while a pixel-equivalent compositor restores sharp edges and global image quality even away from the edge seams. 1. Each quadrant compares ground-truth pixel composites1 , our DecFormer predictions, and heuristic latent interpolation. Across soft, binary, and structured masks, DecFormer restores sharp edges and high-frequency detail, whereas the heuristic exhibits smearing and artifacts on soft blends, halos and discoloration at boundaries, and blocky low-fidelity masks. Notably, in the bottom-right example, global background degradation occurs far from the masked region, reflecting how latent entanglement corrupts off-mask content; this effect is eliminated by DecFormer.
We introduce PELC (Pixel-Equivalent Latent Compositing), a model-agnostic methodology for learning latent operators C θ that are decode-equivalent with target pixel operators using only a frozen encoder-decoder and synthetic supervision from pixel composites. As an inpainting instantiation, we propose DecFormer, a lightweight 7.7M-parameter transformer that predicts per-channel blend weights together with a nonlinear residual correction, supporting genuinely soft masks, and is a drop-in replacement for heuristic latent compositing. The same principle of pixel-equivalence apply to non-compositing operations as well, such as the color corrections demonstrated in this work.
• We formalize pixel-equivalence as a general criterion for latent operators and present PELC, a simple training recipe to realize latent compositing from pixel-space supervision. • We formalize and demonstrate (Fig. 1, Table 2) that linear latent interpolation cannot meet pixel-equivalence in modern VAEs, which exhibit nonlinearity and wide effective receptive fields. • We design DecFormer, a 7.7M-parameter compositor that restores mask fidelity and supports genuinely soft masks with negligible overhead (3.5% FLOP overhead). Decformer consistently halves error metrics at key mask edge areas. • For inpainting, Decformer improves all visual metrics as a drop-in replacement, and adding a lightweight LoRA achieves comparable visual quality with a dedicated inpainting model (Flux-Fill). • Additionally to inpainting, our PE objective applies to any pixel-space operator F , providing a path to principled latent-space editing without repeated encoding and decoding at every step.
Instead of denoising in pixel space, modern diffusion models operate in the latent space of a pretrained variational autoencoder (VAE). We investigate Flux’s VAE [13], latest and state-of-the-art in the line of autoencoders following [21]. Given an image x ∈ R H×W ×3 , the VAE encoder E produces a latent tensor
where f is the downsampling factor of the VAE. The decoder D then reconstructs pixels through x ≈ x = D(z).
A notable feature of these latents is that they resemble images. Channel-wise visualizations, show downsampled content aligned to the spatial (h, w) grid. The convolutionbased architecture creates inductive bias for latents to be spatially consistent with the encoded image. Each latent voxel z[i, j, :] is aggregated by strided convolutions over a receptive field centered at approximately
The effective stride S E = L i=l+1 s i cumulatively over encoder layers L and their respective strides s i is equal to the downsample factor f . In Flux’s autoencoder, S l = f = 8. Because convolutions are translation-equivariant, and each latent position’s receptive field over the pixel space is strided by f ,
This content is AI-processed based on ArXiv data.