Reduce Information Loss in Transformers for Pluralistic Image Inpainting
Transformers have achieved great success in pluralistic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize 2563 RGB pixels to a small number (such as 512) of quantized pixels. The indices of quantized pixels are used as tokens for the inputs and prediction targets of transformer. Although an extra CNN network is used to upsample and refine the low-resolution results, it is difficult to retrieve the lost information back. To keep input information as much as possible, we propose a new transformer based framework “PUT”. Specifically, to avoid input downsampling while maintaining the computation efficiency, we design a patch-based auto-encoder P-VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the features from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction targets. Extensive experiments show that PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.
💡 Research Summary
The paper identifies two fundamental sources of information loss in current transformer‑based pluralistic image inpainting: (1) down‑sampling the input image to reduce token count, which discards high‑frequency details and misaligns mask boundaries, and (2) quantizing the full RGB space (256³ colors) into a few hundred codebook entries, which compresses pixel information into coarse discrete tokens. Both steps hinder the ability of subsequent CNN up‑sampling and refinement stages to recover the original content.
PUT (Pluralistic Un‑Quantized Transformer) eliminates these losses through two complementary components. First, a Patch‑based Vector‑Quantized Variational Auto‑Encoder (P‑VQVAE) processes the image in non‑overlapping 8×8 patches. Each patch is linearly projected into a high‑dimensional feature vector, preserving full resolution while keeping token numbers low. An innovative dual‑codebook separates latent vectors for masked and unmasked patches, allowing the encoder to learn distinct representations for each region. Second, the Un‑Quantized Transformer (UQ‑Transformer) receives the continuous feature vectors from the encoder as input, but predicts only the quantized token indices for masked patches. This asymmetric design prevents any information loss at the input stage while still leveraging a discrete prediction space.
The decoder, called Multi‑Scale Guided Decoder (MSG‑Dec), consists of a main branch that reconstructs the image from the retrieved codebook vectors and a reference branch that extracts multi‑scale features from the original masked image. A Mask Guided Addition (MGA) module fuses these branches, ensuring that unmasked regions remain unchanged and that mask boundaries are perfectly aligned.
Training proceeds in two stages: (i) self‑supervised learning of P‑VQVAE to obtain the encoder, dual codebooks, and MSG‑Dec, and (ii) training of UQ‑Transformer to predict token indices for masked patches. At inference, the transformer’s predicted tokens are looked up in the appropriate codebook, the corresponding vectors are fed to MSG‑Dec, and the final high‑resolution inpainted image is produced.
Extensive experiments on FFHQ, Places2, and ImageNet demonstrate that PUT substantially outperforms both CNN‑based and previous transformer‑based methods across PSNR, SSIM, LPIPS, and human preference scores. The advantage is especially pronounced for large masks (over 50 % of the image) and for complex, large‑scale datasets, where PUT preserves fine details and structural consistency that other methods lose.
Limitations include sensitivity to patch size and codebook dimensions, and increased memory consumption for large codebooks. Future work may explore adaptive patching, more memory‑efficient codebook learning, and fully continuous decoding to further boost quality. In summary, PUT offers a principled solution to the information‑loss problem in transformer‑based pluralistic inpainting, delivering higher fidelity and diversity without sacrificing computational efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment