DinoLizer : 겹치는 이미지 크롭과 DINOv2 임베딩을 활용한 자동 로컬라이제이션 기법

Reading time: 6 minute
...

📝 Abstract

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper.

💡 Analysis

We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper.

📄 Content

DinoLizer: Learning from the Best for Generative Inpainting Localization Minh Thong DOI1,2 Jan BUTORA2 Vincent ITIER1,2 Jérémie BOULANGER2 Patrick BAS2 1IMT Nord Europe, Institut Mines-Télécom, Centre for Digital Systems, F-59000 Lille, France 2Centre de Recherche en Informatique, Signal et Automatique de Lille, Avenue Henri Poincaré, 59655 Villeneuve d’Ascq, France November 27, 2025 Cls Input DINOv2-B Patch embedding Logit Map 1 × 1 × D × 1 conv Predicted Map Sliding Window Fusion p p D w w s Figure 1: Principle of DinoLizer: the image is decomposed into a set of 504 × 504 overlapping crops which are fed to the DINOv2 model to provide embeddings of dimension d for each patch plus a class token which is not considered here. A 1 × 1 trainable convolutional layer is used to infer a logit map, which is then fused with other overlapping maps in order to provide, after thresholding, a localization mask. Abstract We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to de- tect synthetic images on the B-Free dataset. We add a lin- ear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a 14 × 14 patch resolution. The head is trained to focus on semanti- cally altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed- size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipula- tion masks. Empirical results show that DinoLizer sur- passes state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different genera- tive models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies compar- ing DINOv2 and its successor, DINOv3, in deepfake lo- calization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper. 1 Introduction Currently, Generative Artificial Intelligence (Gen-AI) poses serious threats to democracy since it can be used to create or amplify disinformation. When applied to dig- ital images, the realism of the generated content increases after each new generation of models, and with the last generative methods, human observers are finding it in- creasingly difficult to distinguish between generated and authentic images. Generative inpainting is a byproduct of Gen-AI, which enables to locally remove or add sub- jects in the image. Using the last generation of image editing software, it is extremely easy to perform gener- ative inpainting, and the detection of such manipulations is challenging because they blend authentic image parts with generated background (for object removal) or with a generated object (for object insertion). This difference explains why there are more forensic methods in the liter- ature designed to detect generated images than to localize generative inpainting. However, if several detection meth- 1 arXiv:2511.20722v1 [cs.CV] 25 Nov 2025 Dataset Real Inpainting Orig Fake Source Schemes Bg Images BtB [1] Flickr30k Fooocus [40], SDXL ✓ 7.4k B-Free [13] MS-COCO [21] SD2 Inpainting ✓, × 10k COCOG [12] MS-COCO GLIDE [27] ✓ 512 TGIF [24] MS-COCO SD2, SDXL, Firefly ✓ 4.1k SAGI- MS-COCO BrushNet, PowerPaint, SP [11] RAISE [7] HD-Painter, ControlNet ✓ 2k FR [11] OpenImages [17] Inpaint-Anything × 1.3k Table 1: Overview of evaluation datasets with their real image sources, inpainting generators (SD = Stable Diffu- sion, BtB: Beyond the Brush, COCOG: CocoGlide). The presence of an original background (Orig Bg) or a back- ground that is auto-encoded: × in the forged images is also indicated. ods [13, 24] are also designed to detect inpainted images, the goal of localizing the manipulated region can be seen as an additional step towards explainability, and it can be very helpful to security agencies in their forensic tasks. Many existing inpainting methods are based either on diffusion models such as Stable-Diffusion [31], Glide [26], or the commercial model from Adobe Fire- Fly, but also other methods such as Fourier convolution- based methods and residual networks like LaMa [35] for object removal. On diffusion models, the inpaint- ing methods such as Brushnet [15], PowerPaint [42], or Hd-painter [23] come from models that are fine-tuned, or adapted using extra input channels. They are used either to infer a masked part of the image during the backward diffusion process (for object insertion) or to replace an object with another. In order to o

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut