TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

TGIF2: Extended T ext-Guided Inpain ting F orgery Dataset & Benc hmark Hannes Mareen 1* , Dimitrios Karageorgiou 2 , P aschalis Giak oumoglou 2 , P eter Lambert 1 , Symeon P apadop oulos 2 , Glenn V an W allendael 1 1 IDLab, Ghen t Univ ersity – imec, Gent, Belgium. 2 Information T echnologies Institute, CER TH, Thessaloniki, Greece. *Corresp onding author(s). E-mail(s): hannes.mareen@ugent.be ; Con tributing authors: dk arageo@iti.gr ; giak oupg@iti.gr ; p eter.lam b ert@ugen t.b e ; papadop@iti.gr ; glenn.v an wallendael@ugen t.be ; Abstract Generativ e AI has made text-guided inpainting a pow erful image editing to ol, but at the same time a growing challenge for media forensics. Existing bench- marks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery lo calization (IFL) metho ds can lo calize manipulations in spliced images but struggle not in fully regenerated (FR) images, while syn thetic image detection (SID) metho ds can detect fully regenerated images but cannot p er- form localization. With new generative inpain ting mo dels emerging and the op en problem of lo calization in FR images remaining, up dated datasets and b enc h- marks are needed. W e introduce TGIF2, an extended version of TGIF, that captures recent adv ances in text-guided inpainting and enables a deep er analysis of forensic robustness. TGIF2 augments the original dataset with edits gener- ated by FLUX.1 models, as well as with random non-seman tic masks. Using the TGIF2 dataset, w e conduct a forensic ev aluation spanning IFL and SID, including ﬁne-tuning IFL methods on FR images and generativ e sup er-resolution attac ks. Our experiments show that b oth IFL and SID metho ds degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally , while ﬁne-tuning improv es lo calization on FR images, ev aluation with random non- seman tic masks reveals ob ject bias. F urthermore, generative super-resolution signiﬁcan tly w eakens forensic traces, demonstrating that common image enhance- men t op erations can undermine current forensic pip elines. In summary , TGIF2 pro vides an up dated dataset and b enc hmark, which enables new insigh ts into 1 the c hallenges p osed b y mo dern inpain ting and AI-based image enhancements. TGIF2 is a v ailable at h ttps://github.com/IDLabMedia/tgif- dataset . Keyw ords: Image F orensics, F orgery Detection, F orgery Lo calization, Synthetic Image Detection, AI-Generated Image Detection, Super Resolution 1 In tro duction Digital image manipulation has become increasingly accessible and eﬃcien t, producing more realistic forged images with recen t adv ances in generativ e AI (GenAI) tec h- nology [ 1 ]. A few y ears ago, creating convincing image edits required expertise in soft ware such as Adob e Photoshop and signiﬁcant manual eﬀort. More recently , deep- fak e tec hnology [ 2 ] enabled realistic face swaps and mo diﬁcations, but these w ere t ypically limited to faces and required tec hnical skills. T o da y , modern op en-source and commercial GenAI models, such as Stable Diﬀusion (SD) [ 1 ], Adob e Fireﬂy [ 3 ], and FLUX [ 4 ], can p erform high-resolution edits that are often indistinguishable from authen tic images. Additionally , these GenAI mo dels are accessible to users without tec hnical expertise as th ey w ork b y simple text prompting [ 1 , 5 ]. Such capabilities ha ve p ositiv e applications in creative industries, image restoration, and education. How- ev er, they also p ose serious risks, including fraud, disinformation, and the creation of fak e evidence [ 6 ]. Among GenAI approaches, text-guided inpain ting [ 1 , 5 ] is particularly p ow erful. It allo ws users to select a region of an (authentic) image and provide a textual prompt describing the desired edit. Diﬀusion-based mo dels then regenerate the image so that the edited region matches the prompt, while leaving the rest of the image visually in tact. Dep ending on the workﬂo w, the edited region may b e spliced into the original image (i.e., only changing pixels in the selected region) or the mo del ma y regenerate the entire image (i.e., p oten tially c hanging all pixels in the image, y et only seman tically altering the selected region). This is demonstrated in Fig. 1 . In previous w ork [ 7 ], w e sho wed that when regenerating the en tire image during editing, the forensic traces that traditional image forgery lo calization (IFL) metho ds use are destroy ed. Moreov er, our previous work has found that syn thetic image detection (SID) metho ds can detect fully regenerated images as synthetic, but they are not designed to localize manipulated regions. The T ext-Guided Inpainting F orgery (TGIF) dataset prop osed in our previous w ork [ 7 ] provided high-resolution images manipulated with three text-guided inpain t- ing mo dels: SD2 [ 1 ], SDXL [ 8 ], and Adobe Fireﬂy [ 3 ]), including both spliced and fully regenerated images. While TGIF pro vided a v aluable dataset, b enc hmark and insigh ts, recen t dev elopments in generativ e models require an updated, more com- prehensiv e dataset. First, new generative inpainting mo dels w ere released since. Most notably , FLUX.1 [ 4 ] is a new generation of diﬀusion mo del that replaces the tradi- tional U-net and diﬀusion pip eline of traditional diﬀusion mo dels (such as SD) with a transformer and ﬂo w-matching framework. As such, FLUX.1 achiev es higher-quality 2 (a) Original (b) Mask (c) F ully regenerated (d) Spliced (e) F ully regenerated (zo om) (f ) Original & Spliced (zo om) Fig. 1 Examples of the tw o ways of inpainting. An (a) authentic image can b e inpainted in a (b) selected region as mask, with skis as prompt. In the GenAI-based inpain ting process (here using SDXL), (c) the full image is regenerated during editing. T o minimize artifacts, (d) only the region corresponding to the mask can be spliced into the authentic image. (e) and (f ) pro vide zoomed-in versions to more clearly show the diﬀerences b etw een the original or spliced pixels, on the one hand, and the regenerated pixels outside of the mask ed area, on the other. Subtle diﬀerences due to the regenerative pro cess can b e observed in the teeth, sunglasses, beard, etc. lo cal and global generations [ 9 ]. Second, the previously acquired insight that gener- ativ e edits signiﬁcan tly impact IFL mo dels requires further exploration for p oten tial solutions. This highlights the need for a more comprehensive b enchmark and exp eri- men ts that deep en our understanding of the impact of recent GenAI inpain ting models on forensic metho ds. T o this end, we introduce TGIF2, an extended version of TGIF, with several new con tributions: • FLUX inpain ting: TGIF w as limited to using SD2, SD XL, and Adob e Fireﬂy to inpain t images. TGIF2 extends the TGIF dataset b y additionally including FLUX.1 mo dels ([sc hnell], [dev] & Fill [dev]). The FLUX.1 mo dels are a new generation of dif- fusion mo dels, and are among the most capable op en-source text-guided inpainting mo dels av ailable to da y [ 9 ]. In con trast to traditional SD mo dels, they utilize ﬂow- matc hing instead of denoising diﬀusion and rely on large multimodal transformers rather than a U-net for b etter contextual reasoning. • Random non-semantic masks: In addition to semantic, ob ject-cen tric masks, TGIF2 includes randomly positioned, non-semantic masks, allowing us to ev aluate p oten tial biases of IFL metho ds tow ard ob jects or other seman tically meaningful regions. 3 • New experimental con tributions: W e pro vide new results for IFL metho ds ﬁne- tuned on TGIF2’s fully regenerated images, the relation of the IFL p erformance with generativ e quality , and the eﬀect of sup er resolution (SR) [ 10 ] as an attack against IFL and SID metho ds These additions make TGIF2 not only a larger dataset but enable new insights into forensic challenges. First, the inclusion of FLUX.1 inpainted images allo ws ev aluation of how current IFL and SID metho ds generalize to newer generativ e mo dels. Second, the pro vided random, non-semantic masks may rev eal p oten tial biases of IFL methods to ward ob jects or semantically meaningful regions, highlighting limitations in their robustness. F or example, we demonstrate that IFL mo dels that are ﬁne-tuned on non- random, semantic FR data enable lo calization in FR images. How ever, our ev aluation on images inpain ted with random, non-semantic masks rev eals mo derate seman tic biases. Lastly , the sup er-resolution attac k exp oses the risk that routine GenAI-based image enhancement can erase traces used for forensic analysis. In summary , TGIF2 pro vides a more comprehensive dataset and ev aluation framew ork that exposes curren t limitations and guides the dev elopment of more eﬀectiv e forensic metho ds for the detection and lo calization of realistic, text-guided inpainting manipulations. The remainder of this pap er is organized as follows. Section 2 reviews related work in generativ e media forensics and existing datasets. Section 3 presen ts the TGIF2 dataset, including the authentic data sources, inpainting mo dels, w orkﬂow, and result- ing subsets. Section 4 reports our forensic b enc hmark, cov ering the performance of existing IFL and SID methods, ﬁne-tuning exp erimen ts, the generative qualit y , and the impact of sup er-resolution attacks. Finally , Section 5 discusses our ﬁndings and concludes the pap er. 2 Related W ork This section brieﬂy discusses recen t adv ances in IFL and SID metho ds, in Section 2.1 and Section 2.2 , resp ectiv ely . Then, Section 2.3 discusses related datasets. 2.1 Image F orgery Localization Image forgery lo calization aims to identify the regions of an image that hav e b een manipulated. Approaches in this area can b e roughly grouped in to tw o categories. Some metho ds fo cus on detecting subtle statistical inconsistencies, for example traces of compression [ 11 ] or sensor noise patterns [ 12 ]. Others exploit more visible cues, suc h as unnatural transitions along the boundary b et ween pristine and altered conten t [ 13 ]. T o improv e robustness, sev eral works combine multiple forensic signals within a uni- ﬁed framework [ 14 – 16 ]. Some existing research also considers forgeries created with inpain ting techniques. Sp ecialized IFL metho ds in this setting include detectors that lev erage distortions in the Laplacian domain [ 17 ], or that learn discriminative features directly from data [ 18 , 19 ]. These approac hes, ho wev er, hav e largely b een outp erformed b y more general methods designed to work across diﬀeren t manipulation t yp es [ 20 , 21 ]. Curren t state-of-the-art mo dels [ 13 , 20 – 25 ] are mostly based on deep neural net- w orks and are trained on datasets cov ering a wide v ariety of manipulations, suc h as 4 splicing, cop y–mov e, or conv entional inpain ting. Related to inpainting, PSCC-Net [ 20 ] w as partly trained on manipulated data using a conv en tional inpainting metho d [ 26 ], and MantraNet [ 24 ] w as partly trained on images manipulated with OpenCV’s inpain t- ing metho d. More datasets that include inpainted images are discussed in Section 2.3 . While training on a v ariety of manipulations makes mo dels broadly applicable, these existing mo dels ha v e not b een optimized for the recent case of text-guided image manipulation. A complementary line of work has examined laundering attacks, where generative mo dels erase forensic traces while leaving the semantic conten t unchanged [ 27 ]. These results demonstrate that generativ e compression, or diﬀusion-based regeneration in general, can serve as eﬀective laundering mechanisms, substantially w eakening IFL metho ds. The TGIF2 dataset proposed in Section 3 directly addresses these challenges b y including fully regenerated manipulated conten t, thereby enabling systematic ev al- uation of IFL robustness in these challenging scenarios. Additionally , the impact of generativ e super resolution is ev aluated in the TGIF2 b enchmark, in Section 4.6 . 2.2 Syn thetic Image Detection Syn thetic image detection methods aim to distinguish b et ween authen tic and AI- generated visual conten t. The ﬁeld initially developed alongside the rise of Generative Adv ersarial Netw ork (GAN)-based image syn thesis, and many early detectors were trained sp eciﬁcally on GAN outputs [ 28 – 34 ]. With the rapid adoption of (latent) dif- fusion mo dels (LDM or DMs), such as Stable Diﬀusion (SD) [ 1 ], more recent w ork has fo cused on this paradigm or explores detectors that can generalize across b oth GANs and DMs [ 35 – 40 ]. As these methods are trained and ev aluated on academic datasets, in practice, their p erformance often decreases when ev aluated on images found in the wild. How ever, their performance can be signiﬁcantly improv ed b y lev eraging in-the-wild conten t for training [ 41 ]. Most approac hes rely on the idea that generativ e mo dels lea ve behind subtle traces or ﬁngerprints, which can be captured in v arious feature domains. The type of repre- sen tation is critical for generalization. F or instance, some detectors trained purely on GAN-generated data are still able to iden tify DM-generated images, highligh ting the p ersistence of shared artifacts across generativ e families [ 30 , 33 , 34 ]. Despite this progress, most SID research assumes a binary scenario where an en tire image is either authentic or syn thetic. This setting ov erlo oks more subtle manipulations, suc h as text-guided inpainting, where a forged region is blended with pristine con tent. In such cases, SID metho ds pro duce a global score but cannot lo calize the edited area. Moreo ver, authentic images that are post-pro cessed using generativ e mo dels without altering their semantics ha ve b een shown to trigger SID metho ds [ 27 , 42 ]. The TGIF2 dataset introduces subsets where manipulated conten t is conﬁned to lo cal regions but fully regenerated through adv anced diﬀusion-based editing. This allo ws us to examine not only the limits of SID metho ds, but also their vulnerability to laundering-style manipulations during generative editing. Moreov er, the impact of generativ e super resolution on SID methods is ev aluated in Section 4.6 . 5 2.3 T ext-Guided Inpain ting Datasets A wide range of datasets has b een developed for image forensics, co vering b oth IFL and SID, as summarized in sev eral recen t surv eys [ 6 , 43 ]. Man y of these resources fo cus on sp eciﬁc manipulation types such as splicing, copy-mo ve, or deepfakes, while others aim to provide more div erse manipulations. F or inpain ting, earlier datasets such as NIST16 [ 44 ] and DEF ACTO [ 45 ] include localized edits, but these were created with non-AI-based inpainting to ols and do not inv olve text guidance. Early T ext-Guide d Inp ainting Datasets Recen tly , sev eral datasets with text-guided inpain ted images ha v e b een released. First, COCO-Glide [ 21 ] includes 512 images inpainted with GLIDE [ 5 ]. While v aluable as a pro of of concept, COCO-Glide is relatively small and limited to low-resolution crops (256 × 256 px) from MS-COCO [ 46 ]. Additionally , the Autosplice dataset [ 47 ] contains appro ximately 3k inpain ted spliced images using D ALL-E 2. Most imp ortantly , b oth datasets do not accoun t for diﬀerences b et ween spliced and fully regenerated edits. F ul ly R e gener ate d Inp ainte d Images TGIF [ 7 ] presen ts approximately 75k images manipulated using three text-guided inpain ting tec hniques (SD2, SD XL, and Adobe Fireﬂy) at higher resolutions (up to 1024 × 1024 p x), sourced from approximately 3k images from MS-COCO. TGIF also diﬀeren tiates betw een spliced and fully regenerated images, whic h was one of the k ey con tributions of the w ork. After the release of TGIF, sev eral new datasets on text- guided inpain ting w ere proposed. Most notably , SA GI [ 48 ] pro vides appro ximately 95k manipulated images sourced from 3 datasets (MS-COCO, RAISE [ 49 ], and Op en- Images [ 50 ]), with resolutions up to 2048 × 2048 p x. SA GI uses adv anced pip elines (HD-P ainter, Brush-Net, Po w erPain t, Con trolNet [ 51 ], and Inpain t-Anything) to seman tically alter images, based on SD2 (a DM) and LaMa [ 52 ] (a GAN-based approac h). T o the b est of our knowledge, SA GI is the only other existing dataset (i.e., apart from TGIF) that diﬀerentiates b etw een spliced and fully regenerated images. Other AI-b ase d Inp ainting Datasets COinCO [ 53 ] pro vides approximately 97k manipulated images from COCO, using SD2 and out-of-context prompts. F urthermore, the GRE dataset [ 54 ] presen ts approx- imately 228k images inpainted using six generative editing metho ds and pipelines (LaMa [ 52 ], MA T [ 55 ], SD2, Con trolNet [ 51 ], Pain tByExample, and Adobe Photo- shop). The UniAIDet [ 56 ] b enc hmark includes a dataset of 10k inpainted images, only considering splicing. None of these inpain ting datasets consider lo cal manipulation using new generative mo dels such as FLUX.1. T o the best of our kno wledge, lo cal manipulation with FLUX.1 was only included in the Op enSDI [ 57 ] dataset, which addi- tionally includes SD1.5, SD2.1, SDXL, SD3, to comprise approximately 150k images that are either globally syn thesized or locally manipulated. Note that they do not consider lo cal manipulation with full regeneration. 6 Going Beyond Obje ct R eplac ement T o Limit Bias All of the ab o ve datasets inpaint ob jects in images, whic h w e hypothesize may intro- duce a potential bias when using them for training or ev aluation. There are few datasets that go b ey ond ob ject replacement, namely Bey ond the Brush (BtB) [ 58 ], GIM [ 59 ], and BR-Gen [ 60 ]. In BtB [ 58 ], images are inpain ted using F o oo cus, which in ternally uses SDXL. Ob jects are automatically segmented from images, and a vision-language mo del is used to generate appropriate inpainting prompts (spanning from replacing the ob ject with a new ob ject type, to remo ving it altogether). Next, GIM [ 59 ] is a million-scale dataset of inpain ted images, either by replacing ob jects using SD2 and GLIDE or remo ving them using the Denoising Diﬀusion Null-Space Mo del (DDNM) [ 61 ]. Finally , BR-Gen [ 60 ] presents appro ximately 150k lo cally forged images, including not only global and main ob ject inpain tings, but also secondary ob ject and background inpain tings, therefore accoun ting for salient ob ject bias. As gen- erativ e editing metho ds, they use LaMa [ 52 ] and MA T [ 55 ] as GAN-based approaches and SD XL [ 8 ], BrushNet [ 62 ] and Po werP ain t [ 63 ] as diﬀusion-based pip elines. Bias is an imp ortan t c hallenge in dataset design. Recent studies [ 40 , 64 ] hav e shown that forensic mo dels can exploit shortcuts or artifacts that are sp eciﬁc to a dataset rather than to the underlying manipulation pro cess. This not only inﬂates rep orted p erfor- mance but also limits cross-dataset generalization. Designing datasets that minimize suc h biases is therefore critical for robust training and ev aluation. Hence, in our work, w e carefully insp ect potential semantic biases, originating from training on ob ject-only replacemen t. Summary of T ext-Guide d Inp ainting Datasets An o verview of these datasets with generativ e inpainted images is given in T able 1 , where w e focus on spliced vs. fully regenerated images, the used generative models and pip elines, as well as the presence of a subset that goes b eyond ob ject replacement. In summary , no dataset exists that cov ers all three of the following aspects. First, only Op enSDI includes local manipulation with recent generative models suc h as FLUX [ 4 ], whic h has established itself as one of the strongest diﬀusion-based editors. Second, only TGIF and SAGI explicitly diﬀerentiate b et ween spliced and fully regenerated images, whereas providing lo calization in the latter is still an op en problem. Third, few datasets (i.e., only BtB, GIM, and BR-Gen) consider p oten tial ob ject-based biases and provide non-ob ject-based generative inpainted images. The lack of a dataset that addresses all these gaps has motiv ated the creation and release of the TGIF2 dataset. 3 TGIF2 Dataset The TGIF2 dataset is an extension of the original TGIF dataset [ 7 ]. In general, compared to existing datasets in T able 1 , the TGIF2 dataset is the only dataset that has all three of the following properties: (1) includes new er generative mo d- els (i.e., FLUX.1 mo dels), (2), explicitly diﬀeren tiates b et ween spliced and fully regenerated images, and (3) go es b eyond ob ject replacemen t (by additionally using random, non-seman tic masks). TGIF2 is av ailable for do wnload at h ttps://gith ub. com/IDLabMedia/tgif- dataset . 7 T able 1 Overview of datasets with generative inpain ted images. SP = spliced, FR = fully regenerated. Name SP FR # fake images Gen. mo dels & pip elines Bey ond ob j. repl. Co co-Glide [ 21 ] ✓ - 512 GLIDE - AutoSplice [ 47 ] ✓ - 3k D ALL-E2 - TGIF [ 7 ] ✓ ✓ 75k SD2, SDXL, Adobe Fireﬂy - SA GI [ 48 ] ✓ ✓ 95k SD2, LaMa, HD-Pain ter, BrushNet, Po werP ain t, Con trolNet, Inpaint-An ything - COinCO [ 53 ] ✓ - 97k SD2 - GRE [ 54 ] ✓ - 228k LaMa, MA T, SD2, ControlNet, P aintByExample, Photoshop - UniAIDet [ 56 ] ✓ - 10k SD2, SDXL, SD3, DreamShap er, Pain tByExample - Op enSDI [ 57 ] ✓ - 150k SD1.5, SD2.1, SDXL, SD3, FLUX.1 - BtB [ 58 ] ✓ - 22k F o o ocus (SDXL) ✓ GIM [ 59 ] ✓ - 1.1M SD, GLIDE, DDNM ✓ BR-Gen [ 60 ] ✓ - 150k LaMa, MA T, SDXL, BrushNet, P ow erPain t ✓ TGIF2 (ours) ✓ ✓ 271k SD2, SDXL, Adobe Fireﬂy , FLUX.1 schnell/dev/ﬁll-dev ✓ The remainder of this section is organized as follo ws. Section 3.1 describes the source of authen tic images and their prop erties. Section 3.2 details the inpain ting mo dels used to generate manipulations. Section 3.3 explains the automated w orkﬂow for generating the v ariety of inpainted images , ending with an o verview of the diﬀeren t subsets in T able 2 . Finally , Section 3.4 quantiﬁes the dataset size and splits. 3.1 Source of Authen tic Images TGIF2 uses the same source of authentic images as TGIF, namely the MS-COCO dataset [ 46 ] ( val2017 ), which is licensed under a Creativ e Commons Attribution 4.0 License. COCO provides images with asso ciated captions, segmentation masks, and ob ject b ounding b o xes, cov ering 80 ob ject categories. T o comply with restrictions in Adob e Fireﬂy , we excluded tw o categories ( knife and, surprisingly , frisb e e ) that w ere not allow ed as prompts in the app. F rom each ob ject category , a maximum of 50 images w ere selected. Images were ﬁltered to ensure b oth dimensions are at least 512 pixels, while the free Flic kr service limited the dimensions to 1024 pixels eac h. W e additionally required ob ject b ounding b o xes to b e smaller than 512 × 512 p x and larger than 64 2 p x in area. This ensures suﬃcient resolution for high-ﬁdelity inpain ting while av oiding excessiv ely small ob jects that may produce ambiguous manipulations. These selection and ﬁltering criteria ensure that TGIF2 builds on a diverse yet well-con trolled set of authen tic images suitable for high-qualit y and consisten t inpainting exp erimen ts. 8 3.2 Inpain ting Models TGIF2 includes several inpainting models to cov er a wide range of generative capabil- ities. Most notably , the three FLUX.1 mo del v ariants are new in TGIF2, compared to TGIF [ 7 ]. Note that the images inpainted with PS, SD2 and SDXL are exactly equal in TGIF and TGIF2, with the exception of those with random non-semantic masks (as explained further in Section 3.3 ). • Adob e Fireﬂy (PS) : commercial mo del accessed through Photoshop 25.4.0, in early 2024. This pro duces spliced inpainted regions with an automatically added gradien t border. • Stable Diﬀusion 2 (SD2) : a ﬁrst-generation open-source diﬀusion model. W e used the inpainting-ﬁne-tuned mo del ﬁne-tuned ( stabilityai/stable-diﬀusion-2- inp ainting ). • Stable Diﬀusion XL (SD XL) : a second-generation open-source diﬀusion mo del that supp orts higher-resolution outputs (i.e., up to 1024 × 1024 px) and impro ved visual qualit y . W e used the inpainting-ﬁne-tuned model ( stabilityai/stable-diﬀusion- xl-1.0-inp ainting-0.1 ). • FLUX.1 mo dels : a third-generation open-source inpainting architecture, oﬀering impro ved attention mechanisms and con text-aw are generation. Speciﬁcally , three FLUX.1 v arian ts are included: – FLUX.1 [sc hnell] : mo del capable of generating high-quality images in only 1 to 4 diﬀusion steps. It is trained using latent adv ersarial diﬀusion distillation (based on the closed-source FLUX.1 [pro]), hence the optimized sp eed comes at the cost of a decrease in image quality . – FLUX.1 [dev] : mo del optimized to balance speed with image quality . T rained using guidance distillation (based on the closed-source FLUX.1 [pro]), making FLUX.1 [dev] more computationally eﬃcient yet pro ducing slightly low er quality images. – FLUX.1 Fill [dev] : same architecture as FLUX.1 [dev], but ﬁne-tuned for the inpain ting task. Note that FLUX.1 [sc hnell] and FLUX.1 [dev] w ere trained for full syn thetic image generation, in contrast to FLUX.1 Fill [dev] and the SD mo dels which w ere ﬁne-tuned for the inpain ting use case. How ev er, these full-image generation mo dels can b e plugged in to the inpainting w orkﬂow without an y adaptations, and p erform (surprisingly) well for this use case. In fact, on a verage, they pro duce higher quality inpaintings than FLUX.1 Fill [dev] (see Section 4.4 ). Using the FLUX.1 [schnell] model is esp ecially practical, as it requires muc h few er diﬀusion steps than FLUX.1 Fill [dev] and hence has signiﬁcan tly lo wer computational complexity (as further discussed in Section 3.3 ). It should additionally b e noted that FLUX.1 [pro] is av ailable as well, although only through a pay-per-use API. Closed-source mo dels b ehind APIs do not allow easy repro ducibilit y b y the communit y , and they only pro vide limited control o ver their inputs, making them unsuitable for an in-depth forensic analysis. Therefore, we did not include this generativ e model in our dataset. 9 Fig. 2 sho ws a side-by-side comparison of an example authen tic image, the b ound- ing b o x used as mask, and six inpain ted images generated by the six GenAI mo dels, resp ectiv ely . Note that, in this example, Fig. 2c and 2e (inpainted using SD2 and PS, resp ectiv ely) are spliced, whereas the other inpain ted images are fully regenerated. Additionally , a b ounding box was used as mask. T ogether, these mo dels span commer- cial and op en-source approaches across three generations of diﬀusion architectures, enabling TGIF2 to capture a broad sp ectrum of inpain ting characteristics. (a) Original (b) Mask (bounding box) (c) SD2 (d) SDXL (e) PS (f ) FLUX.1 schnell (g) FLUX.1 dev (h) FLUX.1 Fill dev Fig. 2 Side-by-side comparison of (a) a real image of a cat with (b) the mask used for inpainting, and (c)-(h) 6 inpainted v ersions using 6 diﬀerent inpain ting mo dels used in TGIF2, respectively . Note that other masks can b e used as well (see Fig. 3 ). 3.3 Inpain ting W orkﬂo w The inpain ting workﬂo w is exactly the same for all inpain ting mo dels, with the only diﬀerence b eing the generative mo del that is called. In particular, for each authen tic image, inpainted v ariants are automatically generated following these steps: • Mask Selection : F or eac h selected image, we use either the ob ject’s semantic seg- men tation mask, the ob ject’s b ounding b o x (bbox) mask (both pro vided b y COCO), or a random non-semantic mask. The random, non-seman tic masks are at a ran- dom lo cation in the (cropp ed) image, and with a random rectangular size (min. 64 p x and max. 60% of the image’s width and height). Examples of these three mask t yp es and corresp onding inpainted v ersions are shown in Fig. 3 . • Prompt Generation : F or SD2, SD XL and FLUX.1, the prompt combines the ob ject category and the image caption (provided by COCO). During initial exp er- imen ts, we noticed that including the caption resulted in b etter results, whic h was also observed in related work [ 65 , 66 ]. F or Adobe Fireﬂy , only the ob ject category is used, as w e noticed that this was suﬃcient in our initial exp erimen ts. F or the random, non-semantic mask, an empty prompt is used. • Generation P arameters : Inference steps, guidance scale, and seed were randomly selected p er image. F or eac h mo del and each mask, three v ariations were generated 10 (a) Bounding box mask (b) Segmentation mask (c) Random mask (d) Inpainted bb ox mask (e) Inpainted segm. mask (f ) Inpainted random mask Fig. 3 The (a)-(c) three types of masks used and (d)-(f ) three corresp onding inpainted examples, respectively . All inpainted images used Fig. 2a as input image. with three diﬀeren t seeds. The num b er of inference steps is b et w een 10 and 50 for the SD and FLUX mo dels, and b etw een 3 and 6 for FLUX.1 [schnell]. The guidance scale is b et ween 1 and 10, except for FLUX.1 [schnell] where it is set to 0 (as it do es not supp ort this parameter). • Splicing vs. F ull Regeneration : W e sav e up to tw o v ariants of each inpainted image, namely a fully regenerated (FR) and a spliced (SP) version. The diﬀerences b et w een these tw o are illustrated in Fig. 1 and explained b elo w. – F ul l R e gener ation (FR) : When inpain ting using generativ e models, the entire input image is used to condition the diﬀusion pro cess. While only the masked area is semantically changed, the non-mask ed area is subtly altered as well (as visualized in Fig. 1 ). – Splicing (SP) : after p erforming full regeneration, the inpainted region is inserted (or splic e d ) back into the original image using the provided mask. This approach preserv es the rest of the original image and follo ws the strategy used in com- mon inpainting pipelines, suc h as Adobe Fireﬂy and HuggingF ace diﬀusers [ 67 ]. This is also the most commonly used inpainting metho d in existing datasets (see Section 2.3 ). • Generativ e Quality Metrics : Each image also includes aesthetic quality scores (NIMA [ 68 ], GIQA [ 69 ]) and image-text-matching (ITM) scores [ 70 ], measured on the ob ject’s inpainted area. Additionally , we include the preserv ation ﬁdelity , cal- culated using the P eak Signal-to-Noise Ratio (PSNR), Structural SIMilarity index measure (SSIM), and the LPIPS p erceptual metric [ 71 ] betw een the real and FR image, in the real but fully regenerated areas. 11 T able 2 Overview of TGIF2 dataset subsets group ed by inpainting model. Columns indicate whether spliced (SP), fully regenerated (FR), and Semantic or Random mask v ariants are included. Inpainting Mo del Semantic Random SP FR SP FR SD2 ✓ * ✓ * ✓ ✓ SDXL - ✓ * - ✓ Adobe Fireﬂy (PS) ✓ * - - - FLUX.1 [schnell] ✓ ✓ ✓ ✓ FLUX.1 [dev] ✓ ✓ ✓ ✓ FLUX.1 Fill [dev] ✓ ✓ ✓ ✓ * These subsets are already included in the original TGIF dataset. Eac h of the steps ab ov e are applied on each inpainting mo del, with some excep- tions and notes. First, SD2 is limited to pro cessing images with resolution 512 × 512, whic h is the resolution of the fully regenerated images in the corresp onding subsets. In con trast, for SDXL and the FLUX.1 v ariants, the resolution is up to 1024p. This only aﬀects the fully regenerated v ariant, and not the spliced v ariant. Second, Adob e Fire- ﬂy (Photoshop, PS) only provides a spliced v ariant, but no full regeneration v ariant. Third, w e do not pro vide the spliced v ariant for SDXL, as it discolors the en tire image, whic h is extremely visible when splicing into the original (and a known issue [ 72 ]). This discoloration can also b e observed when comparing Fig. 2a with 2d . F ourth, we do not generate random masks for Adob e Fireﬂy / PS, since it only pro duces spliced edits, which is the least challenging use case (as discussed in Section 4.2 . T able 2 gives an ov erview of the 19 resulting TGIF2 subsets. 3.4 Dataset Statistics and Splits Starting from 3 , 124 authen tic images, TGIF2 contains 18 , 744 manipulated images p er SP subset or FR subset, and 9 , 372 manipulated images p er Random (SP or FR) subset (as only one mask type, i.e., a random rectangular one, is used as opposed to t wo, i.e., a segmentation mask and b ounding b o x). This amounts to a total of 271 , 788 manipulated images in TGIF2, spread ov er the 19 subsets presen ted in T able 2 (of whic h 4 subsets containing 74,976 images are already included in the original TGIF dataset). F urthermore, the dataset is split approximately by 80/10/10% in to train- ing, v alidation, and test sets, carefully av oiding category ov erlap and image leak age b et w een splits. This makes TGIF2 a reliable resource for training and ev aluating forensic mo dels. 4 F orensic Benc hmark W e pro vide a comprehensiv e benchmark of image forensic methods on the TGIF2 dataset. Compared to the original TGIF b enc hmark, TGIF2 enables a more div erse and challenging ev aluation by introducing new subsets, new ﬁne-tuned mo dels, and new attack scenarios. Section 4.1 describ es the ev aluation setup emplo yed in the exp er- imen ts. Then, in Section 4.2 , w e ev aluate existing IFL methods on the expanded TGIF2 12 subsets. Section 4.3 in vestigates whether ﬁne-tuning IFL mo dels on our fully regen- erated training sets improv es their ability to lo calize manipulations in FR inpainted images while accoun ting for p oten tial ob ject biases. Section 4.4 analyzes the generative qualit y and whether it impacts IFL p erformance. Section 4.5 extends the b enc hmark of SID metho ds b y including the new TGIF2 subsets and recent state-of-the-art SID metho ds. Section 4.6 ev aluates the generative sup er-resolution attac k scenario as a p ost-processing step. 4.1 Ev aluation Setup W e perform ev aluations on our TGIF2 test set, separately considering eac h of the subsets in T able 2 . This setup allows us to systematically compare mo del b ehavior across diﬀerent manipulation t yp es and mask strategies. F or IFL metho ds, as p erformance metric, w e rep ort the mean pixel-level F1 score o ver manipulated images, using a threshold of 0.5 to decide whether pixels are real or fak e. W e additionally pro vide the mean Intersection ov er Union (IoU) results in App endix A . This provides a calibrated measure of lo calization accuracy , which is imp ortan t for practical deploymen t and human in terpretability . W e select several recen t high-p erforming AI-based IFL methods, namely the PSCC-Net [ 20 ], SP AN [ 22 ], ImageF orensicsOSN [ 23 ], MVSS-Net++ [ 13 ], MantraNet [ 24 ], CA T-Net (v2) [ 25 ], T ru- F or [ 21 ], and MMF usion (MMF) [ 14 ]. During ev aluation, w e speciﬁcally distinguish among the semantic (Sem) and the random (Rand) subsets. F or SID metho ds, we rep ort the area under the ROC curv e (AUC) as p erformance metric. W e opt for AUC rather than the threshold-sp eciﬁc F1, since w e noticed that some metho ds tend to o verpredict the fak e class, artiﬁcially inﬂating F1, whereas the AUC b etter reﬂects the trade-oﬀ b et w een false p ositiv es and false negatives across thresholds. W e select several AI-based SID metho ds av ailable in the SIDBench framew ork [ 73 ], namely CNNDetect [ 28 ], DIMD [ 35 ], Dire [ 36 ], F reqDetect [ 29 ], Uni- vFD [ 30 ], F using [ 74 ], GramNet [ 31 ], LGrad [ 32 ], NPR [ 33 ], RINE [ 37 ], DeF ake [ 38 ], and P atc hCraft [ 34 ]. Additionally , new in TGIF2, we incorp orate the recent mo d- els SP AI [ 39 ], SP AI-ITW [ 41 ], RINE-ITW [ 41 ], and B-F ree [ 40 ]. By combining a div erse set of recent SID approac hes with a systematic ev aluation proto col, w e pro- vide a comprehensiv e b enc hmark of curren t forensic capabilities on generativ e image inpain ting. 4.2 Image F orgery Localization W e b enc hmark recen t IFL metho ds on the TGIF2 dataset. Recall that, compared to the original TGIF b enchmark, TGIF2 includes several new subsets (recen t FLUX.1 mo dels and random non-semantic masks). T able 3 and the top part of T able 4 rep ort the F1 v alues for all ev aluated IFL metho ds on the spliced subsets and on the fully regenerated subsets, respectively . W e highlight F1 v alues abov e 0.7 in b old font, chosen empirically as a threshold to allow for quic kly grasping which metho ds demonstrate go od detection p erformance. F or the non-random Sem spliced subsets, some IFL methods (i.e., CA T-Net, T ru- F or, and MMF usion) sho w decen t performance. This w as already rep orted in the 13 T able 3 Ev aluation of image forgery lo calization metho ds on the spliced (SP) subsets of our TGIF2 dataset, for b oth the semantic (Sem) and random (Rand) versions (F1 score). F1 scores above 0.7 are highlighted in b old. IFL Method SD2 PS FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Sem Rand Sem Rand Sem Rand Sem Rand All PSCC-Net [ 20 ] 0.15 0.17 0.39 0.03 0.09 0.02 0.05 0.08 0.21 0.13 0.13 0.13 SP AN [ 22 ] 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.01 0.00 0.00 0.00 ImageF orensicsOSN [ 23 ] 0.23 0.09 0.36 0.31 0.11 0.23 0.08 0.12 0.10 0.25 0.09 0.18 MVSS-Net++ [ 13 ] 0.07 0.03 0.08 0.15 0.06 0.11 0.03 0.09 0.09 0.10 0.05 0.08 Mantranet [ 24 ] 0.15 0.15 0.56 0.06 0.02 0.04 0.02 0.06 0.04 0.17 0.06 0.12 CA T-Net [ 25 ] 0.89 0.92 0.86 0.88 0.93 0.87 0.92 0.86 0.93 0.87 0.92 0.89 T ruF or [ 21 ] 0.83 0.87 0.79 0.79 0.69 0.73 0.58 0.74 0.72 0.77 0.72 0.75 MMF usion [ 14 ] 0.73 0.74 0.72 0.70 0.59 0.59 0.45 0.63 0.59 0.68 0.59 0.64 original TGIF work [ 7 ], and is in line with our exp ectations, as the splicing creates signiﬁcan t diﬀerences in forensic traces b etw een forged and pristine regions. When ev aluating the Sem FLUX.1 subsets, we notice that CA T-Net, T ruF or, and MMF usion still p erform well, although there is a performance drop on the latter. W e also consider inpainted images with random masks, for which the results can be observ ed in the Rand subsets of T able 3 . Compared to the corresp onding Sem subsets, w e see a signiﬁcant drop in p erformance for MMF usion and T ruF or, esp ecially on the FLUX.1 [sc hnell] and [dev] subsets. In contrast, CA T-Net remains highly eﬀective in detecting the spliced region in the Rand subsets, as well as T ruF or and MMF usion on the SD2 subset. This suggests that MMF usion and T ruF or likely suﬀer from a detection bias tow ards ob ject b oundaries or seman tic cues. The limitations exp osed in this subsection should b e tak en in to account when deplo ying these metho ds in real-w orld detection systems. In contrast to spliced images, the FR setting is m uc h more c hallenging, as was already observed in the original TGIF b enchmark [ 7 ]. These insights are conﬁrmed in the new FR FLUX.1 subsets, for which the av erage F1 scores are still low (i.e., well b elo w 0.5). This is the case for b oth the Sem and Rand subsets. In FR images, the clear b oundary b et w een forensic traces in manipulated and authentic regions disapp ears due to the regenerativ e process of the entire image during manipulation. In summary , while some recen t state-of-the-art IFL metho ds p erform reliably on spliced manipulations across generativ e editing mo dels, they struggle on fully regen- erated images. The random-mask exp erimen ts further reveal that some metho ds, such as MMF usion and T ruF or, ma y rely on semantic or ob ject-related biases rather than imp erceptible forensic traces. T ogether, these ﬁndings highligh t the need for dev el- oping IFL approaches that generalize b ey ond semantic ob ject b oundaries, and that remain eﬀective in fully regenerated images. T o further delv e in to this issue, Section 4.3 ev aluates whether we can ﬁne-tune IFL metho ds in order to detect FR images. 14 T able 4 Ev aluation of image forgery lo calization metho ds on the fully regenerated (FR) subsets of our TGIF2 dataset, for b oth the sem an tic (sem) and random (rand) versions (F1 score). The last rows sho w the IFL metho ds ﬁne-tuned on the FR subsets. F1 scores ab o ve 0.7 are highlighted in b old. IFL Method SD2 SDXL FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand All PSCC-Net [ 20 ] 0.05 0.04 0.05 0.09 0.02 0.02 0.02 0.02 0.03 0.05 0.03 0.04 0.04 SP AN [ 22 ] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ImageF orensicsOSN [ 23 ] 0.20 0.09 0.18 0.09 0.30 0.08 0.23 0.07 0.09 0.05 0.20 0.08 0.13 MVSS-Net++ [ 13 ] 0.06 0.03 0.09 0.05 0.15 0.03 0.12 0.03 0.07 0.04 0.10 0.04 0.07 Mantranet [ 24 ] 0.03 0.01 0.05 0.06 0.05 0.02 0.04 0.01 0.02 0.01 0.04 0.02 0.03 CA T-Net [ 25 ] 0.04 0.01 0.03 0.00 0.05 0.01 0.04 0.01 0.03 0.02 0.04 0.01 0.02 T ruF or [ 21 ] 0.19 0.07 0.18 0.09 0.33 0.09 0.25 0.06 0.16 0.08 0.22 0.08 0.14 MMF usion [ 14 ] 0.15 0.05 0.19 0.08 0.34 0.08 0.20 0.04 0.12 0.06 0.20 0.06 0.13 T ruF or – ﬁne-tuned 0.49 0.39 0.68 0.83 0.87 0.94 0.82 0.88 0.76 0.85 0.72 0.78 0.75 MMF usion – ﬁne-tuned 0.48 0.39 0.66 0.83 0.86 0.94 0.81 0.87 0.75 0.84 0.71 0.77 0.74 T able 5 Ev aluation of individually ﬁne-tuned image forgery lo calization metho ds on the fully regenerated (FR) subsets of our TGIF2 dataset, for b oth the semantic (Sem) and random (Rand) versions (F1 score). F1 scores ab o ve 0.7 are highlighted in a b old font. W e notice that in-domain (i.e., same generative metho d) p erformance is relatively go od, but generalization to out-of-domain (i.e., diﬀerent generativ e metho d) p erformance is bad. Interestingly , when ﬁne-tuning on only Sem masks, the ev aluation on Rand subsets is low er than the corresponding Sem subsets, suggesting a p oten tially in tro duced bias tow ards semantics. IFL Method Fine-tune set SD2 SD XL FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand All T ruF or All Sem+Rand 0.49 0.39 0.68 0.83 0.87 0.94 0.82 0.88 0.76 0.85 0.72 0.78 0.75 Sem 0.50 0.23 0.61 0.62 0.80 0.62 0.69 0.46 0.65 0.59 0.55 0.76 0.65 Rand 0.31 0.40 0.50 0.80 0.69 0.92 0.62 0.85 0.63 0.82 0.55 0.76 0.58 SD2 Sem 0.49 0.22 0.29 0.17 0.40 0.16 0.31 0.08 0.21 0.10 0.34 0.14 0.24 Rand 0.31 0.40 0.17 0.22 0.12 0.10 0.09 0.06 0.12 0.12 0.16 0.18 0.17 SDXL Sem 0.09 0.01 0.66 0.71 0.19 0.02 0.08 0.01 0.06 0.02 0.22 0.15 0.19 Rand 0.01 0.00 0.50 0.82 0.02 0.00 0.01 0.01 0.03 0.03 0.11 0.17 0.14 Flux Sem 0.10 0.02 0.20 0.05 0.90 0.90 0.86 0.81 0.78 0.79 0.57 0.51 0.54 Rand 0.02 0.02 0.05 0.06 0.73 0.95 0.68 0.91 0.68 0.84 0.43 0.56 0.49 MMF usion All Sem+Rand 0.48 0.39 0.66 0.83 0.86 0.94 0.81 0.87 0.75 0.84 0.71 0.77 0.74 Sem 0.47 0.17 0.58 0.61 0.80 0.58 0.72 0.50 0.69 0.67 0.65 0.50 0.63 Rand 0.26 0.40 0.48 0.80 0.62 0.91 0.58 0.83 0.62 0.81 0.51 0.75 0.58 SD2 Sem 0.52 0.22 0.28 0.19 0.35 0.15 0.27 0.09 0.18 0.09 0.32 0.15 0.23 Rand 0.30 0.34 0.18 0.19 0.08 0.05 0.07 0.03 0.09 0.08 0.14 0.14 0.14 SDXL Sem 0.08 0.01 0.66 0.70 0.12 0.01 0.06 0.01 0.05 0.02 0.19 0.15 0.17 Rand 0.01 0.01 0.50 0.80 0.03 0.00 0.02 0.01 0.04 0.04 0.12 0.17 0.15 Flux Sem 0.10 0.02 0.20 0.05 0.89 0.87 0.85 0.76 0.78 0.80 0.56 0.50 0.53 Rand 0.03 0.02 0.06 0.07 0.73 0.95 0.69 0.90 0.69 0.84 0.44 0.56 0.50 4.3 Fine-tuning IFL Approac hes on F ully Regenerated Images T o address the p o or p erformance of existing IFL metho ds on FR images (see Section 4.2 and T able 4 ), we ﬁne-tune selected IFL mo dels on the TGIF2 FR training 15 subsets (i.e., SD2, SD XL, Flux.1 [sc hnell], Flux.1 [dev], and Flux.1 Fill [dev], b oth with seman tic and random masks). W e do this for each subset individually (but combining the three FLUX.1 subsets in to one main FLUX.1 subset), as well as for all subsets col- lectiv ely . In the collective ﬁne-tuning, we sample an equal num b er of images from SD2, SD XL and FLUX.1 to a void a bias to FLUX.1, since it has three times more images than SD2 or SDXL. As base mo dels, we select T ruF or and MMF usion, i.e., tw o of the IFL mo dels with the b est p erformance on spliced and fully regenerated images. CA T- Net was excluded from ﬁne-tuning exp eriments due to its signiﬁcan tly more complex training pro cess. W e follo wed the training setups sp eciﬁed in the original pap ers for b oth mo dels. That is, b oth mo dels follow a t wo-stage training pro cedure. In the ﬁrst stage, the mo del is trained on the forgery lo calization task for 100 ep o c hs while in the second stage, part of the model is frozen and is trained on the forgery detection task for another 100 epo c hs. Then, the c heckpoint with the b est v alidation loss is chosen as the ﬁnal chec kp oin t. W e main tain the h yp erparameters and other training conﬁgurations sp eciﬁed in the original pap ers. This ﬁne-tuning approac h allo ws us to adapt high- p erforming IFL models to b etter localize manipulations in fully regenerated images, while remaining consistent with the original training proto cols. T able 5 rep orts the F1 scores for the ﬁne-tuned mo dels. In short, we observe that ﬁne-tuning signiﬁcan tly improv es the p erformance. Ho wev er, performance gains are substan tially larger when training and testing on semantically aligned masks than when ev aluated on random-mask FR images. This is discussed more elab orately in the remainder of this section. First, we consider the ﬁnetuned mo dels on all s eman tic datasets. The p erformance on the Sem datasets is relatively high, but is signiﬁcantly low er for the Random datasets (with the exception of SDXL). W e demonstrate this in Fig. 4 , whic h shows an image inpainted (and fully regenerated) using a semantic mask as well as with a ran- dom mask, along with the corresp onding ground-truth masks and localization results from T ruF or ﬁnetuned on the SD2-FR-Sem subset. The ﬁne-tuned mo del is able to lo calize the seman tic ob ject (i.e., the tennis rac ket), but not the random mask in the bac kground. Note that the inpain ted random box, in fact, made notable changes to the sign in the background, and hence should b e detectable b y a better detection model Additionally , note that the ob ject categories do not ov erlap b et ween the training and test set, i.e., no images of inpain ted tennis rack ets are seen in the ﬁne-tuned training set. This indicates that training exclusiv ely on semantic datasets can induce a bias to ward salien t or semantically meaningful regions. Second, including the Rand subsets during ﬁne-tuning mitigates the semantic bias observ ed in the previous setting. In fact, additionally including random masks during ﬁnetuning improv es p erformance on the random subsets without degrading — and in sev eral cases even improving — p erformance on the semantic subsets. W e select the ﬁne-tuned mo dels on All Sem+Rand subsets as the main one, and also include these results in the b ottom rows of T able 4 . Third, b eyond mask-related eﬀects, w e also observe a discrepancy in IFL perfor- mance b et ween SD2, SDXL, and the FLUX.1 family . That is, the IFL p erformance of FLUX.1 is signiﬁcan tly higher than SD2. This is especially notable, since we notice con- trasting b eha vior when ev aluating the IFL p erformance on spliced images in T able 3 . 16 (a) Inpainted (SD2-FR-Sem) (b) Ground-truth mask (c) SD2-FR-Sem-ﬁnetuned T ruF or (d) Inpainted (SD2-FR-Rand) (e) Ground-truth mask (f ) SD2-FR-Sem-ﬁnetuned T ruF or Fig. 4 An example of a fully regenerated inpainted image with SD2, using a b ounding b o x of the tennis rack et as mask (a, SD2 Sem), or a random mask in the image (d, SD2 Rand), along with the corresp onding ground-truth masks (b, e) and the ﬁne-tuned T ruF or detection results (c, f ). The ﬁne-tuned T ruF or was trained on the SD2-FR-Sem training set. The inpainted tennis rack et in the SD2-FR-Sem test image is detected relatively well by the ﬁne-tuned model. How ever, the inpainted random bo x of the SD2-FR-Rand subset is not detected; instead, the tennis rack et is wrongly detected – suggesting that the ﬁne-tuned mo del is b e biased tow ards semantics or salient ob jects. Note that the inpainted random b o x, in fact, made notable changes to the sign in the background, and hence should b e detectable by a b etter detection mo del. Note that we do not only observe this when training on All subsets, but also when training on each subset individually . W e h yp othesized that this discrepancy could be explained by their diﬀerence in generativ e qualit y , although ﬁnd no supp orting evi- dence of this, in Section 4.4 . It remains to b e inv estigated, in future work, why FLUX.1 fully regenerated images are easier to detect than SD, after ﬁne-tuning. F ourth and ﬁnally , the mo dels ﬁne-tuned on individual subsets exp ose that p erfor- mance on other, out-of-domain generativ e mo dels is substantially low er than on the in-domain subset that they were trained on. Note that these metho ds w ere originally designed for splicing and copy-mo v e detection. Hence, an improv ed training strategies alone may not suﬃce. Instead, more robust lo calization mechanisms may b e required for lo calization in fully regenerated images with better generalization. In summary , this section demonstrated that inpainted areas can b e detected in fully regenerated images, which is in line with similar observ ations in related w ork 17 that trained or ﬁne-tuned mo dels perform w ell on FR image data [ 48 , 75 ]. How ever, ﬁne-tuning exclusively on seman tic subsets can induce ob ject-lev el bias, underscor- ing the imp ortance of incorp orating non-semantic FR data during b oth training and ev aluation. 4.4 Generativ e Qualit y vs. IFL P erformance This subsection inv estigates whether generative quality is predictive of forensic local- ization p erformance. Intuitiv ely , higher-quality or more realistic generations migh t b e exp ected to suppress forensic traces and therefore reduce IFL p erformance. Conv ersely , systematic artifacts in low er-quality generations may facilitate detection. W e therefore analyze whether aesthetic quality , image–text alignmen t, and preserv ation ﬁdelity cor- relate with IFL p erformance across subsets and on a p er-image level. W e additionally examine whether such quality diﬀerences help explain the p erformance discrepancy b et w een SD and FLUX in ﬁne-tuned FR settings (as observed in Section 4.3 ). As men tioned at the end of Section 3.3 , we measure the generative qualit y met- rics in three w ays. First, w e measure the aesthetic qualit y using the NIMA [ 68 ] and GIQA [ 69 ] metrics on the inpain ted area. Second, w e measure the image-text matching p erformance [ 70 ] on the inpainted area. Third, we measure the preserv ation ﬁdelity in real but fully regenerated areas using PSNR, SSIM, and LPIPS. Finally , w e relate these generative qualit y metrics to the IFL performance in tw o w ays: (1) analyzing trends across the subsets, and (2) we calculate the Sp earman’s correlation co eﬃcients b et w een p er-image F1 scores and eac h generative qualit y metric. W e ha ve done this analysis for the T ruF or and MMF usion IFL mo dels, namely for the original v ariants on the SP subsets, and the ﬁne-tuned v ariants (on All Sem+Rand) on the FR sub- sets. Note that we hav e not done this for the original IFL mo dels ev aluated on the FR subsets, as their detection capabilit y is too low in this case (as discussed in Section 4.2 ). T able 6 shows the av erage generativ e quality metrics for each seman tic subset of TGIF2. F or the reader’s conv enience, we again include the IFL performance of T ruF or and MMF usion (as presen ted before in T able 3 and T able 4 ). Additionally , T able 7 sho ws the correlation v alues b et ween the generativ e metrics and the IFL p erformance, calculated on all semantic subsets sim ultaneously . Recall, as also observed previously in Section 4.2 and Section 4.3 , that SD2 has the b est and FLUX.1 [dev] has the w orst IFL (SP) performance. In con trast, the IFL (FR) p erformance is best for FLUX.1 [sc hnell], and signiﬁcantly worse for SD2. Note that, for the preserv ation ﬁdelit y , it only makes sense to relate it with IFL (FR) but not to IFL (SP) performance. In the follo wing, we insp ect whether these p erformance discrepancies can be explained by an y of the generative qualit y metrics. Concerning aesthetic qualit y , in T able 7 , on a p er-subset level, there app ears to b e no relation with IFL (SP) nor IFL (FR) p erformance. In con trast, on a p er-image lev el, in T able 7 , w e notice there is a weak to moderate correlation (i.e., correlation v alues around 0.3 for NIMA and 0.4 for GIQA) b et w een the aesthetic quality and IFL p erformance (both for SP and FR). Interestingly , higher aesthetic scores are w eakly asso ciated with b etter IFL p erformance at the p er-image lev el, contradicting the intuition that visually app ealing generations should b e harder to detect. How ever, since we do not observe this relation on a p er-subset level, and hence may not serve 18 T able 6 Average generative quality per (semantic) subset. In addition, the av erage detection performance is given as well (same as in T able 4 ). F or each metric, we highlight the b est performing subset in a b old font, and the worst p erforming subset in an italic font. Metric Class Metric SD2 PS SDXL FLUX.1 [schnell] [dev] Fill [dev] Aesthetic Quality NIMA mean ↑ 5.05 5.16 4.88 5.16 5.28 5.21 GIQA ↑ -399 -373 -423 -383 -371 -424 Image-T ext Matching ITM ↑ 0.28 0.32 0.26 0.31 0.30 0.25 cos ↑ 0.37 0.38 0.34 0.38 0.38 0.36 Preserv ation Fidelity (FR only) PSNR ↑ 27.1 N/A 18.1 32.5 32.5 28.5 SSIM ↑ 0.79 N/A 0.69 0.92 0.92 0.86 LPIPS ↓ 0.04 N/A 0.17 0.02 0.01 0.06 IFL (SP) T ruF or – original F1 score ↑ 0.83 0.79 N/A 0.79 0.73 0.74 MMF usion – original F1 score ↑ 0.73 0.72 N/A 0.70 0.59 0.63 IFL (FR) T ruF or – ﬁne-tuned F1 score ↑ 0.49 N/A 0.68 0.87 0.82 0.76 MMF usion – ﬁne-tuned F1 score ↑ 0.48 N/A 0.66 0.86 0.81 0.75 T able 7 Sp earman’s correlation and corresp onding p-v alue betw een the IFL p erformance (F1 score), on the one hand, and the generative qualit y metrics, on the other. All correlation v alues are relatively low, with p-v alues equal to 0.000. IFL Method V ersion Subsets Aesthetic Quality ITM Preserv ation Fidelity NIMA GIQA ITM cos PSNR SSIM LPIPS T ruF or original SP 0.36 0.42 -0.31 0.07 N/A N/A N/A MMF usion original SP 0.30 0.39 -0.29 0.11 N/A N/A N/A T ruF or ﬁne-tuned FR 0.28 0.34 -0.03 0.23 0.24 0.28 -0.32 MMF usion ﬁne-tuned FR 0.28 0.35 -0.02 0.24 0.24 0.28 -0.32 as an explanation as to why FLUX.1 p erforms signiﬁcantly better than SD2 in IFL (FR). F or ITM, in T able 6 , on a per-subset level, w e do not observ e a relation with the IFL (SP) nor IFL (FR) p erformance. On a p er-image level, in T able 7 , we see a negligible to weak negativ e correlation for ITM (i.e., approx. -0.3 for IFL (SP) and 0 for IFL (FR)), and a negligible to weak positive correlation for the ITM cosine similarity (i.e., appro x. 0.1 for IFL (SP) and 0.2 for IFL (FR)). Hence, we conclude that there is no clear relationship b etw een ITM and IFL p erformance. F or the preserv ation ﬁdelit y , in T able 6 , on a per-subset level, we observ e no cor- relation with IFL (FR). On a per-image lev el, in T able 7 , we observe only a weak p ositiv e correlation (i.e., appro x. 0.2 – 0.3) with IFL (FR). Hence, we conclude that there is no notable relationship b etw een preserv ation ﬁdelit y and IFL p erformance. Ov erall, generative quality metrics do not meaningfully predict forensic lo caliza- tion p erformance. The detectability diﬀerences b et ween SD and FLUX in FR settings therefore cannot be explained by the ev aluated quality measures. How ever, note that w e were limited b y the current generative quality metrics. That is, on the one hand, the aesthetic quality metrics used could only be calculated on the inpainted area, but not on the real-but-regenerated area (as they do not allow masked calculation). On the other hand, it only mak es sense to calculate the preserv ation ﬁdelit y metrics 19 on the real-but-regenerated area (and not on the inpain ted area). A more infor- mativ e direction for future work ma y b e to measure in tra-image quality con trast, i.e., discrepancies b et w een inpainted regions and real-but-regenerated regions. Such con trast-based metrics could b etter capture subtle inconsistencies exploited b y forensic lo calization mo dels, and hence could p otentially explain the p erformance discrepancy b et w een subsets in the ﬁne-tuned FR setting. 4.5 Syn thetic Image Detection W e benchmark state-of-the-art SID methods on the TGIF2 FR dataset, which now includes FR subsets generated with FLUX.1 models (in addition to only SD2 and SD XL in the original TGIF dataset), as w ell as both semantic and random-mask v ariants. Note that w e do not report results on spliced subsets, since SID metho ds are not designed to detect lo cal manipulations and hence consistently fail to detect these cases. Instead, in this section, we fo cus only on the fully regenerated subsets. W e also expanded the b enc hmark with four recently introduced metho ds: SP AI [ 39 ], SP AI-ITW [ 41 ], RINE-ITW [ 41 ], and B-F ree [ 40 ]. T able 8 rep orts AUC v alues for all ev aluated metho ds, where w e highligh t scores ab o ve 0.8 in a b old font. This thresh- old w as c hosen empirically to allo w for quic kly grasping which metho ds demonstrate decen t detection p erformance. F rom the 17 ev aluated SID mo dels, 11 demonstrate relatively go o d p erformance (A UC > 0.8) on either SD2, SDXL, or b oth. Sp eciﬁcally , the newly included metho ds (SP AI, SP AI-ITW, RINE-ITW, and B-F ree) all hav e relatively strong p erformance on these subsets, with B-F ree ev en scoring 1.00 AUC on both sets. The ev aluation of the new FLUX subsets provides a more challenging picture. Out of the 11 SID metho ds that p erform reliably well on SD2 and SDXL, only 4 maintain an A UC ab o ve 0.8 across all FLUX subsets. In general, all methods exp erience a p erformance drop on FLUX, suggesting that its generation pro cess pro duces fewer or less detectable forensic traces. F or example, B-F ree results in A UC v alues b et ween 0.85 and 0.88 for the FLUX subsets, while it resulted in an A UC v alue of 1.00 for the SD2 and SDXL subsets. This ﬁnding indicates that FLUX represents a signiﬁcan tly harder detection setting, and highligh ts the imp ortance of dev eloping SID models that generalize across diﬀerent families of generativ e mo dels. W e notice no signiﬁcant diﬀerence in SID p erformance b et ween semantic and random-mask subsets. In summary , TGIF2 conﬁrms that modern SID metho ds can successfully detect fully regenerated images pro duced by earlier diﬀusion mo dels such as SD2 and SDXL, but new generative mo dels such as FLUX in tro duce new c hallenges and degrade detec- tion p erformance. Ho wev er, it should b e stressed that binary classiﬁcation in SID mo dels are not able to localize generativ e edits. 4.6 Generativ e Super-Resolution A ttac k This subsection studies the robustness of forensic metho ds against generative sup er resolution. Sp eciﬁcally , w e apply the op en-source Real-ESRGAN [ 10 ], which is one of the most popular super-resolution methods to date (with ov er 34 , 000 stars on 20 T able 8 Ev aluation of synthetic image detection metho ds on the semantic subsets of our TGIF2 Dataset (AUC Score). A UC scores ab ov e 0.8 are highlighted in a b old font. SID Method T rained on SD2 SD XL FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand All CNNDetect [ 28 ] ProGAN 0.57 0.48 0.61 0.62 0.50 0.47 0.49 0.50 0.62 0.62 0.56 0.54 0.55 Dire [ 36 ] ADM 0.49 0.52 0.62 0.64 0.49 0.54 0.48 0.52 0.48 0.49 0.51 0.54 0.53 F reqDetect [ 29 ] GANs 0.50 0.42 0.40 0.41 0.39 0.36 0.37 0.38 0.39 0.34 0.41 0.38 0.40 F using [ 74 ] ProGAN 0.55 0.48 0.47 0.47 0.55 0.53 0.55 0.56 0.62 0.62 0.55 0.53 0.54 NPR [ 33 ] ProGAN 0.51 0.49 0.67 0.31 0.49 0.50 0.49 0.48 0.44 0.43 0.52 0.44 0.48 DeF ake [ 38 ] Diﬀusion 0.62 0.60 0.61 0.59 0.67 0.66 0.70 0.67 0.63 0.61 0.65 0.63 0.64 DIMD [ 35 ] LDM 0.62 0.60 0.61 0.59 0.67 0.66 0.70 0.67 0.63 0.61 0.65 0.63 0.64 UnivFD [ 30 ] ProGAN 0.82 0.80 0.80 0.81 0.63 0.60 0.58 0.59 0.69 0.71 0.70 0.70 0.70 GramNet [ 31 ] GANs 0.82 0.79 0.55 0.54 0.81 0.81 0.82 0.82 0.84 0.84 0.77 0.76 0.76 LGrad [ 32 ] ProGAN 0.85 0.85 0.83 0.80 0.84 0.84 0.85 0.86 0.86 0.84 0.85 0.84 0.84 RINE [ 37 ] ProGAN 0.89 0.86 0.89 0.90 0.67 0.65 0.62 0.62 0.77 0.77 0.77 0.76 0.76 LDM 0.97 0.97 0.93 0.98 0.69 0.75 0.67 0.74 0.78 0.87 0.81 0.86 0.84 Patc hCraft [ 34 ] ProGAN 0.98 0.98 0.95 0.96 0.94 0.94 0.95 0.95 0.92 0.90 0.95 0.95 0.95 SP AI [ 39 ] LDM 0.97 0.96 0.77 0.78 0.85 0.84 0.86 0.86 0.83 0.81 0.86 0.85 0.85 SP AI-ITW [ 41 ] ITW 0.81 0.75 0.83 0.82 0.53 0.52 0.53 0.56 0.55 0.52 0.65 0.63 0.64 RINE-ITW [ 41 ] ITW 0.98 0.97 0.98 0.98 0.73 0.73 0.72 0.77 0.74 0.71 0.83 0.83 0.83 B-F ree [ 40 ] LDM 1.00 1.00 1.00 0.99 0.86 0.78 0.85 0.77 0.88 0.82 0.92 0.87 0.90 GitHub [ 72 ]). Additionally , w e apply bicubic in terp olation as baseline comparison. F or the exp erimen ts, w e apply an increase in resolution with a factor of either 2 or 4, follo wed b y a decrease in resolution with the same factor, in order to obtain an attack ed image with the same resolution as the original one. On a verage, the PSNR b et w een the bicubic up-and-do wnsampled image and the original image is 19.2 (regardless of the factor), and the LPIPS is 0.21 and 0.22, for the factor of 2 and 4, resp ectiv ely . F or the sup er-resolution metho d, the PSNR is 18.8 (regardless of the factor), and the LPIPS [ 71 ] is 0.25 and 0.26, for the factor of 2 and 4, respectively . These attack ed images are ev aluated using b oth IFL and SID metho ds, but restricted to the mo dels that demonstrate strong p erformance in Section 4.2 and 4.5 . T able 9 and 10 present the results for IFL methods applied on all non-random SP subsets and SID metho ds applied on all non-random FR subsets, resp ectiv ely . F or eac h ev aluated IFL or SID mo del, the tables sho w the a verage performance on all subsets in addition to the p erformance and corresp onding delta in p erformance when applying the bicubic and sup er-resolution attacks. Note that a delta with a negativ e v alue means a decrease in performance, and a p ositiv e v alue means an increase in p erformance. This allows us to systematically ev aluate how sup er-resolution attacks impact the p erformance of strong forensic mo dels. F or IFL metho ds, T able 9 sho ws that sup er resolution has a pronounced negative impact. While bicubic in terp olation produces only a mo dest decrease in F1 scores (i.e., a verage F1 score decrease b etw een 0.02 and 0.11), applying Real-ESR GAN leads to a substan tial drop across all ev aluated subsets (i.e., av erage F1 score decrease b et ween 0.51 and 0.80). This indicates that generative SR eﬀectiv ely remo ves the ﬁne-grained pixel-lev el traces that IFL metho ds rely on, making lo calization of manipulated regions far more diﬃcult. This is a similar observ ation as w e made for the impact of full regen- eration during generative inpainting (in Section 4.2 ). Moreov er, a similar observ ation w as made in related work, which demonstrated that IFL metho ds p erformance drops 21 T able 9 Ev aluation of sup er-resolution attack on image forgery localization metho ds on spliced images of our TGIF2 Dataset, alongside a cubic interpolation attack as baseline (F1 score, along with the ∆ F1 Score). IFL Method Original Super-resolution metho d Cubic interpolation Real-ESRGAN x2 x4 x2 x4 CA T-Net [ 25 ] 0.87 0.76 -0.11 0.73 -0.14 0.06 -0.81 0.07 -0.80 T ruF or [ 21 ] 0.78 0.76 -0.02 0.76 -0.02 0.21 -0.57 0.22 -0.56 MMF usion [ 14 ] 0.69 0.65 -0.04 0.64 -0.05 0.18 -0.51 0.51 -0.51 T able 10 Ev aluation of sup er-resolution attack on synthetic image detection metho ds on fully regenerated images in our TGIF2 dataset, alongside a cubic interpolation attack as baseline (AUC Score, along with the ∆ A UC score). SID Method T rained on Original Super-resolution metho d Cubic interpolation Real-ESRGAN x2 x4 x2 x4 DIMD [ 35 ] LDM 0.88 0.87 –0.01 0.87 –0.01 0.75 –0.13 0.80 –0.08 UnivFD [ 30 ] ProGAN 0.70 0.70 +0.00 0.69 –0.01 0.68 –0.02 0.68 –0.02 GramNet [ 31 ] GANs 0.76 0.76 +0.00 0.75 –0.01 0.50 –0.26 0.50 –0.26 LGrad [ 32 ] ProGAN 0.85 0.85 +0.00 0.84 –0.01 0.49 –0.36 0.48 –0.37 RINE [ 37 ] ProGAN 0.77 0.77 +0.00 0.76 –0.01 0.72 –0.05 0.72 –0.05 LDM 0.81 0.83 +0.02 0.83 +0.02 0.70 –0.11 0.71 –0.10 Patc hCraft [ 34 ] ProGAN 0.95 0.98 +0.03 0.98 +0.03 0.53 –0.42 0.48 –0.47 SP AI [ 39 ] LDM 0.86 0.84 –0.02 0.84 –0.02 0.73 –0.13 0.73 –0.13 SP AI–ITW [ 41 ] ITW 0.65 0.65 +0.00 0.66 +0.01 0.78 +0.13 0.83 +0.18 RINE–ITW [ 41 ] ITW 0.93 0.83 +0.00 0.83 +0.00 0.81 –0.02 0.83 +0.00 B–F ree [ 40 ] LDM 0.92 0.91 –0.01 0.91 –0.01 0.86 –0.06 0.84 –0.08 when applying generative compression using JPEG AI [ 27 ]. This highlights that gen- erativ e image processing constitutes a signiﬁcan t challenge for current IFL metho ds, requiring new defenses or adaptations. T able 10 sho ws that the eﬀect of generativ e super resolution on SID metho ds is less pronounced than on IFL metho ds. F or several mo dels, Real-ESR GAN causes a clear but mo derate reduction in AUC, suggesting that their detection cues are also w eakened b y the generative upscaling process. How ever, other SID mo dels demonstrate muc h smaller p erformance drops. In some cases, the p erformance remains relatively stable (e.g., for UnivFD and RINE-ITW). Notably , for SP AI-ITW, the performance ev en increases when applying the Real-ESR GAN sup er-resolution attac k. In comparison, applying the bicubic interpolation (as baseline) do es not aﬀect detection p erformance (appro ximately 0.0 diﬀerence in A UC). This suggests that some SID approaches rely on more global or semantic inconsistencies that are less aﬀected b y generativ e SR, while others dep end on lo w-level statistical traces that Real-ESRGAN can aﬀect. In summary , our exp erimen ts sho w that generative sup er resolution represents a serious threat to curren t forensic methods. It can erase the traces needed for IFL, and in man y cases also moderately harm SID performance, while bicubic interpolation 22 alone do es not hav e such strong eﬀects. This highlights the importance of account- ing for p ost-processing op erations such as SR in the developmen t of robust forensic metho ds. 5 Discussion & Conclusion T ext-guided inpainting forgeries represent a rapidly evolving and increasingly realistic form of AI-based image editing. With TGIF2, we extend our previous dataset (TGIF) b y incorp orating recent inpainting mo dels, namely the FLUX.1 family , and by intro- ducing random-mask subsets to probe biases in detection metho ds. Additionally , w e extend our previous b enc hmark by p erforming ﬁne-tuning to tac kle the problem of lo calization in manipulated but fully regenerated images, as well as ev aluating the impact of generative sup er-resolution metho ds on forensic metho ds. Our exp eriments demonstrate that existing IFL metho ds p erform well on tra- ditional splicing scenarios, yet some show case a notable diﬀerence in p erformance b et w een edits with ﬁrst-generation inpaint ing m ethods (i.e., SD and Adobe Fireﬂy) and new-generation metho ds (i.e., FLUX.1), revealing limitations in their ability to generalize to new inpain ting mo dels that signiﬁcan tly increase visual ﬁdelit y . Addi- tionally , we noticed that some IFL metho ds drop in p erformance when ev aluated on the random, non-seman tic subsets, which suggests that these existing methods exhibit a bias tow ards ob jects and seman tics. Moreo ver, existing IFL mo dels are unable to localize manipulated areas in FR images. W e demonstrated that ﬁne-tuning on FR subsets impro ves their ability to lo calize these manipulations, but that it also ma y in tro duce biases, which b ecame eviden t when ev aluating on the random-mask subsets. This highlights the imp ortance of training and ev aluating methods beyond ob ject-centric manipulations. W e also analyzed the generative quality of the inpainted images, demonstrating that FLUX images are of higher quality . Ho wev er, we could not establish conclusive relations b et ween generative quality and IFL p erformance. F or SID, our exp eriments show that several metho ds perform strongly on SD2 and SDXL, but p erformance decreases signiﬁcan tly for FLUX subsets. Hence, FLUX generations are harder to detect for existing IFL and SID metho ds. This emphasizes the need to keep in vesting in generalizable SID metho ds, as w ell as including new generativ e methods in training sets. F urthermore, we sho w that generative sup er resolution signiﬁcan tly weak ens foren- sic traces and thereby reduces the eﬀectiv eness of forensic metho ds, similar to the fully regenerated manipulations and generative compression [ 27 ]. This is esp ecially disrup- tiv e in the case of IFL metho ds, but also moderately aﬀects some SID metho ds. In fact, the results suggest that s eman tics-based SID metho ds show increased robustness to SR attacks. It is interesting to contrast the increased SR-attack robustness of semantics-based SID metho ds (in Section 4.6 ) to the revealed seman tic-bias vulnerabilit y in (ﬁne- tuned) IFL metho ds that were exp osed in random-mask ev aluations (in Section 4.2 and Section 4.3 ). This may seem conﬂicting, since w e demonstrate that semantics is b eneﬁcial in certain use cases (e.g., robustness to SR attacks), while an ov er-reliance on 23 seman tic cues can in tro duce vulnerabilities (e.g., that b ecome apparen t under random- mask ev aluations). It should b e noted that the use of random, non-seman tic masks is not intended to replace ob ject-centric ev aluations, but to serve as a stress test that decouples semantic conten t from manipulation b oundaries, enabling a clearer analysis of mo del bias and cue reliance. As such, TGIF2 aims to provide complementary ev al- uation conditions that exp ose diﬀerent failure modes. W e see seman tic reasoning and lo w-level forensic cues as orthogonal and necessary comp onen ts, and our b enc hmark is designed to analyze their trade-oﬀs rather than promote one ov er the other. F uture work could fo cus on expanding the TGIF2 dataset with next-generation AI-based inpainting to ols, e.g., those that do not require a user-deﬁned mask, such as FLUX Kontext [ 76 ], FLUX.2 [ 77 ], and the GPT-4o (ChatGPT) Image Generator [ 78 ]. Additionally , a k ey op en c hallenge is impro ving the robustness of IFL metho ds against manipulated images that undergo full regeneration (either during manipulation or after, suc h as b y generative sup er resolution). Regeneration signiﬁcantly weak ens or destro ys forensic traces used b y current IFL and SID metho ds. In this context, it is esp ecially imp ortant to closely w atch out for p otential biases. In summary , TGIF2 provides an up dated b enchmark that captures the latest adv ances in text-guided inpainting, highlights strengths and w eaknesses of state-of- the-art forensic metho ds and their ﬁne-tuned versions, and con tributes new challenging subsets that will help the communit y to track progress against this contin uously ev olving threat. Ac knowledgemen ts. Not applicable. Declarations F unding This w ork was funded by the Flemish go vernmen t’s Department of Culture, Y outh & Media (under the COM-PRESS pro ject), by IDLab (Ghent Universit y – imec), b y Flanders Innov ation & Entrepreneurship (VLAIO), b y Research F oundation – Flanders (FWO) (V419524N & G0A2523N), and by the Europ ean Union under the Horizon Europ e pro jects AI4T rust (gran t num b er 101070190) and AI-CODE (gran t n umber 101135437). Comp eting inter ests The authors declare that they hav e no comp eting interests. Ethics appr oval and c onsent to p articip ate Not applicable. Consent for public ation The authors give consent for publication. Data availability The TGIF2 dataset is av ailable at https://gith ub.com/ IDLabMedia/tgif- dataset . Materials availability The TGIF2 materials are a v ailable at h ttps://github.com/ IDLabMedia/tgif- dataset . Co de availability The TGIF2 code are a v ailable at h ttps://github.com/ IDLabMedia/tgif- dataset . A uthor c ontribution HM: dataset preparation, initial exp erimen ts, interpretation of experiments, main man uscript text writing. DK: large-scale b enchmark exp erimen ts, in terpretation of exp erimen ts. PG: large-scale b enc hmark exp eriments, interpretation 24 of exp erimen ts. SP: guidance, funding acquisition. PL: guidance, funding acquisi- tion. GVW: guidance, funding acquisition. All authors revised and approv ed the man uscript. App endix A Image F orgery Lo calization – IoU The IoU results for I FL on the spliced subsets are given in T able A1 (i.e., corresponding to the F1 results in T able 3 ). The IoU results for the fully regenerated subsets are giv en in T able A2 (i.e., corresponding to the F1 results in T able 4 ) for the original IFL metho ds, and in T able A3 (i.e., corresp onding to the F1 results in T able 5 ), for the ﬁne-tuned IFL metho ds. T able A1 Ev aluation of image forgery lo calization metho ds on the spliced (SP) subsets of our TGIF2 dataset, for b oth the semantic (sem) and random (rand) versions ( IoU ). IoU v alues ab o ve 0.6 are highlighted in a b old fon t. Just as in T able 3 , some IFL metho ds (i.e., CA T-Net, T ruF or & MMF usion) p erform well on SP subsets. IFL Method SD2 PS FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Sem Rand Sem Rand Sem Rand Sem Rand All PSCC-Net [ 20 ] 0.12 0.13 0.31 0.02 0.06 0.01 0.03 0.06 0.17 0.10 0.10 0.10 SP AN [ 22 ] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 ImageF orensicsOSN [ 23 ] 0.16 0.05 0.25 0.22 0.07 0.16 0.04 0.08 0.07 0.18 0.06 0.12 MVSS-Net++ [ 13 ] 0.05 0.02 0.05 0.11 0.04 0.08 0.02 0.06 0.07 0.07 0.04 0.06 Mantranet [ 24 ] 0.10 0.09 0.44 0.04 0.01 0.03 0.01 0.03 0.02 0.13 0.03 0.08 CA T-Net [ 25 ] 0.81 0.87 0.77 0.81 0.88 0.78 0.87 0.79 0.89 0.79 0.88 0.83 T ruF or [ 21 ] 0.74 0.79 0.69 0.69 0.60 0.61 0.48 0.64 0.63 0.67 0.63 0.65 MMF usion [ 14 ] 0.62 0.63 0.62 0.60 0.50 0.49 0.37 0.54 0.50 0.57 0.50 0.54 T able A2 Ev aluation of image forgery lo calization metho ds on the fully regenerated (FR) subsets of our TGIF2 dataset, for b oth the sem an tic (sem) and random (rand) versions ( IoU ). The last rows show the IFL metho ds ﬁne-tuned on All FR Sem subsets. Just as in T able 4 , the performance is low for all original IFL metho ds on FR subsets, while it is signiﬁcantly higher for the ﬁne-tuned v ariants. IFL Method SD2 SDXL FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand All PSCC-Net [ 20 ] 0.03 0.02 0.03 0.05 0.01 0.01 0.01 0.01 0.02 0.03 0.02 0.03 0.02 SP AN [ 22 ] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ImageF orensicsOSN [ 23 ] 0.14 0.05 0.12 0.05 0.22 0.05 0.16 0.04 0.06 0.03 0.14 0.04 0.09 MVSS-Net++ [ 13 ] 0.04 0.02 0.06 0.03 0.11 0.02 0.09 0.02 0.05 0.03 0.07 0.02 0.05 Mantranet [ 24 ] 0.02 0.00 0.03 0.03 0.03 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.02 CA T-Net [ 25 ] 0.03 0.01 0.02 0.00 0.04 0.01 0.03 0.00 0.02 0.01 0.03 0.01 0.02 T ruF or [ 21 ] 0.13 0.04 0.12 0.06 0.25 0.05 0.18 0.04 0.10 0.05 0.16 0.05 0.10 MMF usion [ 14 ] 0.11 0.03 0.13 0.06 0.26 0.05 0.15 0.02 0.08 0.04 0.15 0.04 0.09 T ruF or – ﬁne-tuned 0.37 0.28 0.58 0.75 0.80 0.90 0.74 0.82 0.67 0.78 0.63 0.71 0.67 MMF usion – ﬁne-tuned 0.37 0.29 0.55 0.75 0.78 0.89 0.73 0.80 0.66 0.77 0.62 0.70 0.66 25 T able A3 Ev aluation of individually ﬁne-tuned image forgery lo calization metho ds on the fully regenerated (FR) subsets of our TGIF2 dataset, for b oth the semantic (sem) and random (rand) versions ( IoU ). IoU v alues above 0.6 are highlighted in a bold font. Just as in T able 5 , we notice go od in-domain p erformance but bad out-of-domain generalization, as well as a p oten tial bias tow ards semantics. IFL Method Fine-tune set SD2 SDXL FLUX.1 Average [schnell] [dev] Fill [dev] Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand Sem Rand All T ruF or All Sem+Rand 0.37 0.28 0.58 0.75 0.80 0.90 0.74 0.82 0.67 0.78 0.63 0.71 0.67 Sem 0.21 0.29 0.41 0.71 0.60 0.87 0.53 0.78 0.53 0.74 0.55 0.41 0.48 Rand 0.38 0.15 0.50 0.52 0.71 0.52 0.59 0.37 0.55 0.50 0.46 0.68 0.57 SD2 Sem 0.38 0.14 0.20 0.11 0.29 0.11 0.22 0.05 0.15 0.06 0.25 0.09 0.17 Rand 0.21 0.29 0.11 0.14 0.08 0.06 0.05 0.03 0.08 0.08 0.11 0.12 0.11 SDXL Sem 0.07 0.01 0.55 0.61 0.14 0.01 0.06 0.01 0.04 0.01 0.17 0.13 0.15 Rand 0.01 0.00 0.41 0.73 0.01 0.00 0.00 0.01 0.02 0.02 0.09 0.15 0.12 Flux Sem 0.07 0.01 0.14 0.03 0.83 0.85 0.79 0.74 0.70 0.72 0.51 0.47 0.49 Rand 0.01 0.01 0.03 0.04 0.65 0.91 0.61 0.86 0.58 0.77 0.38 0.52 0.45 MMF usion All Sem+Rand 0.37 0.29 0.55 0.75 0.78 0.89 0.73 0.80 0.66 0.77 0.62 0.70 0.66 Sem 0.36 0.11 0.47 0.50 0.70 0.48 0.62 0.41 0.59 0.57 0.55 0.42 0.48 Rand 0.18 0.30 0.39 0.71 0.54 0.85 0.49 0.75 0.52 0.72 0.42 0.67 0.55 SD2 Sem 0.40 0.15 0.19 0.12 0.26 0.10 0.19 0.05 0.12 0.06 0.23 0.10 0.16 Rand 0.20 0.24 0.11 0.12 0.05 0.03 0.04 0.02 0.06 0.05 0.09 0.09 0.09 SDXL Sem 0.06 0.01 0.55 0.60 0.10 0.01 0.04 0.01 0.03 0.01 0.16 0.13 0.14 Rand 0.01 0.01 0.41 0.71 0.02 0.00 0.01 0.01 0.03 0.03 0.10 0.15 0.12 Flux Sem 0.07 0.01 0.14 0.03 0.83 0.80 0.77 0.68 0.69 0.71 0.50 0.45 0.47 Rand 0.02 0.01 0.04 0.05 0.65 0.90 0.61 0.85 0.60 0.77 0.38 0.52 0.45 References [1] Rombac h, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-Resolution Image Syn thesis with Laten t Diﬀusion Models. In: Pro c. IEEE/CVF Conf. Computer Vission Pattern Recogn. (2022) [2] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: F ace2face: Real-time face capture and reenactment of rgb videos. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016) [3] Adob e: Bringing Generativ e AI into Creative Cloud with Adob e Fireﬂy . h ttps://blog.adob e.com/en/publish/2023/03/21/ bringing- gen- ai- to- creative- cloud- adob e- ﬁreﬂy . Accessed: 2025-09-26 (2023) [4] Black F orest Labs: FLUX. https://gith ub.com/black- forest- labs/ﬂux . Accessed: 2025-09-19 (2024) [5] Nichol, A., Dhariwal, P ., Ramesh, A., Shy am, P ., Mishkin, P ., McGrew, B., Sutsk ever, I., Chen, M.: GLIDE: T ow ards photorealistic image generation and editing with text-guided diﬀusion mo dels. In: Pro ceedings of the 39th In terna- tional Conference on Mac hine Learning, pp. 16784–16804. PMLR, ??? (2022) [6] V erdoliv a, L.: Media forensics and deepfakes: an o verview. IEEE J Selected T opics 26 Signal Pro cess. 14 (5), 910–932 (2020) [7] Mareen, H., Karageorgiou, D., W allendael, G.V., Lambert, P ., Papadopoulos, S.: TGIF: T ext-guided inpainting forgery dataset. In: 2024 IEEE In ternational W orkshop on Information F orensics and Securit y (WIFS) (2024) [8] Podell, D., English, Z., Lacey , K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombac h, R.: SDXL: Impro ving laten t diﬀusion mo dels for high-resolution image synthesis. arXiv preprin t arXiv:2307.01952 (2023) [9] fal.ai: imgsys.org - a generative image mo del arena. https://imgsys.org/rankings . Accessed: 2025-09-19 (2025) [10] W ang, X., Xie, L., Dong, C., Shan, Y.: Real-ESRGAN: T raining real-world blind super-resolution with pure syn thetic data. In: International Conference on Computer Vision W orkshops (ICCVW) [11] Mareen, H., V anden Bussche, D., Guillaro, F., Cozzolino, D., V an W allendael, G., Lam b ert, P ., V erdoliv a, L.: Comprin t: Image forgery detection and lo calization using compression ﬁngerprints. In: Rousseau, J.-J., Kapralos, B. (eds.) P attern Recognition, Computer Vision, and Image Pro cessing. ICPR 2022 International W orkshops and Challenges, pp. 281–299. Springer, Cham (2023). h ttps://doi.org/ 10.1007/978- 3- 031- 37742- 6 23 [12] Cozzolino, D., V erdoliv a, L.: Noiseprint: A CNN-based camera mo del ﬁngerprint. IEEE T rans Inf. F orensics Securit y 15 , 144–159 (2020) https://doi.org/10.1109/ TIFS.2019.2916364 [13] Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: MVSS-Net: Multi-view m ulti-scale sup ervised netw orks for image manipulation detection. IEEE T rans. Pattern Analysis and Machine Intel. 45 (3), 3539–3553 (2022) [14] T riaridis, K., Mezaris, V.: Exploring multi-modal fusion for image manipulation detection and lo calization. In: Int. Conf. Multimedia Mo del., pp. 198–211 (2024). Springer [15] Karageorgiou, D., Kordopatis-Zilos, G., Papadopoulos, S.: F usion transformer with ob ject mask guidance for image forgery analysis. In: Pro c. IEEE/CVF Conf. Computer Vission Pattern Recogn., pp. 4345–4355 (2024) [16] Mareen, H., De Nev e, L., Lam b ert, P ., V an W allendael, G.: Harmonizing image forgery detection & localization: F usion of complementary approaches. J. Imaging 10 (1), 4 (2023) [17] Li, H., Luo, W., Huang, J.: Lo calization of diﬀusion-based inpainting in digital images. IEEE T rans Inf. F orensics Securit y 12 (12), 3050–3064 (2017) 27 [18] W u, H., Zhou, J.: Iid-net: Image inpainting detection net w ork via neural arc hitec- ture search and attention. IEEE T rans. Circuits Systems Video T echnol. 32 (3), 1172–1185 (2021) [19] Li, A., Ke, Q., Ma, X., W eng, H., Zong, Z., Xue, F., Zhang, R.: Noise do esn’t lie: T ow ards univ ersal detection of deep inpainting. In: Zhou, Z.-H. (ed.) Pro c. Int. Conf. Artiﬁcial Intel. IJCAI, pp. 786–792 (2021) [20] Liu, X., Liu, Y., Chen, J., Liu, X.: PSCC-Net: Progressive spatio-channel corre- lation netw ork for image manipulation detection and lo calization. IEEE T rans. Circuits Systems Video T echnol. 32 (11), 7505–7517 (2022) [21] Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., V erdoliv a, L.: T ruF or: Lev eraging all-round clues for trust worth y image forgery detection and localization. In: Pro c. IEEE/CVF Conf. Computer Vision P attern Recogn. (CVPR), pp. 20606–20615 (2023) [22] Hu, X., Zhang, Z., Jiang, Z., Chaudhuri, S., Y ang, Z., Nev atia, R.: SP AN: Spatial p yramid attention netw ork for image manipulation lo calization. In: Pro c. Europ. Computer Vision Conf., pp. 312–328 (2020). Springer [23] W u, H., Zhou, J., Tian, J., Liu, J., Qiao, Y.: Robust image forgery detection against transmission ov er online so cial netw orks. IEEE T rans. Inf. F orensics Securit y 17 , 443–456 (2022) [24] W u, Y., AbdAlmageed, W., Natara jan, P .: ManT ra-Net: Manipulation tracing net work for detection and lo calization of image forgeries with an omalous features. In: Proc. IEEE/CVF Conf. Computer Vision P attern Recogn., pp. 9543–9552 (2019) [25] Kwon, M.-J., Nam, S.-H., Y u, I.-J., Lee, H.-K., Kim, C.: Learning JPEG compres- sion artifacts for image manipulation detection and localization. Int. J. Computer Vision 130 (8), 1875–1895 (2022) [26] Li, J., W ang, N., Zhang, L., Du, B., T ao, D.: Recurrent feature reasoning for image inpainting. In: Pro c. IEEE/CVF Conf. Computer Vision Pattern Recogn., pp. 7760–7768 (2020) [27] Cannas, E.D., Mandelli, S., Popovi ´ c, N., Alkhateeb, A., Gn utti, A., Bestagini, P ., T ubaro, S.: Is jp eg ai going to c hange image forensics? arXiv preprint arXiv:2412.03261 (2024) [28] W ang, S.-Y., W ang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to sp ot...for now. In: Pro c. Int. Conf. Pattern Recogn. (CVPR) (2020) 28 [29] F rank, J., Eisenhofer, T., Sch¨ onherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever- aging frequency analysis for deep fake image recognition. In: In t. Conf. Mac hine Learning, pp. 3247–3258 (2020). PMLR [30] Ojha, U., Li, Y., Lee, Y.J.: T ow ards univ ersal fak e image detectors that generalize across generative mo dels. In: Proc. IEEE/CVF Conf. Computer Vision Pattern Recogn., pp. 24480–24489 (2023) [31] Liu, Z., Qi, X., T orr, P .H.: Global texture enhancement for fak e face detection in the wild. In: Proc. IEEE/CVF Conf. Computer Vision Pattern Recogn., pp. 8060–8069 (2020) [32] T an, C., Zhao, Y., W ei, S., Gu, G., W ei, Y.: Learning on gradien ts: Generalized artifacts representation for gan-generated images detection. In: Pro c. IEEE/CVF Conf. Computer Vision P attern Recogn., pp. 12105–12114 (2023) [33] T an, C., Zhao, Y., W ei, S., Gu, G., Liu, P ., W ei, Y.: Rethinking the up-sampling op erations in cnn-based generative netw ork for generalizable deepfake detection. In: Proc. IEEE/CVF Conf. Computer Vision Pattern Recogn., pp. 28130–28139 (2024) [34] Zhong, N., Xu, Y., Qian, Z., Zhang, X.: Patc hCraft: Exploring texture patc h for eﬃcien t ai-generated image detection. arXiv preprint arXiv:2311.12397 (2023) [35] Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., V erdoliv a, L.: On the detection of synthetic images generated b y diﬀusion models. In: Pro c. IEEE In t. Conf. Acoustics, Sp eec h Signal Pro cess. (ICASSP), pp. 1–5 (2023) [36] W ang, Z., Bao, J., Zhou, W., W ang, W., Hu, H., Chen, H., Li, H.: DIRE for diﬀusion-generated image detection. In: Proc. IEEE/CVF Int. Conf Computer Vision (ICCV), pp. 22445–22455 (2023) [37] Koutlis, C., Papadopoulos, S.: Lev eraging represen tations from in termediate enco der-blocks for syn thetic image detection. In: Pro c. Europ. Computer Vision Conf. (ECCV) (2024) [38] Sha, Z., Li, Z., Y u, N., Zhang, Y.: DE-F AKE: Detection and attribution of fake images generated by text-to-image generation mo dels. In: Pro c. A CM SIGSA C Conf. Computer Comm. Securit y , pp. 3418–3432 (2023) [39] Karageorgiou, D., Papadopoulos, S., Kompatsiaris, I., Gavv es, E.: Any-resolution AI-generated image detection b y sp ectral learning. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR), pp. 18706–18717 (2025) [40] Guillaro, F., Zingarini, G., Usman, B., Sud, A., Cozzolino, D., V erdoliv a, L.: A bias-free training paradigm for more general ai-generated image detection. 29 In: Pro ceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 18685–18694 (2025) [41] Konstantinidou, D., Karageorgiou, D., Koutlis, C., Papadopoulou, O., Sc hinas, E., P apadop oulos, S.: Navigating the challenges of AI-generated image detection in the wild: What truly matters? arXiv preprint arXiv:2507.10236 (2025) [42] Mandelli, S., Bestagini, P ., T ubaro, S.: When syn thetic traces hide real con- ten t: Analysis of stable diﬀusion image laundering. In: 2024 IEEE In ternational W orkshop on Information F orensics and Securit y (WIFS) (2024). IEEE [43] Lin, L., Gupta, N., Zhang, Y., Ren, H., Liu, C.-H., Ding, F., W ang, X., Li, X., V erdoliv a, L., Hu, S.: Detecting m ultimedia generated b y large AI mo dels: A surv ey . arXiv preprin t arXiv:2402.00045 (2024) [44] Guan, H., Kozak, M., Robertson, E., Lee, Y., Y ates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J.: MFC datasets: Large-scale b enc hmark datasets for media forensic c hallenge ev aluation. In: Proc. IEEE Winter App. Computer Vision W orkshops (W ACVW), pp. 63–72 (2019). IEEE [45] Mahfoudi, G., T a jini, B., Retrain t, F., Morain-Nicolier, F., Dugelay , J.L., Marc, P .: Defacto: Image and face manipulation dataset. In: Europ. Signal Process. Conf. (EUSIPCO), pp. 1–5 (2019). IEEE [46] Lin, T.-Y., Maire, M., Belongie, S., Ha ys, J., Perona, P ., Ramanan, D., Doll´ ar, P ., Zitnick, C.L.: Microsoft COCO: Common ob jects in context. In: Pro c. Europ. Computer Vision Conf., pp. 740–755 (2014). Springer [47] Jia, S., Huang, M., Zhou, Z., Ju, Y., Cai, J., Lyu, S.: Autosplice: A text-prompt manipulated image dataset for media forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 893–903 (2023) [48] Giakoumoglou, P ., Karageorgiou, D., Papadopoulos, S., Petran tonakis, P .C.: SA GI: Semantically aligned and uncertain ty guided ai image inpain ting. In: Pro c. IEEE/CVF Int. Conf Computer Vision (ICCV) (2025) [49] Dang-Nguyen, D.-T., Pasquini, C., Conotter, V., Boato, G.: RAISE: A raw images dataset for digital image forensics. In: Pro ceedings of the 6th A CM Multimedia Systems Conference, pp. 219–224 (2015) [50] Kuznetsov a, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pon t-T uset, J., Kamali, S., Popov, S., Mallo ci, M., Kolesniko v, A., Duerig, T., F errari, V.: The op en images dataset v4: Uniﬁed image classiﬁcation, ob ject detection, and visual relationship detection at scale. International journal of computer vision 128 (7), 1956–1981 (2020) [51] Zhang, L., Rao, A., Agraw ala, M.: Adding conditional con trol to text-to-image 30 diﬀusion mo dels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023) [52] Suvoro v, R., Logachev a, E., Mashikhin, A., Remizo v a, A., Ash ukha, A., Silvestro v, A., Kong, N., Gok a, H., P ark, K., Lempitsky , V.: Resolution-robust large mask inpain ting with fourier conv olutions. arXiv preprin t arXiv:2109.07161 (2021) [53] Y ang, T., Jordan, T., Liu, N., Sun, J.: Common inpainted ob jects in-n-out of con text. arXiv preprint arXiv:2506.00721 (2025) [54] Sun, Z., F ang, H., Cao, J., Zhao, X., W ang, D.: Rethinking image editing detec- tion in the era of gene rativ e AI rev olution. In: Pro ceedings of the 32nd A CM In ternational Conference on Multimedia, pp. 3538–3547 (2024) [55] Li, W., Lin, Z., Zhou, K., Qi, L., W ang, Y., Jia, J.: MA T: Mask-a ware transformer for large hole image inpainting. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition, pp. 10758–10768 (2022) [56] Zhang, H., W an, X.: UniAIDet: A uniﬁed and univ ersal benchmark for ai-generated image conten t detection and lo calization. arXiv preprint arXiv:2510.23023 (2025) [57] W ang, Y., Huang, Z., Hong, X.: Op enSDI: Spotting diﬀusion-generated images in the op en w orld. In: Pro ceedings of the Computer Vision and Pattern Recognition Conference, pp. 4291–4301 (2025) [58] Bertazzini, G., Albisani, C., Baracc hi, D., Shullani, D., Piv a, A.: Beyond the brush: F ully-automated crafting of realistic inpainted images. In: 2024 IEEE In ter- national W orkshop on Information F orensics and Security (WIFS), pp. 1–6 (2024). IEEE [59] Chen, Y., Huang, X., Zhang, Q., Li, W., Zhu, M., Y an, Q., Li, S., Chen, H., Hu, H., Y ang, J., Liu, W., Hu, J.: GIM: A million-scale b enchmark for generative image manipulation detection and localization. In: Pro ceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 39, pp. 2311–2319 (2025) [60] Cai, L., W ang, H., Ji, J., ZhouMen, Y., Ma, Y., Sun, X., Cao, L., Ji, R.: Zo oming in on fakes: A no vel dataset for lo calized ai-generated image detection with forgery ampliﬁcation approach. arXiv preprin t arXiv:2504.11922 (2025) [61] W ang, Y., Y u, J., Zhang, J.: Zero-shot image restoration using denoising diﬀusion n ull-space model. arXiv preprint arXiv:2212.00490 (2022) [62] Ju, X., Liu, X., W ang, X., Bian, Y., Shan, Y., Xu, Q.: BrushNet: A plug-and-pla y image inpain ting mo del with decomp osed dual-branc h diﬀusion. In: European Conference on Computer Vision, pp. 150–168 (2024). Springer 31 [63] Zhuang, J., Zeng, Y., Liu, W., Y uan, C., Chen, K.: A task is worth one word: Learning with task prompts for high-qualit y v ersatile image inpain ting. In: Europ ean Conference on Computer Vision, pp. 195–211 (2024). Springer [64] Smeu, S., Boldisor, D.-A., Oneata, D., Oneata, E.: Circum ven ting shortcuts in audio-visual deepfak e detection datasets with unsup ervised learning. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference, pp. 18815–18825 (2025) [65] Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: T ext and shap e guided ob ject inpain ting with diﬀusion mo del. In: Pro c. IEEE/CVF Conf. Computer Vission Pattern Recogn., pp. 22428–22437 (2023) [66] W ang, S., Saharia, C., Montgomery , C., P ont-T uset, J., No y , S., Pellegrini, S., Ono e, Y., Laszlo, S., Fleet, D.J., Soricut, R., Baldridge, J., Norouzi, M., Anderson, P ., Chan, W.: Imagen editor and EditBenc h: Adv ancing and ev aluating text- guided image inpainting. In: Pro c. IEEE/CVF Conf. Computer Vission Pattern Recogn., pp. 18359–18369 (2023) [67] Diﬀusers: State-of-the-art diﬀusion models for image and audio generation in PyT orch and FLAX. https://gith ub.com/huggingface/diﬀusers . Accessed: 2024- 07-05 [68] T alebi, H., Milanfar, P .: NIMA: Neural image assessment. IEEE T rans. Image Pro cess. 27 (8), 3998–4011 (2018) [69] Gu, S., Bao, J., Chen, D., W en, F.: GIQA: Generated image qualit y assessmen t. In: Pro c. Europ. Computer Vision Conf. (ECCV), pp. 369–385 (2020). Springer [70] Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bo otstrapping language-image pre-training for uniﬁed vision-language understanding and generation. In: Int. Conf. Machine Learning, pp. 12888–12900 (2022). PMLR [71] Zhang, R., Isola, P ., Efros, A.A., Shech tman, E., W ang, O.: The unreasonable eﬀectiv eness of deep features as a p erceptual metric. In: Pro c. IEEE/CVF Conf. Computer Vision Pattern Recogn. (CVPR), pp. 586–595 (2018) [72] Diﬀusers GitHub Issue - Inpainting pro duces results that are uneven with input image. https://gith ub.com/huggingface/diﬀusers/issues/5808 . Accessed: 2024-07-05 [73] Schinas, M., P apadop oulos, S.: SIDBenc h: A python framew ork for reliably assessing synthetic image detection methods. In: MAD ’24: Pro c. A CM In t. W. Multimedia AI Ag. Disinform., pp. 55–64 (2024) [74] Ju, Y., Jia, S., Ke, L., Xue, H., Nagano, K., Lyu, S.: F using global and lo cal features for generalized ai-synthesized image detection. In: Pro c. IEEE Int. Conf. 32 Image Pro cess. (ICIP), pp. 3465–3469 (2022). IEEE [75] Smeu, S., Oneata, E., Oneata, D.: DeCLIP: Deco ding CLIP represen tations for deepfak e lo calization. In: 2025 IEEE/CVF Win ter Conference on Applications of Computer Vision (W ACV), pp. 149–159 (2025). IEEE [76] Labs, B.F., Batifol, S., Blattmann, A., Bo esel, F., Consul, S., Diagne, C., Do c k- horn, T., English, J., English, Z., Esser, P ., Kulal, S., Lacey , K., Levi, Y., Li, C., Lorenz, D., M ¨ uller, J., Podell, D., Rombac h, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kon text: Flow Matc hing for In-Context Image Generation and Editing in Latent Space (2025). [77] Labs, B.F.: FLUX.2: F ron tier Visual Intelligence. https://bﬂ.ai/blog/ﬂux- 2 . Accessed: 2025-02-02 (2025) [78] Chen, S., Bai, J., Zhao, Z., Y e, T., Shi, Q., Zhou, D., Chai, W., Lin, X., W u, J., T ang, C., Xu, S., Zhang, T., Y uan, H., Zhou, Y., Cho w, W., Li, L., Li, X., Zhu, L., Qi, L.: An empirical study of GPT-4o image generation capabilities. arXiv preprin t arXiv:2504.05979 (2025) 33

TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment