FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword. While enabling unprecedented creativity, it also facilitates the generation of highly convincing content, undermining societal trust. As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task–it necessitates explainable methodologies to enhance trustworthiness and transparency. However, existing detection models primarily focus on classification, offering limited explanatory insights. To address these limitations, we propose FakeScope, an expert large multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies synthetic images with high accuracy but also delivers rich query-contingent forensic insights. At the foundation of our approach is FakeChain, a large-scale dataset containing structured forensic reasoning based on visual trace evidence, constructed via a human-machine collaborative framework. Then we develop FakeInstruct, the largest multimodal instruction tuning dataset to date, comprising two million visual instructions that instill nuanced forensic awareness into LMMs. Empowered by FakeInstruct, FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. It can accurately distinguish synthetic images, provide coherent explanations, discuss fine-grained forgery artifacts, and suggest actionable enhancement strategies. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative detection capability via our proposed token-based probability estimation strategy. Furthermore, it shows robust generalization across unseen image generators and performs reliably under in-the-wild scenarios.


💡 Research Summary

FakeScope addresses the growing need for transparent AI‑generated image forensics by moving beyond binary classification toward explainable, query‑driven analysis. The authors first construct a large multimodal dataset, FakeChain, containing 47,594 images (an equal split of real and AI‑generated samples from 17 generators) together with human‑annotated categories of visual trace evidence (e.g., lighting inconsistency, texture anomalies) and detailed authenticity reasoning. To keep annotation costs low, they introduce the Anthropomorphic Chain‑of‑Thought Inference (ACoTI) pipeline: a three‑step “Steer → Demonstrate → Enlighten” process where a small set of expert annotations guides strong vision‑language models to generate chain‑of‑thought explanations at scale.

Building on FakeChain, they create FakeInstruct, a 2‑million‑instance visual instruction dataset. Each instruction prompts the model to perform one of four forensic tasks—binary authenticity judgment, evidence description, forensic analysis, or improvement suggestion—allowing the model to learn multi‑turn, multi‑task dialogue capabilities.

FakeScope itself is a fine‑tuned large multimodal model (based on the open‑source LLaVA‑v1.5 architecture) trained on FakeInstruct. A novel token‑based probability estimation method converts the log‑probabilities of the generated “Fake” and “Real” tokens into a calibrated authenticity score, enabling zero‑shot quantitative detection without adding a separate classifier head.

Extensive experiments demonstrate that FakeScope outperforms traditional CNN/VIT detectors and state‑of‑the‑art multimodal models (GPT‑4‑V, Gemini‑Pro) on multiple benchmark datasets, achieving AUROC > 0.94 and higher F1 scores across a wide variety of generators, including unseen models such as Stable Diffusion XL and DALL·E 3. Human evaluations show that the model’s natural‑language explanations are highly persuasive (average 4.3/5) and correctly identify fine‑grained forensic cues in 78 % of cases. Moreover, the system maintains strong generalization to “in‑the‑wild” images with compression or noise artifacts, and to generators not seen during training.

Limitations are acknowledged: the current focus on visual trace evidence leaves text‑image hybrid forgeries less well‑handled, and the token‑based confidence scores can be over‑confident for short prompts. Future work aims to incorporate audio/video forensics, improve probability calibration, and expand the instruction set to cover more complex manipulation scenarios.

In summary, FakeScope combines a cost‑effective human‑machine data pipeline with large‑scale multimodal instruction tuning to deliver a unified forensic expert capable of accurate detection, detailed explanation, and actionable advice, thereby advancing trustworthy AI‑generated content analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment