텍스트 기반 이미지 편집 평가를 위한 종합 벤치마크와 인간 지각에 맞춘 메트릭

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1’s superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Text-driven image editing [1,2,44,68,72] has attracted significant attention in recent years. However, there is currently no well-established metric for evaluating the results of image editing. Objective metrics such as CLIP score [38], DINO score [34], LPIPS score [69], and 1 * Equal Contribution SSIM [57] tend to evaluate image quality from a single perspective, such as text-image consistency or the correlation between the source and edited images. These metrics, however, do not provide an overall evaluation, nor do they align well with human perception. Previous studies [4,19,24,60] have shown that these metrics can significantly differ from human judgment in practical applications.

In recent years, some metrics aligned with human perception, such as HPS scores [30,59,60], Pick score [19], and ImageReward [63], have made effective progress in evaluating text-to-image generation tasks by collecting human visual feedback. However, these methods focus only on individual images and text, which differs from the setting of image editing tasks. Unlike text-driven image generation, text-driven image editing also takes a source image as input. The edited result is expected to differ from the source image, but there is also a certain degree of correspondence. Modeling this relationship is crucial for evaluation: in some cases, the edited result is expected to retain semantic information related to the original image. If only the edited image output is considered, this aspect would be missed, which is an issue to be well-addressed. However, modeling this relationship is a challenging task. The connection between source and target images changes dynamically depending on the text context. For example, a stylistic instruction like “make it a claymation style” may drastically alter the structure, texture, and lines of the original image, whereas a replacement instruction like “replace the cat with a dog” will directly alter the semantic content, and thus a large difference between the source and target images is expected. On the other hand, an instruction like “remove her earrings” is expected to retain most of the identity information of the original character. Therefore, a multi-modal method that can dynamically model the source-target relationship is urgently needed.

In this work, we propose the Text-driven Image Editing Benchmark (IE-Bench) to improve the alignment between Src Img Collect.

Prompt: Make the airplane blue The original instructions were clear that only one aspect (the color) needed modification while keeping other elements intact. Text Alignment: The instruction was followed correctly by changing just the plane’s body from white to blue. Fidelity: While most details remain consistent like runway markings, trees on either side, and background mountain, there might have been slight changes due to digital manipulation such as subtle shifts or adjustments around edges which could affect fidelity slightly but not drastically. Quality: Overall perception shows an acceptable level where major features retain their integrity despite minor alterations during retouching. 4.70 The edit demonstrates a strong adherence to the instruction, … The fidelity of the edited image is commendable, as the majority of the unaltered elements from the source image are preserved. The background, including the runway, trees, and sky, remains intact and unaffected by the modification. This consistency ensures that the focus remains on the airplane and the intended change without introducing unnecessary alterations to other parts of the image. However, there might be slight variations in lighting or texture around the edges of the airplane, where the blue coloration has been applied. These subtle inconsistencies could slightly impact the natural integration of the edit into the original image… The perceptual quality of the edited image is high, with the blue coloration appearing vibrant and visually appealing…While the result is visually pleasing and well-executed, slight improvements in blending and consistency could elevate the fidelity and overall quality of the final image. 3.73

Policy Training w/ GRPO + RLVR GT Overall Score: 3.72 evaluation metrics for text-driven image editing and human perception. We first introduce IE-Bench, a database containing various source-prompt-target cases and their corresponding Mean Opinion Scores (MOS). We collect diverse real-world, CG, AIGC, and art painting images from different sources. Following previous works [14], we manually design diverse editing instructions for each image, covering aspects such as structural changes (e.g., shape, size), style changes (e.g., texture, color), and semantic changes (e.g., pose, action, addition, replacement, deletion). We then apply multiple methods to generate diverse edited results. Finally, we assemble 15 human participants from various backgrounds to provide subjective ratings. T

View Original ArXiv

This content is AI-processed based on ArXiv data.

텍스트 기반 이미지 편집 평가를 위한 종합 벤치마크와 인간 지각에 맞춘 메트릭

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found