RealDrag: The First Dragging Benchmark with Real Target Image
The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.
💡 Research Summary
The paper “RealDrag: The First Dragging Benchmark with Real Target Image” addresses a critical problem in the field of drag-based image editing: the lack of reliable and standardized evaluation. Currently, assessing and comparing different models is highly subjective and inconsistent due to the absence of a common benchmark with ground truth data. The authors identify two main issues: inconsistent evaluation protocols and, most fundamentally, the lack of datasets containing paired real source and target images, which makes objective quantification of performance impossible.
To solve this, the authors introduce RealDrag, the first comprehensive benchmark specifically designed for point-based image editing that includes actual ground truth target images. The dataset consists of over 400 carefully curated and human-annotated samples sourced from diverse videos. Each sample provides a complete set of data: a source image, its corresponding real target image (extracted from a subsequent video frame), handle and target points, a mask defining the editable region, and descriptive captions for both the image content and the editing action performed. This rich annotation enables a direct, apples-to-apples comparison between a model’s output and the ideal result.
Furthermore, the paper proposes four novel, task-specific evaluation metrics designed to capture different aspects of drag editing quality:
- Semantical Distance (SeD): Measures pixel-level matching fidelity between the edited result and the ground truth target image.
- Outer Mask Preserving Score (OMPS): Evaluates how well the model preserves the content in regions outside the editable mask, ensuring background consistency.
- Inner Patch Preserving Score (IPPS): Assesses the preservation of internal details and texture within the edited object itself during deformation.
- Directional Similarity (DiS): Quantifies the semantic alignment of the edit direction (e.g., translation, rotation, scaling) with the user’s intent described in the action caption.
Utilizing this new benchmark and metrics, the authors conduct the first large-scale systematic analysis of the field, evaluating 17 state-of-the-art models spanning different categories (GAN-based, Diffusion-based, training-free, optimization-based, etc.). The evaluation reveals clear trade-offs among current approaches. For instance, GAN-based methods like DragGAN are fast but struggle with complex deformations, while Diffusion-based methods (e.g., DragDiffusion, DragonDiffusion) offer higher quality and flexibility at a greater computational cost. The analysis also maps the evolution of ideas, showing a trend from simple point tracking to more sophisticated region-based control, language-integrated guidance, and task-aware adaptive controllers.
In summary, RealDrag establishes a robust, reproducible baseline for the drag-based image editing community. By providing the crucial element of real target images and a suite of tailored metrics, it enables objective comparison and drives future research. The promised public release of the dataset and evaluation toolkit is poised to become an essential resource for standardizing evaluation and fostering meaningful progress in developing controllable generative image editing tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment