EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.


💡 Research Summary

The paper tackles a fundamental obstacle that has prevented the widespread adoption of reinforcement learning (RL) for instruction‑guided image editing: the lack of a high‑fidelity, efficient reward function that can be queried at scale. The authors introduce a two‑pronged solution: (1) EditReward‑Bench, a comprehensive benchmark designed to evaluate reward models on the quality of image edits, and (2) EditScore, a family of specialized reward models (ranging from 7 B to 72 B parameters) that achieve state‑of‑the‑art performance on this benchmark.

EditReward‑Bench covers 13 diverse editing subtasks grouped into four categories—Subject, Appearance, Scene, and Advanced—reflecting real‑world editing challenges. For each subtask, the benchmark provides edited outputs generated by 11 different editing systems (both open‑source and proprietary). Human experts rank five candidate outputs per input along three dimensions: Prompt Following (how well the edit matches the instruction), Consistency (preservation of unchanged regions), and Overall Quality (photorealism, artifact‑free rendering). A novel “Two‑Annotator Discussion Protocol” forces two experts to discuss each sample until consensus is reached, dramatically reducing annotation noise (by more than 12 % on the Consistency dimension). The resulting dataset contains 3,072 pairwise preference judgments (944 for Prompt Following, 890 for Consistency, 1,238 for Overall Quality).

Building on this data, EditScore is fine‑tuned from the Qwen2.5‑VL series using a conditional text‑generation objective. Given (Instruction, Input Image, Output Image), the model produces a chain‑of‑thought reasoning followed by two orthogonal scores: Semantic Consistency (SC) and Perceptual Quality (PQ). SC captures both adherence to the instruction and preservation of unedited content; PQ assesses photorealism and the presence of visual artifacts. The final reward is the geometric mean of SC and PQ, providing a balanced scalar that reflects both semantic correctness and visual fidelity.

To further boost reliability, the authors propose an inference‑time ensembling strategy. For each triplet, the model is run K (typically 4–8) stochastic forward passes; the resulting scores are aggregated (average or majority vote) to reduce variance. This simple yet effective technique improves correlation with human judgments by 3–5 percentage points.

Empirically, the 72 B self‑ensembled EditScore surpasses the proprietary GPT‑5 on the benchmark, achieving an average preference‑prediction accuracy of 92.3 % across all tasks—well above the previous best of 86.7 %. Smaller variants (7 B, 13 B, 34 B) also outperform existing open‑source reward models, demonstrating that scale combined with domain‑specific fine‑tuning yields substantial gains.

The paper then validates the practical utility of EditScore in two ways. First, a “Best‑of‑N” selection experiment shows that using EditScore to pick the highest‑scoring output from several state‑of‑the‑art editors consistently improves the final edit quality. Second, the authors integrate EditScore as the reward signal in an online RL loop (PPO) applied to OmniGen2, a strong base editing model. Training with the high‑fidelity reward leads to notable uplifts: Prompt Following improves by 8.4 % and Overall Quality by 6.9 % relative to the non‑RL baseline. In contrast, substituting the reward with a generic large VLM (Qwen2.5‑VL‑72B) results in unstable training and even performance degradation, underscoring that merely increasing model size does not guarantee a useful reward.

Finally, the authors commit to releasing the benchmark data, the entire suite of EditScore models, and the RL training code, enabling reproducibility and further research. By systematically linking benchmark design, reward‑model development, and RL optimization, the work provides the first end‑to‑end roadmap for unlocking online reinforcement learning in image editing. It demonstrates that a specialized, high‑fidelity reward model is the key to harnessing the full potential of RL for complex, instruction‑driven visual manipulation.


Comments & Academic Discussion

Loading comments...

Leave a Comment