GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
ArXiv ID: 2512.16853
Date: 2025-12-18
Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

📝 Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

💡 Deep Analysis

Deep Dive into GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.

📄 Full Content

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation Amita Kamath1,2,3,∗, Kai-Wei Chang3, Ranjay Krishna2,4, Luke Zettlemoyer1,2, Yushi Hu1,†, Marjan Ghazvininejad1,† 1FAIR at Meta, 2University of Washington, 3University of California, Los Angeles, 4Allen Institute for AI ∗Work done at Meta, †Joint last author Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time—resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks. Date: December 19, 2025 Correspondence: Amita Kamath at kamatha@cs.washington.edu, Yushi Hu at yushihu@meta.com, Marjan Ghazvininejad at ghazvini@meta.com Code: https://github.com/facebookresearch/GenEval2 1 Introduction Text-to-Image (T2I) models are becoming increasingly capable (Deng et al., 2025; Wu et al., 2025a; Labs, 2024; Comanici et al., 2025), with models training on increasing amounts of natural and synthesized data. Their rapid progress has been both driven and measured by T2I benchmarks, for everything from basic capabilities like object colors and counts (Ghosh et al., 2023; Huang et al., 2023; Li et al., 2024) to advanced capabilities like knowledge and reasoning (Niu et al., 2025; Chang et al., 2025; Sun et al., 2025; Chen et al., 2025b). These benchmarks employ model-based evaluation: images generated by T2I models are evaluated using either a combination of specialized models such as object detectors and image-text matching models (Ghosh et al., 2023; Huang et al., 2023), or a single VQA model (Li et al., 2024; Niu et al., 2025; Chang et al., 2025). However, we raise a critical question: given how much T2I models’ capabilities have changed, are the evaluations of longer-standing benchmarks still valid? To investigate, we study one of the most prominent benchmarks to measure basic T2I capabilities: GenEval (Ghosh et al., 2023). This benchmark has been a primary evaluation in many popular T2I papers over the past three years including (but not limited to) Stable Diffusion 3 (Esser et al., 2024), Transfusion (Zhou et al., 2024), Emu3 (Wang et al., 2024b), Show-o (Xie et al., 2024), SEED-X (Ge et al., 2024), MetaQueries (Pan et al., 2025), BAGEL (Deng et al., 2025), Janus (Wu et al., 2025b), OmniGen (Xiao et al., 2025), BLIP3-o (Chen et al., 2025a), and Qwen-Image (Wu et al., 2025a). 1 arXiv:2512.16853v1 [cs.CV] 18 Dec 2025 dining table in COCO Usually viewed from above, surrounded by chairs or people GenEval Prompt: a dining table GenEval Evaluation (trained on COCO): dining table, 0.91 bench, 0.86 ”No dining table found” Stable Diffusion 2.1 (2022) Bagel (2025) ”Correct” (a) GenEval relies on CLIP (Radford et al., 2021) and a detector trained on COCO (Cheng et al., 2022b), which are no longer reliable for evaluating recent T2I models. 17.7%! (b) The gap between human and automatic evaluation scores on GenEval increases as T2I models become better, and eventually saturate the prompts. Figure 1 With the distribution shift of Text-to-Image (T2I) models’ outputs over time, we reveal that the model-based evaluation of GenEval decreases in human-alignment, masking the fact that the benchmark is now saturated. We introduce GenEval 2, a more robust benchmark that is challenging for state-of-the-art T2I models, alongside an evaluation method, Soft-TIFA, that is less likely to suffer benchmark drift. However, despite being so ubiquitously used in T2I research—usually with reports of gains of ∼2–3% over previous state-of-the-art models—we find that GenEval results on recent models can diverge from human judgment by a

…(Full text truncated)…

🇰🇷 이 논문을 한글로 읽기

📄 Read Full PDF on ArXiv