Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
Deep Dive into GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time – resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is mor
GenEval 2: Addressing Benchmark Drift in
Text-to-Image Evaluation
Amita Kamath1,2,3,∗, Kai-Wei Chang3, Ranjay Krishna2,4, Luke Zettlemoyer1,2, Yushi Hu1,†,
Marjan Ghazvininejad1,†
1FAIR at Meta, 2University of Washington, 3University of California, Los Angeles, 4Allen Institute
for AI
∗Work done at Meta, †Joint last author
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to
score correctness, and test prompts must be selected to be challenging for current T2I models but
not the judge. We argue that satisfying these constraints can lead to benchmark drift over time,
where the static benchmark judges fail to keep up with newer model capabilities. We show that
benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks.
Although GenEval was well-aligned with human judgment at the time of its release, it has drifted
far from human judgment over time—resulting in an absolute error of as much as 17.7% for current
models. This level of drift strongly suggests that GenEval has been saturated for some time, as
we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new
benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of
compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA,
an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is
more well-aligned with human judgment and argue is less likely to drift from human-alignment over
time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will
provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our
work, more generally, highlights the importance of continual audits and improvement for T2I and
related automated model evaluation benchmarks.
Date: December 19, 2025
Correspondence: Amita Kamath at kamatha@cs.washington.edu, Yushi Hu at yushihu@meta.com,
Marjan Ghazvininejad at ghazvini@meta.com
Code: https://github.com/facebookresearch/GenEval2
1
Introduction
Text-to-Image (T2I) models are becoming increasingly capable (Deng et al., 2025; Wu et al., 2025a; Labs, 2024;
Comanici et al., 2025), with models training on increasing amounts of natural and synthesized data. Their
rapid progress has been both driven and measured by T2I benchmarks, for everything from basic capabilities
like object colors and counts (Ghosh et al., 2023; Huang et al., 2023; Li et al., 2024) to advanced capabilities
like knowledge and reasoning (Niu et al., 2025; Chang et al., 2025; Sun et al., 2025; Chen et al., 2025b). These
benchmarks employ model-based evaluation: images generated by T2I models are evaluated using either a
combination of specialized models such as object detectors and image-text matching models (Ghosh et al.,
2023; Huang et al., 2023), or a single VQA model (Li et al., 2024; Niu et al., 2025; Chang et al., 2025).
However, we raise a critical question: given how much T2I models’ capabilities have changed, are the evaluations
of longer-standing benchmarks still valid? To investigate, we study one of the most prominent benchmarks to
measure basic T2I capabilities: GenEval (Ghosh et al., 2023). This benchmark has been a primary evaluation
in many popular T2I papers over the past three years including (but not limited to) Stable Diffusion 3 (Esser
et al., 2024), Transfusion (Zhou et al., 2024), Emu3 (Wang et al., 2024b), Show-o (Xie et al., 2024), SEED-X
(Ge et al., 2024), MetaQueries (Pan et al., 2025), BAGEL (Deng et al., 2025), Janus (Wu et al., 2025b),
OmniGen (Xiao et al., 2025), BLIP3-o (Chen et al., 2025a), and Qwen-Image (Wu et al., 2025a).
1
arXiv:2512.16853v1 [cs.CV] 18 Dec 2025
dining table in COCO
Usually viewed from above, surrounded by chairs or people
GenEval Prompt: a dining table
GenEval Evaluation (trained on COCO):
dining
table,
0.91
bench,
0.86
”No dining
table found”
Stable
Diffusion 2.1
(2022)
Bagel
(2025)
”Correct”
(a) GenEval relies on CLIP (Radford et al., 2021) and a
detector trained on COCO (Cheng et al., 2022b), which
are no longer reliable for evaluating recent T2I models.
17.7%!
(b) The gap between human and automatic evaluation scores
on GenEval increases as T2I models become better, and
eventually saturate the prompts.
Figure 1 With the distribution shift of Text-to-Image (T2I) models’ outputs over time, we reveal that the model-based
evaluation of GenEval decreases in human-alignment, masking the fact that the benchmark is now saturated. We
introduce GenEval 2, a more robust benchmark that is challenging for state-of-the-art T2I models, alongside an
evaluation method, Soft-TIFA, that is less likely to suffer benchmark drift.
However, despite being so ubiquitously used in T2I research—usually with reports of gains of ∼2–3% over
previous state-of-the-art models—we find that GenEval results on recent models can diverge from human
judgment by a
…(Full text truncated)…
This content is AI-processed based on ArXiv data.