GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time – resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

💡 Research Summary

The paper tackles a fundamental problem in the automated evaluation of text‑to‑image (T2I) systems, which the authors term “benchmark drift.” Existing benchmarks rely on a static judge model that scores generated images and on a fixed set of prompts that are deliberately hard for current T2I models but easy for the judge. Over time, as T2I models improve, the static judge fails to keep pace, and the prompts no longer challenge the newest models. Consequently, the benchmark’s correlation with human judgment deteriorates.

The authors focus on GenEval, one of the most widely used T2I evaluation suites. When GenEval was first released, it aligned well with human ratings (Pearson r ≈ 0.81). However, a large‑scale human study on twelve state‑of‑the‑art models released between 2022 and 2024 shows that the absolute error between GenEval scores and human judgments can reach 17.7 %, with an average error of about 12 %. This indicates that GenEval has been saturated for some time: its prompts no longer probe the limits of modern models, and the judge model remains stuck at an older performance level.

To remedy this, the paper introduces GenEval 2, built around two design principles. First, it expands coverage of “primitive visual concepts” – basic attributes such as color, texture, shape, size, spatial relation, lighting, etc. – and explicitly incorporates them into the prompt set. Second, it raises compositionality by combining three to five primitives in each prompt, forcing models to satisfy multiple constraints simultaneously. This contrasts with earlier benchmarks that largely assess global qualities like realism or fidelity.

Because GenEval 2 evaluates many primitives per image, the authors propose a new scoring method called Soft‑TIFA. For each primitive, a separate human judgment is collected and a dedicated sub‑judge is trained. Soft‑TIFA then aggregates these primitive‑level scores using a Bayesian weighted average that accounts for the difficulty of each primitive and the confidence of the human labels. The result is a single score that reflects fine‑grained visual correctness rather than a monolithic holistic rating.

Empirical evaluation demonstrates that Soft‑TIFA dramatically improves alignment with human evaluation. On the same set of modern T2I models, Soft‑TIFA achieves Pearson r = 0.84 and Spearman ρ = 0.81 with human scores, compared to r ≈ 0.61 for the original GenEval/VQAScore pipeline. The mean absolute error drops to under 5 %, roughly half of the previous error. Moreover, the primitive‑level analysis reveals systematic weaknesses: while models excel at reproducing colors and textures, they still struggle with complex spatial relationships (e.g., “a blue apple on a red book”). This diagnostic capability is a valuable by‑product of the new benchmark.

Finally, the authors argue that any automated T2I benchmark must be treated as a living resource. Continuous audits, periodic prompt refreshes, and regular re‑training of judge models are essential to prevent future drift. GenEval 2 and Soft‑TIFA are positioned as a more robust foundation that is less prone to drift because its evaluation is grounded in many low‑level visual judgments rather than a single holistic metric. The work underscores the broader lesson that the community should invest in sustainable benchmarking practices to ensure that automated scores remain faithful proxies for human perception as generative models continue to evolve.