DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, do not represent downstream use-cases, and saturate early as models improve; (ii) ‘blindly-solvable’ questions which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize their fidelity and discriminability. We find that transformations such as converting MCQs to generative tasks reveal sharp capability drops of up to 35%. In addition, filtering blindly-solvable and mislabeled samples enhances the discriminative power of these evaluations, while simultaneously reducing their computational cost. We release DATBENCH-FULL, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DATBENCH, a discriminative subset that achieves 13× average speedup (up to 50×) while closely matching the discriminative power of the original datasets. Our work provides a path towards evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

📜 Original Paper Content