📝 Original Info
- Title: Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
- ArXiv ID: 2512.16921
- Date: 2025-12-18
- Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
📝 Abstract
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
💡 Deep Analysis
📄 Full Content
Differences That Matter:
Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu1,2* Chengzhi Mao1
Yaojie Liu1
Alan Yuille2
Wen-Sheng Chu1
1Google
2Johns Hopkins University
https://auditdm.github.io/
Abstract
Conventional evaluation methods for multimodal LLMs
(MLLMs) lack interpretability and are often insufficient
to fully disclose significant capability gaps across mod-
els. To address this, we introduce AuditDM, an automated
framework that actively discovers and rectifies MLLM fail-
ure modes by auditing their divergence.
AuditDM fine-
tunes an MLLM as an auditor via reinforcement learn-
ing to generate challenging questions and counterfactual
images that maximize disagreement among target mod-
els.
Once trained, the auditor uncovers diverse, inter-
pretable exemplars that reveal model weaknesses and serve
as annotation-free data for rectification. When applied to
SoTA models like Gemma-3 and PaliGemma-2, AuditDM
discovers more than 20 distinct failure types. Fine-tuning
on these discoveries consistently improves all models across
16 benchmarks, and enables a 3B model to surpass its 28B
counterpart. Our results suggest that as data scaling hits
diminishing returns, targeted model auditing offers an ef-
fective path to model diagnosis and improvement.
1. Introduction
The rapid evolution of Multimodal Large Language Mod-
els (MLLMs) has led to a surge in high-performing mod-
els [17, 21, 30, 75].
However, despite steady improve-
ments on public benchmarks, selecting the proper models
for real-world deployment remains challenging, as conven-
tional evaluations often obscure how models truly differ.
This is particularly evident when models are retrained on
new data: although retraining may improve targeted knowl-
edge, its impact on broader capabilities remains unclear.
Similar concerns arise with task-specific fine-tuning and
edge deployment, where practitioners must weigh trade-offs
between accuracy, model size, and generalization. In prac-
tice, the critical question is not “who wins the leaderboard”
but “what changes, and why”: identifying which inputs flip,
what skills improve, and where brittle behaviors persist.
*This work was done during Qihao Liu’s internship at Google.
Auditor Model
Report:
The target model is
strong at …, but it
still has major
weaknesses, such as
understanding
object relationships,
……
To address these
weaknesses, you can
finetune on these
(~1.1M) examples.
Q: “What connects the
motorcycle to the aircraft?”
ladder
air
ladder
Q: “What is the man doing
in the image?”
boarding
plane
exiting
plane
boarding
plane
Q: “What is the woman in
the foreground focused on?”
phone
laptop
phone
Figure 1. Overview of AuditDM. We propose to train an auditor
model to systematically discover capability gaps in an MLLM by
generating failure-inducing question–image pairs. We show three
automatically generated examples of weaknesses in object rela-
tionships. The proposed framework offers diagnostic insight and
enables targeted rectification via auditor-guided feedback.
While common benchmarks [3, 29, 53, 81] are the de
facto standard for model comparison, they fall short in
two key respects for answering the above questions. First,
closed-set evaluations are bounded by a fixed knowledge
scope and inevitably leave blind spots. As a result, com-
parisons based on closed sets can be inherently selective
and biased. Second, benchmarks tend to compress complex
behavior into sparse scores, thus obscuring heterogeneous
shifts across data slices, whereas the most significant capa-
bility gaps are often entangled and concentrated in the long
tail. Prior work proposed human online testing [67], yet
it is prohibitively expensive and time-consuming to scale.
To bridge this gap, we introduce model auditing, an auto-
matic evaluation paradigm designed to uncover and inter-
pret hidden divergence patterns missed by closed sets on-
line.
An effective “auditor” must go beyond simple de-
1
arXiv:2512.16921v1 [cs.CV] 18 Dec 2025
3B
3B+Ours 10B
28B
80.04
84.56
84.16
83.65
(a) Improving PaliGemma2
4B
4B+Ours 12B
27B
58.83
64.47
63.96
66.69
(b) Improving Gemma3
Figure 2. Model improvement with AuditDM. We report aver-
age performance over all benchmarks per model (excluding MME
due to its incompatible score scale). Once trained, AuditDM gen-
erates targeted, large-scale data points aligned with discovered
weaknesses, training on which can produce consistent gains across
diverse models and benchmarks.
tection: it should systematically discover capability gaps,
summarize interpretable weaknesses, and provide feedback
to guide rectification and model improvement.
To this end, we propose to Audit the Differences that
Matter, and introduce AuditDM, an annotation-free frame-
work that exploits cross-model divergence to discover
failure modes in a target MLLM (Fig. 1).
Within the
context of VQA1, rather than relying on human inspec-
tion [67], AuditDM trains an MLLM auditor that gener-
ates quest
Reference
This content is AI-processed based on open access ArXiv data.