Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
ArXiv ID: 2512.16921
Date: 2025-12-18
Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

📝 Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

💡 Deep Analysis

📄 Full Content

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification Qihao Liu1,2* Chengzhi Mao1 Yaojie Liu1 Alan Yuille2 Wen-Sheng Chu1 1Google 2Johns Hopkins University https://auditdm.github.io/ Abstract Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across mod- els. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM fail- ure modes by auditing their divergence. AuditDM fine- tunes an MLLM as an auditor via reinforcement learn- ing to generate challenging questions and counterfactual images that maximize disagreement among target mod- els. Once trained, the auditor uncovers diverse, inter- pretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an ef- fective path to model diagnosis and improvement. 1. Introduction The rapid evolution of Multimodal Large Language Mod- els (MLLMs) has led to a surge in high-performing mod- els [17, 21, 30, 75]. However, despite steady improve- ments on public benchmarks, selecting the proper models for real-world deployment remains challenging, as conven- tional evaluations often obscure how models truly differ. This is particularly evident when models are retrained on new data: although retraining may improve targeted knowl- edge, its impact on broader capabilities remains unclear. Similar concerns arise with task-specific fine-tuning and edge deployment, where practitioners must weigh trade-offs between accuracy, model size, and generalization. In prac- tice, the critical question is not “who wins the leaderboard” but “what changes, and why”: identifying which inputs flip, what skills improve, and where brittle behaviors persist. *This work was done during Qihao Liu’s internship at Google. Auditor Model Report: The target model is strong at …, but it still has major weaknesses, such as understanding object relationships, …… To address these weaknesses, you can finetune on these (~1.1M) examples. Q: “What connects the motorcycle to the aircraft?” ladder air ladder Q: “What is the man doing in the image?” boarding plane exiting plane boarding plane Q: “What is the woman in the foreground focused on?” phone laptop phone Figure 1. Overview of AuditDM. We propose to train an auditor model to systematically discover capability gaps in an MLLM by generating failure-inducing question–image pairs. We show three automatically generated examples of weaknesses in object rela- tionships. The proposed framework offers diagnostic insight and enables targeted rectification via auditor-guided feedback. While common benchmarks [3, 29, 53, 81] are the de facto standard for model comparison, they fall short in two key respects for answering the above questions. First, closed-set evaluations are bounded by a fixed knowledge scope and inevitably leave blind spots. As a result, com- parisons based on closed sets can be inherently selective and biased. Second, benchmarks tend to compress complex behavior into sparse scores, thus obscuring heterogeneous shifts across data slices, whereas the most significant capa- bility gaps are often entangled and concentrated in the long tail. Prior work proposed human online testing [67], yet it is prohibitively expensive and time-consuming to scale. To bridge this gap, we introduce model auditing, an auto- matic evaluation paradigm designed to uncover and inter- pret hidden divergence patterns missed by closed sets on- line. An effective “auditor” must go beyond simple de- 1 arXiv:2512.16921v1 [cs.CV] 18 Dec 2025 3B 3B+Ours 10B 28B 80.04 84.56 84.16 83.65 (a) Improving PaliGemma2 4B 4B+Ours 12B 27B 58.83 64.47 63.96 66.69 (b) Improving Gemma3 Figure 2. Model improvement with AuditDM. We report aver- age performance over all benchmarks per model (excluding MME due to its incompatible score scale). Once trained, AuditDM gen- erates targeted, large-scale data points aligned with discovered weaknesses, training on which can produce consistent gains across diverse models and benchmarks. tection: it should systematically discover capability gaps, summarize interpretable weaknesses, and provide feedback to guide rectification and model improvement. To this end, we propose to Audit the Differences that Matter, and introduce AuditDM, an annotation-free frame- work that exploits cross-model divergence to discover failure modes in a target MLLM (Fig. 1). Within the context of VQA1, rather than relying on human inspec- tion [67], AuditDM trains an MLLM auditor that gener- ates quest

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

A Women's Health Benchmark for Large Language Models

Adaptive Regime-Switching Forecasts with Distribution-Free Uncertainty: Deep Switching State-Space Models Meet Conformal Prediction

Chat with UAV -- Human-UAV Interaction Based on Large Language Models

Start searching

No results found