DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of “thinking with images,” which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.
💡 Research Summary
The paper introduces DeFacto, a counterfactual reasoning framework designed to improve both answer correctness and evidence‑answer consistency in multimodal language models (MLLMs) that “think with images.” The authors observe that current “thinking with images” approaches, despite generating visual reasoning steps, often suffer from three failure modes: (i) mislocalized failure (wrong evidence, wrong answer), (ii) spurious correctness (wrong evidence, correct answer), and (iii) faithful incorrectness (correct evidence, wrong answer). These issues indicate that a model can arrive at a correct answer without truly grounding its reasoning in the visual content, which limits interpretability and reliability.
To address this, DeFacto proposes three complementary training paradigms: positive supervision, counterfactual abstention, and random masking. Positive samples retain the question‑relevant visual regions (R⁺) and reward the model for selecting those regions and producing the correct answer. Counterfactual samples mask R⁺, forcing the model to output a designated “unknown” token, thereby penalizing reasoning when essential evidence is missing. Random‑masking samples hide irrelevant regions (R⁻) to prevent the model from exploiting mask patterns as shortcuts.
A key technical contribution is an automated pipeline that constructs these three sample types without human annotation. First, a large‑scale MLLM (Qwen2.5‑VL) parses the question and extracts textual descriptors (e.g., “red cup,” “text on shirt”). Candidate regions are generated by a Region Proposal Network (RPN) and OCR. Descriptors are matched to visual proposals using an open‑vocabulary detector (DINO‑X) for objects and OCR string matching for text, yielding the set of evidence regions R⁺. Remaining proposals become R⁻. Using this pipeline, the authors build DeFacto‑100K, a dataset of ~100k images each paired with positive, counterfactual, and random‑masking variants, ensuring that the only difference among the three versions is the presence or absence of the essential evidence.
Training proceeds with GRPO‑based reinforcement learning (Generalized Reward Policy Optimization). The reward function combines three components: (1) answer correctness, (2) region selection fidelity (how well the predicted bounding boxes match R⁺), and (3) evidence‑answer consistency (penalizing mismatches between selected evidence and the final answer, and rewarding appropriate “unknown” outputs in counterfactual cases). This multi‑objective optimization forces the model to align its reasoning trace (the
Experiments span several VQA benchmarks (VQA, OK‑VQA, GQA) and a newly curated human‑annotated benchmark DeFacto‑1.5K, which provides ground‑truth evidence annotations for 1,500 samples. Compared to strong baselines such as Deepeyes and GRIT, DeFacto improves overall answer accuracy by 2–4 percentage points and boosts evidence‑answer consistency metrics by 10–15 points. Crucially, the rates of spurious correctness and faithful incorrectness drop dramatically, demonstrating that the model is less likely to rely on spurious correlations or to ignore correct visual cues. Human evaluation confirms that the bounding boxes selected by DeFacto align well with annotators’ expectations, and the model appropriately emits “unknown” when evidence is masked.
The paper’s contributions are fourfold: (1) a counterfactual “thinking with images” framework that jointly optimizes answer correctness and region‑level faithfulness, (2) an automated language‑guided pipeline for constructing a large‑scale counterfactual dataset, (3) a new human‑verified benchmark for systematic evaluation of evidence grounding, and (4) extensive empirical validation showing consistent gains across accuracy and faithfulness. The authors argue that this approach moves multimodal reasoning beyond mere performance toward trustworthy, evidence‑driven inference, and they suggest future extensions to richer visual modalities (segmentation masks, depth maps) and multi‑turn dialogue settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment