PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations–cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.
💡 Research Summary
The paper “PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment” addresses a critical shortcoming in current reinforcement‑learning (RL) approaches for multimodal large language models (MLLMs). Existing reward designs focus almost exclusively on the correctness of the final answer, which allows models to arrive at the right answer while following a reasoning trajectory that is inconsistent with the visual evidence—a phenomenon the authors term “process hallucination.”
PaLMR proposes a two‑stage framework that aligns both the perception of visual inputs and the reasoning process itself.
-
Perception‑Aligned Data Layer (PaDLayer)
- Data collection: The authors start from the FineVision corpus, which spans multiple domains (geometry, charts, scientific VQA, etc.). They uniformly sample 1,500 instances per sub‑domain to obtain a balanced set of roughly 1.5 K raw samples.
- Learnability‑based filtering: Each candidate is run through multiple stochastic rollouts using the same RL policy that will later be trained. Samples that consistently produce incorrect or unstable responses are discarded, as are overly easy items that yield near‑perfect accuracy. This filtering yields about 4.7 K high‑quality instances that exhibit clear visual‑semantic alignment.
- Pseudo‑ground‑truth generation: Using Gemini (a strong LLM), the authors generate detailed, structured captions for each image. The captions enumerate objects, colors, spatial relations, and other visual attributes, thereby providing a symbolic, verifiable representation of the visual scene.
- Reference chain‑of‑thought creation: A Best‑of‑N strategy iteratively refines sampled reasoning traces, selecting the most coherent and fact‑consistent trajectory as the reference. This reference serves both as a target for process alignment and as a source of visual facts for scoring.
-
Process‑Aligned Optimization Layer (PaOLayer)
- Vision‑Guided Group‑Relative Policy Optimization (V‑GRPO): The classic GRPO algorithm computes a relative advantage for each response within a generated group. PaLMR augments the reward function with a perception‑aware score that evaluates the visual fidelity of the model’s chain‑of‑thought.
- Perception‑aware scoring: From the model’s
… blocks, visual claims Z are extracted (e.g., “there are two cylinders”). These claims are compared against the structured pseudo‑GT; a binary consistency label (1 = consistent, 0 = inconsistent) is produced. - Hierarchical reward fusion: The final reward is a weighted sum of three components: (i) the perception‑aware visual consistency score, (ii) the traditional final‑answer correctness score, and (iii) a format score that encourages proper use of CoT tags. By integrating these signals, the model is penalized heavily if any part of the reasoning deviates from the visual evidence, even when the final answer is correct.
- Stability benefits: The hierarchical fusion reduces gradient oscillations that arise when a single reward dominates, leading to more stable policy updates.
Experimental Evaluation
The authors fine‑tune Qwen2.5‑VL‑7B with PaLMR and evaluate on four benchmarks:
- HallusionBench (designed to measure reasoning hallucinations). PaLMR reduces hallucination rates by over 45 % compared with baseline models and baseline + GRPO, while maintaining comparable answer accuracy.
- MMMU, MathVista, and MathVerse (standard multimodal reasoning suites). PaLMR matches or slightly exceeds baseline performance (0.2–0.5 % improvement), demonstrating that enforcing visual fidelity does not sacrifice overall correctness.
Qualitative case studies show that PaLMR’s
Limitations and Future Work
- The pseudo‑GT relies on an LLM (Gemini); errors in caption generation could propagate to the reward signal, especially for complex scenes with occlusions or 3D depth.
- The current perception‑aware score is binary, which may be too coarse to capture subtle inconsistencies; a continuous confidence‑based metric or human‑in‑the‑loop feedback could improve granularity.
- V‑GRPO introduces additional computational overhead due to pairwise visual consistency checks; scaling to massive datasets will require more efficient approximations.
Conclusion
PaLMR introduces a principled shift from outcome‑centric to process‑centric reinforcement learning for multimodal models. By constructing a perception‑aligned dataset and embedding visual consistency directly into the reward function, the framework dramatically curtails reasoning hallucinations while preserving (or modestly improving) standard accuracy metrics. This work paves the way for more trustworthy, interpretable multimodal AI systems that reason in a way that is faithful to what they actually see.
Comments & Academic Discussion
Loading comments...
Leave a Comment