More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model’s self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

💡 Research Summary

The paper tackles two persistent failure modes of vision‑language models (VLMs) trained with reinforcement learning from verifiable rewards (RLVR): inaccurate visual extraction (hallucinations or missed details) and logically inconsistent chain‑of‑thought (CoT) despite a correct final answer. The authors argue that these problems stem from the outcome‑only nature of standard RLVR, which rewards only the final answer and thus permits “reward hacking” where the model arrives at the right answer for the wrong reasons.

To address this, they propose PeRL‑VL (Perception and Reasoning Learning for Vision‑Language Models), a decoupled framework that separates the VLM’s inference into a perception stage and a reasoning stage.

Perception stage: The model must generate a detailed image description inside a <description> tag. This description is evaluated by a separate VLM‑based reward model (e.g., GPT‑4o) that checks two criteria: (1) faithfulness to the visual content and (2) sufficiency for solving the downstream question. The resulting binary description reward (r_desc) is combined with the usual format reward (r_fmt) and answer reward (r_ans).

Reasoning stage: Independently of vision, the model is fine‑tuned on a high‑quality, logic‑rich text‑only CoT dataset (OpenThought, distilled from a strong reasoning LLM). This text‑only Reasoning SFT equips the model with a coherent logical backbone, improving the quality of the <think> segment regardless of visual input.

The authors explore two ways to combine the rewards:

Aggregated rewards – a weighted sum of r_fmt, r_desc, and r_ans. This provides dense supervision but still allows a correct answer to be rewarded even when the description is wrong.
Conditional rewards – a gated formulation where the answer reward is granted only if the description reward is also positive. Formally, r = α_fmt·r_fmt + α_ans·γ·r_ans + (1‑γ)·(r_ans·r_desc). Setting γ = 0 yields a hard gate (answer rewarded only when description is correct); γ = 0.5 gives a softer gate. Experiments show that the hard‑gate conditional design dramatically reduces “false‑positive” rollouts and yields the best overall performance.

Training proceeds sequentially: first the text‑only Reasoning SFT, then the perception‑focused RL using Group‑Relative Policy Optimization (GRPO) to stabilize gradients. The output format is explicitly structured as <description><think><answer>, enabling straightforward external evaluation of each component.

Empirical results: Using Qwen2.5‑VL‑7B as the base model, PeRL‑VL raises average Pass@1 from 63.3 % (baseline RLVR) to 68.8 %, outperforming three strong baselines: (a) standard RLVR with only answer/format rewards, (b) text‑only reasoning SFT, and (c) naive multimodal distillation from GPT‑4o. Detailed analysis reveals a 12 % improvement in visual description accuracy and a 9 % boost in logical consistency metrics. The conditional reward design cuts the proportion of rollouts that achieve the correct answer for the wrong reason from 35 % down to 12 %.

Strengths and contributions:

Introduces a description reward that directly supervises visual extraction, moving beyond proxy signals like CLIP scores.
Demonstrates that decoupling perception and reasoning allows targeted improvements without interference.
Provides a systematic study of reward composition, showing that conditional gating is crucial for mitigating reward hacking.
Shows that a modest amount of text‑only CoT data can substantially raise logical coherence, even for a multimodal model.

Limitations: The description reward is binary, which may ignore nuanced quality differences; the VLM used as the reward model is itself imperfect and can introduce noise. Scaling to larger VLMs and richer reward signals remains an open question.

Future directions: The authors suggest (1) moving to continuous confidence scores and incorporating human feedback, (2) expanding multimodal CoT datasets across domains, (3) replacing the reward VLM with larger, more reliable models, and (4) exploring lightweight gating mechanisms for real‑time inference.

In summary, PeRL‑VL offers a principled, modular solution to the twin challenges of visual fidelity and logical consistency in VLMs. By rewarding faithful descriptions and strengthening text‑only reasoning, it achieves a significant performance jump and sets a new baseline for future multimodal reinforcement‑learning research.

More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment