SR‑MCR: 자체참조 신호를 활용한 단계별 추론 정렬 프레임워크

Reading time: 5 minute
...

📝 Abstract

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five selfreferential cues-semantic alignment, lexical fidelity, nonredundancy, visual grounding, and step consistency-are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A criticfree GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

💡 Analysis

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five selfreferential cues-semantic alignment, lexical fidelity, nonredundancy, visual grounding, and step consistency-are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A criticfree GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

📄 Content

Multimodal reasoning must deliver not only correct answers but also explanations that are coherent and visually grounded.

Yet recent MLLMs (e.g., LLaVA [32], Qwen-VL [2]) often drift in their intermediate steps-contradicting themselves, hallucinating evidence [28,43,62,70,73,79], or repeating trivial content.

Existing alignment pipelines, including instruction tuning [50,61,71] and preference finetuning (DPO [38,63], RLHF [36,44,72]), rely on costly humanlabeled rewards or external evaluators [69,77]. While effective, they remain brittle under domain shift and largely outcome-centric, supervising answers rather than the reasoning process [7,29,67,68].

This outcome-centric focus leaves two issues unresolved: (i) Lack of intrinsic process reward: supervising only final answers leaves step coherence and visual grounding underconstrained; (ii) Cross-domain instability: single reward proxies (e.g., lexical overlap) miscalibrate signals across heterogeneous tasks, causing over-alignment to spurious patterns.

At the same time, a single forward pass of an MLLM already exposes multiple measurable signals that correlate with “good” reasoning-semantic alignment, lexical fidelity, non-redundancy, grounding to visual evidence, and local step consistency [60,65,66]. These signals are inexpensive to compute, task-agnostic, and complementary in nature. Rather than relying solely on external preference labels, we ask: can we turn these process-level signals into a practical, intrinsic self-reward for multimodal reasoning? We posit that, once normalized and reliabilityweighted, such signals can serve as a unified self-reward for both training and diagnosis [57,64,74].

Based on this observation, we develop Self-Rewarded Multimodal Coherent Reasoning (SR-MCR), a unified framework that instantiates process-aware, label-free alignment for MLLMs. Given an image I, textual input x, and model outputs (ŷ a , ŷt ) (final answer and reasoning trace), SR-MCR computes a normalized self-reward R(I, x, ŷa , ŷt ) = k∈{sem,lex,nr,vis,step} λ k sk , λ k = exp(α Relia k ) j exp(α Reliaj ) .Here sk ∈ [0, 1] are min-max normalized scores for semantic similarity, lexical overlap, nonredundancy, visual grounding, and step-wise coherence. The term Relia k estimates each signal’s reliability (e.g., held-out correlation or inverse variance), producing adaptive weights λ k that attenuate noisy or unstable signals and amplify consistent ones.

Optimization. Instead of constructing human preference pairs, we directly treat this self-reward as the optimization target and fine-tune the model with a self-rewarded GRPO [42] objective. The policy is updated to increase the likelihood of high-reward generations while constraining drift from the base model via a KL term [14,40,41,55] (Sec. 3, Eq. 9). To further stabilize training under noisy selfrewards, we introduce a dynamic cooling weight (Eq. 8) that down-weights trivial or overconfident samples based on their normalized negative log-likelihood [27,52]. Together, these components yield a simple, label-free, and processaware alignment recipe that can be applied across diverse visual domains.

Contributions. This work is positioned as a practical, process-centric alignment framework rather than a new RL algorithm. Our main contributions are: • We identify a key gap in multimodal alignment-the lack of an intrinsic, process-aware reward for coherent and visually grounded reasoning. • We introduce SR-MCR, which fuses five self-signals into a reliability-weighted reward and optimizes MLLMs via GRPO with a simple cooling scheme. • We provide a lightweight, reproducible recipe that improves accuracy and reasoning coherence over Qwen2.5-VL baselines, with ablations clarifying each component’s role.

Vision-Language Models. Large Vision-Language Models (VLMs) couple visual encoders (e.g., CLIP [3,12,13,37], ViT [11]) with LLMs for multimodal understanding. CLIP benefits from large-scale contrastive pretraining that yields strong zero-shot transfer and alignment between images and text, but it can struggle with compositional queries, and inherits biases present in its training data.

LLaVA demonstrated that visual instruction-following can be achieved with a lightweight projector and a two-stage pipeline. Subsequent models such as Qwen-VL [75] improved perception and multilinguality via larger data and refined architectures, while GPT-4V [35] advanced visual reasoning. InstructBLIP [10] further enhances crossmodal fusion through Q-Former [24]. These works establish the foundation of modern multimodal reasoning.

Alignment Methods for VLMs. Early alignment of vision-language models typically begins with Supervised Fine-Tuning (SFT) on curated multimodal datasets [25,26,31,49]. While SFT imparts task formats and basic instruction following, it struggles to model nuanced human preferences [9] and often overfits to annotator distributions.

RLHF extends SFT by training on preference data with a learned reward model

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut