Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage b…

Authors: Feiding, Yongkang Zhang, Yuhao Liao

Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
Difference F eedbac k: Generating Multimo dal Pro cess-Lev el Sup ervision for VLM Reinforcemen t Learning F eiding, Y ongk ang Zhang, Y uhao Liao, Zijian Zeng, Ch unzheng Zhu, Y aozong Zheng, Y afei Liu, Y eling P eng, Y ouw ei W ang, Sib o W ang, Huiming Y ang, Linglin Liao, and Sh unzhi Y ang Shenzhen In ternational Graduate School, Tsinghua Univ ersity , China Abstract. Vision–language models (VLMs) are increasingly aligned via Group Relativ e Policy Optimization (GRPO)-st yle training. How ever, re- lying solely on terminal outcome rew ards yields sparse credit assignmen t in multi-step reasoning, w eakening the link age b et ween visual evidence and intermediate steps and often causing unstable optimization and vi- sual hallucinations. W e prop ose Differ ential F e e db ack , which automat- ically constructs token/step-lev el sup ervision masks b y repairing erro- neous reasoning tra jectories, explicitly marking the key p ositions that require correction. Without costly large-scale step-by-step human an- notations, our metho d enables process-level visual alignment and can b e seamlessly integrated into existing GRPO-like framew orks. Exp eri- men ts on m ultimo dal reasoning b enc hmarks including MMMStar and MathVista sho w an av erage 3% impro vemen t under matched compute budgets. Our approac h offers an effective, lo w-cost solution for accurate vision–reasoning pro cess alignmen t. Keyw ords: vision–language models · reinforcement learning · pro cess sup ervision 1 In tro duction Vision-Language Mo dels (VLMs) ha ve demonstrated impressiv e p erformance on visual question answ ering, cross-mo dal reasoning, and m ulti-turn dialogue by in tegrating visual enco ders with large language mo del generation. T o improv e instruction following and alignment, [21,23] introduce reinforcement learning and preference optimization in to VLM training, including GRPO [17], PPO [16], GSPO [24], and SAPO [8]. Despite impressive engineering progress, the training signal design universally relies on outc ome sup ervision : mo dels receiv e a single global rew ard or preference judgmen t only after generation completes, without fine-grained feedbac k on intermediate reasoning steps or critical visual decisions. Outcome sup ervision creates significant credit assignment difficulties in m ultimo dal, m ulti-step reasoning. Different tok ens, reasoning steps, and visual 2 F eiding et al. Multimodal Input Im a g e I Instruction q X = (I , q) Policy Network π θ (VLM) Sample G tra jecto ries {y i } (one rollout) Sampled Traj ectory y y 1 √ y 2 √ y 3 × y 4 √ Reward Evaluation τ ta s k R = 0 Repa ir M odel P φ Repa ir e d T raject ory y ~ y 1 y 2 √ y 3 y 4 ~ √ √ √ Dif ferenc e Alig nm e nt (A lig n ) Supe rvision Ma sk 0 0 0 1 m Pr ocess G a t e g i,t ∈ [ 0,1 ] Advan t a g e A t ~ Grad i e n t weigh ts [0,0 ,1, 0 ] Differe nce - Feedbac k Gener ato r (DFG) DF - GRPO Optimization RL P o l icy Upd ate Gra di e nt up da te Trigger C orr e ction Coun terfa ctua l Im a g e - Removal Check Fig. 1. Difference F eedback (DF) pro vides fine-grained pro cess supervision for VLM alignmen t. When the p olicy pro duces an incorrect tra jectory , a small-edit repair is generated; the difference betw een the tw o outputs yields a token-lev el mask that gates gradien t up dates. evidence c hoices contribute unequally to the final outcome; a single outcome sig- nal amplifies p olicy gradien t v ariance and mak es updates susceptible to sampling noise—manifesting as training instability , slow con vergence, and wasteful gradi- en t up dates. In challenging samples inv olving fine-grained recognition, counting, spatial relationships, or cross-mo dal constraints, errors typically concentrate in a few critical fragments ( e.g ., a wrong attribute word, a misidentified region, or a logical jump). Outcome signals cannot lo calize these critical steps, limiting stable impro vemen t on long-tail hard cases. Concrete failure modes include in- abilit y to learn from hard examples and resp onses that ignore visual con ten t in fa vor of language priors. Pro cess-lev el sup ervision is a natural remedy: training a Process Reward Mo del (PRM) to score intermediate steps, or com bining test-time search with step scoring. How ev er, explicit pro cess sup ervision requires exp ensiv e step-level annotation that is difficult to scale; test-time searc h incurs substan tial compu- tational ov erhead and is highly sensitive to ev aluator qualit y . In m ultimo dal tasks, pro cess annotation further requires reliable visual evidence verification, amplifying cost and noise. Thus, obtaining stable, effectiv e pro cess-lev el training signals without large-scale annotation or heavy search o ver- head remains a central challenge in VLM alignment. W e prop ose Difference F eedbac k (DF) to address this challenge (Fig. 1). Difference F eedback is not a sp ecific optimizer, but rather a general mecha- nism for generating pro cess-lev el sup ervision signals . When the p olicy mo del pro duces an incorrect tra jectory , w e train a m ultimo dal repair mo del to generate a corrected answer while making relativ ely small edits to the origi- nal response. W e then align the original answer with the repaired answ er and automatically derive token-/step-lev el sup ervision masks from their differences, explicitly iden tifying which p ositions ne e d to b e c orr e cte d . This sup ervision sig- Difference F eedback for VLM Alignmen t 3 nal can b e plugged in to a v ariety of alignment ob jectives, where the difference mask serves as token-lev el gradient gating to reduce ineffective up dates. As a result, Difference F eedback preserves the scalabilit y of training while pro viding more stable and interpretable lo cal correction signals for long-horizon m ulti- mo dal reasoning. W e ev aluate our approac h on multiple multimodal reasoning and alignment b enc hmarks. Exp erimen tal results sho w that Difference F eedback consistently impro ves p erformance across different ev aluation datasets. Ov erall, this work op ens a new general pathw ay for VLM alignment: automatically constructing pro cess-lev el sup ervision via output differences, leading to stronger m ultimo dal alignmen t and reasoning capabilities. Con tributions. The main con tributions of this w ork are summarized as follo ws: – W e introduce the Difference F eedbac k (DF) mec hanism. Without mo difying the neural arc hitecture, the ob jective function, or requiring large- scale step-by-step human annotations, DF provides a general metho d for gen- erating process-level sup ervision. Through a “repair → alignment → difference mask” pipeline, the method automatically constructs tok en-/step-lev el correc- tion signals, and uses the difference mask to gate tok en-level up dates, reducing the probabilit y of gradients up dating irrelev ant tokens. – Compatibilit y with diverse alignmen t ob jectives. As a plug-in mo dule, Difference F eedback can be in tegrated in to m ultiple optimization metho ds suc h as GRPO, DAPO, GSPO, and PPO, reducing ineffective up dates and impro ving the training efficiency of multimodal mo dels. – Consisten t improv emen ts on multimodal reasoning and alignmen t tasks. Across m ultiple b enc hmarks, Difference F eedbac k yields consisten t per- formance gains. 2 Related W ork Alignmen t and Reinforcemen t Learning(RL) fine-tuning typically rely on sequence- lev el outcome reward/preference signals, causing credit assignment difficulties and high-v ariance up dates in multi-step generation. Pro cess sup ervision miti- gates this but requires exp ensive step-level annotation and PRMs [12]. T o a void step-lev el labeling, one line of work constructs denser token-lev el signals from outcome feedback: [4] redistributes rew ards using internal reward model infor- mation; [11] decomp oses con tributions for rew ard redistribution; [3] allocates marginal contributions from a Shapley p ersp ectiv e. Another line learns token- wise signals from preference optimization, typified by DPO [14] and token-lev el v arian ts [25,26,5,20]; [6] uses p er-token generation probability as token-lev el re- w ard. Unlik e these methods that answer “how much cr e dit should e ach token r e- c eive,” our w ork asks “which tokens/steps ar e the critic al b ottlene cks c ausing err ors.” W e lo calize erroneous p ositions b y introducing ligh tly-edited repairs: rather than outputting tok en-level reward v alues, we generate token/step masks 4 F eiding et al. via edit alignment (Lev ensh tein/LCS) and inject them into GRPO/PPO/GSPO ob jectives, fo cusing negative adv antage on segments requiring correction. Re- lated work on minimal contrastiv e edits [15] demonstrates the v alue of minor edits for causal span lo calization. Self-refinement metho ds [13,10] fo cus on test- time iterative revision. Implicit process rew ards [22] provide token-lev el reward allo cation but face a language capabilit y–reasoning trade-off that limits scala- bilit y . T o our knowledge, we are the first to in tro duce pro cess-lev el supervision in to multimodal RL training, using repair differences to improv e multimodal alignmen t under sparse outcome sup ervision. 3 Metho d 3.1 Notation and Problem Setup W e are giv en a multimodal input x = ( I , q ), where I denotes an image (or a sequence of video frames) and q denotes a textual instruction/question. The p olicy mo del (VLM) is denoted by π θ , which autoregressively generates an out- put sequence y = ( y 1 , . . . , y T ) . W e define the state at step t as s t = ( x, y

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment