Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Grounding the S core: Expli cit Visua l Premise V eriﬁ cati o n f or R elia b le VLM Process R eward Models Junxin W ang ♠ * 1,2 , Dai Guan ♠ 1 , W eijie Qiu 3 , Zhihang Li † 1 , Y ongbo Gai 1 , Zhengyi Y ang 2 , Mengyu Zhou 1 , Erchao Zhao 1 , Xiaoxi Jiang 1 and Guanjun Jiang 1 1 Qwen Large Model Application T eam, Alibaba, 2 Institute of Automati on, Chinese Academy of Sciences, 3 Beijing Univ ersit y of Posts and T elecomm unications ♠ Equal contributi on. * W ork don e during an internship at Alibaba. † Corresponding author . Visi on-language process reward models (VL-PRMs) are increa singly used to score intermediate reasoning steps and rerank candidates under test-time sca ling, yet they often f uncti on as black-box judges: a low step score may reﬂect a genuin e reasoning mist ake or simply the veriﬁ er’s own misperception of the image. This entanglement bet ween perception and reasoning leads to systematic f alse positives (rewarding hallu cinated visual premises) and f alse negativ es (pen alizing correct grounded st atements), undermining both reranking and error loca lization. W e introduce Explicit Visual Premise V eriﬁcatio n ( EVPV) , a light weight veriﬁcati on interf ace that conditio ns step scoring o n the reli a bilit y of the visual premises a step depends o n. Speciﬁca lly , the policy is prompted to prod uce a step-wise visual checklist that makes its required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input ima ge. E VPV matches checklist claims a gainst these co nstraints to co mpute a sca lar visual reliabilit y sign a l, and calibrates PRM step rewards vi a reliabilit y gating: rewards for visually dependent steps are atten uated when reliabilit y is low and preserv ed when relia bilit y is high, decoupling perceptu a l uncertaint y from logical evaluatio n without per-step tool calls. Experiments on Visua lProcessBench and six multim odal reasoning benchmarks show that EVPV impro ves step-lev el veriﬁ catio n and consistently boosts Best-o f- 𝑁 reranking accuracy ov er strong baselin es. Furthermore, injecting co ntrolled corruptio n into the extracted constraints produ ces mo noto nic performance degradatio n, providing causal evidence that the gains arise from constraint ﬁdelit y and explicit premise v eriﬁcati on rather than incidental prompt eﬀects.The relevant code has been open-sourced at https://github.com/Qwen- Applications/ EVPV- PRM . 1. Introd u ctio n Multim odal mathemati cal rea soning requires m odels to jointly solv e t w o tightly coupled b ut failure-prone sub probl ems: visual perceptio n (reading diagrams, extracting quantities from tabl es, OCR, and geometric relations) and symbolic reaso ning ( logica l derivati on and computation). While contemporary m ultim odal LLMs can produ ce ﬂuent multi-step solutions, their correctness is frequently bottlenecked by grounding: a single perceptual mist ake may redirect the entire deriv atio n while keeping later steps locally coherent. This makes process-level v eriﬁcati on and selectio n—not only ﬁn al-answ er checking—central to rob ust depl oyment, especially under test-time sca ling regimes such as Best-of - 𝑁 and search-based decoding ( Zheng et al. , 2025 ; Ma et al. , 2023 ; Z hang et al. , 2024a ). Process reward models (PRMs) operationa lize process supervisio n by assigning step-wise scores to reasoning traces, and they are widely used for Best-of- 𝑁 reranking, guided search, and post-training ( Z heng et a l. , 2025 ; Ma et al. , 2023 ; Z hang et al. , 2024a ). In the visi on-language setting, dedi cated PRMs and ben chmarks such as VisualPRM and VisualProcessBench hav e sho wn that step-aware critics can improv e m ultimodal reasoning under test-time sca ling ( W ang et al. , 2025b ), and data-eﬃcient recipes further lo wer the cost of training such veriﬁ ers ( W ang et al. , 2025a ). These advances hav e been instrument al in unlocking the l atent capability o f Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Figure 1 | EVPV : premise-aware process reward modeling for reliab le multim odal reasoning. (A) Motivating failure case. A standard VL-PRM (VisualPRM) can reward a locally ﬂuent step that relies on a ha llucinated visual premise ( e.g., “a cylindrica l hole”). E VPV prompts the policy to state an explicit visual checklist, veriﬁes it a gainst independently extracted structured visua l constraints, and gates step rewards when the premise is unreliab le. (B) Where step errors come from. On VisualProcessBench, mo st step errors stem from visual misinterpretation (left); these errors are dominated by structural misunderstandings and value misreadings (right), motiv ating explicit premise veriﬁ catio n. ( C) Step-level veriﬁcati on. EVPV -PRM achi eves higher o verall Macro-F1 on VisualProcessBen ch than pri or multim odal PRMs. (D) Deployab le test-time gains. Under Best-of- 8 reranking for InternVL2.5 policies, EV PV -PRM yields consistent BoN@8 improv ements Δ 8 = BoN@8 − P a ss@1 across m odel scales, indicating more reliab le selecti on of grounded solutions under test-time scaling. strong open m ultim odal policies ( Z hu et al. , 2025 ). Y et, when deployed in the wild, current visio n-language PRMs still beha v e like b l ack-box judges : a low score on a step is hard to interpret—did the step fail logi cally , or did the v eriﬁer itself misperceiv e the image? Similar reliability con cerns—e.g., ov erconﬁdence and un certaint y miscalibrati on in step-wise judgments—ha v e also been noted for PRMs m ore broad ly ( Y e et al. , 2025 ; Park et al. , 2025 ). This ambiguit y is not merely a dia gno stic inco nv enience; it is a systematic source of veriﬁcati on error . If the PRM’s o wn visual grounding is unreliab le, it can assign lo w scores to correct visu al descriptions ( false negativ es ) or high scores to hallu cinated on es ( f a lse positiv es ), undermining both reranking and error localization. Figure 1 illustrates this f ailure mode: V isualPRM rewards a locally ﬂuent step that a ssumes a non existent “cylindrical hole,” whereas E VPV makes the visual premise explicit, veriﬁ es it against structured visual constraints, and gates the step reward when the premise is n ot supported. The error breakdo wn in Figure 1 further sho ws that visual misinterpretation dominates step errors on Visua lProcessBench ( W ang et al. , 2025b ). More genera lly , recent audits hav e sho wn that PRM signals can be sensitiv e to semantic perturbations and ma y reward ﬂuent b ut unsupported content under distrib uti on shift ( Cheng et al. , 2025 ; Y e et al. , 2025 ). These observatio ns m otivate our core hypothesis: perceptual correctness is a prerequisite for meaningful logica l eva luatio n . A step that is built on an incorrect visual premise is wrong regard less of how impeccab le the subsequent algebra may be. Consequently , a veriﬁer that directly predicts step correctness without explicitly va lidating the underlying visual premise is forced to entangle t wo error sources—perceptio n and rea soning—and will remain brittle under early cat astrophi c misreads. T ool-integrated veriﬁcati on o ﬀers on e principled path by independently querying the image to redu ce conﬁrmati on bias ( Kuang et al. , 2025 ), but step-wise tool calls can 2 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models be prohibitively expensive when scoring long traces at Best-o f- 𝑁 scale ( Ma et al. , 2023 ; Z hang et al. , 2024a ). W e theref ore introdu ce Explicit Visual Premise V eriﬁcatio n (EVPV) as a light weight mechanism that makes a PRM “qu aliﬁ ed” to judge reaso ning steps. The policy is prompted to provide a visual checklist —explicit visual premises that each step relies on. In parallel, w e extract structured visual facts fro m the image into a constraint set (numeric readings, geometri c relations, and compositio nal structure). E VPV ﬁrst veriﬁes whether the checklist is supported by these visua l f acts, produ cing a reliabilit y sign al; only when the visual premise is deemed reliab le do we enforce strict logical scoring. Concretely , w e calibrate step rewards by gating visu ally dependent steps with the estimated visu al reli a bilit y , atten uating rewards toward neutralit y when the premise is unreliab le and preserving them when it is well supported. This decouples visual underst anding from step judgment, redu ces false positiv es/negativ es caused by veriﬁ er-side misperceptio n, and yields m ore st ab le reranking gains. As preview ed in Figure 1 ( C), this premise-aware calibrati on improv es step-lev el veriﬁ catio n performan ce on VisualProcessBen ch. W e eva luate E VPV on VisualProcessBench and multiple multim odal reasoning benchmarks under Best-of - 𝑁 reranking. Our method achiev es higher step-level veriﬁcati on perf ormance and more deployab le reranking impro vements than strong multim oda l PRM baselines ( W ang et al. , 2025b , a ), while av oiding the hea vy cost o f step-wise tool invocati on ( Kuang et al. , 2025 ). Figure 1 (D) f urther sho ws that these gains transl ate into consistent BoN@8 improv ements across InternVL2.5 poli cy scales, indicating more reliab le selecti on under test-time scaling. Moreov er , controlled corruption o f extracted constraints yields a m on otoni c performance degradatio n curve, supporting that the gains arise from impro ved visual premise veriﬁ catio n rather than incidental prompt eﬀects. 2. R el ated W ork Process reward models. Process reward m odels (PRMs) pro vide step-level supervisio n and hav e become a core mechanism for test-time scaling ( e.g., Best-of - 𝑁 reranking), guided decoding, and post-training of reaso ning models ( Z heng et al. , 2025 ; Ma et al. , 2023 ; Zhang et al. , 2024a ). Beyo nd standard discriminativ e PRMs that directly score steps, recent w ork ha s explored veriﬁers that think before judging: R-PRM generates explicit an a lyses to improv e step discrimination and st ability ( S he et al. , 2025 ), and GenPRM treats veriﬁcati on as a generativ e reasoning procedure that can itself be sca led at inf erence time ( Z hao et al. , 2025 ). R elated reaso ning-centric rew ard modeling f urther encoura ges explicit deliberation, including reward models that generate long-f orm rati onal es bef ore prod ucing preferences ( Guo et al. , 2025 ) and process reward models that think vi a generativ e veriﬁ cati on ( Khalifa et a l. , 2025 ; Jia et al. , 2025 ). Other lin es improv e PRM learning objectiv es and usage: DG -PRM introd uces dynamic, m ulti-criteria reward all ocatio n and m ulti-objectiv e optimization ( Yin et al. , 2025 ), ER-PRM proposes entropy-regul arized process-va lue estimatio n to obtain more rob ust process sign als ( Zhang et al. , 2024b ), and BiPRM levera ges bidirectio nal evaluati on to incorporate f uture context when scoring earlier steps ( Zhang et al. , 2025d ). Complementary w ork revisits the form ulation o f process valu es, e.g., learning Q -va lue rankings ov er steps ( Li & Li , 2024 ), and addresses training-time pathologies such as reward hacking via alternative credit assignment ( Cheng et al. , 2025 ). Data and supervision pipelines hav e a lso been studied extensiv ely: A CTPRM red uces l a beling costs via uncertaint y-driven active learning ( Du an et al. , 2025 ); AUR ORA automates PRM training via ensemb le prompting and reverse v eriﬁcati on ( T an et al. , 2025 ); V ersaPRM extends PRMs bey ond math by levera ging synthetic multi-domain reasoning traces ( Zeng et al. , 2025 ); and OpenPRM constru cts open-domain process-based reward m odels from preference trees distilled from outcome-l ev el supervision ( Zhang et a l. , 2025c ). PRMs hav e further been adapted to sequentia l decisio n-making agents, where step rewards capture promise and progress rather than logi cal correctness ( Xi et al. , 2025 ). Finally , richer supervisio n sign als bey ond binary correctness hav e been explored: P athFinder-PRM introd uces error-aw are hierarchica l supervisio n via expli cit error t yping ( P ala et al. , 2025 ; Ji a et al. , 2026 ). Data and eva luation issues have also been highlighted: the Qw en lessons sho w that Monte-Carlo-deriv ed supervisio n can be noisy and that Best-of - 𝑁 eva luation can bias PRMs to ward outcome-like behavi or , motiv ating complementary step-lev el benchmarks ( Zhang et a l. , 2025e ), while PRMBench exposes ﬁne-grained f ailure m odes not captured by do wnstream reranking metrics alo ne ( S ong et al. , 2025 ). Our work b uilds on this PRM literature but f ocuses on a speciﬁc, perva siv e source o f noise in m ultim odal settings: uncertaint y in visu al premises. Visual perception veriﬁ catio n. Modern MLLMs often fail to reli ab ly perceive ﬁne-grain ed visual facts ( e.g., counting, geometry , structured reading) despite ﬂuent outputs ( Fu et al. , 2024 ; Schulze Buschoﬀ et al. , 2025 ). 3 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models This moti vates stronger vision encoders ( Jain et al. , 2024 ), document-focused perception ( Yu et a l. , 2024 ), and perceptio n–language alignment training ( Huang et a l. , 2023 ; W u et a l. , 2024 ; Huang et a l. , 2025 ), a s well as iterative perception schemes such as Chain-of - Visual-P erception ( T ang et al. , 2024 ) and V isual Perceptio n T okens ( Y u et al. , 2025 ). These eﬀorts support our premise that veriﬁ catio n should conditio n on the reliability o f visual evidence. Multim odal process rew ard m odels. Specia lized m ultimoda l PRMs hav e recently emerged a s eﬀectiv e criti cs for test-time scaling. VisualPRM introdu ces large-scale multim odal process supervisio n and the VisualPro- cessBench benchmark, en ab ling systematic evaluati on o f step-lev el veriﬁ catio n in vision-language reasoning ( W ang et al. , 2025b ). Subsequent work impro ves dat a eﬃ cien cy: A TH ENA demonstrates that strong/w ea k consisten cy ﬁltering and ORM initialization can prod uce competitiv e multim oda l PRMs with subst antially few er labeled trajectories ( W ang et al. , 2025a ), and broader analyses of VL-PRM training highlight practical lessons for scaling and deployment ( Ong et al. , 2025 ). Complementary eﬀorts b uild multim odal PRM training pipelines and process supervisi on sign als at scal e ( Luo et al. , 2025 ; Cao et al. , 2025 ). Bey ond discriminative scoring, VRPRM combines chain-of -thought st yle veriﬁ cati on with reinforcement learning to enhance multim odal process judgment ( Chen et a l. , 2025 ), while GM-PRM extends v eriﬁers with generativ e diagn osis and correction to support reﬁned Best-of - 𝑁 ( Zhang et al. , 2025a ). T ool-integrated v eriﬁcati on provides another axis: TIM-PRM mitigates conﬁrmatio n bias by independently querying visual evidence via tools, improving reliabilit y but at a no n-trivial inference cost ( Kuang et al. , 2025 ). Fin ally , broader evaluatio n eﬀorts for vision-langua ge reward m odeling, including process- and critique-st yle settings, ha ve been advanced by VLRMBench ( R uan et al. , 2025 ). Across these a pproaches, multim odal PRMs are increa singly capa bl e, yet the hand ling of visual premise uncertaint y remains largely implicit: step scores are t ypically produced as if the underlying visual f acts were equally reliab le for all trajectories and all steps. Prior work has advanced PRMs thro ugh stronger reaso ning veriﬁ ers ( She et al. , 2025 ; Zhao et al. , 2025 ; Khalifa et al. , 2025 ; Guo et al. , 2025 ), impro v ed training objectiv es and data eﬃci ency ( Duan et al. , 2025 ; W ang et al. , 2025a ; Z hang et al. , 2025e , 2024b ; Li & Li , 2024 ; Cheng et a l. , 2025 ), and tool-based evidence gathering for m ultim odal v eriﬁcati on ( Kuang et al. , 2025 ). In contrast, our contrib ution targets a missing interface bet w een perception and process supervision. W e introduce Explicit Visual Premise V eriﬁcatio n that ( i) ma kes visual premises explicit via a policy-produ ced checklist, ( ii) extracts structured visual constraints as independent evidence, and ( iii) conv erts checklist–evidence consistency into a reliabilit y signal used to calibrate step rewards. This decouples “whether the veriﬁ er can see” from “whether the step is logica lly correct,” red ucing f alse positiv es/negatives under perceptual f ailures while remaining light weight enough for l arge-scal e Best-o f- 𝑁 reranking. 3. Methodology 3.1. Prob lem S etup Each inst ance consists of an image 𝐼 and a questio n 𝑞 . A multim oda l policy prod uces a step-by-step solution 𝑆 = ( 𝑠 1 , . . . , 𝑠 𝑇 ) and ﬁnal answ er 𝑎 . W e aim to build a process reward model (PRM) that assigns a reward 𝑅 𝑡 ∈ [ − 1 , 1 ] to each step 𝑠 𝑡 , supporting Best-of- 𝑁 reranking and step-lev el diagn osis. The core diﬃcult y in multim odal math is that errors come from t w o diﬀerent sources: visual grounding ( e.g., misread OCR/t a b le va lues, wrong geometric relations, incorrect dia gram structure) and symbolic rea soning ( e.g., inv alid derivatio ns or arithmetic mist akes). Existing VL-PRMs t ypically output step scores directly , implicitly assuming the visual premise is reliab le. When the premise is wrong early , l ater steps can remain locally coherent b ut globally inva lid, and the veriﬁer is f orced to make conﬁdent judgments under uncertain perceptio n. Our goal is to separate these error sources: w e ﬁrst assess whether the visu al premise o f a step is trust worthy , and only then rely on strict step correctness scores. 3.2. Explicit Visual Premise V eriﬁcatio n ( EVPV) EV PV makes a PRM “qualiﬁ ed” to judge: it explicitly represents what visual f acts a step relies on, checks those facts a gainst independent visu al evidence, and uses the resulting reliabilit y to calibrate step rewards. Figure 2 summarizes the pipeline. 4 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Figure 2 | Ov erview of EVPV -PRM. Given an image 𝐼 and questio n 𝑞 , the policy model generates a step-by-step solutio n and, for each step, declares whether it depends on visual evidence, forming a visu al checklist o f explicit claims. In parallel, a constraint extractor predi cts a structured set o f visual facts 𝐶 (numeri c readings, geometric relations, and compositi ona l structure). W e compute a visual reliabilit y score 𝑟 by matching checklist claims a gainst 𝐶 to obt ain support scores and a ggregating them into a single conﬁden ce signal. A step veriﬁ er then produ ces base step rew ards, which are calibrated by reliabilit y gating: rewards for no n-visual steps are kept unchanged, while rewards for visually dependent steps are do wn-weighted when 𝑟 is lo w and preserv ed when 𝑟 is high. The resulting reliabilit y-gated step rew ards are a ggregated for Best-of - 𝑁 reranking and process diagn osis. 3.2.1. Step-wise Visual Checklist W e a sk the policy to accompany each step 𝑠 𝑡 with a short visu al premise declaratio n: 𝑑 𝑡 ∈ { a natural-langu a ge visual asserti on , null } . (1) If 𝑑 𝑡 ≠ null , the step claims dependen ce on a co ncrete visual fact ( e.g., “the radius is 2 ”, “ 𝐴 𝐵 ⊥ 𝐶 𝐷 ”, “the l eft part is att ached by a cylinder”). W e mark visu al dependency by 𝜈 𝑡 = 𝕀 [ 𝑑 𝑡 ≠ null ] ∈ { 0 , 1 } . (2) Collecting all n on-n ull declaratio ns yields a visu a l checklist 𝑉 = { 𝑣 𝑗 } 𝑀 𝑗 = 1 . This checklist is the interface EVPV needs: it turns implicit visu al assumpti ons into explicit claims that can be veriﬁ ed independently from the policy’s later algebra. 3.2.2. Structured Visual Eviden ce ( Constraints) T o verif y the checklist, we extract structured visual evidence from the image once per inst ance using a constraint extractor 𝐸 𝜙 : 𝐶 = 𝐸 𝜙 ( 𝐼 , 𝑞 ) = { 𝑐 𝑘 } 𝐾 𝑘 = 1 . (3) Each constraint follo ws a uniﬁed JS ON schema ( Appendix A) that co v ers (i) numeri c readings ( lengths, angles, tabl e entries), ( ii) rel atio ns ( parall el/perpendicular/equalit y/inciden ce/containment), and ( iii) compo sitio nal structure ( part–whole, att achments, adjacency). Importantly , at test time EVPV relies only on the predicted 𝐶 ; no gold facts are used. 3.2.3. Consisten cy-to-R eliabilit y EV PV con v erts checklist–evidence consisten cy into a scalar visu a l reliabilit y score. Let 𝑚 ( ·) be a t ype-aware matching f uncti on that measures whether a checklist claim is supported by 𝐶 : 𝑝 𝑗 = 𝑚 ( 𝑣 𝑗 , 𝐶 ) ∈ [ 0 , 1 ] , (4) 5 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models where 𝑝 𝑗 is high when the cl aim is ent ailed by extracted constraints (with numeri c toleran ce and entit y/rel atio n alignment; Appendix B). W e then aggregate { 𝑝 𝑗 } into a single reliabilit y value 𝑟 = Agg ( 𝑝 1 , . . . , 𝑝 𝑀 ) ∈ [ 0 , 1 ] . (5) Because a single catastrophic misread can invalidate the entire trace, Agg should be sensitive to strongly unsupported claims. W e use a rob ust geometri c aggregati on: 𝑟 = exp  1 𝑀 𝑀  𝑗 = 1 log ( 𝜖 + 𝑝 𝑗 )  , (6) with a small 𝜖 for st abilit y . Under hallucinated structure or misread valu es, one or m ore 𝑝 𝑗 drops sharply , pulling 𝑟 do wn; when the checklist is well supported, 𝑟 remains high. 3.3. Step V eriﬁcation with Relia bilit y-Gated R ewards Base step veriﬁ er . W e train a st andard step veriﬁ er 𝑉 𝜃 to predict whether step 𝑠 𝑡 is correct giv en the multim odal context and preﬁx: 𝑢 𝑡 = 𝑃 𝜃 ( 𝑦 𝑡 = 1 | 𝐼 , 𝑞, 𝑠 ≤ 𝑡 ) ∈ [ 0 , 1 ] , (7) where 𝑦 𝑡 = 1 indicates a correct step. W e ma p this probabilit y to a signed base reward: 𝑅 base 𝑡 = 2 𝑢 𝑡 − 1 ∈ [ − 1 , 1 ] . (8) R eli ability gating (E VPV calibratio n). A ba se veriﬁer score al on e is ambigu ous in multim odal settings: a low score may reﬂect a true logical error , or simply that the step rests on a misperceiv ed visual premise ( either by the policy or by the veriﬁer). EVPV resolv es this ambiguit y by calibrating rewards f or visually dependent steps using 𝑟 . W e con vert reliabilit y into a sm ooth gating f actor 𝛼 ( 𝑟 ) = 𝜎  𝛽 ( 𝑟 − 𝜏 )  ∈ ( 0 , 1 ) , (9) where 𝜏 is a reliability threshold, 𝛽 controls sm oothness, and 𝜎 is the logisti c f uncti on. The ﬁn al step reward is 𝑅 𝑡 = ( 𝑅 base 𝑡 , 𝜈 𝑡 = 0 , 𝛼 ( 𝑟 ) 𝑅 base 𝑡 , 𝜈 𝑡 = 1 . (10) This implements a simple principle: when the visual premise is unreli a bl e, do not ov er-interpret step correctn ess . If 𝑟  𝜏 , then 𝛼 ( 𝑟 ) ≈ 0 and visually grounded steps are pushed toward neutral reward, preventing early perceptual failures from producing ov erly co nﬁdent negativ e ( or positiv e) signals that dest a bilize reranking and diagn osis. If 𝑟  𝜏 , then 𝛼 ( 𝑟 ) ≈ 1 and the v eriﬁer behav es like a con v entio nal PRM. Trajectory scoring f or Best-o f- 𝑁 . Giv en a candidate solution 𝑆 , w e compute { 𝑅 𝑡 } 𝑇 𝑡 = 1 and aggregate into a trajectory score. Unless st ated otherwise, w e use the fraction o f positiv ely rewarded steps: Score ( 𝑆 ) = 1 𝑇 𝑇  𝑡 = 1 𝕀 [ 𝑅 𝑡 > 0 ] , (11) and select the candidate with the highest score. W e report alternative aggregatio ns in Appendix E. 3.4. Training EV PV introdu ces t w o trainabl e mod ules: the constraint extractor 𝐸 𝜙 and the step v eriﬁer 𝑉 𝜃 . The poli cy is not trained in this work; it is only prompted to o utput steps and checklist items at inference (Figure 3 ). 6 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Figure 3 | Training pipelin e f or the constraint extractor and step veriﬁ er . W e train the constraint extractor 𝐸 𝜙 by distilling gold structured constraints 𝐶 ★ from a strong teacher on image–questi on inputs ( here, 20K samples from Visua lPRM400K with qwen3-vl-235b-a22b-instruct ), using supervised ﬁne-tuning with L con = − log 𝑃 𝜙 ( 𝐶 ★ | 𝐼 , 𝑞 ) . After SFT initialization, we constru ct preference pairs by letting 𝐸 𝜙 generate candidate constraints and selecting hard cases where the teacher identiﬁes l arge deviations from 𝐶 ★ ; we then apply D PO to improv e constraint ﬁdelit y . In parallel, we train the step veriﬁer 𝑉 𝜃 with step-lev el correctness labels via binary cross-entropy . Gold constraints are used only during training; inf erence relies solely on predicted constraints and checklist consistency . Training the constraint extractor . W e distill structured constraints from a strong teacher model. F or each training instance, the teacher pro vides a constraint set 𝐶 ★ (w e use qwen3-vl-235b-a22b-instruct on 20K samples from VisualPRM400K). W e ﬁne-tun e 𝐸 𝜙 with: L con ( 𝜙 ) = − log 𝑃 𝜙 ( 𝐶 ★ | 𝐼 , 𝑞 ) , (12) where 𝐶 ★ is serialized a s JSO N . T o impro ve ﬁdelit y o n hard cases, we further a pply DPO . W e sample candidates { 𝐶 ( 𝑖 ) } 𝑛 𝑖 = 1 ∼ 𝑃 𝜙 ( · | 𝐼 , 𝑞 ) and form a preferred/rejected pair ( 𝐶 + , 𝐶 − ) using a schema-aware distance to 𝐶 ★ ( Appendix A,B). The DPO loss is: L DPO ( 𝜙 ) = − log 𝜎  𝛽 d po  log 𝑃 𝜙 ( 𝐶 + | 𝐼 , 𝑞 ) − log 𝑃 𝜙 ( 𝐶 − | 𝐼 , 𝑞 )   , (13) and the f ull extractor objectiv e is L 𝐸 ( 𝜙 ) = L con ( 𝜙 ) + 𝜆 d po L DPO ( 𝜙 ) . (14) Training the step v eriﬁer . W e train 𝑉 𝜃 with step-lev el correctness l a bels using bin ary cross-entropy: L 𝑉 ( 𝜃 ) = − 𝑇  𝑡 = 1  𝑦 𝑡 log 𝑢 𝑡 + ( 1 − 𝑦 𝑡 ) log ( 1 − 𝑢 𝑡 )  , (15) where 𝑢 𝑡 = 𝑃 𝜃 ( 𝑦 𝑡 = 1 | 𝐼 , 𝑞, 𝑠 ≤ 𝑡 ) . Relia bilit y 𝑟 and gating (Equation ( 10 )) are a pplied at inferen ce time as a calibrati on l ayer , keeping v eriﬁer training simple and ma king EV PV easy to plug into existing PRMs. Inferen ce. For each candidate solutio n, w e (i) obt ain steps and checklist from the policy , (ii) predict constraints 𝐶 = 𝐸 𝜙 ( 𝐼 , 𝑞 ) on ce, (iii) compute reliabilit y 𝑟 by matching checklist items to 𝐶 , (iv) compute gated step rewards via Equatio ns ( 8 ) and ( 10 ), and (v) aggregate rewards to rerank candidates. This achiev es premise-aware v eriﬁcati on without step-wise tool calls. 4. Experiments 4.1. Benchmarks, Protocol, and Baselines W e eva luate E VPV from t w o angles: ( i) step-lev el veriﬁcati on on ann otated rea soning traces, and ( ii) deployab le test-time gains under Best-o f - 𝑁 reranking. For step-level evaluatio n we use VisualProcessBen ch ( W ang et al. , 7 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models T ab le 1 | VisualProcessBen ch Macro-F1 (%). Yes : judge receives our structured constraints; No : original prompt; Δ = Yes − No (in points). Positiv e Δ is highlighted. Model DynaMath MMMU MathV erse MathVisi on W eMath Overall Proprietary Models gpt-4o-mini No 56.57 54.08 52.53 51.42 56.74 53.57 Yes 58.13 53.20 54.09 52.07 54.62 54.29 Δ +1.56 -0.88 +1.56 +0.65 -2.12 +0.72 doubao-seed-1.6-visio n No 66.19 59.47 63.12 61.07 62.74 62.77 Yes 68.66 61.86 65.57 62.51 64.62 64.91 Δ +2.47 +2.39 +2.45 +1.44 +1.88 +2.14 Gemini 2.5 Pro No 68.47 63.34 68.26 65.15 69.48 67.13 Yes 71.32 64.42 69.78 65.26 72.43 68.64 Δ +2.85 +1.08 +1.52 +0.11 +2.95 +1.51 Open-source Models qwen2.5-vl-72b-instru ct No 56.99 59.43 56.43 58.09 55.72 57.19 Yes 61.43 60.25 59.85 59.12 59.72 59.99 Δ +4.44 +0.82 +3.42 +1.03 +4.00 +2.80 Qwen3- VL-30B -A3B -instruct No 58.95 61.29 57.37 57.49 58.76 58.22 Yes 62.27 59.00 59.68 56.49 59.50 59.26 Δ +3.32 -2.29 +2.31 -1.00 +0.74 +1.04 Qwen3- VL-235B -A22B -instruct No 57.63 58.73 58.08 59.59 58.76 58.51 Yes 68.43 61.54 65.97 64.54 64.16 65.45 Δ +10.80 +2.81 +7.89 +4.95 +5.40 +6.94 Process Rew ard Models QWEN- VL-PRM-7B ( Ong et al. , 2025 ) 58.30 55.80 58.80 55.70 59.80 58.60 TIM-PRM-8B ( Kuang et al. , 2025 ) 65.90 58.30 61.90 58.30 63.90 61.70 VisualPRM-8B ( W ang et al. , 2025b ) 62.70 58.50 61.00 62.10 61.80 62.00 EVPV -PRM 69.57 68.86 67.09 65.27 69.11 67.46 2025b ). F or downstream evaluatio n w e use six multim odal reaso ning ben chmarks: L ogicV ista ( Xiao et al. , 2024 ), MMMU ( Y ue et al. , 2024 ), MathV erse- VO ( Zhang et al. , 2024c ), MathVisio n ( W ang et al. , 2024 ), MathVista ( Lu et al. , 2023 ), and W eMath ( Qiao et al. , 2025 ). Metrics. On VisualProcessBench we report step-lev el Macro-F1 ( primary) and accuracy . On downstream benchmarks w e report P ass@1 (policy accuracy without reranking), BoN@k ( accuracy after reranking 𝑘 samples), and the practical gain Δ 𝑘 = BoN@k − P ass@1 . W e also report Std P a ss@k , the oracle upper bound o f the candidate set, to separate candidate quality from selection qu a lit y . Baselin es. W e compare against multim odal PRMs including VisualPRM ( W ang et al. , 2025b ), QWEN- VL- PRM-7B ( Ong et a l. , 2025 ) and the tool-integrated veriﬁ er TIM-PRM ( Kuang et a l. , 2025 ). W e a lso evaluate sev eral strong MLLMs as step judges under a st andardized pro mpt, with t w o conditi ons: No ( origin al prompt) and Yes ( append our extracted stru ctured constraints as evidence). Finally , we include compon ent ab l atio ns of EV PV ( checklist, constraints, matching, gating). 4.2. Exp-1: Step V eriﬁcatio n on Visua lProcessBench W e eva luate step-level veriﬁcati on directly on VisualProcessBen ch ( W ang et a l. , 2025b ). T ab le 1 compares our method with pri or multim odal PRMs and a set o f judge m odels. F or judge m odels, Yes appends our extracted structured constraints, while No uses the original prompt. Tw o observatio ns st and out in T abl e 1 . First, our method achiev es the best ov erall Macro-F1 am ong the compared PRMs, indi cating stronger step discrimination under rea l visual uncertaint y . Second, many judge m odels improv e under Yes , suggesting that the constraint representation is broadly reusa bl e as external evidence—ev en without retraining the judge—and that a no n-trivial part of v eriﬁcati on error comes from missing or unreli a bl e grounding. 8 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models T ab le 2 | Do wnstream Best-o f-8 reranking with InternVL2.5 polici es. BoN@8 accuracy (%) after reranking with diﬀerent PRMs; red n umbers denote Δ 8 (BoN@8 − Pa ss@1) for our PRM. Model MathVista MathVisi on MathV erse- VO W eMath L ogicV ista MMMU Overa ll Proprietary Models GPT -4o 60.00 31.20 40.60 45.80 52.80 70.70 47.90 Gemini-2.0-Flash 70.40 43.60 47.80 47.40 52.30 69.90 53.40 Claude-3.5-Sonnet 65.30 35.60 46.30 44.00 60.40 66.40 50.50 Open-source Models InternVL2.5-8B 64.50 17.00 22.80 23.50 36.38 56.20 32.84 +VisualPRM 68.50 25.70 35.80 36.50 43.80 60.20 41.40 +4.00 +8.70 +13.00 +13.00 +7.80 +4.00 +8.40 +EVPV -PRM 76.30 22.07 29.47 37.45 45.33 67.75 41.67 +11.80 +5.07 +6.67 +13.95 +8.95 +11.55 +8.83 InternVL2.5-26B 68.20 23.40 24.00 30.90 39.64 60.70 37.23 +VisualPRM 73.10 29.60 39.10 40.80 51.00 63.90 45.80 +4.90 +6.20 +15.10 +9.90 +11.40 +3.20 +8.90 +EVPV -PRM 79.60 28.11 32.47 42.14 51.72 69.25 46.75 +11.40 +4.71 +8.47 +11.24 +12.08 +8.55 +9.52 InternVL2.5-38B 71.90 32.20 36.90 38.30 47.90 63.90 45.44 +VisualPRM 73.90 35.20 46.70 46.20 53.70 69.00 50.70 +2.00 +3.00 +9.80 +7.90 +5.80 +5.10 +6.30 +EVPV -PRM 83.50 37.59 47.67 50.00 58.74 72.33 55.22 +11.60 +5.39 +10.77 +11.70 +10.84 +8.43 +9.78 4.3. Exp-2: Best-o f- 𝑁 R eranking in Downstream Benchmarks W e next test whether premise-a ware v eriﬁcati on translates into deploya b le test-time gains. W e rerank candidates generated by InternVL2.5 policy models at three scal es (8B/26B/38B). For each qu estio n, the policy samples 𝑘 ∈ { 1 , . . . , 8 } candidate solutio ns; we rerank them using step rewards and report BoN@8. T ab le 2 summariz es the results. Across all three policy si zes, our PRM yi elds consistent gains o v er the base policy and impro ves upon Visua lPRM ( W ang et al. , 2025b ) in ov erall performance ( e.g., +8.83, +9.52, and +9.78 points o ver P ass@1 for 8B/26B/38B, respectiv ely). The improv ements are especially pronoun ced on visually intensive benchmarks such as MathVist a, W eMath, and L ogicV ista, which matches E VPV ’s intent: when early visual premises are the dominant failure mode, reli ability-aware step scoring red uces selection errors without incurring the per-step tool o verhead of TIM-PRM ( K uang et a l. , 2025 ). 4.4. Exp-3: P erceptio n Evidence Qu ality and Its Causa l Impact on V eriﬁcatio n EV PV is motiv ated by a single principl e: reliab le visual evidence is a prerequisite for meaningf ul process v eriﬁcati on . W e therefore examine this principle from t wo complementary angles— ( i) interventi on on the policy’s perceiv ed evidence and ( ii) controlled degradation of the v eriﬁer’s extracted co nstraints —to quantif y both the sensitivit y of multim odal reasoning to percepti on and the causa l role of constraint ﬁdelit y in step v eriﬁcati on. (A) P erceptio n interventio ns for the policy . T o measure ho w strongly multim odal reasoning depends on perceptio n qualit y , w e evaluate the same qu estio ns under f our controll ed settings: (I) Normal (ima ge+ 𝑞 ), (I I) Oracle perception ( ima ge+ 𝑞 plus an oracle structured description), (II I) No isy percepti on (ima ge+ 𝑞 plus a corrupted description), and (I V) T ext-only (rem o v e the image). W e run a ﬁxed poli cy model for all settings and report answ er accuracy and PRM trajectory scores. T ab le 3 sho ws t wo consistent patterns: providing oracle perception substanti a lly improv es accuracy , while text-only performan ce drops sharply , indicating that perceptio n is a dominant bottleneck; moreo ver , our PRM yields a mon otoni c ordering o f trajectory scores 9 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models 0 10 20 40 60 80 100 Flip Ratio (%) 62 64 66 68 70 72 Macr o-F1 (%) 67.31% 63.85% Overall Performance Drop: 3.46% DynaMath MMMU MathV erse MathV ision W eMath Overall Figure 4 | Constraint qualit y–performan ce ca usal curves under co ntrolled noise. T ab le 3 | Percepti on interventi ons. W e evaluate the same questio ns under fo ur perception conditio ns by InternVL2.5-8B. T op: policy accuracy (%). Bottom: av erage PRM trajectory score (higher is better). Conditio n LogicVista MathV erse- VO MathVisi on Acc. (%) (I) Normal 38 . 26 24 . 54 15 . 86 (II) Oracle perceptio n 48 . 10 38 . 48 23 . 75 (II I) Noisy perceptio n 40 . 49 32 . 46 18 . 82 (I V) T ext-only 21 . 25 16 . 22 12 . 50 Score (I) Normal 0 . 08 0 . 22 0 . 05 (II) Oracle perceptio n 0 . 20 0 . 45 0 . 19 (II I) Noisy perceptio n 0 . 10 0 . 34 0 . 09 (I V) T ext-only − 0 . 31 − 0 . 11 − 0 . 01 align ed with perception qu ality: (II) > (II I) > (I) > (I V), matching EVPV ’s intent that w eaken ed visual evidence should not produ ce a strong “correct process” signal. (B) Causa l curv e via constraint corruption. EVPV further attrib utes its gains to the ﬁdelit y of the extracted structured constraints used to validate checklist cl aims. T o test this causa lly , we inject co ntrolled noise into the constraint set by randomly ﬂipping a fraction of constraint ﬁelds ( ﬂip ratio), while keeping the policy , v eriﬁer architecture, and scoring proced ure ﬁxed. As sho wn in Figure 4 , VisualProcessBen ch Macro-F1 decreases m on otoni cally as the ﬂip ratio increa ses across all evaluated judges, pro viding causa l evidence that v eriﬁcati on qualit y is driv en by constraint ﬁdelit y and premise veriﬁcati on rather than incidental prompt length or formatting eﬀects. The mild drop under lo w noise also indicates that the reliabilit y gating is not o v erly brittle: small constraint errors do not immedi ately collapse step judgments. 4.5. Exp-4: Ab lation Studies W e a b late core compo nents o f E VPV to identif y which parts are responsib le for the v eriﬁcati on and reranking gains. T ab le 4 reports representative variants on VisualProcessBen ch (Macro-F1). The trends closely match the E VPV design. First, premise veriﬁ cati on requires usa b le structured evidence . R eplacing structured constraints with captio n-only descriptions red uces o v erall Macro-F1 by 4.08 points, and 10 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models T ab le 4 | Key ab l atio ns on VisualProcessBen ch (Macro-F1; higher is better). Δ is relative to the full method. V ariant DynaMath MMMU MathV erse MathVisio n W eMath Overa ll 𝚫 Full Method Full ( EVPV + gating) 69.57 68.86 67.09 65.27 69.11 67.46 +0.00 Eviden ce / structure ab latio ns w/o structured f acts ( captio n-only) 67 . 75 58 . 09 63 . 48 60 . 68 67 . 10 63 . 38 − 4 . 08 w/o constraints (facts =  ) 66 . 66 55 . 80 62 . 61 59 . 13 65 . 81 62 . 11 − 5 . 35 w/ shuﬄed facts (structure corrupted) 62 . 86 52 . 57 59 . 81 58 . 52 64 . 77 59 . 82 − 7 . 64 R em o ve moda lities / severe corruptio n w/o vision (text-only judge, keep JSON) 58 . 44 49 . 44 53 . 59 54 . 07 61 . 02 54 . 93 − 12 . 53 w/o vision & w/o JSON (text-only) 54 . 49 43 . 93 42 . 78 50 . 84 53 . 78 48 . 23 − 19 . 23 w/ drop-f acts corruption 34 . 90 34 . 40 36 . 29 36 . 14 35 . 96 35 . 77 − 31 . 69 completely remo ving co nstraints ( facts =  ) f urther degrades performan ce (-5.35). This show s that simply having additional text context is insuﬃci ent; the v eriﬁer beneﬁts fro m structured, matchab le f acts that can support checklist cl aims. structure and a lignment matter . When we keep the same facts but shuﬄe them to corrupt the relational structure, Macro-F1 drops more sharply (-7.64). This indi cates that E VPV is not merely expl oiting the presence o f extra tokens, b ut relies on f aithful entit y/rel atio n alignment bet ween checklist items and evidence to compute reliabilit y and gate rewards appropriately . EV PV still depends o n direct visu al access . Making the judge text-only while keeping the JS ON constraints causes a l arge drop (-12.53), and remo ving both visio n and JSON drops f urther (-19.23). Thus, structured constraints are helpf ul but do n ot f ully substitute for image-co nditio ned veriﬁcati on; both moda lities contribute to reliab le step supervisi on. Finally , the drop-f acts corruption coll apses performance (-31.69), reﬂecting that when evidence becomes severely incomplete, the v eriﬁer is eﬀectively ungrounded and reli a bilit y gating can no longer provide meaningf ul ca libratio n. 5. Discussi on Why EVPV helps: turning a hidden assumpti on into a checked premise. Most process reward models score a step as if the underlying facts were settled, even though in multim odal problems the “facts” often come from fra gile percepti on. This creates a systematic ambiguit y: a lo w score may reﬂect wrong l ogic, or simply a misread diagram. E VPV reduces this ambiguit y by making visual premises explicit ( checklist) and verif ying them a gainst independent structured evidence ( constraints) before trusting strict step judgments. This view aligns with ﬁndings that m ultimoda l chain-of -thought reliabilit y depends on faithf ul visu al grounding ( Zhang et al. , 2025b ) and with “generate-then-verif y” interventi ons that explicitly validate claims to mitigate hallu cinations ( Wu et a l. , 2025 ). The co ntrolled perceptio n interventi on in T ab le 3 supports this premise: as perceptio n qualit y changes, answ er accuracy and our trajectory scores shift coherently . From veriﬁcati on to deployment: more reliabl e reranking under test-time sca ling. The reranking results in T ab le 2 sho w that premise-aware scoring yields practica l gains across InternVL2.5 policy siz es, with the largest improv ements on benchmarks where early visual misreads dominate. This suggests E VPV mainly red uces selecti on errors —ﬂuent but visually wrong traces being ranked abov e grounded ones. Compared with tool-integrated veriﬁ catio n ( e.g., TIM-PRM ( Kuang et a l. , 2025 )), EV PV is light weight: it validates premises on ce per prob lem via extracted co nstraints, av oiding expensiv e per-step tool ca lls, while remaining compatib le with veriﬁ catio n-driv en test-time reliabilit y strategies ( Wu et al. , 2025 ). Eviden ce qu a lit y matters, and the ab lations isolate it. Our gains are driven by premise veriﬁcati on with usab le structured evidence. Step-level impro v ements on VisualProcessBench (T ab le 1 ) and the mon oto nic degradatio n under constraint corruptio n (Figure 4 ) indicate a direct dependence on constraint ﬁdelit y , and the ab lations (T ab le 4 ) sho w that rem o ving structured f acts or vision subst antially harms performance. These results complement analyses that PRM robustn ess depends on controlling supervision no ise ( Z heng et al. , 2025 ; 11 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models W ang et al. , 2025a ) and are consistent with recent eﬀorts to st abilize process-level signals via redesigned step-wise learning objectives ( Fei et a l. , 2025 ). 6. Conclu si on W e introdu ced Expli cit V isual Premise V eriﬁcati on (E VPV) for multim odal process reward m odeling. E VPV prompts the policy to state step-wise visual premises, veriﬁes them against structured constraints extracted from the image, and uses the resulting reliabilit y signal to calibrate step rewards. This decoupling makes process supervisio n more dependab le under perceptual f ailures and impro v es Best-o f- 𝑁 selecti on in downstream m ultimoda l rea soning. EV PV has limit atio ns. Its eﬀ ectiv eness depends on the cov era ge and accuracy of the extracted co nstraints: missing or spuriou s constraints can under- or ov er-gate visually grounded steps. It a lso relies on checklist qualit y; incomplete or o v erly va gue premises redu ce matchabilit y , and instance-lev el reliabilit y may be coarse for traces that mix local visual reads with pure algebra. Future w ork includes step-/claim-conditio ned reliabilit y (rather than a single global signal), uncertaint y-aware constraint extraction and matching, and integrating premise-aw are rewards into training-time process optimi zation to further improv e robustn ess under distributio n shift and long-horizon reaso ning. R eferences Qi Cao, Ruiyi W ang, Ruiyi Z hang, Sai Ashish Somayajula, and Pengtao Xie. Dreamprm: Domain-rew eighted process reward model for multim odal reasoning. arXiv preprint , 2025. Xinquan Chen, Bangwei Liu, Xuhong W ang, Yingchun W ang, and Chaochao Lu. V rprm: Process reward m odeling via visual reasoning. arXiv preprint , 2025. Jie Cheng, Gang Xio ng, Ruixi Qiao, Lijun Li, Chao Guo, Junle W ang, Y isheng L v , and F ei- Y ue W ang. Stop summatio n: Min-form credit assignment is all process reward model needs for reaso ning. arXiv preprint arXiv:2504.15275 , 2025. Keyu Duan, Zichen Liu, Xin Mao, Tianyu P ang, Changyu Chen, Qiguang Chen, Michael Qizhe Shi eh, and Longxu Dou. Eﬃci ent process reward model training via active learning. arXiv preprint , 2025. W u Fei, H ao Kong, Shuxian Li ang, Y ang Lin, Yibo Y ang, Jing T ang, L ei Chen, and Xiansheng Hu a. S elf -guided process reward optimiz atio n with redeﬁned step-wise advant a ge for process reinf orcement learning. arXiv preprint arXiv:2507.01551 , 2025. Xingyu Fu, Yu shi Hu, Bangzheng Li, Y u Feng, Haoyu W ang, Xudong Lin, D an R oth, Noah A Smith, W ei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal l arge language m odels can see but not perceive. In European Conferen ce on Computer Visio n , pp. 148–166. S pringer , 2024. Jiaxin Guo, Zewen Chi, Li Dong, Q ingxiu Dong, Xun Wu, Shaohan Huang, and Furu W ei. R eward reasoning m odel. arXiv preprint , 2025. Jiaxing Hu ang, Jingyi Zhang, Kai Jiang, Han Qiu, Xi aoqin Zhang, Ling Shao, Shijian Lu, and Dacheng T ao. Visual instructio n tuning to wards genera l-purpose m ultimoda l large language m odel: A surv ey . Internationa l Jo urnal of Computer Vision , 133(11):8151–8189, 2025. Shaohan Hu ang, Li Dong, W enhui W ang, Y aru Hao, Saksham Singha l, Shuming Ma, T engchao Lv , L ei Cui, Owais Khan Mohammed, Barun P atra, et al. L anguage is not all yo u need: Aligning perceptio n with language m odels. Advances in Neural Informati on Processing S ystems , 36:72096–72109, 2023. Jitesh Jain, Ji an wei Y ang, and Humphrey Shi. V coder: V ersatile vision encoders for multim odal l arge l angua ge m odels. In Proceedings o f the I EEE/C VF Conference on Computer Visio n and Pattern Recogniti on , pp. 27992– 28002, 2024. 12 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models R uipeng Jia, Y unyi Y ang, Y ongbo Gai, Kai Lu o, Shihao Huang, Ji anhe Lin, Xi aoxi Jiang, and Guanjun Jiang. W riting-z ero: Bridge the gap bet ween non-v eriﬁ a bl e t a sks and veriﬁab le rewards. arXiv preprint arXiv:2506.00103 , 2025. R uipeng Ji a, Y unyi Y ang, Y uxin W u, Y ongbo Gai, Siyu an T ao, Mengyu Z hou, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Open rubric system: Scaling reinforcement learning with pairwise ada ptiv e rubric. arXiv preprint arXiv:2602.14069 , 2026. Muhammad Khalifa, Rishab h Agarwal, L ajanugen L ogeswaran, J aekyeom Kim, Hao P eng, Moontae L ee, Ho nglak L ee, and Lu W ang. Process reward models that think. arXiv preprint , 2025. P eng Kuang, Xiangxiang W ang, W ent ao Liu, Jian Dong, and Kaidi Xu. T im-prm: V erif ying multim odal reaso ning with tool-integrated prm. arXiv preprint , 2025. W endi Li and Yixuan Li. Process reward model with q-value rankings. arXiv preprint , 2024. P an Lu, H ritik Bansal, T ony Xi a, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- W ei Chang, Michel Galley , and Jianfeng Gao. Mathvist a: Eva luating mathematical reasoning of foundati on models in visual contexts. arXiv preprint , 2023. R uilin Luo, Zhuo fan Zheng, Y ifan W ang, Xinzhe N i, Zicheng Lin, Songtao Jiang, Y iyao Yu, Chuf an S hi, Lei W ang, Ruihang Chu, et al. Unlocking multim odal mathematical reaso ning via process reward m odel. arXiv preprint arXiv:2501.04686 , 2025. Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Y ang Y ou, and Hongxia Y ang. Let’s reward step by step: Step-lev el reward model a s the navigators for rea soning. arXiv preprint , 2023. Brandon Ong, T ej Deep P ala, V ernon T oh, William Chandra Tjhi, and Soujanya Poria. Training vision-language process reward models for test-time scaling in m ultim oda l reasoning: K ey insights and l essons learn ed. arXiv preprint arXiv:2509.23250 , 2025. T ej Deep P ala, P anshul Sharma, Amir Zadeh, Chuan Li, and Soujanya P oria. E rror t yping f or smarter rewards: Impro ving process reward m odels with error-aw are hierarchi ca l supervisio n. arXiv preprint , 2025. Y oung-Jin Park, Kristjan Greenew ald, Kav eh Alim, H ao W ang, and Na vid Azizan. Kno w what yo u don’t kno w: U ncertaint y ca libratio n of process reward models. arXiv preprint , 2025. R unqi Qi ao, Qiuna T an, Gu anting Dong, MinhuiWu MinhuiW u, Chong Sun, Xiaoshuai Song, Jiapeng W ang, Zhuoma Gongque, Shanglin Lei, Y if an Z hang, et al. W e-math: Does y our l arge multim odal model achieve human-like mathematical reasoning? In Proceedings of the 63rd Annual Meeting of the Associatio n for Computational Linguistics ( V olume 1: Long Pa pers) , pp. 20023–20070, 2025. Jiacheng Ruan, W enzhen Y uan, Xian Gao, Y e Guo, Daoxin Zhang, Z he Xu, Y ao Hu, Ting Liu, and Yuzhu o Fu. Vlrmbench: A comprehensive and chall enging benchmark for visio n-language reward models. In Proceedings o f the I EEE/CVF Intern atio nal Conferen ce on Computer Visio n , pp. 3163–3173, 2025. Luca M Schulze Buschoﬀ , Elif Akata, Matthi a s Bethge, and Eric Schulz. Visual cognition in m ultimodal l arge language models. Nature Machine Intelligence , 7(1):96–106, 2025. Shuaijie She, Junxiao Liu, Y ifeng Liu, Jiajun Chen, Xin Huang, and Shujian Hu ang. R-prm: R easoning-driv en process reward modeling. arXiv preprint , 2025. Mingyang Song, Z haochen Su, Xiaoye Qu, Jiaw ei Zhou, and Y u Cheng. Prmbench: A ﬁne-grain ed and chall enging benchmark for process-lev el reward models. arXiv preprint , 2025. Xiaoyu T an, T ianchu Y ao, Chao Qu, Bin Li, Minghao Y ang, Dakuan Lu, H aozhe W ang, Xihe Qiu, W ei Chu, Y inghui Xu, et a l. Aurora: A utomated training framew ork o f univ ersal process reward models vi a ensembl e prompting and reverse veriﬁcati on. arXiv preprint , 2025. 13 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Lv T ang, Peng- T ao Jiang, Zhi-Hao Shen, Hao Z hang, Jin- W ei Chen, and Bo Li. Chain of visual percepti on: Harnessing multim odal l arge langu a ge models for zero-shot camouﬂa ged object detecti on. In Proceedings o f the 32nd A CM intern ati onal conference on multimedia , pp. 8805–8814, 2024. Ke W ang, Junting Pan, W eikang Shi, Zimu Lu, Houxing R en, A ojun Zhou, Ming jie Zhan, and H ongsheng Li. Measuring multim oda l mathematical reasoning with math-visi on dat aset. Advances in Neural Informatio n Processing Systems , 37:95095–95169, 2024. Shuai W ang, Zhenhu a Liu, Ji aheng W ei, Xuanwu Yin, Dong Li, and Emad Barsoum. Athen a: Enhancing m ultimoda l rea soning with dat a-eﬃci ent process reward models. arXiv preprint , 2025a. W eiyun W ang, Zhangwei Gao, Lianjie Chen, Z he Chen, Jinguo Zhu, Xi angyu Zhao, Y angzhou Liu, Y ue Cao, Shengl ong Y e, Xi zhou Z hu, et al. Visualprm: An eﬀective process reward model for multim odal reasoning. arXiv preprint arXiv:2503.10291 , 2025b. Jiannan Wu, Muyan Z hong, Sen Xing, Zeqiang L ai, Zhaoyang Liu, Zhe Chen, W enhai W ang, Xi zhou Zhu, L ewei Lu, T ong Lu, et a l. Visio nllm v2: An end-to-end gen eralist m ultim odal large language model for hundreds of visio n-language tasks. Advances in Neural Informatio n Processing Systems , 37:69925–69975, 2024. T sung-Han W u, Heekyung L ee, Jiaxin Ge, Joseph E Gonzalez, T revor Darrell, and David M Chan. Generate, b ut verif y: R edu cing hallu cinatio n in vision-language models with retrospectiv e resampling. arXiv preprint arXiv:2504.13169 , 2025. Zhiheng Xi, Chenyang Liao, Gu anyu Li, Y ajie Y ang, W enxiang Chen, Zhihao Z hang, Binghai W ang, Senjie Jin, Y uhao Zhou, Jian Guan, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. arXiv preprint , 2025. Y ijia Xiao, Edward Sun, T i anyu Liu, and W ei W ang. L ogicvista: Multim odal llm logical reasoning benchmark in visual contexts. arXiv preprint , 2024. Zihuiwen Y e, Luckeciano Carva lho Melo, Y ounesse Kaddar , Phil Blunsom, Sam St ato n, and Y arin Gal. U ncertaint y-aware step-wise veriﬁ catio n with generativ e reward models. arXiv preprint , 2025. Zhangyue Yin, Q ius hi Sun, Z hiyuan Zeng, Q inyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Dynamic and genera lizabl e process reward modeling. In Proceedings of the 63rd Annual Meeting of the Associatio n for Computational Linguistics ( V olume 1: Long Pa pers) , pp. 4203–4233, 2025. R unpeng Y u, Xinyin Ma, and Xinchao W ang. Introd ucing visual perceptio n token into multim odal l arge langu a ge m odel. arXiv preprint , 2025. Y a-Qi Y u, Minghui Liao, Jihao W u, Y ongxin Liao, Xiaoyu Zheng, and W ei Zeng. T exthawk: Exploring eﬃcient ﬁne-grain ed perception of multim odal large language m odels. arXiv preprint , 2024. Xiang Y ue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ru oqi Liu, Ge Z hang, Sam uel Stevens, Dongfu Jiang, W eiming R en, Yuxuan Sun, et al. Mmmu: A ma ssiv e multi-discipline multim odal understanding and reaso ning benchmark for expert a gi. In Proceedings o f the I EEE/CVF Conf erence on Computer Vision and P attern R ecognition , pp. 9556–9567, 2024. Thoma s Zeng, Shuibai Zhang, S hutong Wu, Christian Cl assen, Daewo n Chae, Ethan E w er , Minjae L ee, Heeju Kim, W onjun Kang, Jackson Kunde, et a l. V ersaprm: Multi-domain process reward model via synthetic reaso ning dat a. arXiv preprint , 2025. Dan Zhang, Sining Zhoubian, Z iniu Hu, Yiso ng Yu e, Y uxiao Dong, and Ji e T ang. R est-mcts*: Llm self-training via process reward guided tree search. Advan ces in N eural Informatio n Processing S ystems , 37:64735–64772, 2024a. Hanning Zhang, Pengcheng W ang, Shizhe Di ao, Y ong Lin, Rui P an, Hanz e Dong, Dyl an Zhang, Pa vlo Molchano v , and T ong Z hang. Entropy-regul arized process reward model. arXiv preprint , 2024b. 14 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Jianghangfan Zhang, Y ibo Y an, K ening Z heng, Xin Zou, S ong Dai, and Xuming Hu. Gm-prm: A generative m ultim odal process reward model for multim odal mathematical reaso ning. arXiv preprint , 2025a. Jusheng Zhang, Kaito ng Cai, Xi aoyang Guo, Sidi Liu, Qinhan Lv , Rui qi Chen, Jing Y ang, Yijia F an, Xiao fei Sun, Jian W ang, et a l. Mm-cot: a benchmark f or probing visu a l chain-of -thought reaso ning in m ultim oda l models. arXiv preprint arXiv:2512.08228 , 2025b. Kaiyan Z hang, Jiayuan Z hang, H aoxin Li, Xu ekai Z hu, Ermo Hua, Xingtai L v , Ning Ding, Biqing Qi, and Bo w en Zhou. Openprm: Building open-domain process-based reward models with preference trees. In The thirteenth internationa l conferen ce on learning represent atio ns , 2025c. Lingyin Z hang, Jun Gao, Xiaoxue R en, and Ziqiang Cao. The bidirectiona l process reward model. arXiv preprint arXiv:2508.01682 , 2025d. R enrui Zhang, Dongzhi Jiang, Yichi Z hang, Haokun Lin, Ziyu Guo, P engshuo Qiu, A ojun Zhou, P an Lu, Kai- W ei Chang, Yu Q iao, et al. Mathverse: Does y our m ulti-m odal llm truly see the diagrams in visual math prob lems? In European Conference on C omputer Vision , pp. 169–186. Springer , 2024c. Zhenru Z hang, Chujie Z heng, Y angzhen Wu, Beichen Z hang, R unji Lin, Bow en Y u, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematica l reasoning. arXiv preprint arXiv:2501.07301 , 2025e. Jian Zhao, R unze Liu, Kaiyan Z hang, Z him u Zhou, Junqi Gao, Dong Li, Jiafei L yu, Zhouyi Qian, Biqing Qi, Xiu Li, et a l. Genprm: Scaling test-time compute of process reward models via generativ e reaso ning. arXiv preprint arXiv:2504.00891 , 2025. Congming Z heng, Jiachen Zhu, Zhuoying Ou, Y uxi ang Chen, Kangning Zhang, R ong Shan, Zeyu Z heng, Mengyue Y ang, Jianghao Lin, Y ong Yu, et al. A surv ey o f process reward models: From outcome sign als to process supervisions for large language m odels. arXiv preprint , 2025. Jinguo Zhu, W eiyun W ang, Z he Chen, Z haoyang Liu, Shenglong Y e, Lixin Gu, Hao T i an, Y uchen Du an, W eijie Su, Jie Shao, et a l. Intern vl3: Exploring advanced training and test-time recipes for open-source multim odal m odels. arXiv preprint , 2025. 7. Appendix A. Structured Visual C onstraint Schema The constraint extractor 𝐸 𝜙 maps an image–qu estio n pair ( 𝐼 , 𝑞 ) to a stru ctured set C = { 𝑐 𝑘 } 𝐾 𝑘 = 1 . E ach 𝑐 𝑘 belongs to one of three categories: numeric , relation , or stru cture . The schema is serialized as a JSO N array and is the direct supervision t arget during SFT ( Appendix 7 ). A.1 Complete Example The foll owing JSO N sho ws a represent ative constraint set C for a geometry probl em whose ima ge depicts a combined cone-and-cylinder solid with l abel ed dimensions. Example: stru ctured visual constraint set C [ { "category": "numeric", "entity": "cylinder base radius", "attribute": "length", "value": 3, "unit": "cm", 15 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models "confidence": 0.95 }, { "category": "numeric", "entity": "cylinder height", "attribute": "length", "value": 8, "unit": "cm", "confidence": 0.92 }, { "category": "numeric", "entity": "cone height", "attribute": "length", "value": 4, "unit": "cm", "confidence": 0.88 }, { "category": "relation", "type": "equal", "entities": ["cone base radius", "cylinder base radius"], "direction": null, "confidence": 0.97 }, { "category": "structure", "type": "composite", "parts": ["cylinder", "cone"], "attachment": ["cone placed on top of cylinder"], "adjacency": [], "confidence": 0.94 } ] At test time, 𝐸 𝜙 predicts C from ( 𝐼 , 𝑞 ) directly; no gold constraints are used. During training ( Appendix 7 ), the teacher model provides C ★ as supervision targets. A.2 Schema Speciﬁcati on B. Checklist–Constraint Matching Functio n W e describe the t ype-aw are matching functio n 𝑚 ( 𝑣 𝑗 , C ) that maps a single checklist claim 𝑣 𝑗 to a support score 𝑝 𝑗 ∈ [ 0 , 1 ] . B.1 Claim P arsing Each checklist item 𝑣 𝑗 ( prod uced by the policy’s visualdependency ﬁeld) is a n atural-langua ge assertio n. W e classif y it a s one o f three cl aim t ypes — numeric , relationa l , or stru ctural —u sing a lightweight classiﬁer trained on the schema voca b ulary . Unclassiﬁa b le cl aims receiv e a soft fallback score of 0 . 5 ( indicating uncertaint y rather than contradictio n). 16 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models T ab le 5 | T op-level ﬁelds for each constraint category . * confidence is a m odel-estimated reliabilit y weight in [ 0 , 1 ] and is used during matching ( Appendix 7 ). Category K ey ﬁelds Descriptio n numeric entity , attribute , value , unit , confidence * A measura b le fact a ssociated with a named visual entit y . entity is a l abel or descriptio n o f the object ( e.g., "segment AB" ); attribute names the quantit y ( e.g., "length" , "angle" , "count" ); value is a numeric literal; unit is optiona l ( e.g., "cm" , "degrees" ). relation type , entities , direction , confidence A geometric or logical rel atio nship bet ween t wo or m ore entities. type encodes one of: { parallel , perpendicular , equal , subset , incident , adjacent , greater , less }; entities is an ordered list of the participants; direction is op- tio nal ( e.g., "AB → CD" ). structure type , parts , attachment , adjacency , confidence Compositi ona l or topologica l description of a multi- part ﬁgure. type is one of { composite , graph , table , sequence }; parts lists sub-compon ents; attachment and adjacency are optiona l rel a- tio nal lists specif ying ho w parts connect. B.2 Type-S peciﬁc Matching Numeri c matching. F or a numeric cl aim asserting “entit y 𝑒 has attribute 𝑎 equal to value 𝑥 ( unit 𝑢 )”, we search C for constraints 𝑐 𝑘 with matching entity ≈ 𝑒 and attribute = 𝑎 using token-ov erlap simil arit y (J accard ≥ 0 . 5 ). Among all matching co nstraints, we select the one with highest confidence and compute 𝑝 num 𝑗 = 𝟙  | 𝑥 − 𝑐 𝑘 . value | max ( | 𝑥 | , 1 ) < 𝛿  · 𝑐 𝑘 . conﬁdence , (16) with tolerance 𝛿 = 0 . 15 . If no matching constraint exists we set 𝑝 num 𝑗 = 0 . R elatio n matching. For a rel atio nal claim asserting a type 𝑡 bet ween entities { 𝑒 1 , 𝑒 2 , . . . } , we search C for constraints with type = 𝑡 and entit y ov erlap. Entit y ov erlap is measured by set intersectio n ov er unio n (J accard) o f the entit y token sets. W e deﬁne 𝑝 rel 𝑗 = max 𝑐 𝑘 ∈ C ( 𝑡 ) J accard ( { 𝑒 𝑖 } , 𝑐 𝑘 . entities ) · 𝑐 𝑘 . conﬁdence , (17) where C ( 𝑡 ) is the subset o f constraints with type = 𝑡 . Syn onym groups are used to hand le equiva lent rel atio n labels ( e.g., perpendicular ↔ orthogonal ). Structura l matching. F or a structura l claim s pecif ying a set o f parts 𝑃 = { 𝑝 1 , . . . , 𝑝 𝑚 } , we search f or composite/gra ph-t ype constraints and compute part-list J accard similarit y: 𝑝 str 𝑗 = max 𝑐 𝑘 ∈ C struct | 𝑃 ∩ 𝑐 𝑘 . parts | | 𝑃 ∪ 𝑐 𝑘 . parts | · 𝑐 𝑘 . conﬁdence . (18) B.3 Score Aggregation The per-claim score 𝑝 𝑗 is the t ype-speciﬁ c score from the matched sub-routine. If n o constraint can be matched ( empt y C or entirely disjo int entit y v oca bulary), w e apply a soft fallback : 𝑝 𝑗 = 0 . 5 , reﬂecting neutra l evidence rather than active contradi ctio n. 17 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models The per-sample visual reliabilit y score 𝑟 (Eq. (6) of the main paper) is the geometric mean o f a ll { 𝑝 𝑗 } : 𝑟 = exp ©  « 1 𝑀 𝑀  𝑗 = 1 log ( 𝜖 + 𝑝 𝑗 ) ª ® ¬ , 𝜖 = 10 − 6 . (19) The geometric mean is deliberately sensitiv e to catastrophic failures: if any 𝑝 𝑗 ≈ 0 ( a clear contradicti on bet ween checklist and eviden ce), the produ ct collapses and 𝑟 is pulled sharply do wnw ard regard less of ho w w ell other claims are supported. This asymmetry is intenti onal—a single deeply misperceiv ed premise can inv alidate the entire trace, and E VPV ’s gating should reﬂect this. C. Training Det ails C.1 Dataset Constructi on Constraint distill ati on. W e sample 20,000 image–questi on pairs fro m VisualPRM400K and ann otate each with a gold constraint set C ★ using qwen3-vl-235b-a22b-instruct as the teacher model. The teacher is prompted with the schema fro m Appendix 7 and instructed to output a JS ON array of constraints; responses that fail schema validati on are ﬁltered. The resulting 20K pairs form the SFT corpus for the constraint extractor 𝐸 𝜙 . Step v eriﬁer labels. W e use the process-lev el correctness annotations from V isualProcessBench (W ang et a l., 2025b), which provides { 𝑦 𝑡 } labels ( 𝑦 𝑡 ∈ { 0 , 1 } ) for each step in each soluti on trace. These labels are the direct supervisio n t argets for the step veriﬁ er 𝑉 𝜃 . C.2 Constraint Extractor 𝐸 𝜙 Architecture. 𝐸 𝜙 is initialized from a pre-trained multim odal VLM backbon e (InternVL2.5-8B) and ﬁne-tun ed to generate structured constraint JS ON condition ed on ( 𝐼 , 𝑞 ) . SFT st age. W e minimize the next-t oken predictio n loss on the JS ON seri a liz ati on o f C ★ : L con ( 𝜙 ) = − log 𝑃 𝜙 ( C ★ | 𝐼 , 𝑞 ) . T raining uses AdamW with learning rate 2 × 10 − 5 , linear warmup ov er the ﬁrst 3% of steps, cosine decay , batch size 16, and 3 epochs. Maxim um sequence length is 4096 tokens. D PO stage. T o improv e constraint ﬁdelit y on hard ca ses, we apply D PO after SFT . F or each training inst ance, w e sample 𝑛 = 4 candidates { 𝐶 ( 𝑖 ) } 4 𝑖 = 1 ∼ 𝑃 𝜙 ( · | 𝐼 , 𝑞 ) and compute a schema-aware distance to C ★ . The dist ance combines ( i) category-wise constraint recall ( fraction of gold constraints reco vered), (ii) numeri c value deviation (Eq. equation 16 ), and (iii) relation t ype precision. The sample closest to C ★ becomes the preferred response 𝐶 + ; the most dist ant becomes the rejected respo nse 𝐶 − . W e then a pply the st andard DPO objectiv e: L DPO ( 𝜙 ) = − log 𝜎  𝛽 d po  log 𝑃 𝜙 ( 𝐶 + | 𝐼 , 𝑞 ) − log 𝑃 𝜙 ( 𝐶 − | 𝐼 , 𝑞 )   , with 𝛽 d po = 0 . 1 and preferen ce-pair w eight 𝜆 d po = 0 . 1 . The f ull extractor objectiv e is L 𝐸 ( 𝜙 ) = L con ( 𝜙 ) + 𝜆 d po L DPO ( 𝜙 ) . D PO training runs f or 1 epoch with learning rate 5 × 10 − 6 . C.3 Step V eriﬁer 𝑉 𝜃 𝑉 𝜃 is ﬁne-tuned from the same InternVL2.5-8B backbo ne using binary cross-entropy on per-step correctness labels from VisualProcessBen ch: L 𝑉 ( 𝜃 ) = − 𝑇  𝑡 = 1  𝑦 𝑡 log 𝑢 𝑡 + ( 1 − 𝑦 𝑡 ) log ( 1 − 𝑢 𝑡 )  , where 𝑢 𝑡 = 𝑃 𝜃 ( 𝑦 𝑡 = 1 | 𝐼 , 𝑞, 𝑠 ≤ 𝑡 ) . Training uses AdamW with learning rate 2 × 10 − 5 , batch size 8, 3 epochs, and maxim um sequen ce length 8,192 tokens. R eliabilit y gating is applied only at inference time as a calibrati on layer; the veriﬁ er is trained on raw step l abels without gating. 18 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models C.4 R eliabilit y Gating Hyperparameters The gating f actor 𝛼 ( 𝑟 ) = 𝜎 ( 𝛽 ( 𝑟 − 𝜏 ) ) (Eq. (9) of the main paper) is controlled by t w o hyperparameters. • 𝜏 = 0 . 5 : reliabilit y threshold belo w which rewards are attenuated. A claim-set where ev ery claim is half -supported yields 𝑟 ≈ 0 . 5 , which maps to 𝛼 ≈ 0 . 5 under our sigmo id. • 𝛽 = 10 : sigmo id s harpness. At 𝛽 = 10 the transition from near-zero atten uation ( 𝑟 > 0 . 7 ) to near-full attenuati on ( 𝑟 < 0 . 3 ) spans roughly 0.4 units o f 𝑟 , providing a smooth but decisive gate. Sensitivit y analysis. T ab le 6 reports VisualProcessBench o vera ll Macro-F1 under ﬁve choi ces of 𝜏 (with 𝛽 = 10 ﬁxed). Perf ormance is relatively st ab le for 𝜏 ∈ [ 0 . 4 , 0 . 6 ] , conﬁrming that the method is not strongly sensitiv e to this threshold. T ab le 6 | VisualProcessBen ch ov era ll Macro-F1 (%) under varying reliabilit y threshold 𝜏 ( 𝛽 = 10 ﬁxed). 𝜏 0.3 0.4 0.5 0.6 0.7 Macro-F1 66.91 67.23 67.46 67.18 66.74 D. Complete Prompt T empl ates W e provide the verbatim prompts used in each pipeline st a ge. Placeholders are shown in angle brackets ( {...} ). D.1 St age 1 — Structured Image Description Prompt U sed by both the constraint extractor 𝐸 𝜙 (generating C ) and in single-ima ge Step-1 of the E VPV -PRM pipeline to produ ce a n atura l-language golden description o f the image. Stage 1: Structured image descripti on ( S YSTEM + USER) [No system prompt — user turn only ] You are a top-tier image analyst and mathematics education expert. Your task is to create a clear, accurate, and solution-critical description of the image provided alongside a math question. Focus on analyzing the image and generating a structured natural-language description as if explaining the diagram to a student. Your description must cover the following points in one coherent paragraph: 1. [What is it?] One sentence summarizing the image type and topic (e.g., "This is a geometric figure showing a combined cone and cylinder."). 2. [Key elements and data?] Identify the main mathematical objects (points, lines, shapes, graphs) and list all directly visible numbers, labels, and symbols. 3. [Important relationships?] Describe spatial layouts and geometric relationships that are critical for solving the problem. Output format: a strict JSON code block. The root object must contain a single key image_description whose value is the complete description string. Do NOT add any explanatory text outside the JSON block. -- Question text: {question_text} Image: [image token] 19 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models D.2 St age 2 — Visual Checklist Eva lu atio n Prompt U sed to score the policy’s visualdependency checklist a gainst the golden description, producing a p_score ∈ [ 0 , 1 ] . Stage 2: F air visual checklist a udit (USER turn) ### Fair Visual Checklist Audit: Penalize Direct Contradictions [Role and Task] You are a fair, objective AI auditor. Your sole task is to compare two description lists about the same image: a "Ground-Truth Checklist" (provably correct) and a "Candidate Checklist" (model-generated visual understanding). Evaluate the factual accuracy of the Candidate Checklist. [Core Evaluation Principle] "Incomplete" is NOT "incorrect." Only a direct contradiction counts as an error. Strictly ignore all omissions and missing details. [Inputs] 1. [Ground-Truth Checklist] : Verified-correct factual statements about the image. 2. [Candidate Checklist] : Model-generated visual fact statements to be audited. [Output Format] Output a strict JSON code block with three keys: • errors_and_hallucinations : array of objects, each representing a direct contradiction. Each object has fields: faulty_statement, correction_or_reason, severity ("High" | "Low"). Empty array [] if no contradictions found. • omissions : string array. This field does NOT affect scoring and should almost always be []. • p_score : float in [0.0, 1.0] representing the final reliability score. [Scoring Rules] • Start at 1.0. • Each High severity contradiction: deduct 0.5. • Each Low severity contradiction: deduct 0.2. • Minimum score: 0.0. [High vs Low Severity] • High : contradicts a primary visual fact (object count, key spatial relationship, presence/absence of a main object). • Low : contradicts a secondary detail (background color, non-critical attribute). -- [Ground-Truth Checklist] : {golden_standard_text} [Candidate Checklist] : {checklist_to_review_text} D.3 St age 3 — Step R eward Judgment Prompt U sed by the step veriﬁ er 𝑉 𝜃 to judge each rea soning step. 20 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Stage 3: Step reward judgment ( USER turn + image) You are a professional expert in mathematical reasoning. You will judge whether the CURRENT solution step is correct given the image and the problem. You MUST output ONLY a JSON integer: 1 or -1. - 1 means the step is correct. - -1 means the step is incorrect (contradicts the image/question/ previous steps, or is invalid reasoning). Important: • Use the image + question + previous steps as context. • Judge ONLY the CURRENT step relative to the full solution so far. • Do NOT output any other text, keys, markdown, or explanation. Problem: {question_text} Image description: {image_description_text} Previous steps: {history_steps_text} CURRENT step to evaluate: {current_step_text} Problem image: [image token] D.4 Poli cy Inf erence Prompt U sed to elicit structured, step-by-step solutions with per-step visualdependency annotatio ns from the InternVL2.5 policy . A unique nonce and variant_id are injected per candidate to promote div ersit y across the 𝑁 = 8 samples. Poli cy inference prompt ( USER turn + image) You are a meticulous and precise AI assistant, an expert in visual mathematical reasoning. Your primary goal is to solve the user’s query by providing a detailed, step-by-step thought process. You MUST provide your entire response in a single, valid JSON code block. Do not include any text, explanations, or markdown formatting outside of the JSON object. -- ### DIVERSITY REQUIREMENTS (VERY IMPORTANT) • This is reasoning variant #{variant_id}. Your reasoning path should be meaningfully different from other variants. • Try a different logical decomposition, use different intermediate variables, or vary the order of non-dependent steps. • Use this nonce strictly as a randomness anchor for this specific generation: {nonce} -- ### JSON OUTPUT SPECIFICATION (CRITICAL) Your entire output must conform to this JSON schema: 21 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models { "reasoningprocess": [ { "steptext": "A single, clear step of reasoning...", "visualdependency": "A specific, observable fact from the image, or null." } ], "finalanswer": "The final answer." } Field-Specific Rules: 1. reasoningprocess (List of Objects): • steptext : Each step should represent a single calculation, observation, or deduction. • visualdependency (String or null): Include a description if the step directly reads a value/label from the image. Use null ONLY for purely abstract steps. CRITICAL : use the JSON literal null, NEVER an empty string "". 2. finalanswer (String): For multiple-choice output the option letter only (e.g., "A"); for open-ended output the numerical result only. -- ### USER QUERY {user_query} D.5 Step Error Attributi on in Visua lProcessBench Step v eriﬁer labels. W e use the process-lev el correctness annotations from V isualProcessBench (W ang et a l., 2025b), which provides { 𝑦 𝑡 } labels ( 𝑦 𝑡 ∈ { 0 , 1 } ) for each step in each soluti on trace. These labels are the direct supervisio n t argets for the step veriﬁ er 𝑉 𝜃 . Step-lev el error-t ype attributi on in VisualProcessBench. VisualProcessBen ch already provides step-lev el correctness labels ( + 1 = correct, − 1 = incorrect) for each solution trace. T o underst and why incorrect steps f ail and to support the error-distrib utio n st atistics reported in the main paper ( e.g., the pie charts), w e performed error-t ype classiﬁ catio n on all steps that are marked incorrect ( − 1 ). The t axon omy is t wo-l ev el. T op-level categories : Visual Misinterpret ation (misreading or misusing the image), Logical Error (in va lid deducti on or reasoning chain), Calculation Error ( arithmetic or algebraic mist ake), Knowl edge E rror (wrong f ormula or domain fact), and Incompleten ess (step is underspeciﬁ ed or missing key detail). Visual Misinterpret atio n is f urther split into sub-t ypes : Structura l Misunderstanding (wrong spatial or geometric structure), V alue Misreading (wrong number or measure from the ﬁgure), and Object Misidentiﬁcati on (wrong object, label, or corresponden ce). W e used a dedicated prompt (see below) with Gemini-2.5-Pro to assign, for each incorrect step, one top-lev el category and, when the m odel chose Visual Misinterpret ati on, on e sub-t ype. The model was given the probl em text, the image, the full solution, and the index o f the incorrect step. Human annotators then review ed a subset of these model-predicted error-t ype labels, correcting misclassiﬁ catio ns ( e.g., a step labeled as Calculatio n Error but actu ally du e to V alue Misreading). Disa greements were resolved by discussion or a third annotator . The human-corrected subset was used to evaluate a greement and to reﬁne the remaining labels where n eeded. The statistics reported in the main paper ( e.g., 74% Visual Misinterpret atio n, 19% L ogica l Error , 3% Ca lculatio n Error , 3% Kno wledge Error , 1% In completen ess; and within Visual Misinterpretation, 56% Structura l Misunderst anding, 29% V alue Misreading, 15% Object Misidentiﬁ catio n) are computed from this ﬁnal, human-veriﬁ ed error-t ype distrib utio n o ver all incorrect steps in V isualProcessBench. The prompt used for the Gemini-2.5-Pro error-t ype classiﬁcati on pass is giv en belo w . The model outputs a JS ON object with the chosen top-level category and, when appli cab le, the visual sub-t ype. 22 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Error-t ype classiﬁcati on prompt ( Gemini-2.5-Pro) T a s k. Y ou are an expert in mathematical reasoning and multim oda l evaluatio n. Y ou will be given a math problem, an image, a step-by-step solutio n, and the index o f on e step that is already known to be incorrect . Y our job is to classif y the t ype of error that best explains why this step is wrong. T op-lev el error t ypes ( choose exactly one): • Visual Misinterpretation — The step is wrong because it misreads or misuses informatio n from the ima ge (wrong shape, number , label, rel atio n, or structure). • Logical E rror — The step is wrong du e to invalid ded ucti on, wrong implicati on, or broken reaso ning chain (not primarily a visu al or calculatio n mist ake). • Cal culation Error — The step applies correct reasoning but contains an arithmetic or algebraic mist a ke. • Kno wledge Error — The step uses a wrong formula, deﬁnition, or domain fact. • Incompl eteness — The step is underspeciﬁed, skips necessary detail, or does not f ully justif y the conclu sio n. If you choose Visua l Misinterpretation , also choose exactly one sub-t ype : • Structura l Misunderst anding — W rong spatia l, geometric, or compositio nal structure ( e.g., misidentiﬁed lay out or parts). • V alue Misreading — Wro ng numeric valu e or measure read from the ﬁgure ( e.g., misread length, angle, or count). • Object Misidentiﬁcatio n — W rong object, label, or corresponden ce ( e.g., confused t wo elements or misiden- tiﬁed what a symbol refers to). Output format. R eply with a single JSO N object in a code bl ock: { "top_level": "Visual Misinterpretation" | "Logical Error" | "Calculation Error" | "Knowledge Error" | "Incompleteness", "visual_subtype": "Structural Misunderstanding" | "Value Misreading" | "Object Misidentification" | null } Set visual_subtype to null if top_level is not “Visual Misinterpretation”. Input. Problem: {question_text} Image: [image] Solution steps: Step 1: ... Step 2: ... ... The following step is INCORRECT (index {step_index}): ... Classify the error type for this step. E. Alternative Score Aggregati on Strategies The main paper (T ab le 2) uses Geometric Mean —the fraction of steps with score > 0 —as the trajectory aggrega- tio n f uncti on for Best-of- 𝑁 reranking. Here we report results for all ﬁve a ggregatio n strategies implemented in the evaluati on pipeline. 1. Geometric Mean : maps 𝑅 𝑡 ∈ { 1 , − 1 } to { 1 . 0 , 0 . 1 } and t akes the geometric mean, making it sensitive to any single incorrect step. 2. Correctness Rate (used in main paper): Score ( 𝑆 ) = 1 𝑇 Í 𝑡 𝟙 [ 𝑅 𝑡 > 0 ] . 3. Strea k Score : rewards consecutiv e correct-step runs; score is incremented by the current streak length on each correct step and decremented by 1 o n each in correct step, then n ormalized. 4. W eighted Correctness : later steps receive linearly higher w eight. L et 𝑤 𝑡 = 𝑡 ; then Score ( 𝑆 ) = Í 𝑡 𝑤 𝑡 𝑅 𝑡 − 𝑊 min 𝑊 max − 𝑊 min where 𝑊 max / min are the maximum/minim um achieva b le w eighted sums. 23 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models 5. First-Error P ositi on : Score ( 𝑆 ) = 𝑖 ∗ / 𝑇 where 𝑖 ∗ is the index o f the ﬁrst step with 𝑅 𝑡 = − 1 ; equals 1.0 if no error occurs. T ab les 7 – 9 report P ass@1 and BoN@8 accuracy (%) f or each strategy across three InternVL2.5 policy scales. Δ 8 = BoN@8 − P ass@1. T ab le 7 | Best-o f-8 reranking under ﬁve aggregati on strategies, InternVL2.5-8B policy . P a ss@1 is the same across strategies; BoN@8 and Δ 8 vary . MathVista MathVisio n MathV erse- VO W eMath LogicVista Overall Strategy P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 Geometric Mean 64.5 76.3 +11.8 17.0 22.1 +5.1 22.8 29.5 +6.7 23.5 37.5 +14.0 36.4 45.3 +8.9 32.8 41.7 +8.9 Correctness Rate 64.5 75.1 +10.6 17.0 21.4 +4.4 22.8 28.9 +6.1 23.5 36.8 +13.3 36.4 44.6 +8.2 32.8 41.0 +8.2 Streak Score 64.5 74.8 +10.3 17.0 21.9 +4.9 22.8 28.6 +5.8 23.5 36.5 +13.0 36.4 44.3 +7.9 32.8 40.7 +7.9 W eighted Correctness 64.5 73.2 +8.7 17.0 20.5 +3.5 22.8 27.4 +4.6 23.5 35.1 +11.6 36.4 43.1 +6.7 32.8 39.5 +6.7 First-Error Position 64.5 75.7 +11.2 17.0 22.0 +5.0 22.8 29.0 +6.2 23.5 37.1 +13.6 36.4 44.9 +8.5 32.8 41.3 +8.5 T ab le 8 | Best-of -8 reranking under ﬁv e a ggregati on strategies, InternVL2.5-26B policy . MathVista MathVisio n MathVerse- VO W eMath LogicV ista Overall Strategy P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 Geometric Mean 68.2 79.6 +11.4 23.4 28.1 +4.7 24.0 32.5 +8.5 30.9 42.1 +11.2 39.6 51.7 +12.1 37.2 46.8 +9.6 Correctness Rate 68.2 78.4 +10.2 23.4 27.5 +4.1 24.0 31.8 +7.8 30.9 41.3 +10.4 39.6 50.9 +11.3 37.2 45.8 +8.6 Streak Score 68.2 78.0 +9.8 23.4 27.2 +3.8 24.0 31.4 +7.4 30.9 41.0 +10.1 39.6 50.5 +10.9 37.2 45.4 +8.2 W eighted Correctness 68.2 76.5 +8.3 23.4 26.0 +2.6 24.0 30.1 +6.1 30.9 39.6 +8.7 39.6 49.1 +9.5 37.2 44.0 +6.8 First-Error Position 68.2 79.0 +10.8 23.4 27.9 +4.5 24.0 32.2 +8.2 30.9 41.7 +10.8 39.6 51.2 +11.6 37.2 46.3 +9.1 T ab le 9 | Best-of -8 reranking under ﬁv e a ggregati on strategies, InternVL2.5-38B policy . MathVista MathVision MathV erse- VO W eMath LogicVista Ov erall Strategy P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 P@1 B@8 Δ 8 Geometric Mean 71.9 83.5 +11.6 32.2 37.6 +5.4 36.9 47.7 +10.8 38.3 50.0 +11.7 47.9 58.7 +10.8 45.4 55.2 +9.8 Correctness Rate 71.9 82.3 +10.4 32.2 36.8 +4.6 36.9 46.8 +9.9 38.3 49.1 +10.8 47.9 57.8 +9.9 45.4 54.3 +8.9 Streak Score 71.9 81.9 +10.0 32.2 36.4 +4.2 36.9 46.5 +9.6 38.3 48.8 +10.5 47.9 57.5 +9.6 45.4 54.0 +8.6 W eighted Correctness 71.9 80.4 +8.5 32.2 35.0 +2.8 36.9 45.1 +8.2 38.3 47.4 +9.1 47.9 56.1 +8.2 45.4 52.5 +7.1 First-Error Positio n 71.9 83.0 +11.1 32.2 37.3 +5.1 36.9 47.3 +10.4 38.3 49.6 +11.3 47.9 58.3 +10.4 45.4 54.7 +9.3 Geometric Mean achiev es the best or near-best BoN@8 across all sca les and benchmarks, while being the simplest to compute. W eighted Correctness is consistently the mo st conservati ve: it penalizes any single incorrect step heavily , which sometimes o v er-rejects good candidates with on e minor error . Correctness Rate and First-Error P ositi on closely track Geometri c Mean, conﬁrming that the reranking improv ement is robust to the choice of aggregati on f unctio n. F . Complete Ablatio n R esults T ab le 10 extends T ab le 4 of the main pa per to include all 27 ab l atio n conﬁgurati ons executed in Exp4. Conﬁguratio ns are organi zed by the compon ent being v aried; the Full Method ro w (E VPV + reliabilit y gating) is repeated at the top for reference. All scores are VisualProcessBench Macro-F1 (%); Δ is relative to the f ull method. Severa l additiona l observatio ns emerge from the full tab le. First, history length s how s a consistent mon otoni c trend: longer history is better , b ut the margin a l gain diminishes quickly beyond 4 steps, suggesting a mem ory saturatio n eﬀ ect. Second, visi on sampling temperature has negligib le impact on ﬁnal accuracy ( | Δ | < 0 . 5 ), indicating that constraint extraction is rob ust to m oderate temperature variatio n. Third, parse-f ailure policy matters modestly ( | Δ | ≤ 1 . 78 ): defaulting to − 1 ( conservativ e) slightly outperforms defaulting to + 1 or random, consistent with VisualProcessBen ch’s skew tow ard incorrect steps at harder positi ons. 24 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models T ab le 10 | Complete ab l atio n results on VisualProcessBen ch (Macro-F1, %). Δ = variant − Full Method. Best per group bolded . Group V ari ant DynaMath MMMU MathV erse MathVisi on W eMath Overa ll 𝚫 Full Method Full (E VPV + gating) 69.57 68.86 67.09 65.27 69.11 67.46 +0.00 Evidence t ype w/o structured facts ( captio n-only) 67.75 58.09 63.48 60.68 67.10 63.38 − 4.08 w/o constraints ( facts = ∅ ) 66.66 55.80 62.61 59.13 65.81 62.11 − 5.35 w/ shuﬄed facts (structure corrupted) 62.86 52.57 59.81 58.52 64.77 59.82 − 7.64 w/ noise caption only 64.41 56.22 61.05 59.80 65.33 61.18 − 6.28 Short visio n prompt 68.02 66.14 65.73 63.91 67.44 66.05 − 1.41 w/ drop-facts corruption 34.90 34.40 36.29 36.14 35.96 35.77 − 31.69 Modality w/o vision (text-only judge, keep JSON) 58.44 49.44 53.59 54.07 61.02 54.93 − 12.53 w/o vision & w/o JS ON (text-only) 54.49 43.93 42.78 50.84 53.78 48.23 − 19.23 w/o vision JSON (keep image) 65.83 62.19 63.72 62.44 66.07 64.14 − 3.32 Judge prompt Lenient judge preﬁx 66.91 65.28 64.02 62.75 67.09 65.13 − 2.33 No-visio n judge preﬁx 57.22 48.71 52.84 53.30 60.14 54.21 − 13.25 Judge temperature 0.2 68.44 67.50 66.11 64.38 68.22 66.58 − 0.88 Judge temperature 0.5 67.83 66.97 65.44 63.76 67.81 66.02 − 1.44 History length History: non e 65.74 63.21 62.80 61.45 65.53 63.49 − 3.97 History: last 1 step 66.88 65.42 64.55 63.02 66.91 65.22 − 2.24 History: last 2 steps 67.51 66.09 65.18 63.74 67.60 65.90 − 1.56 History: last 4 steps 68.31 67.44 65.93 64.56 68.40 66.73 − 0.73 History: last 8 steps 68.94 68.21 66.58 64.97 68.82 67.14 − 0.32 Visio n temp. Visio n temperature 0.0 68.75 67.91 66.43 64.81 68.51 67.01 − 0.45 Visio n temperature 0.5 69.02 68.27 66.76 65.01 68.79 67.18 − 0.28 Visio n top-p 0.7 68.83 68.44 66.91 65.10 68.93 67.25 − 0.21 P arse-failure P arse fail → + 1 67.44 66.31 65.02 63.19 67.25 65.68 − 1.78 P arse fail → random 67.89 66.74 65.47 63.67 67.72 66.12 − 1.34 P arse fail → − 1 (default) 69.57 68.86 67.09 65.27 69.11 67.46 +0.00 Compound No visio n JSO N + text-only judge 53.11 42.87 41.64 49.72 52.45 47.07 − 20.39 Capti on-only + no image in judge 56.72 47.39 49.81 52.14 57.03 52.49 − 14.97 Shuﬄed f acts + lenient judge 61.45 50.88 57.93 56.71 62.24 57.94 − 9.52 G. Qu a litative Case Studies W e present three cases from VisualProcessBen ch. In each, process_correctness denotes the ground-truth step-lev el labels ( + 1 = correct, − 1 = incorrect). W e sho w that E VPV -PRM’s step-wise judgments align with ( or match) these l abels by v erif ying the policy’s visual cl aims against extracted co nstraints C . G.1 DynaMath: Misread kink positi on Case G.1: Graph — co ntinu ous but not diﬀerentiab le Questio n ( DynaMath): Determine for which valu es of 𝑥 = 𝑎 the f unctio n is contin uou s but not diﬀerentiab le at 𝑥 = 𝑎 . Gold answ er: 1. 25 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Extracted constraints C ( by 𝐸 𝜙 ): numeric: {entity:"piecewise graph", attribute:"kink position", value:1, unit:"x"} structure: {type:"graph", parts:["left branch","right branch"], attachment:["sharp corner at x = 1"]} relation: {type:"continuous_at", entities:["function","x=1"], confidence:0.95} Process-lev el veriﬁcati on. The policy claims a sharp corner at 𝑥 = − 2 (fro m step 3 o nw ard); C gives the kink at 𝑥 = 1 . Steps 3–6 thus contain an unsupported visual premise. Matching yi elds lo w 𝑝 𝑗 for those steps; reliabilit y 𝑟 is atten uated and step rewards are gated do wn. Step ( abbreviated) process_correctness E VPV step 1 Setup: ﬁnd where contin uous b ut not diﬀerentiab le + 1 + 1 2 Deﬁniti ons ( continu ous / diﬀerentiab le) + 1 + 1 3 “S harp corner at 𝑥 = − 2 ” (visual claim) − 1 − 1 4 “Theref ore 𝑥 = − 2 ” ( conclusi on) − 1 − 1 5 “The answ er is 𝑥 = − 2 ” − 1 − 1 6 V eriﬁcatio n of 𝑥 = − 2 − 1 − 1 EV PV -PRM’s step-wise output matches the ground-truth process_correctness: correct steps 1–2 are preserv ed; incorrect steps 3–6 are do wn-weighted because the visu al premise co ntradicts C . G.2 MathV isio n: U nsupported geometric inference Case G.2: Quadril atera l angl e (MathV isio n) Questio n: In quadril ateral ABCD, 𝐴 𝐷 = 𝐵𝐶 , ∠ DA C = 50 ◦ , ∠ DCA = 65 ◦ , ∠ A CB = 70 ◦ . How big is ∠ ABC ? Gold answer: B ( 55 ◦ ). Extracted constraints C ( by 𝐸 𝜙 ): relation: {type:"equal", entities:["AD","BC"], confidence:0.96} numeric: {entity:"angle DAC", value:50, unit:"deg"}, ... structure: {type:"quadrilateral", parts:["A","B","C","D"]} Process-lev el veriﬁcati on. Step 1 only rest ates the problem and ﬁgure; its checklist items match C . Step 2 cl aims “triangle AB C is isosceles with 𝐴 𝐵 = 𝐴𝐶 ” from 𝐴 𝐷 = 𝐵𝐶 ; this cl aim is not supported by C ( equ ality is bet ween 𝐴 𝐷 and 𝐵𝐶 , not 𝐴 𝐵 and 𝐴𝐶 ). Steps 2–5 are thus given low reliabilit y and attenuated. 26 Grounding the Score: Explicit Visual Premise V eriﬁcation for R eliab le VLM Process Reward Models Step ( abbreviated) process_correctness E VPV step 1 T ask and given dat a ( 𝐴 𝐷 = 𝐵𝐶 , angles) + 1 + 1 2  𝐴𝐶 𝐷 : ∠ CAD = 65 ◦ − 1 − 1 3 “ 𝐴 𝐷 = 𝐵𝐶 ⇒ 𝐴 𝐵 = 𝐴𝐶 ”, ∠ AB C = 70 ◦ − 1 − 1 4 V erif y angles at 𝐶 − 1 − 1 5 Final answer D ( 70 ◦ ) − 1 − 1 Our method’s step labels match the ground truth: step 1 is correct and supported by C ; steps 2–5 are incorrect and are correctly ﬂa gged because the key geometri c premise is unsupported. G.3 W eMath: Mixed correct/incorrect steps, correct ﬁnal answer Case G.3: P aper folding ( W eMath) Questio n: When the paper is folded with ∠ 1 = ∠ 2 = ∠ 3 , then ∠ 1 equals ( ). A. 90 ◦ B. 45 ◦ C. 60 ◦ D. 30 ◦ E. No correct answer . Gold answ er: C ( 60 ◦ ). Process-lev el veriﬁcati on. The policy infers 60 ◦ via “angles form a triangle” and “equilateral” (steps 2–3); the ﬁgure does not support th at the three angles are interior angles of one triangle. Steps 2–3 are incorrect; steps 4–6 ( algebra and ﬁn al answer) are correct. EVPV a ssigns lo w 𝑝 𝑗 to the unsupported structura l claims in steps 2–3 and preserv es reward for steps 4–6. Step ( abbreviated) process_correctness E VPV step 1 K ey info: ∠ 1 = ∠ 2 = ∠ 3 + 1 + 1 2 “ Angles form a triangle; sum 180 ° ” − 1 − 1 3 “Equilateral; each 180 / 3 ” − 1 − 1 4 “Each angle 60 ” + 1 + 1 5 “Thus ∠ 1 = 60 ” + 1 + 1 6 Final answer C + 1 + 1 EV PV -PRM’s step-wise judgment matches process_correctness exactly: incorrect intermedi ate reasoning (steps 2–3) is do wn-w eighted; correct conclusi on steps (4–6) are preserved, illustrating process-lev el rather than outcome-only eva luation. 27

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment