Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality **question-proof-check**'' triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating an LLM-as-a-RM-for-RM’’ approach and balanced token weighting to stabilize the RL process. Our experiments validate the model’s scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.


💡 Research Summary

Proof‑RM tackles the fundamental challenge of automatically verifying full‑length mathematical proofs, a task where the usual “verification asymmetry” that powers Reinforcement Learning with Verifiable Rewards (RLVR) breaks down. The authors construct a large‑scale Question‑Proof‑Check (QPC) dataset with minimal human effort by combining three sources of diversity: (1) problem origins (OlympiadBench, USAMO, Putnam, and Chinese high‑school exams), (2) proof‑generation methods (rewriting, masked generation, step‑wise expansion), and (3) multiple large language models (GPT‑4, Claude, LLaMA, etc.). By prompting these models in varied ways they obtain proofs that differ in linguistic style, verbosity, logical flow, and error type, thereby covering a broad proof space.

For labeling, each proof is judged five times by three independent LLMs; a label (True/False) is accepted only under unanimous agreement. To keep human labor low, the authors perform hierarchical, composition‑level human checks on a sampled subset. If human judgments align strongly with the LLM consensus for a given data slice, that slice is retained as silver‑standard; otherwise it is discarded. This strategy yields ~21 k high‑quality QPC triples in roughly 100 hours, cutting manual effort by over 90 %.

Training a proof‑reward model directly on binary T/F signals proved unstable: noisy reasoning steps could still lead to a correct verdict, rewarding poor generations and causing model collapse. The authors mitigate this with two innovations. First, an “LLM‑as‑a‑RM‑for‑RM” component supervises the fluency of the chain‑of‑thought, jointly optimizing proof‑checking accuracy and reasoning quality. Second, they introduce a balanced token‑weighting scheme that combines sequence‑level and token‑level weights to neutralize the harmful effects of large output‑length variance. With these fixes, RL training proceeds stably over 312 steps, processing more than 18 k samples and 112 M tokens.

Empirical evaluation shows Proof‑RM outperforms prior baselines (simple LLM judges, IneqMath, DeepTheorem) by 8–12 percentage points in proof‑verification accuracy across multiple difficulty tiers and languages. Moreover, when used as a test‑time scorer, Proof‑RM improves top‑k selection performance, demonstrating its utility for math agents, data‑curation pipelines, and further RL‑based proof generation.

In sum, the paper delivers a comprehensive, scalable pipeline for generating, labeling, and training a reward model that can reliably assess mathematical proofs. The combination of multi‑dimensional data diversity, hierarchical human validation, “LLM‑as‑RM” supervision, and balanced token weighting addresses both data scarcity and training instability, paving the way for more trustworthy and capable LLMs in advanced mathematical reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment