PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

💡 Research Summary

The paper introduces PRIME, a novel benchmark designed to evaluate verifiers on process‑outcome alignment rather than merely checking the final answer against a ground truth. Existing model‑based verifiers in Reinforcement Learning with Verifiable Rewards (RLVR) are predominantly outcome‑centric: they grant positive rewards whenever the final answer matches the expected result, regardless of whether the intermediate reasoning steps are correct. This leads to “lucky guesses,” where a model arrives at the right answer through flawed logic, and such cases constitute roughly 17 % of generated responses in the authors’ preliminary analysis.

To address this gap, the authors construct PRIME from a massive pool of over 7 million college‑level STEM problems (mathematics, physics, chemistry, engineering). They apply a two‑stage automated filtering pipeline: (1) a verifiability check using GPT‑OSS‑120B to discard ambiguous or open‑ended questions, and (2) a correctness validation where top‑tier models (Gemini‑3‑Pro, GPT‑5, Claude‑Sonnet‑4.5) cross‑validate the textbook answer. After filtering, the dataset is organized into 16 fine‑grained sub‑domains (e.g., topology, organic polymer, systems robotics) and balanced to avoid category skew.

For each retained problem, the authors generate a single reasoning trajectory from each of M distinct Large Reasoning Models (LRMs), encompassing both open‑source and commercial systems. To focus on discriminative samples, they employ a Verification Difficulty Filtering step: a proxy verifier (GPT‑OSS‑120B) evaluates each (question, answer, trajectory) pair eight times, producing a consensus score C ∈

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment