ScRPO: From Errors to Insights
Reading time: 1 minute
...
📝 Original Info
- Title: ScRPO: From Errors to Insights
- ArXiv ID: 2511.06065
- Date: 2025-11-08
- Authors: 논문에 명시된 저자 정보가 제공되지 않았습니다.
📝 Abstract
We introduce Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to empower large language models with advanced mathematical reasoning capabilities through iterative self-reflection and error correction. The ScRPO framework operates in two distinct phases: (1) Trial-and-error learning stage, where the model is trained via GRPO, and incorrect responses are collected to form an "error pool"; and (2) Self-correction learning stage, which guides the model to introspectively analyze and rectify the reasoning flaws behind its previous errors. Extensive evaluations across challenging mathematical benchmarks, including AIME, AMC, Olympiad, MATH-500, and GSM8k, validate the efficacy of our approach. Using DeepSeek-R1-Distill-Qwen-1.5B and 7B as backbones, ScRPO achieves average accuracies of 64.8% and 77.8%, respectively. This represents a significant improvement of 6.0% and 3.2% over vanilla baselines, consistently outperforming strong post-training methods such as DAPO and GRPO. These findings establish ScRPO as a robust paradigm for enabling autonomous self-improvement in AI systems, particularly in tasks with limited external feedback.💡 Deep Analysis
📄 Full Content
Reference
This content is AI-processed based on open access ArXiv data.