Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking?

Reading time: 6 minute
...

📝 Abstract

Chain-of-thought (CoT) prompting has become central to mathematical reasoning in large language models, yet models remain brittle to early errors: a single arithmetic slip or unjustified inference typically propagates uncorrected to an incorrect final answer. We investigate whether training on intentionally flawed reasoning traces can teach models to detect and recover from such errors without degrading standard problem-solving ability. Using competition-level problems from MATH-lighteval, we generate CoT prefixes containing exactly one controlled error, either a calculation error (sign flips, dropped terms) or a reasoning error (misapplied rules, unjustified logical steps), and fine-tune Qwen3-4B with GRPO using a binary final-answer reward. Our Mixed-CoT-RL model matches standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems prefilled with flawed reasoning (24% vs 19%). Notably, clean-only RL fine-tuning degrades robustness below the untuned baseline 19% vs. 20%), indicating that conventional training increases susceptibility to misleading prefills. Among error types, training on reasoning errors yields greater robustness gains than calculation errors alone, with mixed training performing best. These findings demonstrate that exposure to flawed traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.

💡 Analysis

Chain-of-thought (CoT) prompting has become central to mathematical reasoning in large language models, yet models remain brittle to early errors: a single arithmetic slip or unjustified inference typically propagates uncorrected to an incorrect final answer. We investigate whether training on intentionally flawed reasoning traces can teach models to detect and recover from such errors without degrading standard problem-solving ability. Using competition-level problems from MATH-lighteval, we generate CoT prefixes containing exactly one controlled error, either a calculation error (sign flips, dropped terms) or a reasoning error (misapplied rules, unjustified logical steps), and fine-tune Qwen3-4B with GRPO using a binary final-answer reward. Our Mixed-CoT-RL model matches standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems prefilled with flawed reasoning (24% vs 19%). Notably, clean-only RL fine-tuning degrades robustness below the untuned baseline 19% vs. 20%), indicating that conventional training increases susceptibility to misleading prefills. Among error types, training on reasoning errors yields greater robustness gains than calculation errors alone, with mixed training performing best. These findings demonstrate that exposure to flawed traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.

📄 Content

Recently, we have seen large language models (LLMs) achieve incredibly strong performances across a wide range of reasoning tasks. Strong performance of models is typically measured by quantitative metrics such as accuracy on a dataset, and typical reasoning tasks require logical thinking, complex analysis, and multi-step problem-solving. The most common benchmarks for reasoning tasks are math and coding-based tasks. Especially since ChatGPT was released in November 2022, LLMs have improved rapidly at solving difficult math and coding problems, and much of this progress can be attributed to post-training methods, which seek to fine-tune a base model after it has already been trained by essentially highlighting specific example responses it should focus on emulating. Wellknown examples of post-training processes include Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017), which works by having humans rate responses generated by the model, and Direct Preference Optimization (DPO) (Rafailov et al., 2023), which aligns model outputs with human preferences using a simpler objective function and eliminating the need for a separate reward model. However, a major limitation of existing post-training methods is that they primarily optimize LLMs to produce a correct final answer rather than the correct intermediate chain-of-thought (CoT) used to derive it. As a result, even reasoning models can be brittle: if an early error is introduced, for example, an arithmetic slip or an unjustified logical step, the model typically propagates the mistake rather than correcting it. This is especially problematic in mathematics, where a single local error can invalidate an entire derivation. Ensuring that LLMs can identify and recover from their own mistakes is vital for users who rely on step-by-step solutions for both correctness and understanding. Reliable step-by-step reasoning in LLMs has important benefits beyond simply producing correct answers. For mathematics researchers, LLMs can assist by verifying intermediate calculations, exploring alternative solution strategies, and creating derivations more quickly than manual methods. For learners, LLMs can function as interactive study tools: students can follow detailed problem-solving steps, receive immediate explanations for each step, and gain intuitive understanding of complex concepts. By improving the reasoning process, LLMs can improve mathematical comprehension and support more effective learning and research workflows.

In this paper, we introduce a simple classification of flawed reasoning types and study how different error families affect a model’s ability to recover from misleading CoTs. Specifically, for each math problem, we prefill the Chain-of-Thought with one of two error types (calculation errors or reasoning errors) and evaluate how the model generates a corrected reasoning sequence and final answer.

Calculation errors include sign flips, dropped terms, or incorrect simplifications. Reasoning errors include unjustified inferences, misapplied rules, or broken invariants. We then train models under mixtures of clean and flawed trajectories and evaluate how exposure to each error type influences robustness to adversarial or misleading prefills.

Our primary contributions are:

(1) We formulate a process-level training setup for mathematical reasoning that explicitly injects flawed CoTs and encourages the model to recover from them.

(2) We explore two major families of errors-calculation errors and reasoning errors-and measure how training on each type affects robustness.

(3) We outline a reinforcement learning pipeline based on a binary reward model and GRPO that can be run on modest hardware while still enabling fine-grained process-level supervision.

Post-training methods such as RLHF (Christiano et al., 2017) and DPO (Rafailov et al., 2023) have significantly improved LLM performance across reasoning tasks. These approaches optimize models to produce preferred outputs but typically supervise only the final answer, leaving intermediate reasoning steps unsupervised. As a result, LLMs can propagate early mistakes in multi-step problems, particularly in mathematics, where local errors can invalidate entire derivations (Uesato et al., 2022). Recent efforts, including Outcome Reward Models (ORMs) and Process Reward Models (PRMs), attempt to score reasoning steps sequentially, showing that step-level supervision improves faithfulness and interpretability (Lightman et al., 2024;Wang et al., 2025). However, these methods generally rely on clean reasoning traces and do not explore how exposing models to flawed or misleading steps affects learning.

To counter this, Peng et al. (Peng et al., 2025) introduce the RECAP framework, which strengthens models’ ability to withstand missteps in reasoning. RECAP trains models to recover from flawed reasoning prefixes: short, intentionally incorrect CoTs that precede correct demonstrations. In safety-

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut