📝 Original Info
- Title: Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
- ArXiv ID: 2512.16917
- Date: 2025-12-18
- Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
📝 Abstract
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
💡 Deep Analysis
📄 Full Content
GENERATIVE ADVERSARIAL REASONER: ENHANCING
LLM REASONING WITH ADVERSARIAL REINFORCE-
MENT LEARNING
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
Johns Hopkins University
qliu45@jhu.edu
ABSTRACT
Large language models (LLMs) with explicit reasoning capabilities excel at math-
ematical reasoning yet still commit process errors, such as incorrect calculations,
brittle logic, and superficially plausible but invalid steps. In this paper, we intro-
duce Generative Adversarial Reasoner, an on-policy joint training framework de-
signed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based
discriminator through adversarial reinforcement learning. A compute-efficient re-
view schedule partitions each reasoning chain into logically complete slices of
comparable length, and the discriminator evaluates each slice’s soundness with
concise, structured justifications. Learning couples complementary signals: the
LLM reasoner is rewarded for logically consistent steps that yield correct answers,
while the discriminator earns rewards for correctly detecting errors or distinguish-
ing traces in the reasoning process. This produces dense, well-calibrated, on-
policy step-level rewards that supplement sparse exact-match signals, improving
credit assignment, increasing sample efficiency, and enhancing overall reasoning
quality of LLMs. Across various mathematical benchmarks, the method delivers
consistent gains over strong baselines with standard RL post-training. Specifically,
on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3)
and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular dis-
criminator also enables flexible reward shaping for objectives such as teacher dis-
tillation, preference alignment, and mathematical proof-based reasoning.
1
INTRODUCTION
Large language models (LLMs) have demonstrated remarkable mathematical reasoning abilities,
often achieving expert-level performance across diverse benchmarks (Achiam et al., 2023; Dubey
et al., 2024; Shao et al., 2024; DeepSeek-AI, 2025). However, despite extensive training on large-
scale datasets with sophisticated paradigms, these models still suffer from errors in reasoning, such
as incorrect calculations, flawed logic, superficially plausible but invalid arguments, and repetitive
or incoherent reasoning steps. To tackle these challenges, researchers have explored approaches
such as model debate collaboration, in which models debate against each other (Du et al., 2023;
Liang et al., 2023) or with themselves (Kuba et al., 2025; Liu et al., 2025a), and Process Reward
Models (Lightman et al., 2023; Wang et al., 2023), which aim to identify and mitigate process errors
throughout the reasoning process. These methods provide finer-grained supervision and contribute
to more robust and reliable LLM performance.
Among existing approaches, Process Reward Models (PRMs) have shown strong results on complex
reasoning tasks, largely because they leverage detailed step-level annotations. However, PRMs face
challenges related to annotation costs and data quality (Lightman et al., 2023), as fine-grained labels
are expensive and prone to subjective error, and are sometimes susceptible to over- or under-reward
issues (Wen et al., 2024; Lv et al., 2025). Alternatively, prompt-based methods employ LLMs as
critics for stepwise judgments at a lower cost (Zhang et al., 2024; Gao et al., 2024; Xia et al., 2025).
However, their judgments can sometimes be noisy, inconsistent, and less discriminative.
To bridge this gap, we retain a stepwise critic (referred to as the discriminator) but enable it to co-
evolve with the LLM reasoner through joint training, generating effective step-level signals with
1
arXiv:2512.16917v2 [cs.AI] 26 Dec 2025
92.2
92.5
94.8
85.2
90.0
91.3
90.6
90.3
94.3
82.9
84.5
88.1
DS-R1-Distill-Qwen-7B
DS-R1-Distill-Qwen-7B + GAR
DS-R1-Distill-Llama-8B
DS-R1-Distill-Llama-8B + GAR
LiveMathBench-Hard AIME25
OlympiadBench
AIME24
GSM8K
AMC23
MATH500
24.9
44.3
54.8
61.3
22.4
36.2
50.9
53.7
18.4
38.0
52.5
54.0
18.5
30.3
48.2
43.7
Pass@1 on Math Benchmarks
Figure 1: Pass@1 accuracy on seven mathematical reasoning benchmarks. Our Generative Adversarial
Reasoner (GAR) consistently improves over strong baselines across both Deepseek-R1-Distill-Qwen-7B and
Deepseek-R1-Distill-Llama-8B. GAR achieves gains of +22.9% on AIME24 and +19.5% on AIME25 for the
Llama backbone, as well as +35.3% on LiveMathBench-Hard for Qwen. These results demonstrate the robust-
ness and generality of GAR in enhancing reasoning performance across diverse mathematical tasks (Tab. 1).
lower annotation costs and increased robustness to label noise and reward mis-specification. Con-
cretely, we optimize the LLM reasoner and an LLM-based discriminator together: the discriminator
judges the logical soundness of each intermediate reasoning step and explains its judgment, while
the reasoner learns to produce steps the discriminator consistently endorses for
Reference
This content is AI-processed based on open access ArXiv data.