Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
  • ArXiv ID: 2512.16917
  • Date: 2025-12-18
  • Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

📝 Abstract

Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

💡 Deep Analysis

📄 Full Content

GENERATIVE ADVERSARIAL REASONER: ENHANCING LLM REASONING WITH ADVERSARIAL REINFORCE- MENT LEARNING Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille Johns Hopkins University qliu45@jhu.edu ABSTRACT Large language models (LLMs) with explicit reasoning capabilities excel at math- ematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we intro- duce Generative Adversarial Reasoner, an on-policy joint training framework de- signed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient re- view schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguish- ing traces in the reasoning process. This produces dense, well-calibrated, on- policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular dis- criminator also enables flexible reward shaping for objectives such as teacher dis- tillation, preference alignment, and mathematical proof-based reasoning. 1 INTRODUCTION Large language models (LLMs) have demonstrated remarkable mathematical reasoning abilities, often achieving expert-level performance across diverse benchmarks (Achiam et al., 2023; Dubey et al., 2024; Shao et al., 2024; DeepSeek-AI, 2025). However, despite extensive training on large- scale datasets with sophisticated paradigms, these models still suffer from errors in reasoning, such as incorrect calculations, flawed logic, superficially plausible but invalid arguments, and repetitive or incoherent reasoning steps. To tackle these challenges, researchers have explored approaches such as model debate collaboration, in which models debate against each other (Du et al., 2023; Liang et al., 2023) or with themselves (Kuba et al., 2025; Liu et al., 2025a), and Process Reward Models (Lightman et al., 2023; Wang et al., 2023), which aim to identify and mitigate process errors throughout the reasoning process. These methods provide finer-grained supervision and contribute to more robust and reliable LLM performance. Among existing approaches, Process Reward Models (PRMs) have shown strong results on complex reasoning tasks, largely because they leverage detailed step-level annotations. However, PRMs face challenges related to annotation costs and data quality (Lightman et al., 2023), as fine-grained labels are expensive and prone to subjective error, and are sometimes susceptible to over- or under-reward issues (Wen et al., 2024; Lv et al., 2025). Alternatively, prompt-based methods employ LLMs as critics for stepwise judgments at a lower cost (Zhang et al., 2024; Gao et al., 2024; Xia et al., 2025). However, their judgments can sometimes be noisy, inconsistent, and less discriminative. To bridge this gap, we retain a stepwise critic (referred to as the discriminator) but enable it to co- evolve with the LLM reasoner through joint training, generating effective step-level signals with 1 arXiv:2512.16917v2 [cs.AI] 26 Dec 2025 92.2 92.5 94.8 85.2 90.0 91.3 90.6 90.3 94.3 82.9 84.5 88.1 DS-R1-Distill-Qwen-7B DS-R1-Distill-Qwen-7B + GAR DS-R1-Distill-Llama-8B DS-R1-Distill-Llama-8B + GAR LiveMathBench-Hard AIME25 OlympiadBench AIME24 GSM8K AMC23 MATH500 24.9 44.3 54.8 61.3 22.4 36.2 50.9 53.7 18.4 38.0 52.5 54.0 18.5 30.3 48.2 43.7 Pass@1 on Math Benchmarks Figure 1: Pass@1 accuracy on seven mathematical reasoning benchmarks. Our Generative Adversarial Reasoner (GAR) consistently improves over strong baselines across both Deepseek-R1-Distill-Qwen-7B and Deepseek-R1-Distill-Llama-8B. GAR achieves gains of +22.9% on AIME24 and +19.5% on AIME25 for the Llama backbone, as well as +35.3% on LiveMathBench-Hard for Qwen. These results demonstrate the robust- ness and generality of GAR in enhancing reasoning performance across diverse mathematical tasks (Tab. 1). lower annotation costs and increased robustness to label noise and reward mis-specification. Con- cretely, we optimize the LLM reasoner and an LLM-based discriminator together: the discriminator judges the logical soundness of each intermediate reasoning step and explains its judgment, while the reasoner learns to produce steps the discriminator consistently endorses for

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut