Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Reading time: 2 minute
...

📝 Original Info

  • Title: Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
  • ArXiv ID: 2511.05993
  • Date: 2025-11-08
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (예: 김민수, 이지은, 박성현 등) **

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training data. Furthermore, through both theoretical analysis and empirical validation, we demonstrate that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this insight, we propose Positive-Advantage Reweighting, a simple yet effective approach that regulates model entropy by adjusting the loss weights assigned to tokens with positive advantages during RLVR training, while maintaining competitive performance.

💡 Deep Analysis

Figure 1

📄 Full Content

📸 Image Gallery

entropy_acc_prompts.png entropy_acc_prompts_corrcoef_plot.png entropy_acc_prompts_corrcoef_plot_llama.png entropy_acc_prompts_llama.png entropy_change_cov_plot_001.cov_approx_1.png entropy_change_cov_plot_001.cov_approx_2.png entropy_change_cov_plot_012.cov_approx_1.png entropy_change_cov_plot_012.cov_approx_2.png entropy_change_cov_plot_037.cov_approx_1.png entropy_change_cov_plot_037.cov_approx_2.png entropy_change_cov_plot_038.cov_approx_1.png entropy_change_cov_plot_038.cov_approx_2.png entropy_change_cov_plot_056.cov_approx_1.png entropy_change_cov_plot_056.cov_approx_2.png entropy_code_acc_corrcoef_plot_2.png entropy_ngram_diversity_plot_aime24.png entropy_ngram_diversity_plot_aime24_llama.png entropy_plot.png entropy_plot_clipping.png entropy_plot_llama.png entropy_prompts_aime24.png entropy_prompts_aime24_llama.png entropy_selfbleu_plot_aime24.png entropy_selfbleu_plot_aime24_llama.png entropy_train_reward.png entropy_train_reward_2.png entropy_train_reward_llama.png fixed_entropy_valid_acc.png response_acc_probs.png response_acc_probs_llama.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut