Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Reading time: 2 minute
...

📝 Original Info

  • Title: Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration
  • ArXiv ID: 2511.00794
  • Date: 2025-11-02
  • Authors: ** (논문에 명시된 저자 정보가 제공되지 않아 알 수 없습니다.) **

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.

💡 Deep Analysis

Figure 1

📄 Full Content

📸 Image Gallery

ACTOR_ENTROPY_LOSS_w_pc2_svg-tex.png EXAMPLES_NON_DIVERSE_EXAMPLES_RATIO_exp3_svg-tex.png EXAMPLES_NON_DIVERSE_EXAMPLES_RATIO_exp4_svg-tex.png EXAMPLES_NON_DIVERSE_EXAMPLES_RATIO_w_pc2_svg-tex.png HTML_to_SVG_1.png HTML_to_SVG_2.png HTML_to_SVG_3.png HTML_to_SVG_4.png cal_ppl_gen_2_svg-tex.png diversity_baseline_heatmap_v2_svg-tex.png diversity_comparison_v2_svg-tex.png llama_3_1_8b_partial_svg-tex.png main_15b_entropy_loss_svg-tex.png main_15b_grad_norm_svg-tex.png main_15b_non_diverse_examples_ratio_svg-tex.png main_entropy_loss_2_svg-tex.png main_grad_norm_2_svg-tex.png main_non_diverse_examples_ratio_2_svg-tex.png passrate_full_v2_svg-tex.png ppl_all_range_2_svg-tex.png ppl_all_range_m15b_2_svg-tex.png ppl_range_m15_svg-tex.png ppl_range_svg-tex.png qwen2_5_math_1_5b_partial_svg-tex.png qwen2_5_math_7b_partial_svg-tex.png qwen3_4b_ins_2507_partial_svg-tex.png summary_row5_llama_8b.png summary_row5_qwen_1.5b.png summary_row5_qwen_7b.png summary_row5_qwen_m7b.png summary_row5_random_qwen7b.png val_macro_avg_AIME24_AIME25_MATH500_OlympiadBench_full_data_v3_svg-tex.png val_macro_avg_AIME24_AIME25_MATH500_OlympiadBench_v3_svg-tex.png weight_distribution_svg-tex.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut