On Entropy Control in LLM-RL Algorithms
For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM’s extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy’s benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.
💡 Research Summary
The paper investigates why entropy regularization, a staple in many reinforcement‑learning (RL) algorithms such as PPO, SAC, and A3C, fails to provide noticeable gains when applied to large‑language‑model (LLM) RL (LLM‑RL). The authors first present a theoretical analysis that highlights two fundamental issues specific to LLM‑RL: (1) the action space is the model’s entire vocabulary, often exceeding hundreds of thousands of tokens, and (2) optimal responses are extremely sparse within this space. Proposition 1 shows that policy entropy upper‑bounds the norm of the policy gradient, implying that low entropy signals a near‑stationary policy. However, when entropy collapses, the performance gap is bounded by a term inversely proportional to the probability of sampling any optimal token sequence (Cπθ). Proposition 2 extends this to entropy‑regularized objectives, revealing an additional bias term proportional to H·log|A|·log(1/|A*|), where |A| is the vocabulary size and |A*| is the number of optimal token sequences. Because |A| is huge and |A*| is tiny in LLM tasks, this bias dominates any variance‑reduction benefit, explaining the empirical observation that traditional entropy bonuses often yield no improvement or even degrade performance.
To address these problems, the authors propose AEnt (Adaptive Entropy with Token‑space Clamping). AEnt consists of two complementary mechanisms: (i) Token‑space clamping – at each decoding step the policy’s probability distribution πθ is truncated to the top‑(1‑p) fraction of tokens, forming a reduced set A(s). The distribution is renormalized to obtain ˜πθ, and a “clamped entropy” ˜H(πθ) = –E
Comments & Academic Discussion
Loading comments...
Leave a Comment