One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.


💡 Research Summary

The paper introduces Power‑Mean Policy Optimization (PMPO), a unified and adaptive framework for group‑based reinforcement learning (RL) in large language model (LLM) reasoning. Existing methods—Group Relative Policy Optimization (GRPO) and Geometric‑Mean Policy Optimization (GMPO)—represent two fixed aggregation geometries: GRPO uses an arithmetic mean of token‑level importance ratios, while GMPO replaces this with a geometric mean to improve stability. Both suffer from rigidity; the arithmetic mean is overly aggressive and sensitive to outlier ratios, whereas the geometric mean is overly conservative and cannot exploit reliable signals.

PMPO generalizes these approaches by aggregating token‑level ratios rₜ with the power‑mean (generalized mean) of order p:

  (\hat r_p = \bigl(\frac{1}{n}\sum_{t=1}^{n} \exp(p\Delta\ell_t)\bigr)^{1/p})

where Δℓₜ is the log‑probability change for token t. When p = 1, the expression reduces to the arithmetic mean (GRPO); as p → 0, it converges to the geometric mean (GMPO). Thus, p parameterizes a continuum between aggressive and conservative updates.

A key theoretical contribution is the gradient analysis: the derivative of (\hat r_p) with respect to each Δℓₜ is proportional to a softmax over the Δℓ values with temperature 1/p. Small p produces a near‑uniform softmax, spreading gradient mass evenly across tokens (robustness). Large p sharpens the softmax, concentrating updates on tokens with the highest Δℓ (efficiency). Consequently, p acts as an inverse temperature controlling token‑level attention in the policy gradient.

To adapt p per rollout, the authors link it to the PPO‑style clipping fraction. After applying log‑domain clipping (threshold c) to each Δℓₜ, they compute the fraction f of tokens that were clipped. This fraction serves as a proxy for signal reliability: high f indicates an unstable trajectory that should be updated conservatively. They define a target effective sample size (ESS) ratio

  (\eta^* = \frac{1}{n} + f\bigl(1 - \frac{1}{n}\bigr))

and solve for the unique p that makes the normalized ESS of the induced token weights equal to η*. The token weights are

  (w_t(p) = \text{softmax}_p(\tilde\Delta\ell_t))

and the ESS is

  (\text{ESS}(p) = \frac{1}{n\sum_t w_t(p)^2}).

Because ESS is monotonic in p, a simple bisection search finds the appropriate p within a bounded interval (


Comments & Academic Discussion

Loading comments...

Leave a Comment