Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an “Expand-and-Prune” strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
💡 Research Summary
Group Relative Policy Optimization (GRPO) has become a popular tool for aligning generative models with human preferences because it can estimate advantages without a learned value function. However, GRPO’s performance hinges on using a large group of sampled trajectories (G) to obtain reliable advantage estimates, which creates a tension: a small group is cheap but yields noisy, unstable gradients, while a large group is computationally prohibitive.
The authors first conduct systematic empirical studies of GRPO’s sampling dynamics. They discover a “reward clustering” phenomenon: for a given prompt, most trajectories receive rewards that lie very close to the group mean μ_G, resulting in a small within‑group standard deviation σ_G. Because the normalized advantage A_i = (R_i − μ_G)/(σ_G + ε) shrinks with σ_G, many trajectories contribute almost zero gradient despite consuming the same compute. Random subsampling does not alleviate this issue because it merely discards trajectories without changing the distribution’s variance.
To address the lack of useful signal, the paper introduces Optimal Variance Filtering (OVF). OVF selects a fixed‑size subset (k < G) that maximizes the variance of rewards within the subset, effectively pulling in the extreme high‑ and low‑reward samples and spreading the reward distribution. Experiments show that OVF‑filtered subsets outperform the full unfiltered group on several alignment metrics (PickScore, HPSv2, etc.), confirming the “Less is More” hypothesis: a carefully chosen high‑variance subset can provide stronger learning signals than a larger, noisy set.
Nevertheless, OVF is a post‑sampling filter; the model still has to fully generate all G trajectories before discarding the low‑variance ones, which wastes computation. The core contribution of the paper is Pro‑GRPO (Proactive GRPO), a dynamic framework that prunes trajectories during the sampling process itself. At predefined checkpoints (t_i) the latent state of each active trajectory is projected forward to the terminal time T using a single deterministic ODE step (derived from the diffusion or rectified‑flow drift). The projected latent is decoded and fed to the reward model, yielding a cheap proxy reward Ř_i. OVF is then applied to these proxy rewards, retaining only a high‑variance subset for further denoising while early‑terminating the rest. This “latent‑feature‑based pruning” reduces the number of SDE/ODE steps dramatically, scaling the actual compute cost to the size of the surviving set K rather than the original G.
Pro‑GRPO also incorporates an “Expand‑and‑Prune” schedule. Instead of starting with a budget‑constrained group, the method temporarily expands the initial pool to a larger size G_max > K to increase coverage of the reward landscape. As sampling proceeds, successive OVF‑based pruning steps shrink the pool monotonically (G_max → K_2 → … → K), preserving exploration diversity while keeping final optimization cost low.
The authors evaluate Pro‑GRPO on two representative generative families: the diffusion‑based Stable Diffusion v1.4 and the flow‑based Stable Diffusion 3.5‑M. Baselines include Flow‑GRPO and Dance‑GRPO, both state‑of‑the‑art RL fine‑tuning methods. All methods are trained on the same reward signals (HPS‑v2.1, CLIP Score, PickScore) and evaluated on DrawBench, HPSv2, ImageReward, GenEval, Aesthetic Score, and other metrics. Pro‑GRPO consistently outperforms the baselines, achieving 3–5 percentage‑point gains across metrics while reducing overall FLOPs by 40–60 % compared to the full‑group GRPO. The reward variance of the surviving trajectories remains substantially higher throughout training, confirming that the dynamic pruning effectively maintains a rich advantage signal.
In summary, the paper identifies a fundamental inefficiency in GRPO (reward clustering), validates that maximizing reward variance improves learning, and then builds a practical, compute‑efficient solution that integrates variance‑based filtering directly into the sampling loop. Pro‑GRPO’s “expand‑and‑prune” paradigm preserves the exploratory benefits of large groups without incurring their cost, making GRPO scalable to larger models and tighter compute budgets. Future directions include improving proxy‑reward estimation, adaptive checkpoint placement, and extending the framework to other generative modalities such as video or 3‑D synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment