베이지안 사전 가이드 최적화로 강화된 그룹 상대 정책 최적화

Reading time: 5 minute
...

📝 Abstract

Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual-visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: intergroup Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

💡 Analysis

Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual-visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: intergroup Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

📄 Content

Recent progress in text-to-video/image generation has been largely driven by powerful diffusion architectures Rombach et al. (2022); Ramesh et al. (2022); Saharia et al. (2022); Zhang et al. (2024a) and reinforcement learning (RL)-based post-training strategies Black et al. (2023); Fan et al. (2023); Xue et al. (2025); Liang et al. (2025) that align generative models with perceptual or preference feedback Ouyang et al. (2022); Xu et al. (2024). Among these, Group Relative Policy Optimization (GRPO) Shao et al. (2024); Guo et al. (2025) has emerged as a promising framework, providing stable optimization and noticeable gains in visual quality, motion smoothness, and temporal coherence Xue et al. (2025). However, despite these advances, the semantic alignment between text prompts and generated videos and images remains a persistent weakness-models often produce visually plausible yet semantically mismatched results. This limitation stems from the inherently ambiguous nature of textual-visual correspondence. A single video can be described in multiple semantically valid ways depending on temporal granularity or linguistic focus. For instance, a short sequence showing a gymnast performing rotations could be described as “doing a gymnastics spin,” “performing a turn,” or “completing two rounds of rotation.” Conversely, a single text prompt may correspond to diverse videos and images that differ in motion trajectory, style, or camera framing, yet still satisfy the same description. These one-to-many and many-to-one relationships, as shown in Figure 1a, make the alignment task inherently uncertain, leading the reward model to produce unreliable or noisy signals. Existing GRPO-based methods generally treat rewards as consistent scalar feedback, without accounting for such uncertainty in the textual-visual relation. To address this challenge, we propose Bayesian Prior-Guided Optimization (BPGO), a principled framework that explicitly models reward uncertainty within a Bayesian formulation. Rather than equally trusting all prompt groups, BPGO dynamically reallocates learning trust based on the consistency between observed rewards and a semantic prior anchor that represents the model’s expected performance as shown in Figure 1b. Groups achieving above-prior rewards receive amplified update gains, while unreliable groups are softly down-weighted. Within each group, a reward renormalization mechanism further sharpens discriminability by stretching confident deviations around the prior and compressing ambiguous ones. Together, these mechanisms create a hierarchical, quality-aware optimization landscape that enhances both reward reliability and semantic alignment during GRPO training.

In summary, our contributions are threefold:

• We identify the intrinsic ambiguity of textual-visual alignment, and introduce priors to reduce the uncertainty for better post-training.

• We propose BPGO, a Bayesian prior-guided optimization framework that adaptively reweights and renormalizes reward signals according to their uncertainty, introducing both group-level trust allocation and sample-level discriminability enhancement.

• We demonstrate that BPGO significantly improves textual-visual alignment and overall generation quality across multiple benchmarks, outperforming recent GRPO-based approaches including DanceGRPO while maintaining computational efficiency.

GRPO and Variants Group Relative Policy Optimization (GRPO) was introduced by DeepSeek-Math as a memory-efficient alternative to Proximal Policy Optimization (PPO) Shao et al. (2024).

Unlike PPO Schulman et al. (2017), which requires a separate value network to estimate advantages, GRPO generates multiple responses per prompt and uses normalized group rewards as baselines, achieving approximately 50% reduction in memory usage. This design aligns naturally with the comparative nature of reward models typically trained on pairwise preference datasets Ouyang et al. (2022).

DeepSeek-R1 demonstrated GRPO’s effectiveness when combined iteratively with supervised finetuning for developing advanced reasoning capabilities Guo et al. (2025). The algorithm’s groupbased advantage estimation provides significantly lower variance compared to traditional policy gradient methods like REINFORCE Williams (1992), while avoiding the computational complexity of trust region methods Schulman et al. (2015). Beyond mathematical reasoning, GRPO has been successfully applied to various language tasks including instruction following Ouyang et al. (2022) and text summarization Stiennon et al. (2020), demonstrating its versatility as a general-purpose alignment technique. Recent theoretical analyses have examined GRPO’s convergence properties under different feedback models, establishing favorable sample complexity bounds that highlight its efficiency advantages in large-scale model training Casper et al. (2023).

Visual Generation with Reinforcement Learning Early RL approaches for visual generation

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut