Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO
💡 Research Summary
This paper introduces Mask‑GRPO, the first reinforcement‑learning (RL) framework that integrates Group Relative Policy Optimization (GRPO) with masked generative models (MGMs) for text‑to‑image (T2I) synthesis. While most recent RL‑enhanced T2I work focuses on diffusion or autoregressive (AR) architectures, MGMs—models that predict all masked tokens in parallel and progressively “unmask” the most confident ones—have been largely ignored due to the difficulty of defining a suitable transition probability for RL.
The authors first reformulate the unmasking process of MGMs as a multi‑step Markov decision process (MDP). The state at step t is the current token sequence together with the text prompt, and the action is the next token sequence after one unmasking iteration. The reward is derived from a CLIP‑based similarity score computed on the final generated image.
A central contribution is the proposal of two principled definitions of the transition probability pθ(sₜ₊₁|sₜ,aₜ), which is required for the importance‑sampling ratio in GRPO. The naïve AR‑style definition (product of confidence scores over all currently masked tokens) performs poorly because MGMs’ dynamics are driven primarily by the newly unmasked tokens. The first definition (pθ₁) multiplies the confidence scores of these newly unmasked tokens and additionally accounts for the probability that all remaining masked tokens have lower confidence than the minimum among the newly unmasked ones. The second, simplified definition (pθ₂) uses only the product of confidence scores of the newly unmasked tokens. Both definitions lead to substantial performance gains; pθ₂ is computationally cheaper while still effective.
Beyond the core formulation, the paper explores three practical enhancements:
-
Removing the KL‑regularization term. For the relatively small 1.3 B parameter Show‑o base model, the KL constraint hinders exploration, so setting β = 0 improves results. This contrasts with findings on larger diffusion models where KL regularization is beneficial.
-
Reduction strategies. To alleviate the heavy computational burden of revisiting every unmasking step during RL training, two strategies are introduced: (a) a computational reduction that computes the GRPO objective on only a subset of iterations (e.g., the first or last 25 of 50 steps), and (b) an unmasking reduction that shortens the number of unmasking iterations during training (e.g., from 50 to 20) while keeping the full schedule at inference time.
-
Low‑quality sample filtering. The authors observe a “Vanishing Samples” problem where high‑quality trajectories become scarce during training. By discarding samples whose CLIP reward falls below a threshold, the policy avoids being corrupted by noisy gradients.
Extensive experiments on standard T2I benchmarks (MS‑COCO, Flickr30k) show that Mask‑GRPO consistently outperforms prior RL‑enhanced diffusion and AR baselines. It achieves lower Fréchet Inception Distance (FID), higher Inception Score (IS), and better CLIPScore. Human preference studies also indicate a clear advantage over state‑of‑the‑art methods. Ablation studies confirm that each component—transition‑probability design, KL removal, reduction strategies, and sample filtering—contributes meaningfully to the overall gain.
In summary, the paper makes three key contributions: (1) it pioneers the application of RL to masked generative models by casting unmasking as a multi‑step decision problem and providing theoretically grounded transition probabilities; (2) it demonstrates that KL regularization can be safely omitted for smaller models, and introduces efficient training tricks that dramatically reduce computational cost; (3) it delivers a method that not only improves objective metrics but also aligns better with human aesthetic preferences, thereby expanding the practical utility of RL in modern T2I pipelines. Future work may explore scaling Mask‑GRPO to larger MGM backbones and incorporating richer multimodal reward signals for even finer preference alignment.
Comments & Academic Discussion
Loading comments...
Leave a Comment