IRIS: Intrinsic Reward Image Synthesis
Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in math and code reasoning, we show that minimizing self-certainty, rather than maximizing it, improves image generation. We observe that autoregressive T2I models with higher certainty are likely to generate simple and uniform images, which are less aligned with human preferences, and models with lower certainty are likely to generate vivid images rich in detail. Based on this observation, we propose IRIS(Intrinsic Reward Image Synthesis), the first framework to improve autoregressive T2I models with reinforcement learning using only an intrinsic reward. Empirical results demonstrate that applying IRIS to autoregressive T2I models achieves performance superior to those trained by individual external rewards, and matching those trained by ensemble external rewards. IRIS also incentivizes the emergence of nuanced CoT reasoning for high-quality image generation.
💡 Research Summary
The paper introduces IRIS (Intrinsic Reward Image Synthesis), a novel reinforcement‑learning framework that improves autoregressive text‑to‑image (T2I) models using only an intrinsic reward signal, eliminating the need for human preference data or domain‑specific evaluators. The authors begin by investigating the behavior of self‑certainty (SC), a metric originally defined as the KL divergence between a model’s output distribution and a uniform distribution. While prior work on mathematics and code reasoning found that maximizing SC improves performance, the authors discover the opposite trend for T2I: higher SC correlates with overly simple, uniform images that are less appealing to humans, whereas lower SC is associated with vivid, detail‑rich images. Based on this observation, they define Negative Self‑Certainty (NSC) as the negative of SC and use NSC as the reward to be maximized during RL fine‑tuning. The reward is computed via forward KL (U‖π), encouraging mode‑covering behavior that spreads probability mass over many plausible tokens rather than concentrating on a single high‑probability token. Training employs Group‑wise Relative Policy Optimization (GRPO), sampling multiple text candidates per prompt and a single image per text, then estimating advantages from the average NSC of each candidate. Importantly, NSC is applied to both text and image token streams, prompting the model to generate richer descriptive text before image synthesis, which in turn leads to more expressive visual outputs. Experiments are conducted on the Janus‑Pro 1B multimodal model using 553 GenEval prompts, with hyper‑parameters such as a learning rate of 1e‑6, KL regularization β = 0.01, batch size 8, and a maximum of 1024 generation tokens. The model is evaluated on three diverse benchmarks: T2I‑CompBench (testing compositional and spatial reasoning), WISE (world‑knowledge and common‑sense reasoning), and TIIF‑Bench (short and long instruction following). After 800 k training steps, IRIS achieves performance comparable to or surpassing external‑reward baselines (T2I‑R1) and matches ensembles of external reward models, showing improvements of 13.3 % on T2I‑CompBench, 28.8 % on WISE, and 10.7 %/4.2 % on TIIF‑Bench short/long respectively. The authors argue that external rewards tend to over‑constrain models to narrow domains, whereas IRIS leverages the model’s own prior knowledge, yielding better generalization across tasks. Additional ablations demonstrate that forward KL outperforms backward KL and that maximizing NSC for text tokens, contrary to the trend in math reasoning, is beneficial for the exploratory text generation required in T2I pipelines. Moreover, IRIS encourages the emergence of nuanced chain‑of‑thought (CoT) reasoning, enabling the model to handle complex, multi‑step prompts more effectively. In summary, IRIS provides the first successful demonstration that intrinsic reward signals can replace costly human‑labeled feedback in text‑to‑image generation, delivering superior image quality, broader generalization, and richer reasoning capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment