프록시 연산자를 활용한 효율적인 텍스트‑투‑이미지 확산 모델 ProxT2I

Reading time: 5 minute
...

📝 Original Info

  • Title: 프록시 연산자를 활용한 효율적인 텍스트‑투‑이미지 확산 모델 ProxT2I
  • ArXiv ID: 2511.18742
  • Date: 2025-12-01
  • Authors: : Fang et al., 2025

📝 Abstract

Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.

💡 Deep Analysis

📄 Full Content

Generative artificial intelligence has rapidly transformed the landscape of content creation, enabling the synthesis of realistic text, images, audio, and video. In particular, diffusion models, generative models based on discretizations of stochastic differential equations (SDEs) and ordinary differential equations (ODEs), have emerged as the workhorse behind recent advances in image and text-to-image (T2I) generation (Ho et al., 2020;Song et al., 2021b). By simulating a carefully-designed continuous-time dynamics that transforms noise into data, these diffusion-based methods achieve state-of-the-art image quality and flexible semantic controllability, underpinning many widely used systems such as Stable Diffusion (Rombach et al., 2022), Imagen (Saharia et al., 2022), and DALL•E 2 (Ramesh et al., 2022).

Despite the success of diffusion models, high-quality generation with very few sampling steps remains an open challenge. Conventional score-based samplers, based on forward, explicit discretization of continuous-time processes, often deteriorate significantly in sample quality when the (Wu et al., 2023) vs. number of sampling steps at 256 2 resolution. Our ProxT2I achieves more efficient and human-preference-aligned T2I generation than competing methods. number of steps is reduced, or require carefully tuned and specialized solvers to improve sampling speed (Salimans and Ho, 2022;Lu et al., 2022Lu et al., , 2025)). This leads to inefficiencies in practical deployments where fast inference is essential. Efficient sampling is particularly challenging in human image generation, where the image quality-particularly for faces and hands-remains less satisfactory, with subtle but very perceptible artifacts or unrealistic features still common (Liao et al., 2024).

An alternative approach to designing diffusion models, recently proposed by (Fang et al., 2025), relies on applying a backward discretization to the reverse diffusion process. Unlike traditional forward-discretized solvers that approximate the reverse-time SDE via explicit updates based on scores, the proximal diffusion models from (Fang et al., 2025) employ backward and implicit updates, and leverage proximal operators-instead of the gradients-of the log-density. These proximalbased solvers achieve improved theoretical convergence rates and superior empirical sampling efficiency in generating high-quality samples. So far, proximal diffusion models have been limited to unconditional generation and showcased on low-dimensional data, leaving its potential for conditional generation of high-resolution images unexplored.

Moreover, diffusion models trained simply with denoising score-matching objectives often fall short on downstream objectives such as text-image alignment, aesthetics, safety requirements, or human preference. In this context, reinforcement learning (RL) has become a crucial post-training paradigm to optimize diffusion models for such goals (Black et al., 2024;Fan et al., 2023;Clark et al., 2023;Uehara et al., 2024b,a). Unfortunately, policy-gradient methods in RL typically rely on stochastic sampling paths and are therefore not directly applicable to deterministic (and faster) ODE-based samplers, necessitating ad hoc modifications (Liu et al., 2025;Xue et al., 2025). In contrast, proximal diffusion provides an efficient yet stochastic SDE-based sampler with speed comparable to ODE methods (Fang et al., 2025), making it naturally amenable to RL. Yet, the distinct non-Gaussian transition kernel in proximal-based samplers introduces challenges for applying RL objectives, which are defined using transition densities. This motivates the main question in this work: Can we combine the improved speed of proximal diffusion models with the flexibility of reinforcement learning to optimize task-specific rewards in T2I generation?

We answer this question positively by introducing ProxT2I, a text-conditional proximal diffusion model augmented with reinforcement learning for T2I generation. More specifically, our contributions are:

  1. We leverage the proximal diffusion framework (Fang et al., 2025) to develop a new conditional generative model for efficient, high-quality human text-to-image synthesis.

  2. We integrate reinforcement learning via Group Relative Policy Optimization (GRPO) (Shao et al., 2024) into proximal diffusion, improving perceptual quality and text-image alignment while preserving the fast sampling advantages of proximal-based samplers.

  3. We curate and opensource LAION-Face-T2I-15M, a new large-scale dataset of 15M highquality human images with fine-grained captions and a 3M hand-focused subset, establishing a new foundation for developing human T2I models.

  4. As will be demonstrated, ProxT2I offers a lightweight and efficient solution for fine-grained, text-conditional human image generation with state-of-the-art performance.

The rest of the paper is organized as follows. Section 2 introduces related works on diffusion models,

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut