Toward Early Quality Assessment of Text-to-Image Diffusion Models
Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate–then–select’’ mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement–that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.
💡 Research Summary
This paper tackles the inefficiency inherent in the common “generate‑then‑select” workflow used by modern text‑to‑image diffusion and flow‑matching models. In practice, many seeds are sampled per prompt, and only a few are kept after ranking with external evaluators such as CLIPScore, ImageReward, or PickScore. Because each seed requires tens to hundreds of denoising steps, a large amount of compute is wasted on low‑quality candidates that are discarded only after full generation. The authors introduce Probe‑Select, a lightweight plug‑in that enables early quality assessment (EQA) by exploiting structural signals present in intermediate denoiser activations.
Key Observation
Even at early timesteps (≈ 20 % of the total diffusion trajectory), certain hidden layers of the U‑Net denoiser already encode a stable coarse layout: object positions, spatial composition, and semantic groupings. These cues evolve slowly, making them reliable predictors of the final image’s fidelity.
Probe‑Select Architecture
A probe is attached to a selected block of the denoiser. It consists of: (1) a feature tap that extracts the activation hₜ; (2) a tiny vision encoder g_ϕ that pools hₜ together with a timestep embedding to produce a global vector uₜ; and (3) a small MLP p_ϕ that maps uₜ to a scalar quality estimate ŷₜ. The probe adds negligible overhead and does not modify the original generator, sampler, or schedule, allowing it to be used with any diffusion or flow‑matching backbone.
Training Objectives
Two complementary losses align the probe with external quality metrics and the textual prompt:
• Listwise ranking loss (L_list) encourages the relative ordering of early predictions to match the ordering of ground‑truth evaluator scores across a batch, focusing the probe on discriminative structural cues.
• Contrastive text‑alignment loss (L_align) uses an InfoNCE objective to pull the probe embedding uₜ toward the frozen CLIP text embedding of the prompt, ensuring that the early estimate remains prompt‑aware.
The total loss is L = L_list + λ·L_align.
Selective Sampling Procedure
During inference, the model runs for a fraction η of the total steps (e.g., η = 0.2). The probe scores each seed; only the top K ≪ N seeds are continued to completion. Expected computation cost becomes Cost ≈ η + (1‑η)·K/N, yielding >60 % reduction when K = 1 and η = 0.2.
Experimental Evaluation
The method is evaluated on four backbones: Stable Diffusion 2, Stable Diffusion 3.5 (Medium and Large), and FLUX.1‑dev, using MS‑COCO captions. Across eight external evaluators, early predictions at 20 % of the trajectory achieve median Spearman correlations of ≥ 0.7, and up to 0.98‑0.99 for BLIP‑ITM and ImageReward. When pruning to the top‑1 of five candidates, the expected denoising cost drops to ~0.36 of the full cost while the average ImageReward score rises dramatically (e.g., from 0.49 to 1.59 on SD2). Similar gains are observed for HPSv2.1 and other metrics, demonstrating that early structural cues reliably guide selective generation.
Related Work and Distinction
Prior efficiency work reduces the number of diffusion steps or the per‑step cost, while methods like HEaD use task‑specific attention maps to detect hallucinations. Probe‑Select differs by providing a general‑purpose, model‑agnostic early quality estimator that works for any downstream evaluator and does not require retraining the generator.
Conclusion and Outlook
Probe‑Select reframes quality evaluation from a post‑hoc operation to an online, compute‑aware process. By leveraging early latent features, it achieves substantial speed‑ups without sacrificing—and often improving—final image quality. Its plug‑in nature enables straightforward integration into existing pipelines and opens avenues for adaptive stopping, dynamic guidance scaling, and user‑personalized quality criteria in large‑scale text‑to‑image generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment