Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Reading time: 5 minute
...

📝 Original Info

  • Title: Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation
  • ArXiv ID: 2512.03996
  • Date: 2025-12-03
  • Authors: ** Hang Xu¹, Linjiang Huang², Feng Zhao¹* ¹ MoE Key Lab of BIPC, University of Science and Technology of China (USTC) ² Beihang University **

📝 Abstract

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.

💡 Deep Analysis

Figure 1

📄 Full Content

Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation Hang Xu1 Linjiang Huang2 Feng Zhao1* 1MoE Key Lab of BIPC, USTC 2Beihang University Abstract Test-time scaling (TTS) aims to achieve better results by in- creasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffu- sion models, most related works focus on search strategies and reward models, yet the impact of the stochastic charac- teristic of noise in T2I diffusion models on the method’s per- formance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding pertur- bation, which couples with existing randomness like SDE- injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturba- tion enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key de- signs: (1) Introducing step-based text embedding perturba- tion, combining frequency-guided noise schedules with spa- tial noise perturbation. (2) Adapting the perturbation inten- sity selectively based on their frequency-specific contribu- tions to generation and tolerance to perturbation. Our ap- proach can be seamlessly integrated into existing TTS meth- ods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at https://github.com/xuhang07/TEP-Diffusion. 1. Introduction Diffusion models start from random noise and have demonstrated impressive generative capabilities in text-to- image(T2I) generation. However, due to the inconsistency *Corresponding author Original Inference-Time Scaling Methods (Spatial Noise Randomness in Latent) Original Methods + Our Text Embedding Perturbation Text Embedding Initial Noise SDE Noise Spatial Noise Randomness Selection Selection Sample N Times Text Embedding Initial Noise SDE Noise Spatial Noise Randomness Selection Selection Sample N Times Text Embedding Text Embedding Perturbation Text Embedding Initial Noise Direct Generation with T2I Diffusion Models SD3.5 SDXL + Inference-Time Scaling + Inference-Time Scaling + Text Embedding Perturbation Sample N Times Figure 1. Top: Comparison of text embedding perturbation with previous randomness. Bottom: The corresponding generated im- ages of the Top. Our method is plug-and-play. in their training-inference paradigm, where multiple noise- to-data mappings are learned during training but only a single noise is used during inference, the full potential of diffusion models in generation remains untapped. There- fore, inspired by the test-time scaling (TTS) techniques in LLMs [11, 15], many researchers aim to enhance the gener- ation quality of diffusion models by scaling inference com- putations during inference [13, 23]. Specifically, these TTS methods rely on the sampling randomness of diffusion mod- els (like initial noise) to generate multiple candidate sam- ples, evaluate them using reward models, and then employ search strategies to select and further refine the candidates. Therefore, the core components of TTS methods consist of randomness, search strategies, and reward models. Research on search strategies and reward models has dominated TTS methods for T2I diffusion models, while randomness and its impact on these methods remain unex- plored. Notably, randomness directly affects the size of the search space in TTS methods [32]. However, most existing works rely solely on spatial random noise introduced in la- tent space (i.e., SDE), which may not provide a sufficiently large search space. A constrained search space means re- peated sampling tends to converge on similar and redun- dant candidates, leading to ineffective use of computational resources [18]. Therefore, it is meaningful to explore a new format of randomness that can both enhance generative di- versity and complement existing spatial noise randomness. In this paper, we explore a new format of randomness, 1 arXiv:2512.03996v1 [cs.CV] 3 Dec 2025 text embedding perturbation, for TTS methods in T2I dif- fusion models. While recent studies have utilized text em- bedding perturbation to generate more diverse images [20], they struggle to maintain visual quality and text faithful- ness, making them unsuitable for TTS methods (see Fig. 2). Our experimental analysis attributes this limitation to two key factors: (1) poor complementarity between t

📸 Image Gallery

CADS.png comparison_of_NDFE.png comparison_of_NRFE.png comparison_of_cfg.png comparison_of_cfg_sd3.png comparison_of_influence.png comparison_of_influence_conditional.png comparison_of_layer_perturbation.png diversity.png first_image.png frequency_reduce.png frequency_reduce_sd3.png main_image.png main_image_2.png perturbing_operation.png quality.png score_distribution.png sde2ode.png sde2ode_sd3.png tep2ntep.png tep2ntep_sd3.png text_embedding_differentiation.png uncon_con.png uncon_con_sd3.png visual_result.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut