Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation
ArXiv ID: 2512.03996
Date: 2025-12-03
Authors: ** Hang Xu¹, Linjiang Huang², Feng Zhao¹* ¹ MoE Key Lab of BIPC, University of Science and Technology of China (USTC) ² Beihang University **

📝 Abstract

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.

💡 Deep Analysis

📄 Full Content

Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation Hang Xu1 Linjiang Huang2 Feng Zhao1* 1MoE Key Lab of BIPC, USTC 2Beihang University Abstract Test-time scaling (TTS) aims to achieve better results by in- creasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffu- sion models, most related works focus on search strategies and reward models, yet the impact of the stochastic charac- teristic of noise in T2I diffusion models on the method’s per- formance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding pertur- bation, which couples with existing randomness like SDE- injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturba- tion enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key de- signs: (1) Introducing step-based text embedding perturba- tion, combining frequency-guided noise schedules with spa- tial noise perturbation. (2) Adapting the perturbation inten- sity selectively based on their frequency-specific contribu- tions to generation and tolerance to perturbation. Our ap- proach can be seamlessly integrated into existing TTS meth- ods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at https://github.com/xuhang07/TEP-Diffusion. 1. Introduction Diffusion models start from random noise and have demonstrated impressive generative capabilities in text-to- image(T2I) generation. However, due to the inconsistency *Corresponding author Original Inference-Time Scaling Methods (Spatial Noise Randomness in Latent) Original Methods + Our Text Embedding Perturbation Text Embedding Initial Noise SDE Noise Spatial Noise Randomness Selection Selection Sample N Times Text Embedding Initial Noise SDE Noise Spatial Noise Randomness Selection Selection Sample N Times Text Embedding Text Embedding Perturbation Text Embedding Initial Noise Direct Generation with T2I Diffusion Models SD3.5 SDXL + Inference-Time Scaling + Inference-Time Scaling + Text Embedding Perturbation Sample N Times Figure 1. Top: Comparison of text embedding perturbation with previous randomness. Bottom: The corresponding generated im- ages of the Top. Our method is plug-and-play. in their training-inference paradigm, where multiple noise- to-data mappings are learned during training but only a single noise is used during inference, the full potential of diffusion models in generation remains untapped. There- fore, inspired by the test-time scaling (TTS) techniques in LLMs [11, 15], many researchers aim to enhance the gener- ation quality of diffusion models by scaling inference com- putations during inference [13, 23]. Specifically, these TTS methods rely on the sampling randomness of diffusion mod- els (like initial noise) to generate multiple candidate sam- ples, evaluate them using reward models, and then employ search strategies to select and further refine the candidates. Therefore, the core components of TTS methods consist of randomness, search strategies, and reward models. Research on search strategies and reward models has dominated TTS methods for T2I diffusion models, while randomness and its impact on these methods remain unex- plored. Notably, randomness directly affects the size of the search space in TTS methods [32]. However, most existing works rely solely on spatial random noise introduced in la- tent space (i.e., SDE), which may not provide a sufficiently large search space. A constrained search space means re- peated sampling tends to converge on similar and redun- dant candidates, leading to ineffective use of computational resources [18]. Therefore, it is meaningful to explore a new format of randomness that can both enhance generative di- versity and complement existing spatial noise randomness. In this paper, we explore a new format of randomness, 1 arXiv:2512.03996v1 [cs.CV] 3 Dec 2025 text embedding perturbation, for TTS methods in T2I dif- fusion models. While recent studies have utilized text em- bedding perturbation to generate more diverse images [20], they struggle to maintain visual quality and text faithful- ness, making them unsuitable for TTS methods (see Fig. 2). Our experimental analysis attributes this limitation to two key factors: (1) poor complementarity between t

📄 Read Full PDF on ArXiv