MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Reading time: 5 minute
...

📝 Original Info

  • Title: MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
  • ArXiv ID: 2512.19311
  • Date: 2025-12-22
  • Authors: ** - Hui Li¹ - Jiayue Lyu¹ - Fu‑Yun Wang² - Kaihui Cheng¹ - Siyu Zhu¹⁴⁵ - Jingdong Wang³ ¹ 푸단대학교, ² 홍콩중문대학, ³ 바이두, ⁴ 상하이 혁신연구소, ⁵ 상하이 과학AI학원 **

📝 Abstract

This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a highernoise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 × 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 × 512.

💡 Deep Analysis

Figure 1

📄 Full Content

MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture Hui Li1 Jiayue Lyu1 Fu-Yun Wang2 Kaihui Cheng1 Siyu Zhu1,4,5 Jingdong Wang3 1Fudan University 2The Chinese University of Hong Kong 3Baidu 4Shanghai Innovation Institute 5Shanghai Academy of AI for Science https://mixflowgen.github.io/ Abstract This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for im- proving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher- noise timestep (termed slowed timestep), i.e., the corre- sponding ground-truth timestep is slower than the sam- pling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image gen- eration (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guid- ance) and 1.10 (with guidance) at 256 × 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 × 512. 1. Introduction We study the training-testing discrepancy problem [32, 36], also known as exposure bias [11, 20, 22, 27, 28], for diffu- sion and flow matching models. During training, diffusion models learn a prediction network, where the input to the prediction network at each training timestep is the corre- sponding ground-truth noisy data, i.e., an interpolation of the noise and the data. During testing, the input to the pre- diction network is the generated noisy data. The difference of the inputs to the prediction network for training and test- ing, i.e., the training-testing discrepancy, is one of the rea- sons leading to the prediction discrepancy and accordingly the problems of error accumulation and sampling drift. (a) (b) Figure 1. Illustrating (1) the Slow Flow phenomenon during the sampling process: the timestep (y-axis), corresponding to the ground truth noisy data that is the nearest to the generated noisy data at the sampling timestep t (x-axis), is slower (with higher noise), i.e., the shading area is under the line x = y; and (2) the ef- fectiveness of MixFlow training: the range of slowed timesteps for (b) MixFlow training is smaller and closer to the sampling steps than (a) standard training, indicating that MixFlow training effec- tively alleviates the training-testing discrepancy. The boundary of the shading area in (b) is plotted as blue lines in (a). Note: x-axis - the sampling timestep at which the noisy data is generated; y-axis - the slowed timestep corresponding to the ground truth noisy data that is the nearest to the generated noisy data; shading area - the range (the vertical line) of slowed timesteps at each sampling step; noise corresponds to timestep 0, and data corresponds to timestep 1. The slowed timestep ranges are obtained from 20, 000 training images in ImageNet [1], 50 sampling steps, and SiT-B [26]. De- tails on how to plot the figures are provided in Appendix A. There are two main lines of solutions to alleviating the discrepancy problem. One line is to modify the training procedure [11, 27]. For example, Input Perturbation [27] conducts an input perturbation on the ground truth noisy data, and self-forcing [11] uses the generated noisy data as the input. The other line is to modify the sampling pro- cess [20, 28, 33, 53]. For example, Epsilon Scaling [28] scales the predicted noise during sampling, and Time-Shift Sampler [20] shifts the sampling timestep for the next sam- pling iteration. In this paper, we are interested in the former line and present a novel training procedure. Our approach is motivated by the Slow Flow phe- 1 arXiv:2512.19311v1 [cs.CV] 22 Dec 2025 nomenon about the generated noisy data and its nearest ground-truth noisy data. Figure 1 illustrates the Slow Flow phenomenon. The nearest ground-truth noisy data to the generated noisy data at the sampling timestep t is observed to corresponds to a higher-noise timestep, called Slowed Timestep mt. This intuitively means that the generated noisy data is slower than the ground truth noise data, i.e., the higher-noise timestep slower than the sampling timestep (mt ≤t). In addition, the range of the timestep difference is larger for a greater sampling timestep t, meaning that the slowed timestep at a greater timestep t is possibly more dif- ferent from the sampling timestep t. In light of the

📸 Image Gallery

result_089.jpg result_207.jpg result_250.jpg result_270.jpg result_388.jpg result_417.jpg result_512_033.jpg result_512_088.jpg result_512_291.jpg result_512_387.jpg result_512_928.jpg result_512_973.jpg row11.png row12.png row13.png row51.png row52.png row53.png row91.png row92.png row93.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut