This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a highernoise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 × 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 × 512.
💡 Deep Analysis
📄 Full Content
MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Hui Li1
Jiayue Lyu1
Fu-Yun Wang2
Kaihui Cheng1
Siyu Zhu1,4,5
Jingdong Wang3
1Fudan University
2The Chinese University of Hong Kong
3Baidu
4Shanghai Innovation Institute
5Shanghai Academy of AI for Science
https://mixflowgen.github.io/
Abstract
This paper studies the training-testing discrepancy (a.k.a.
exposure bias) problem for improving the diffusion models.
During training, the input of a prediction network at one
training timestep is the corresponding ground-truth noisy
data that is an interpolation of the noise and the data, and
during testing, the input is the generated noisy data. We
present a novel training approach, named MixFlow, for im-
proving the performance. Our approach is motivated by
the Slow Flow phenomenon: the ground-truth interpolation
that is the nearest to the generated noisy data at a given
sampling timestep is observed to correspond to a higher-
noise timestep (termed slowed timestep), i.e., the corre-
sponding ground-truth timestep is slower than the sam-
pling timestep.
MixFlow leverages the interpolations at
the slowed timesteps, named slowed interpolation mixture,
for post-training the prediction network for each training
timestep. Experiments over class-conditional image gen-
eration (including SiT, REPA, and RAE) and text-to-image
generation validate the effectiveness of our approach. Our
approach MixFlow over the RAE models achieve strong
generation results on ImageNet: 1.43 FID (without guid-
ance) and 1.10 (with guidance) at 256 × 256, and 1.55 FID
(without guidance) and 1.10 (with guidance) at 512 × 512.
1. Introduction
We study the training-testing discrepancy problem [32, 36],
also known as exposure bias [11, 20, 22, 27, 28], for diffu-
sion and flow matching models. During training, diffusion
models learn a prediction network, where the input to the
prediction network at each training timestep is the corre-
sponding ground-truth noisy data, i.e., an interpolation of
the noise and the data. During testing, the input to the pre-
diction network is the generated noisy data. The difference
of the inputs to the prediction network for training and test-
ing, i.e., the training-testing discrepancy, is one of the rea-
sons leading to the prediction discrepancy and accordingly
the problems of error accumulation and sampling drift.
(a)
(b)
Figure 1. Illustrating (1) the Slow Flow phenomenon during the
sampling process: the timestep (y-axis), corresponding to the
ground truth noisy data that is the nearest to the generated noisy
data at the sampling timestep t (x-axis), is slower (with higher
noise), i.e., the shading area is under the line x = y; and (2) the ef-
fectiveness of MixFlow training: the range of slowed timesteps for
(b) MixFlow training is smaller and closer to the sampling steps
than (a) standard training, indicating that MixFlow training effec-
tively alleviates the training-testing discrepancy. The boundary of
the shading area in (b) is plotted as blue lines in (a). Note: x-axis -
the sampling timestep at which the noisy data is generated; y-axis
- the slowed timestep corresponding to the ground truth noisy data
that is the nearest to the generated noisy data; shading area - the
range (the vertical line) of slowed timesteps at each sampling step;
noise corresponds to timestep 0, and data corresponds to timestep
1. The slowed timestep ranges are obtained from 20, 000 training
images in ImageNet [1], 50 sampling steps, and SiT-B [26]. De-
tails on how to plot the figures are provided in Appendix A.
There are two main lines of solutions to alleviating the
discrepancy problem. One line is to modify the training
procedure [11, 27]. For example, Input Perturbation [27]
conducts an input perturbation on the ground truth noisy
data, and self-forcing [11] uses the generated noisy data as
the input. The other line is to modify the sampling pro-
cess [20, 28, 33, 53]. For example, Epsilon Scaling [28]
scales the predicted noise during sampling, and Time-Shift
Sampler [20] shifts the sampling timestep for the next sam-
pling iteration. In this paper, we are interested in the former
line and present a novel training procedure.
Our approach is motivated by the Slow Flow phe-
1
arXiv:2512.19311v1 [cs.CV] 22 Dec 2025
nomenon about the generated noisy data and its nearest
ground-truth noisy data. Figure 1 illustrates the Slow Flow
phenomenon. The nearest ground-truth noisy data to the
generated noisy data at the sampling timestep t is observed
to corresponds to a higher-noise timestep, called Slowed
Timestep mt.
This intuitively means that the generated
noisy data is slower than the ground truth noise data, i.e.,
the higher-noise timestep slower than the sampling timestep
(mt ≤t). In addition, the range of the timestep difference
is larger for a greater sampling timestep t, meaning that the
slowed timestep at a greater timestep t is possibly more dif-
ferent from the sampling timestep t.
In light of the