MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
Diffusion models [28,61,64] and Flow Matching [2,39,40] have achieved state-of-the-art results across image [4,48,75], video [1,17,70], and 3D generation [21,22]. However, it remains a persistent weakness that the denoising relies on many, small iteration steps, making sampling computationally expensive [25]. Higher-order samplers [13,32,43,44,54,62,80] partially alleviate this, though achieving high fidelity with fewer than 10 steps remains a challenge. Consequently, recent work has focused on models that enable inference in a few steps, or even a Early approaches distill few-step generative models from pretrained multi-step diffusion models, using direct [45,56,81], adversarial [59,60,76], or score-based supervision [47,77,86]. The two-stage design increases complexity, requires two distinct training processes, and often depends on large-scale synthetic data generation [40,45], or on propagation through teacher-student cascades [49,56].
Consistency models [66] represent a step towards onetime, end-to-end training, by enforcing consistent outputs for all samples drawn along the same denoising trajectory. Despite various improvements [19,26,34,42,63,71,72], a substantial performance gap remains between few-step consistency models and multi-step diffusion models. More recent research [6,7,16,34,55,74,85] has proposed to char-acterize diffusion/flow quantities along two distinct time indices. Among these attempts, MeanFlow [18] stands out as a stable end-to-end training scheme that markedly narrows the gap between one-step and multi-step generation.
The key to MeanFlow’s success is the idea of exploiting the intrinsic relationship between the instantaneous velocity (at a single time point) and the average velocity (integrated over a time interval), such that a single network learns both simultaneously. However, MeanFlow training is computationally expensive, and has only been analyzed rather superficially. In particular, it remains poorly understood how the two coupled velocity fields interact during learning and how their interplay can be coordinated to achieve high-quality one-step generation.
Here, we investigate these learning dynamics and develop a training strategy that greatly improves both generation quality and efficiency. Through controlled experiments, we determine that: (i) instantaneous velocity must be established early in the training process, because it provides the foundation for learning average velocity: if instantaneous velocities are poorly formed or corrupted, then learning average velocities fails altogether; (ii) in the opposite direction, the time interval over which average velocities are computed (the “temporal gap”) critically determines how they impact the learning of instantaneous velocities: small gaps facilitate instantaneous velocity formation and refinement, while large gaps destabilize it; (iii) task affinity analysis reveals that one should initially focus on small-gap average velocities, which lay the foundation for learning the large-gap average velocities that are required for one-step generation.
Standard MeanFlow training ignores these subtle, but impactful dynamics. Throughout the training process, it applies the same, fixed loss function and sampling scheme, disregarding the complex dependencies between the two velocity fields. This naive training objective interferes with the early formation of reasonable instantaneous velocities, which in turn delays the learning of average velocities. Ultimately, the current training practice significantly degrades overall performance compared to what would be achievable with a given model and dataset, and also slows down the training.
To remedy these issues, we propose a simple yet effective extension of MeanFlow training. To quickly establish reasonable instantaneous velocities, we adopt acceleration techniques from diffusion training [10,20,25,35,73]. To support the learning of correct average velocities, we design a progressive weighting scheme. In early training stages, the weighting prioritizes small gaps, which reinforces instantaneous velocity formation and prepares the ground for large-gap learning. As training progresses, the weighting gradually transitions to equal weighting across all gap lengths, ensuring accurate average velocities over large gaps, which are the vital ingredient for few-step inference.
Empirically, the enhanced training protocol substantially improves the generation results and also accelerates convergence. On the standard 1-NFE ImageNet [11] 256 × 256 benchmark, we improve the FID of MeanFlow-XL from FID 3.43 to 2.87, see Fig. 1. To reach the performance of conventional training, our improved training scheme needs 2.5× fewer iterations. Remarkably, it is even capable of matching that same performance with a smaller DiT-L backbone.
In a broader context, our work shows that there is still a lot of untapped potential to accelerate recent few-step generative models. With a better understanding of
This content is AI-processed based on open access ArXiv data.