Stable Velocity: A Variance Perspective on Flow Matching

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.

💡 Research Summary

Flow matching and stochastic interpolants provide a unified continuous‑time framework for transforming a simple prior distribution into a complex data distribution. Conditional Flow Matching (CFM) trains a neural network to predict the conditional velocity field vₜ(xₜ|x₀), which is an unbiased Monte‑Carlo estimator of the true marginal velocity vₜ(xₜ). In practice, however, CFM uses a single data sample x₀ to compute the target, and this single‑sample estimate can have very high variance, especially at early diffusion times when the posterior pₜ(x₀|xₜ) spreads over many data points. The authors first quantify this variance by the average trace of the conditional velocity covariance, V_CFM(t). Empirical measurements on Gaussian mixture models, CIFAR‑10, and ImageNet‑256 latent codes reveal a clear two‑regime structure: a low‑variance regime near the data distribution (0 ≤ t < ξ) where V_CFM(t) ≈ 0, and a high‑variance regime near the prior (ξ ≤ t ≤ 1) where the variance grows rapidly. The split point ξ shifts toward 1 as dimensionality increases, enlarging the low‑variance window.

To address the high‑variance regime, the paper introduces Stable Velocity Matching (StableVM). Instead of a single reference sample, a batch of n reference samples {x₀ⁱ}₁ⁿ is drawn from the data distribution. A composite conditional path is defined as a uniform mixture of the individual conditional distributions: p_GMMₜ(xₜ|{x₀ⁱ}) = (1/n)∑ₖpₜ(xₜ|x₀ᵏ). The training target becomes the self‑normalized importance‑weighted average of the conditional velocities:

bᵥ^StableVM(xₜ) = (∑ₖ pₜ(xₜ|x₀ᵏ) vₜ(xₜ|x₀ᵏ)) / (∑ⱼ pₜ(xₜ|x₀ʲ)).

The authors prove (Theorem 3.1) that this target is unbiased (its expectation equals vₜ(xₜ)) and that the global minimizer of the StableVM loss coincides with the true marginal velocity field. Moreover, Theorem 3.2 shows that the variance of the StableVM target is always less than or equal to that of CFM, and Theorem 3.3 establishes an O(1/n) decay rate under mild boundedness assumptions. Consequently, by increasing n the high‑variance regime can be tamed without altering the ultimate objective.

The second contribution is Variance‑Aware Representation Alignment (VA‑REPA). Prior work on representation alignment (REPA) adds an auxiliary semantic loss to accelerate diffusion training, but the authors observe that REPA is effective only when the diffusion time lies in the low‑variance regime. In the high‑variance regime the noisy state xₜ carries little semantic information, making the alignment ill‑posed. VA‑REPA therefore introduces a time‑dependent weighting function w(t)∈

Stable Velocity: A Variance Perspective on Flow Matching

💡 Research Summary

Comments & Academic Discussion

Leave a Comment