Optimal and Diffusion Transports in Machine Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: Optimal and Diffusion Transports in Machine Learning
  • ArXiv ID: 2512.06797
  • Date: 2025-12-07
  • Authors: Gabriel Peyré

📝 Abstract

Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This includes sampling via diffusion methods, optimizing the weights of neural networks, and analyzing the evolution of token distributions across layers of large language models. While the targeted applications differ (samples, weights, tokens), their mathematical descriptions share a common structure. A key idea is to switch from the Eulerian representation of densities to their Lagrangian counterpart through vector fields that advect particles. This dual view introduces challenges, notably the non-uniqueness of Lagrangian vector fields, but also opportunities to craft density evolutions and flows with favorable properties in terms of regularity, stability, and computational tractability. This survey presents an overview of these methods, with emphasis on two complementary approaches: diffusion methods, which rely on stochastic interpolation processes and underpin modern generative AI, and optimal transport, which defines interpolation by minimizing displacement cost. We illustrate how both approaches appear in applications ranging from sampling, neural network optimization, to modeling the dynamics of transformers for large language models.

💡 Deep Analysis

Figure 1

📄 Full Content

1 Introduction.

The goal of this survey is to expose the connection between optimal transport and evolutions over the space of probability measures, with emphasis on partial differential equations and generative models based on diffusion. The exposition is informal, favoring intuition over detailed proofs. We work over R d and highlight links with machine learning applications: the training dynamics of multilayer perceptrons, the modeling of transformers where tokens are transported, and flow matching approaches for generative models. This paper is organized around four core examples that illustrate how dynamical systems of probability measures connect with modern machine learning:

• Diffusion models for generative AI ( §3). Generative models construct flows (α t ) t between simple references and complex data distributions. Diffusion and flow matching models achieve state-of-the-art results, but raise fundamental questions on sample complexity, discretization errors, and geometric interpretation.

• Optimal transport and ML ( §4). Rooted in Monge and Kantorovich, optimal transport provides a geometric framework for comparing and transforming measures. It has become a central tool in statistics and machine learning, with scalable entropic algorithms and dynamic formulations (Benamou-Brenier) enabling practical use in generative modeling and beyond.

• Optimization over measures ( §5 and §6). Wasserstein gradient flows extend OT to optimization, interpreting nonlinear PDEs as steepest descents in Wasserstein space. This theory informs both PDE analysis (porous medium, crowd motion) and machine learning (training shallow neural networks as mean-field limits with convergence guarantees).

• Very deep transformers ( §7). Transformer architectures process data as sets of tokens coupled through self-attention. Their depth can be modeled by continuous-time flows or PDEs over token distributions, linking attention dynamics to Vlasov-or aggregationtype equations. This measure-theoretic view sheds light on expressivity, clustering, and universality.

Taken together, these examples reveal a common structure: learning systems can often be described as evolutions of probability measures under vector fields, sometimes with a variational or gradient-flow interpretation, sometimes not. What is at stake is a unified mathematical understanding of how such flows encode computation, optimization, and generalization in machine learning.

All the problems outlined in the introduction can be formulated by studying the evolution t → α t ∈ P(R d ), where α t is a probability measure on R d with unit mass. Such evolution can be described in a “Lagrangian” way as the advection of particles along a (time-dependent) vector field v t (x) in R d . At the particle level, this advection is governed by (1) dx(t) dt = v t (x(t)), such that x(0) is mapped to x(t) by a “transport” mapping T t : x(0) → x(t). The fact that α t is the density of advected particles implies α t = (T t ) ♯ α 0 . Here T ♯ is the “push-forward” operator between measures [75], which for discrete measures acts by transporting the support of the Dirac masses, i.e. T ♯

, and this definition extends to arbitrary measures (for instance with densities). This means that for the flow we consider, if α 0 = 1 n i δ xi(0) represents a discrete point cloud, then α t = 1 n n i=1 δ xi(t) , where each x i (t) solves (1). Another way to describe this evolution is to track the measure α t itself. For instance, one may consider its density ρ t = dαt dx with respect to the Lebesgue measure, provided it exists. In this case, ρ t (x) represents the local concentration of particles around x. In this “Eulerian” representation, the ODE for particle trajectories becomes the PDE [4] (2)

where div(v t α t ) is understood in the weak sense, or in terms of the density ρ t whenever it exists. This PDE is commonly called the advection equation, the continuity equation, or Liouville’s equation when defined on a phase space. Finally, if α 0 = 1 n i δ xi(0) is a discrete particle distribution, then α t = 1 n i δ xi(t) = (T t ) ♯ α 0 is again a sum of moving particles with velocities ẋi (t) = v t (x i (t)), and it satisfies the PDE only in the weak sense.

A crucial point, central to both the practical and theoretical aspects of this review, is that for a given evolution α t , there are infinitely many possible choices of vector fields v t satisfying (2). This is because adding to v t any field from the linear space H α := {v : div(αv) = 0} does not affect the evolution of the measure. In other words, H α consists of vector fields that leave α invariant. For example, if α is an isotropic measure such as a Gaussian N (0, Id d ), then all rotations T t (and in fact a much richer class of transformations) leave α t = (T t ) ♯ α 0 = α 0 unchanged. More generally, H α is non-trivial since it corresponds to the kernel of a weighted divergence operator. In the Gaussian case, H α includes vector fields gener

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut