We consider a class of optimization problems on the space of probability measures motivated by the mean-field approach to studying neural networks. Such problems can be solved by constructing continuous-time gradient flows that converge to the minimizer of the energy function under consideration, and then implementing discrete-time algorithms that approximate the flow. In this work, we focus on the Fisher-Rao gradient flow and we construct an interacting particle system that approximates the flow as its mean-field limit. We discuss the connection between the energy function, the gradient flow and the particle system and explain different approaches to smoothing out the energy function with an appropriate kernel in a way that allows for the particle system to be well-defined. We provide a rigorous proof of the existence and uniqueness of thus obtained kernelized flows, as well as a propagation of chaos result that provides a theoretical justification for using the corresponding kernelized particle systems as approximation algorithms in entropic mean-field optimization.
We consider the following optimization problem on the space of probability measures P(X ) on X ⊂ R d where F : P(X ) → R is a (possibly non-convex) functional bounded from below, σ > 0 is a regularization parameter, π ∈ P(X ) is a fixed reference measure and KL denotes the relative entropy (the KL-divergence). While some general results in Section 2 will be stated for domains X ⊂ R d which do not have to be compact, for the crucial examples studied in Section 3 we will additionally require X to be compact. In recent years, there has been considerable interest in problems of this type, motivated by the mean-field approach to the problem of training neural networks (see Mei et al. (2018) or (Hu et al., 2021, Section 3) and the references therein), as well as in the context of reinforcement learning, in policy optimization for entropy-regularized Markov Decision Processes with neural network approximation in the mean-field regime (Leahy et al., 2022;Lascu and Majka, 2025).
In order to solve (1), one typically aims to construct a gradient flow (µ t ) t≥0 on P(X ) that converges to the minimizer m * ,σ of (1) when t → ∞. The most commonly used example is the Wasserstein gradient flow
defined via the flat derivative (first variation) δV σ δµ of the energy function V σ (see Definition D.1). The conditions guaranteeing the convergence of (2) to m * ,σ have been studied by numerous authors in various settings, including Ambrosio et al. (2008); Hu et al. (2021); Nitanda et al. (2022); Chizat (2022); Leahy et al. (2022) and many others.
However, from the point of view of applications, an equally important question is how to approximate gradient flows such as (2) by a practically implementable algorithm. One possible approach leads through the Jordan-Kinderlehrer-Otto (JKO) schemes (see Jordan et al. (1998) for the original paper or Salim et al. (2020); Lascu et al. (2024) for more recent expositions). Another, which we are going to focus on in the present paper, utilizes an interpretation of (2) as the mean-field limit of an interacting particle system. In the latter approach, one typically aims to prove a propagation of chaos result, i.e., a result that shows that as the number of particles approaches infinity, particles become independent and they all follow the same mean-field dynamics. This can be then used as a theoretical justification that an appropriately constructed interacting particle system may be used to produce (after a discretisation) an algorithm that approximates the minimizer of (1) when the number of particles and the number of iterations are both sufficiently large.
Propagation of chaos for the Wasserstein gradient flow has been studied in detail in various settings (utilizing the interpretation of (2) as the Fokker-Planck equation for the mean-field Langevin SDE). A (far from complete) list of references includes Chen et al. (2025); Monmarché et al. (2024); Durmus et al. (2020); Carrillo et al. (2003); Malrieu (2001); Delarue and Tse (2025); Lacker and Le Flem (2023); Suzuki et al. (2023); Nitanda (2024); Nitanda et al. (2025); Gu and Kim (2025). A related active strain of research involves the propagation of chaos for kinetic models Monmarché (2017); Guillin and Monmarché (2021); Schuh (2024); Chen et al. (2024).
In the present work, we focus on a different gradient flow, the so-called Fisher-Rao gradient flow given by ∂ t µ t = -µ t δV σ δµ (µ t , •).
(3)
The interest in studying this flow in the context of optimization problems (1) is motivated by the fact that in some settings, its convergence to the minimizer could be easier to verify than for the Wasserstein flow Liu et al. (2023); Kerimkulov et al. (2025); Lascu et al. (2024). There has been a considerable literature studying the fundamental properties of Fisher-Rao gradient flows such as well-posedness in various settings, see e.g. Carrillo et al. (2024); Zhu and Mielke (2024) and the references therein, also in combination with the Wasserstein flow as the Wasserstein-Fisher-Rao gradient flow (also known as Hellinger-Kantorovich) Liero et al. (2018); Gallouët and Monsaingeon (2017); Lu et al. (2019); Rotskoff et al. (2019).
However, unlike for the Wasserstein gradient flow (2), in the context of the Fisher-Rao flow (3) much less is known about particle approximations (with a few exceptions that will be discussed in detail in Section 2.4).
In particular, the main goal of the present paper is to fill a gap in the literature by providing a rigorous proof for a particle approximation for the Fisher-Rao gradient flow (3) corresponding to a class of minimization problems of the form (1).
We summarize the main contributions of our paper:
• In Section 2, we propose a general framework for studying particle approximations of Fisher-Rao gradient flows. This part extends the results from Cavil et al. (2017); Lu et al. (2019); Rotskoff et al. (2019); Domingo-Enrich et al. (2020) (see Remark 2.8 and Section 2.4 for details).
• In Section 3,
This content is AI-processed based on open access ArXiv data.