On propagation of chaos for the Fisher-Rao gradient flow in entropic mean-field optimization

We consider a class of optimization problems on the space of probability measures motivated by the mean-field approach to studying neural networks. Such problems can be solved by constructing continuous-time gradient flows that converge to the minimi…

Authors: Petra Lazić, Linshan Liu, Mateusz B. Majka

On propagation of c haos for the Fisher-Rao gradien t flo w in en tropic mean-field optimization P etra Lazi ´ c Linshan Liu Mateusz B. Ma jk a Univ ersity of Ljubljana and Univ ersity of Zagreb Heriot-W att Universit y and Maxw ell Institute for Mathematical Sciences Heriot-W att Universit y and Maxw ell Institute for Mathematical Sciences Abstract W e consider a class of optimization problems on the space of probabilit y measures moti- v ated by the mean-field approach to study- ing neural net w orks. Suc h problems can b e solv ed b y constructing contin uous-time gra- dien t flo ws that con v erge to the minimizer of the energy function under consideration, and then implementing discrete-time algorithms that approximate the flow. In this work, we fo cus on the Fisher-Rao gradien t flow and w e construct an in teracting particle system that appro ximates the flo w as its mean-field limit. W e discuss the connection b etw een the en- ergy function, the gradien t flow and the par- ticle system and explain differen t approac hes to smo othing out the energy function with an appropriate kernel in a w ay that allo ws for the particle system to b e well-defined. W e pro vide a rigorous pro of of the existence and uniqueness of thus obtained kernelized flo ws, as w ell as a propagation of c haos re- sult that pro vides a theoretical justification for using the corresp onding kernelized parti- cle systems as appro ximation algorithms in en tropic mean-field optimization. 1 INTR ODUCTION W e consider the follo wing optimization problem on the space of probability measures P ( X ) on X ⊂ R d min m ∈P ( X ) V σ ( m ) , V σ ( m ) := F ( m ) + σ KL( m | π ) , (1) Pro ceedings of the 29 th In ternational Conference on Arti- ficial Intelligence and Statistics (AIST A TS) 2026, T angier, Moro cco. PMLR: V olume 300. Copyrigh t 2026 by the au- thor(s). where F : P ( X ) → R is a (p ossibly non-conv ex) func- tional bounded from b elow, σ > 0 is a regularization parameter, π ∈ P ( X ) is a fixed reference measure and KL denotes the relative entrop y (the KL-div ergence). While some general results in Section 2 will b e stated for domains X ⊂ R d whic h do not ha ve to b e com- pact, for the crucial examples studied in Section 3 we will additionally require X to b e compact. In recen t y ears, there has b een considerable in terest in problems of this t yp e, motiv ated b y the mean-field approach to the problem of training neural net works (see Mei et al. (2018) or (Hu et al., 2021, Section 3) and the references therein), as well as in the context of reinforcement learning, in p olicy optimization for entrop y-regularized Mark ov Decision Pro cesses with neural netw ork ap- pro ximation in the mean-field regime (Leahy et al., 2022; Lascu and Ma jk a, 2025). In order to solv e (1), one t ypically aims to construct a gradien t flo w ( µ t ) t ≥ 0 on P ( X ) that conv erges to the minimizer m ∗ ,σ of (1) when t → ∞ . The most com- monly used example is the W asserstein gradient flo w ∂ t µ t = ∇ ·  µ t ∇ δ V σ δ µ ( µ t , · )  (2) defined via the flat deriv ative (first v ariation) δ V σ δ µ of the energy function V σ (see Definition D.1). The con- ditions guaranteeing the conv ergence of (2) to m ∗ ,σ ha ve b een studied b y numerous authors in v arious settings, including Ambrosio et al. (2008); Hu et al. (2021); Nitanda et al. (2022); Chizat (2022); Leahy et al. (2022) and many others. Ho wev er, from the p oin t of view of applications, an equally important question is ho w to appro ximate gra- dien t flows such as (2) by a practically implemen table algorithm. One possible approac h leads through the Jordan-Kinderlehrer-Otto (JK O) sc hemes (see Jordan et al. (1998) for the original paper or Salim et al. (2020); Lascu et al. (2024) for more recen t exp osi- tions). Another, which we are going to fo cus on in the present pap er, utilizes an in terpretation of (2) as the mean-field limit of an interacting particle system. In the latter approac h, one typically aims to prov e a propagation of chaos result, i.e., a result that shows that as the num b er of particles approaches infinity , particles become indep enden t and they all follow the same mean-field dynamics. This can b e then used as a theoretical justification that an appropriately con- structed interacting particle system may b e used to pro duce (after a discretisation) an algorithm that ap- pro ximates the minimizer of (1) when the n umber of particles and the num b er of iterations are b oth suffi- cien tly large. Propagation of c haos for the W asserstein gradien t flo w has b een studied in detail in v arious settings (uti- lizing the in terpretation of (2) as the F okker-Planc k equation for the mean-field Langevin SDE). A (far from complete) list of references includes Chen et al. (2025); Monmarc h´ e et al. (2024); Durm us et al. (2020); Carrillo et al. (2003); Malrieu (2001); Delarue and Tse (2025); Lack er and Le Flem (2023); Suzuki et al. (2023); Nitanda (2024); Nitanda et al. (2025); Gu and Kim (2025). A related active strain of research in- v olves the propagation of c haos for kinetic mo dels Monmarc h´ e (2017); Guillin and Monmarch ´ e (2021); Sc huh (2024); Chen et al. (2024). In the presen t w ork, we focus on a differen t gradient flo w, the so-called Fisher-Rao gradient flo w given b y ∂ t µ t = − µ t δ V σ δ µ ( µ t , · ) . (3) The interest in studying this flow in the context of optimization problems (1) is motiv ated b y the fact that in some settings, its conv ergence to the mini- mizer could be easier to verify than for the W asserstein flo w Liu et al. (2023); Kerimkulo v et al. (2025); Lascu et al. (2024). There has b een a considerable litera- ture studying the fundamental prop erties of Fisher- Rao gradien t flo ws suc h as well-posedness in v arious settings, see e.g. Carrillo et al. (2024); Zhu and Mielke (2024) and the references therein, also in combina- tion with the W asserstein flow as the W asserstein- Fisher-Rao gradient flow (also kno wn as Hellinger- Kan torovic h) Liero et al. (2018); Gallou ¨ et and Mon- saingeon (2017); Lu et al. (2019); Rotsk off et al. (2019). Ho wev er, unlike for the W asserstein gradien t flow (2), in the con text of the Fisher-Rao flow (3) m uch less is kno wn ab out particle appro ximations (with a few ex- ceptions that will b e discussed in detail in Section 2.4). In particular, the main goal of the present pap er is to fill a gap in the literature by providing a rigorous pro of for a particle appro ximation for the Fisher-Rao gradi- en t flow (3) corresp onding to a class of minimization problems of the form (1). 1.1 Con tributions W e summarize the main contributions of our pap er: • In Section 2, w e prop ose a general framework for studying particle appro ximations of Fisher-Rao gradien t flows. This part extends the results from Ca vil et al. (2017); Lu et al. (2019); Rotsk off et al. (2019); Domingo-Enric h et al. (2020) (see Remark 2.8 and Section 2.4 for details). • In Section 3, we discuss differen t approaches to smo othing out Fisher-Rao flows (which is cru- cial for practical implementation) and we show that all the discussed methods fall within our framew ork from Section 2. This part expands on Lu et al. (2019); P amp el et al. (2023); Lu et al. (2023); Carrillo et al. (2019). • In Section 4, we prop ose a practical algorithm for appro ximating Fisher-Rao flows, utilizing a metho d from Section 3. 2 MAIN RESUL TS In this section, w e work on a general (not necessarily compact) space X ⊂ R d and we focus on the flo w ∂ t µ t = − µ t a ( µ t , · ) (4) where the flat deriv ative δ V σ δ µ on the right hand side of (3) has been replaced with a function a : P ( X ) × X → R . In Section 3, we will explain the rationale b ehind appro ximating the flat deriv ativ e δ V σ δ µ of the energy function V σ in (1) with its kernelized counterpart. Differen t choices of kernelizations will lead to differ- en t choices of a and hence in this section we k eep our notation general, in order to pro duce a broadly ap- plicable theoretical framework. F or conv enience, we will refer to flows (4) as Fisher-Rao flows, although it should b e understo o d that they are ”true” Fisher-Rao flo ws only for some choices of a (see Section 3.2 for more details). W e remark that we alw ays c ho ose a in a wa y that en- sures that µ t remains a probability distribution. This is achiev ed by requiring that for all µ ∈ P ( X ), Z X a ( µ, x ) µ ( dx ) = 0 , (5) whic h then immediately implies that ∂ t R X µ t ( dx ) = − R X a ( µ t , x ) µ t ( dx ) = 0, and hence the flow preserves the mass of the initial me asure. It is easy to c heck that all examples of a studied in Section 3 satisfy prop ert y (5). 2.1 Preliminaries Before we pro ceed, let us introduce s ome necessary notation. F or p ∈ [1 , ∞ ), and for any normed vector space ( E , ∥·∥ ), w e define the set of probability mea- sures with finite p -th moment P p ( E ) as P p ( E ) :=  µ ∈ P ( E ) : Z E ∥ x ∥ p µ ( dx ) < ∞  . W e equip P p ( E ) with the p -W asserstein distance W p ( m, m ′ ) := inf π ∈ Π( m,m ′ ) Z E ∥ x − y ∥ p π ( dx, dy ) ! 1 /p , where Π( m, m ′ ) ⊂ P ( E × E ) is the set of all couplings of m and m ′ . F or any T > 0, w e consider the path space C ([0 , T ]; E ) with the supremum norm ∥ x − y ∥ T := sup 0 ≤ s ≤ T ∥ x s − y s ∥ , and write P p ( C ([0 , T ]; E )) for the corresp onding p - momen t space. The induced p -W asserstein distance on path measures is W p,T ( m, m ′ ) := inf π ∈ Π( m,m ′ ) Z ∥ x − y ∥ p T π ( dx, dy ) ! 1 /p . W e require a to satisfy the following assumption: Assumption 1. F unction a is b ounded and Lipsc hitz, i.e., there exist c onstan ts M a , L a > 0 suc h that for all x, y ∈ X and m, m ′ ∈ P 2 ( X ), | a ( m, x ) | ≤ M a , (6) and | a ( m, x ) − a ( m ′ , y ) | ≤ L a  | x − y | + W 2 ( m, m ′ )  . (7) F ollo wing the ideas from Liero et al. (2018) (see also Domingo-Enric h et al. (2020)), we work with the no- tions of lifts and pro jections of measures. This will al- lo w us to use measures defined on the extended space X × R + , with the second comp onent represen ting a lo cal w eight. Definition 2.1 (Lifted measure and pro jection) . Let µ ∈ P 1 ( X ) and ν ∈ P 1 ( X × R + ). W e sa y that ν is a lifted measure of µ , and conv ersely that µ is the pro jection of ν , if for all φ ∈ C ∞ c ( X ), Z X φ ( x ) dµ ( x ) = Z X × R + w φ ( x ) dν ( x, w ) , (8) where C ∞ c ( X ) denotes the space of smo oth, compactly supp orted functions on X . In this case w e define the pro jection op erator h : P 1 ( X × R + ) → P 1 ( X ) , so that µ = h ν whenever (8) holds. Note that the requirement for the measures in Defini- tion 2.1 to hav e finite first momen ts ensures that the in tegral on the right hand side of (8) is finite. 2.2 Existence of mean-field dynamics corresp onding to Fisher-Rao flo w Building on the notion of lifted and pro jected mea- sures defined ab ov e, we no w establish a rigorous con- nection betw een the Fisher–Rao gradient flow and a corresp onding mean-field (single particle) dynamics. The k ey idea is to lift the flow from the probabilit y space P 1 ( X ) to the extended space P 1 ( X × R + ), where the additional co ordinate w represents a lo cal mass w eight. This lifted formulation rev eals that the origi- nal Fisher–Rao flo w can b e in terpreted as the pro jec- tion of an evolution equation on the extended space. Based on this representation, we construct a mean- field dynamic whose la w ev olves according to the lifted flo w, and show that, under suitable regularit y assump- tions, this dynamic admits a unique solution on any finite time horizon. W e begin b y stating the precise corresp ondence betw een the lifted and pro jected flo ws. Definition 2.2. Let ν 0 ∈ P 1 ( X × R + ). W e say that ( ν t ) t ≥ 0 ⊂ P 1 ( X × R + ) is a weak solution to the lifted flo w ∂ t ν t ( x, w ) = ∂ ∂ w  ν t ( x, w ) w a ( h ν t , x )  , (9) with initial condition ν 0 , if for an y ψ ∈ C ∞ c ( X × R + ) and any t > 0 w e ha ve d dt Z X × R + ψ ( x, w ) dν t ( x, w ) = − Z X × R + w a ( h ν t , x ) ∂ ∂ w ψ ( x, w ) dν t ( x, w ) . (10) Definition 2.3. Let µ 0 ∈ P 1 ( X ). W e sa y that ( µ t ) t ≥ 0 ⊂ P 1 ( X ) is a w eak solution to the Fisher-Rao gradien t flo w ∂ t µ t = − µ t a ( µ t , · ) , (11) with initial condition µ 0 , if for any ψ ∈ C ∞ c ( X ) and an y t > 0 we ha ve d dt Z X ψ ( x ) dµ t ( x ) = − Z X ψ ( x ) a ( µ t , x ) dµ t ( x ) . (12) R emark 2.4 . Note that b y a standard appro ximation argumen t, if ( ν t ) t ≥ 0 is a weak solution to the lifted flo w in the sense of Definition 2.2, then (10) holds also for functions of the form ψ ( x, w ) = w ϕ ( x ) , ϕ ∈ C ∞ c ( X ) . This will be important for argumen ts where w e switc h b et ween the lifted and pro jected flows, esp ecially in the pro of of the follo wing lemma. Lemma 2.5. L et ( ν t ) t ≥ 0 ⊂ P 1 ( X × R + ) b e a we ak solution to the lifte d flow (9) in the sense of Defini- tion 2.2, with initial c ondition ν 0 ∈ P 1 ( X × R + ) . Then the pr oje cte d flow ( µ t ) t ≥ 0 := ( h ν t ) t ≥ 0 ⊂ P 1 ( X ) is a we ak solution to the Fisher–R ao flow (11) in the sense of Definition 2.3, with initial c ondition µ 0 := h ν 0 . With this relation at hand, we can construct a mean- field particle dynamic whose law evolv es according to the lifted flow. The precise statemen t is giv en in the follo wing theorem. Theorem 2.6. L et ν 0 ∈ P 1 ( X × R + ) . Consider the me an-field system dX t = 0 , dw t = − w t a  h ν t , X t  dt, ( X 0 , w 0 ) ∼ ν 0 , (13) wher e ν t := Law( X t , w t ) denotes the joint law of ( X t , w t ) . If system (13) admits a solution, then the curve ( ν t ) t ≥ 0 is a we ak solution to the lifte d Fisher– R ao flow (9) in the sense of Definition 2.2, and henc e ( µ t ) t ≥ 0 := ( h ν t ) t ≥ 0 is a we ak solution to the Fisher– R ao flow (11) in the sense of Definition 2.3. R emark 2.7 . System (13) can b e interp reted as a par- ticle X t with a corresponding weigh t w t . Note that the spatial p osition X t of the particle is sampled once from the initial probability distribution and remains fixed o ver time, while only the asso ciated w eigh t w t ev olves according to the dynamic in (13). This reflects the geometry of the Fisher–Rao metric, which gov erns mass change without spatial transp ort. R emark 2.8 . Consider the setting corresp onding to our primary ob ject of interest, i.e., when the function a in (4) is giv en b y the flat deriv ative of V σ defined in (1). Then, for measures µ such that V σ has a flat deriv ative at µ , a ( µ, · ) = δ F δ µ ( µ, · ) + σ log µ π (14) (up to an additive constan t - the details will b e dis- cussed in Section 3). If we assume the necessary dif- feren tiability , w e can consider the W asserstein-Fisher- Rao gradient flo w given by ∂ t µ t ( x ) = ∇ · ( µ t ∇ a ( µ t , x )) − µ t ( x ) a ( µ t , x ) . (15) Then, due to the in terpretation of the W asserstein gra- dien t flow as the mean-field Langevin SDE (Hu et al., 2021), we would exp ect to obtain the following corre- sp onding mean-field (single particle) system dX t = −∇  δ F δ µ ( h ν t , X t ) − σ log π ( X t )  dt + √ 2 σ dW t , dw t = − w t a ( h ν t , X t ) dt, ν t := La w( X t , w t ) , (16) where b oth the lo cation of the particle and the weigh t ev olve in time. There are, ho wev er, several tec hnical c hallenges with studying system (16). If (16) is con- sidered on a compact space X ⊂ R d , this creates an issue with the W asserstein part of the flo w (due to the diffusiv e b ehaviour of X t ) and one needs to mak e sure that the particle stays on the right space. On the other hand, if X = R d , then the Fisher-Rao part b e- comes problematic since the function a defined in (14) is un b ounded (due to the flat deriv ativ e of the rela- tiv e en tropy b eing unbounded). While working on the presen t pap er, w e were unable to ov ercome these tech- nical challenges whic h is why we fo cus on the system (13) corresp onding to the ”pure” Fisher-Rao gradient flo w (and in Section 3 when we discuss sp ecific choices of a related to (14), we work on a compact X to en- sure that a sta ys b ounded). This is consistent with a recen t pap er (Domingo-Enric h et al., 2020), whic h studied (in the context of t wo-pla yer game theory) W asserstein-Fisher-Rao flo ws corresponding to energy functions (1) without entrop y-regularization (i.e., with σ = 0), whic h leads to a system of the form (16) where X t mo ves only according to a deterministic gradien t descen t; as well as W asserstein flo ws corresp onding to (1) with en tropy regularization but without the Fisher- Rao part. Similarly to Domingo-Enrich et al. (2020), w e are unable to cov er W asserstein-Fisher-Rao (WFR) flo ws corresponding to (1) with en tropy-regularization, but unlik e Domingo-Enrich et al. (2020), w e study Fisher-Rao flo ws for (1) with entrop y-regularization, whic h were not co vered there. W e will attempt to treat the challenging case of WFR flo ws in a future w ork. W e now address the well-posedness of mean-field par- ticle dynamics. Theorem 2.9. Supp ose that a satisfies Assumption 1. L et the initial law ν 0 ∈ P 2 ( X × R + ) . Then, for any T > 0 , ther e exists a unique r andom dynamic ( X t , w t ) t ∈ [0 ,T ] that solves system (13) . In p articular, La w  ( X t , w t ) t ∈ [0 ,T ]  ∈ P 2  C ([0 , T ]; X × R + )  . Mor e over, the solution is p athwise unique: if two so- lutions shar e the same initial c ondition almost sur ely, then they c oincide for al l t ∈ [0 , T ] almost sur ely. 2.3 In teracting P article Systems and Propagation of Chaos W e in vestigate the appro ximation of the Fisher–Rao gradien t flow by in teracting particle systems. Our ob- jectiv e is to establish propagation of c haos: as the n umber of particles N → ∞ , the empirical measure of the interacting system con verges to the law of i.i.d. copies of the mean-field solution, with con vergence measured in the 2-W asserstein distance. Before we discuss the in teracting particle system that is our main ob ject of interest, in order to make the no- tation precise we first introduce a system ( X i,N , w i,N t ) of N i.i.d. copies of the mean-field dynamic (13), whic h will be used as an auxiliary tool in our approximation estimates. F or each i ∈ J 1 , N K , we define the dynamic of ( X i,N , w i,N t ) by X i,N ∼ µ i,N 0 , w i,N 0 ≥ 0 , with N X i =1 w i,N 0 = N , dw i,N t = − w i,N t a ( h ν i t , X i,N ) dt, (17) where ν i t := Law( X i,N , w i,N t ) is the marginal la w of the i -th particle. W e denote the empirical measure of th us obtained non-inter acting system b y ν N t := 1 N N X i =1 δ ( X i,N ,w i,N t ) . (18) W e will use this notation throughout our proofs in Ap- p endix B. In teracting particle system. W e are no w ready to in tro duce the interacting particle system, where the in teraction is gov erned by weigh ted empirical distri- butions. F or eac h i ∈ J 1 , N K , w e define ( ˜ X i,N , ˜ w i,N t ) b y ˜ X i,N ∼ ˜ µ i,N 0 , ˜ w i,N 0 ≥ 0 , with N X i =1 ˜ w i,N 0 = N , d ˜ w i,N t = − ˜ w i,N t a  ˜ µ N t , ˜ X i,N  dt, (19) where ˜ µ N t := 1 N P N i =1 ˜ w i,N t δ ˜ X i,N is the weigh ted em- pirical distribution. Note that the empirical measure on the extended space is ˜ ν N t := 1 N N X i =1 δ ( ˜ X i,N , ˜ w i,N t ) , so that ˜ µ N t = h ˜ ν N t . (20) W e now state the main result on the propagation of c haos. Theorem 2.10. Supp ose that the function a satisfies Assumption 1. Fix a finite time horizon T > 0 and let ν ∈ P 2  C ([0 , T ]; X × R + )  b e the law of the single p article ( X, w t ) = ( X , w t ) t ∈ [0 ,T ] define d by the me an- field dynamic (13) . Cho ose p articles ( X i,N , w i,N t ) as i.i.d. c opies of ( X, w t ) as define d in (17) and let ( ˜ X i,N , ˜ w i,N t ) b e the inter acting p article system define d by (19) . Supp ose further that the initial c onditions satisfy lim N →∞ 1 N N X i =1 E     X i,N − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  = 0 . (21) L et ˜ ν N ∈ P 2  C ([0 , T ]; X × R + )  b e the empiric al me a- sur e of the inter acting p article system define d by (20) . Then, lim N →∞ E  W 2 ,T ( ˜ ν N , ν )  = 0 . W e remark that condition (21) is automatically satis- fied when the initializations for b oth systems consid- ered in Theorem 2.10 are iden tical. Corollary 2.11. Under the assumptions of The o- r em 2.10, the pr oje cte d empiric al distribution ˜ µ N t := h ˜ ν N t c onver ges in the 2-Wasserstein distanc e, uni- formly on [0 , T ] , to the me an-field law µ t := h ν t . That is, for any finite time horizon T > 0 , we have lim N →∞ sup t ∈ [0 ,T ] E [ W 2  µ t , ˜ µ N t  ] = 0 . W e remark that Theorem 2.10 and Corollary 2.11 are non-quan titative, and obtaining con vergence rates w ould require further work, which is b eyond the scop e of the present pap er. 2.4 Discussion of other propagation of chaos results in the literature The framework in Cavil et al. (2017) establishes prop- agation of chaos results for particle appro ximations of the following class of non-conserv ativ e nonlinear PDEs: ∂ t v = d X i,j =1 ∂ 2 ij  (ΦΦ ⊤ ) i,j ( t, x, v ) v  − ∇ · ( g ( t, x, v ) v ) + Λ( t, x, v ) v , v (0 , dx ) = v 0 ( dx ) . (22) In this equation, the first tw o terms corresp ond to the W asserstein component of WFR flows, while the last term pla ys the role of a Fisher–Rao-t yp e reaction. Ho wev er, unlik e the WFR flo w (15), where both the transp ort and reaction components are deriv ed from a single energy functional, here the different terms are sp ecified indep endently and ma y correspond to unre- lated energies. More imp ortan tly , the reaction term Λ( t, x, v ) v in Ca vil et al. (2017) depends lo cally on the solution v : that is, Λ is ev aluated using the p oint wise v alue v ( t, x ), without reference to the global structure of the distribution. This contrasts with our setting, where, as w e will explain in Section 3, the reaction term in- v olves a nonlo cal dep endence on the entire probabilit y measure. Hence, PDE (22) does not in general con- serv e total mass: the solution v ( t, · ) is not necessarily a probabilit y densit y for all t . In con trast, the nonlocal normalization we include ensures mass preserv ation at all times. Moreo ver, the regularity assumptions also differ, cf. Assumption 1 in Ca vil et al. (2017) to our Assumption 1. Another k ey distinction is that we w ork on a compact state space X ⊂ R d , whereas the equations in Cavil et al. (2017) are defined on the whole space R d . Ex- tending the results from Ca vil et al. (2017) to compact domains w ould require working with appropriately de- fined b oundary conditions for the PDE (22). As we discussed in Remark 2.8, another related pa- p er (Domingo-Enric h et al., 2020) studied propagation of chaos for W asserstein-Fisher-Rao flows without en- trop y regularization, and for W assesrstein flows cor- resp onding to the entrop y-regularized energy (1) (but without the Fisher-Rao part) and hence their results are not applicable to our s etting. Finally , there has been some work on the propagation of c haos for interacting particle systems defined with a killing-replication mechanism with e xponential clo c ks (rather than with ev olving w eigh ts - cf. also the dis- cussion in Section 4) in Rotsk off et al. (2019); Lu et al. (2019). How ev er, we were unable to make the proofs from those w orks fully rigorous in our setting (1), due to the unboundedness of the flat deriv ative of the rel- ativ e en tropy . 3 ENTR OPIC MEAN-FIELD PR OBLEMS In this section, w e assume X to b e a compact subset of R d , and we fo cus on the energy function V σ ( m ) = F ( m ) + σ KL( m | π ) , for all m ∈ P ( X ) . (23) Note that the choice of a compact X remov es the tec hnical problem with flat differentiabilit y of the KL- div ergence, which on un b ounded domains would hav e to b e rigorously justified Liu et al. (2023); Aubin- F rank owski et al. (2022). Note also that w e do not require conv exity of X since the pure Fisher-Rao flow do es not c hange the supp ort of the initial measure, i.e., it evolv es only the mass without an y transp ort in the state space. The key difficulty in dealing with Fisher-Rao flows (3) corresp onding to the energy function (23) lies in the construction of the corresp onding in teracting particle system. In our setting, one can show that the flat deriv ativ e of V σ is given b y δ V σ δ µ ( µ t , x ) = δ F δ µ ( µ t , x ) + σ log µ t ( x ) π ( x ) − σ KL( µ t | π ) . (24) Note that the term − σ KL( µ t | π ) is necessary to en- sure that the equation (3) is conserv ativ e, i.e., that all measures µ t are indeed probability measures. Ho w- ev er, this mak es (24) well-defined only if the measures µ t are absolutely contin uous with resp ect to π , which creates problems with defining a particle system ap- pro ximating (3). Indeed, one cannot ev aluate the righ t hand side of (3) at the empirical measure of a corre- sp onding particle system, whic h mak es it necessary to define an auxiliary flow, where the problematic terms are replaced b y their counterparts inv olving conv olu- tions with appropriately defined kernels. This is the rationale behind the in tro duction of the so-called ”k er- nelized” flows, where instead of the expression in (24), w e w ork with its kernelized version. 3.1 Setting for en tropic mean-field optimization Throughout Section 3, w e impose the follo wing stand- ing assumptions on the energy functional F , the mol- lifier kernel K ε , and the reference measure π . Assumption 2. W e assume the energy functional F : P ( X ) → R satisfies the following prop erties: (i) F is lo wer semi-contin uous with respect to w eak con vergence in P 2 ( X ). (ii) F is bounded from b elo w: there exists F min ∈ R suc h that F ( m ) ≥ F min for all m ∈ P ( X ) . (iii) Flat deriv ative is bounded: there exists a constan t C F > 0 such that for all µ ∈ P ( X ) and x ∈ X ,     δ F δ µ ( µ, x )     ≤ C F . (iv) Flat deriv ativ e is Lipschitz: there exists a con- stan t L F > 0 suc h that for all µ, ν ∈ P ( X ) and x, y ∈ X ,     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . Note that these assumptions on F are standard in the mean-field optimization literature and, in particular, they are satisfied in mean-field mo dels of neural net- w orks studied in pap ers suc h as Hu et al. (2021); Chen et al. (2023), as well as in mean-field models in p olicy optimization in reinforcemen t learning (Leahy et al., 2022; Lascu and Ma jk a, 2025). Assumption 3. Let ξ : R d → R + b e a smooth, Lip- sc hitz (with constant L ξ ), radial probability densit y function with full supp ort R d and finite second mo- men t. F or ε > 0, define the rescaled function on the compact space X by ξ ε ( x ) := 1 C ε,d ε − d ξ  x ε  , where C ε,d is the normalization constan t ensuring that Z X ξ ε ( x ) dx = 1 . In particular, ξ ε is a probability density function on X . The mollifier kernel K ε on X is then defined as K ε ( x ) :=  ξ ε ∗ ξ ε  ( x ) . Assumption 4. W e assume the reference densit y π ( x ) satisfies the follo wing prop erties: (i) π ( x ) = e − U ( x ) for some con tinuous p otential func- tion U : X → R . (ii) π is globally Lipschitz on X : there exists L π > 0 suc h that | π ( x ) − π ( y ) | ≤ L π | x − y | for all x, y ∈ X . Note that under Assumption 4, since π ∝ e − U and X is compact, there exist constants 0 < π min < π max < ∞ suc h that π min ≤ π ( x ) ≤ π max , for all x ∈ X . 3.2 Kernelization Strategies W e summarize four k ernelization strategies for ap- pro ximating Fisher–Rao flo ws corresponding to energy functions (1), based on related strategies that hav e app eared in the literature in recent years. W e will then sho w in Prop osition 3.1 that for each of these approac hes, the resulting function a satisfies our As- sumption 1 and therefore all our results from Section 2 are applicable. Recall our notation for the flo w ∂ t µ t = − µ t a ( µ t , · ) and consider the follo wing c hoices of a that approximate the flat deriv ativ e δ V σ δ µ giv en in (24). Smo othing only the evolving measure (Lu et al., 2019). Here the k ernel is applied to µ t b oth inside the logarithm and in the KL div ergence. The resulting dynamics read ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ log K ε ∗ µ t π − σ KL( K ε ∗ µ t | π ) ! . (25) Smo othing b oth the ev olving and the target measures (Pampel et al., 2023). In this v ariant, b oth µ t and π are mollified by K ε , which yields ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ log K ε ∗ µ t K ε ∗ π − σ KL( K ε ∗ µ t | K ε ∗ π ) ! . (26) Kernelizing the energy via Lu et al. (2023). Here the k ernel is applied already at the level of the en- ergy function, rather than just at the level of the flow; one studies the ”true” Fisher–Rao gradient flo w corre- sp onding to the mo dified energy function V σ ε ( m ) = F ( m ) + σ Z log K ε ∗ m π m ( dx ) . The induced dynamics takes the form ∂ t µ t = − µ t σ 1 σ δ F δ µ ( µ t , · ) + log K ε ∗ µ t π + K ε ∗  µ t K ε ∗ µ t  − Z log K ε ∗ µ t π µ t ( dx ) − 1 ! . (27) Kernelizing the energy via Carrillo et al. (2019). Another c hoice of a k ernelized energy func- tion replaces the KL term by KL( K ε ∗ m | π ), leading to the energy V σ ε ( m ) = F ( m ) + σ KL( K ε ∗ m | π ) , whose Fisher–Rao gradient flow is ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ K ε ∗ log K ε ∗ µ t π − σ KL( K ε ∗ µ t | π ) ! . (28) Both (27) and (28) preserv e the structure of Fisher–Rao flows, i.e, they are genuine Fisher–Rao gradien t flo ws of the corresp onding kernelized energy functions. A notable dra wback of (28), how ev er, is the presence of nested con volutions and in tegrals—e.g., the term K ε ∗ log( K ε ∗ µ t /π )—whic h prev ents a full discretization into finite particle sums even when µ t is an empirical measure, whic h in practice entails higher computational cost (see the discussion in (Carrillo et al., 2019, Section 1)). F or all kernelizations defined abov e, we ha ve the fol- lo wing result. Prop osition 3.1. Supp ose Assumptions 2, 3 and 4 hold. Consider the gr adient flow ∂ t µ t = − µ t a ( µ t , · ) define d via (25) , (26) , (27) or (28) . Then a satisfies Assumption 1. The pro of can b e found in App endix E. 3.3 Regularization by χ 2 -div ergence Ev en though the relative entrop y is the most p opular c hoice of regularizer in mean-field optimization prob- lems, other c hoices are p ossible, and our framework is indeed applicable to energy functions more general than (1). T o illustrate this, w e discuss briefly en- ergy functions regularized by the χ 2 -div ergence (with an appropriate kernelization analogous to kerneliza- tion (28) for the entrop y-regularized energy). Re- call that given a fixed reference measure π ∈ P ( X ), for an y measure µ ∈ P ( X ) absolutely contin uous with resp ect to π , the χ 2 -div ergence is defined by χ 2 ( m | π ) = R ( dm dπ − 1) 2 dπ . W e consider the functional V σ ε ( m ) := F ( m ) + σ χ 2 ( K ε ∗ m | π ). Its Fisher–Rao gradien t flo w is ∂ t µ t = − µ t a ( µ t , · ), where a ( m, x ) := δ F δ µ ( m, x ) + 2 σ K ε ∗  K ε ∗ m π  ( x ) − 2 σ Z K ε ∗  K ε ∗ m π  ( z ) m ( z ) dz . (29) W e hav e the following result. Prop osition 3.2. Supp ose Assumptions 2, 3 and 4 hold. Then the function a define d by (29) satisfies As- sumption 1. The pro of can b e found in App endix E.5. 3.4 W eak-* Con v ergence of Minimizers under Kernelization In the con text of mean-field optimization problems (1), an imp ortan t question related to kernelizations of the corresp onding gradient flows is whether those k ernel- ized gradient flows are asso ciated with energy func- tions whose minimizers are close to the minimizers of the original energy function in (1). In this subsection, w e partially answer this question for the kernelized Fisher–Rao gradient flow asso ciated with the kernel- ization in (27), namely the case where the flow corre- sp onds to the energy function V σ ε ( m ) := F ( m ) + σ Z X log K ε ∗ m π dm ( x ) . W e study the limiting behaviour of minimizers as ε → 0, and prov e that any sequence of minimizers of V σ ε admits a subsequence conv erging in the w eak-* top ology to a minimizer of the original problem (1). Theorem 3.3 (W eak-* conv ergence of minimizers) . Supp ose Assumptions 2, 3 and 4 hold. F or e ach ε > 0 , let µ ε ∈ P 2 ( X ) b e a minimizer of V σ ε . Then ther e exists a subse quenc e ( ε k ) k ≥ 1 such that ε k → 0 as k → ∞ , and a me asur e µ ∈ P 2 ( X ) such that µ ε k ∗ ⇀ µ as k → ∞ , and µ is a minimizer of V σ given in (1) . W e w ould lik e to remark that a v ersion of Theorem 3.3 can also be obtained, under some additional assump- tions, on X = R d . F or the sak e of completeness, we discuss the details in App endix C, even though in or- der to apply the remaining results in this pap er to the energy function (23), we require compactness of X . 4 ALGORITHM CONSTR UCTION The practical usage of the discussed gradien t flows dep ends on dev eloping implemen table algorithms for appro ximating the corresp onding in teracting particle systems (19). In our algorithm, w e draw the position of the particles ˜ X i,N at the b eginning from the initial dis- tributions ˜ µ i,N 0 , and afterwards they remain constan t. Then w e apply a time discretization of the equations for the w eights ˜ w i,N b y an Euler sc heme, and hence w e obtain a numerical sc heme which updates the weigh ts of particles in each step. In the algorithm discussed in this section, the dynamics for the w eigh ts follo ws the equation (25), for a smo oth kernel K ε . Com bining the theoretical results of the presen t pap er with results guaranteeing con v ergence of the Fisher- Rao gradien t flo w to the minimizer m σ, ∗ of (1) as t → ∞ (see e.g. Liu et al. (2023)), suggests that for a large n umber of particles, a large n umber of iterations, and a small ε , the output of our algorithm should pro- vide a reasonable approximation of m σ, ∗ . Ho wev er, a full quantitativ e analysis of the resulting approxi- mation error remains a challenging op en problem for future researc h. Note that the full analysis w ould need to take into account the following four errors: the er- ror b etw een the con tin uous flo w and the target (due to running the algorithm for finite time), the error b e- t ween the ”correct” flow and the kernelized flow (due to the use of the k ernel K ε ), the error b et ween the con- tin uous particle system and the con tinuous mean-field flo w (due to using a finite num b er of particles) and the discretisation error for the particle system (due to a p ositiv e time-step). Algorithm: Fisher–Rao gradient descen t Input: particles X i ∼ µ i 0 with uniform weigh ts w i 0 = 1, for i = 1 , . . . , N ∆ t = time interv al, J = num ber of iterations Steps: up date w eights for j = 1 : J do ˜ µ N j − 1 = 1 N P N l =1 w l j − 1 δ X l for i = 1 : N do ˜ V j i := δ F δ µ ( ˜ µ N j − 1 , X i ) + σ log  1 N P N l =1 w l j − 1 K ε ( X i − X l )  − σ log π ( X i ) − σ 1 N P N k =1 w k j − 1 ×  log  1 N P N l =1 w l j − 1 K ε ( X k − X l )  − log π ( X k )  ˆ w i j = w i j − 1 exp  − ˜ V i j ∆ t  w i j = N ˆ w i j / P N l =1 ˆ w l j Output: the weigh ted empirical distribution of the particles approximates m σ, ∗ Note that we need the co efficien t N in the up date for w i j since the w eigh ts are supp osed to add up to N (cf. (19)). W e remark that in our setting it is also p ossible to con- struct an algorithm c orresponding to the W asserstein- Fisher-Rao gradien t flo w (16). This algorithm w ould ha ve an additional step within the outer lo op, corre- sp onding to the diffusion mov emen t of the particles due to the W asserstein part of the flow (i.e., this step w ould corresp ond to the discretisation of the SDE in (16)). F rom a practical p oint of view, such an algo- rithm is exp ected to perform better than the algorithm corresp onding to the ”pure” Fisher-Rao flo w, due to the additional exploration of the state space provided b y the diffusion mo vemen t (the main drawbac k of the pure Fisher-Rao flo w is that it does not expand the supp ort of the initial distribution of the particles - see also Remark 4.1). Ho wev er, since our theoretical re- sults co ver only the case of the pure Fisher-Rao flo w (cf. Remark 2.8), w e formulate the algorithm without the diffusion part. Finally , note that due to the F eynman-Kac form ula, there is a p ossible interpretation of the Fisher-Rao flo w (4) as a Kolmogorov equation for a sto c hastic pro cess with killing (see (Applebaum, 2009, Section 6.7.2) or (Karatzas and Shrev e, 1991, Theorem 5.7.6 and Exer- cise 5.7.10)). This leads to an alternative idea for con- structing a particle system appro ximating (4), which instead of rebalancing weigh ts, uses killing and replica- tion mechanism with appropriately defined exponen- tial clo c ks. Such algorithms were used in Lu et al. (2019); Pampel et al. (2023); Rotskoff et al. (2019), ho wev er, w e were unable to pro vide a rigorous pro of for propagation of c haos for suc h particle systems in our framew ork (1), which is wh y in this pap er we are w orking with system (19). R emark 4.1 . As the final remark on the practical ap- plicabilit y of the Fisher-Rao algorithm presented in this section, w e w ould like to stress that the main is- sue with the pure Fisher-Rao flow is that it is very sensitiv e to initialization, which is reflected in the the- oretical results in pap ers such as Lu et al. (2019, 2023); Liu et al. (2023), which state that, in order for the FR flo w to con v erge (exponentially) to the target µ ∗ , the initialization µ 0 has to satisfy a ”w arm start” condi- tion, i.e., there has to exist a constan t C > 0 suc h that for any x ∈ X , dµ 0 dµ ∗ ( x ) ≥ C. In other words, µ 0 and µ ∗ ha ve to b e sufficiently sim- ilar to each other and in particular they need to ha ve matc hing supp orts. Since in practice it may be diffi- cult to choose initialization in a w a y that guaran tees this ”warm start” condition, esp ecially in v ery high- dimensional settings, this ma y limit the applicability of the pure FR flow. Ho wev er, a natural idea for constructing a practically feasible algorithm that utilizes the pure FR flow, would b e to initially run a different flo w to provide an initial exploration of the state space and then to switch to the pure FR flow. F or instance, one could first run a particle system appro ximating the pure W asserstein flo w (from an arbitrary initialization), for a certain amoun t of time t 0 > 0 that guarantees that there exists C > 0 such that for any x ∈ X , dµ t 0 dµ ∗ ( x ) ≥ C and then ”switch off” the W asserstein flow and run a particle appro ximation of the pure FR flo w. According to the theory from the pap ers cited ab ov e, a flow like this would con v erge exp onentially to the target (and the corresp onding algorithm would b e cheaper to run than an algorithm that uses both W asserstein and FR flo ws all the time). A fully rigorous analysis of the corresp onding algorithm w ould require propagation of c haos results for both pure W asserstein flo ws (which has b een co vered extensiv ely in the literature, cf. the discussion in the in tro duction) and the pure FR flows (whic h w e pro vide in the present pap er). A full study of suc h algorithms (and in particular of the question of ho w to optimally c ho ose t 0 ) will be the topic of our future work. Ak cnowledgemen t PL ac knowledges funding from the Slo v enian Research and Inno v ation Agency (ARIS) under programme No. P1-0448 and Croatian Science F oundation gran t no. 2277. References Am brosio, L., Gigli, N., and Sa v ar ´ e, G. (2008). Gr a- dient Flows in Metric Sp ac es and in the Sp ac e of Pr ob ability Me asur es . Lectures in Mathematics ETH Z ¨ uric h. Birkh¨ auser V erlag, Basel, 2 edition. Applebaum, D. (2009). L´ evy pr o c esses and sto chastic c alculus , v olume 116 of Cambridge Studies in A d- vanc e d Mathematics . Cam bridge Univ ersity Press, Cam bridge, second edition. Aubin-F rank owski, P .-C., Korba, A., and L ´ eger, F. (2022). Mirror descent with relativ e smo othness in measure spaces, with application to Sinkhorn and EM. In A dvanc es in Neur al Information Pr o c ess- ing Systems , v olume 35, pages 17263–17275. Curran Asso ciates, Inc. Carrillo, J. A., Chen, Y., Zhengyu Huang, D., Huang, J., and W ei, D. (2024). Fisher-Rao Gradien t Flow: Geo desic Conv exity and F unctional Inequalities. arXiv e-prints , page Carrillo, J. A., Craig, K., and Patacc hini, F. S. (2019). A blob metho d for diffusion. Calculus of V ariations and Partial Differ ential Equations , 58(2):53. Carrillo, J. A., McCann, R. J., and Villani, C. (2003). Kinetic equilibration rates for granular me- dia and related equations: entrop y dissipation and mass transp ortation estimates. R ev. Mat. Ib er o am. , 19(3):971–1018. Carrillo, J. A., Patacc hini, F. S., Sternberg, P ., and W olansky , G. (2016). Conv ergence of a particle metho d for diffusiv e gradient flo ws in one dimen- sion. SIAM Journal on Mathematic al Analysis , 48(6):3708–3741. Ca vil, A. L., Oudjane, N., and Russo, F. (2017). Parti- cle system algorithm and chaos propagation related to non-conserv ative mc kean type sto c hastic differen- tial equations. Sto chastics and Partial Differ ential Equations: Analysis and Computations , 5(1):1–37. Chen, F., Lin, Y., Ren, Z., and W ang, S. (2024). Uniform-in-time propagation of chaos for kinetic mean field langevin dynamics. Ele ctr onic Journal of Pr ob ability , 29(none). Chen, F., Ren, Z., and W ang, S. (2023). Entropic fictitious play for mean field optimization problem. J. Mach. L e arn. R es. , 24:Paper No. [211], 36. Chen, F., Ren, Z., and W ang, S. (2025). Uniform-in- time propagation of chaos for mean field Langevin dynamics. Ann. Inst. Henri Poinc ar ´ e Pr ob ab. Stat. , 61(4):2357–2404. Chizat, L. (2022). Mean-field langevin dynamics: Ex- p onen tial conv ergence and annealing. T r ansactions on Machine L e arning R ese ar ch . Delarue, F. and Tse, A. (2025). Uniform in time weak propagation of c haos on the torus. A nn. Inst. Henri Poinc ar ´ e Pr ob ab. Stat. , 61(2):1021–1074. Domingo-Enric h, C., Jelassi, S., Mensch, A., Rotsk off, G., and Bruna, J. (2020). A mean-field analysis of tw o-pla yer zero-sum games. In Laro c helle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 20215–20226. Curran As- so ciates, Inc. Durm us, A., Eb erle, A., Guillin, A., and Zimmer, R. (2020). An elemen tary approac h to uniform in time propagation of chaos. Pr o c. Am. Math. So c. , 148(12):5387–5398. F ein b erg, E. A., Kasyano v, P . O., and Zadoianch uk, N. V. (2014). F atou’s lemma for weakly con verging probabilities. The ory of Pr ob ability & Its Applic a- tions , 58(4):683–689. Gallou ¨ et, T. O. and Monsaingeon, L. (2017). A jko splitting sc heme for k antoro vic h–fisher–rao gradien t flo ws. SIAM Journal on Mathematic al Analysis , 49(2):1100–1130. Gu, A. and Kim, J. (2025). Mirror Mean- Field Langevin Dynamics. arXiv e-prints , page Guillin, A. and Monmarch ´ e, P . (2021). Uniform long- time and propagation of chaos estimates for mean field kinetic particles in non-con vex landscap es. J. Stat. Phys. , 185(2):20. Id/No 15. Hu, K., Ren, Z., ˇ Si ˇ sk a, D., and Szpruch, L. (2021). Mean-field langevin dynamics and energy landscap e of neural net works. A nnales de l’Institut Henri Poinc ar ´ e, Pr ob abilit´ es et Statistiques , 57(4):2043– 2065. Jordan, R., Kinderlehrer, D., and Otto, F. (1998). The v ariational formulation of the Fokker-Planc k equa- tion. SIAM J. Math. Anal. , 29(1):1–17. Karatzas, I. and Shreve, S. E. (1991). Br ownian mo- tion and sto chastic c alculus , volume 113 of Gr aduate T exts in Mathematics . Springer-V erlag, New Y ork, second edition. Kerimkulo v, B., Leahy , J., ˇ Si ˇ sk a, D., Szpruc h, L., and Zhang, Y. (2025). A fisher–rao gradient flo w for entrop y-regularised mark ov decision pro cesses in p olish spaces. F oundations of Computational Math- ematics . Lac ker, D. (2018). Mean field games and interacting particle systems. Lac ker, D. and Le Flem, L. (2023). Sharp uniform-in- time propagation of c haos. Pr ob ability The ory and R elate d Fields , pages 1–38. Lascu, R.-A. and Ma jk a, M. B. (2025). Non-conv ex en tropic mean-field optimization via Best Resp onse flo w. A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2025) . Lascu, R.-A., Ma jk a, M. B., and Szpruch, L. (2024). A fisher-rao gradien t flo w for entropic mean-field min- max games. T r ansactions on Machine L e arning R e- se ar ch . Lascu, R.-A., Ma jk a, M. B., ˇ Si ˇ sk a, D., and Szpruch, L. (2024). Linear con vergence of pro ximal descent sc hemes on the W asserstein space. arXiv e-prints , page Leah y , J.-M., Kerimkulov, B., ˇ Si ˇ sk a, D., and Szpruc h, L. (2022). Con vergence of p olicy gradient for en- trop y regularized MDPs with neural net work ap- pro ximation in the mean-field regime. In Pr o c e e dings of the 39th International Confer enc e on Machine L e arning , volume 162 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 12222–12252. PMLR. Liero, M., Mielk e, A., and Sav ar´ e, G. (2018). Optimal en tropy-transport problems and a new Hellinger- Kan torovic h distance b etw een p ositiv e measures. In- vent. Math. , 211(3):969–1117. Liu, L., Ma jk a, M. B., and Szpruch, L. (2023). P olyak– Lo jasiewicz inequality on the space of mea- sures and conv ergence of mean-field birth-death pro- cesses. Appl Math Optim 87, 48 (2023). Lu, Y., Lu, J., and Nolen, J. (2019). Accelerating langevin sampling with birth-death. arXiv e-prints. Lu, Y., Slepˇ cev, D., and W ang, L. (2023). Birth–death dynamics for sampling: global con v ergence, ap- pro ximations and their asymptotics. Nonline arity , 36(11):5731. Malrieu, F. (2001). Logarithmic Sobolev inequali- ties for some nonlinear PDE’s. Sto chastic Pr o c esses Appl. , 95(1):109–132. Mei, S., Montanari, A., and Nguyen, P .-M. (2018). A mean field view of the landscap e of t wo-la yer neural net works. Pr o c e e dings of the National A c ademy of Scienc es , 115(33). Monmarc h´ e, P . (2017). Long-time behaviour and prop- agation of c haos for mean field kinetic particles. Sto chastic Pr o c esses Appl. , 127(6):1721–1737. Monmarc h´ e, P ., Ren, Z., and W ang, S. (2024). Time- uniform log-Sob olev inequalities and applications to propagation of c haos. Ele ctr on. J. Pr ob ab. , 29:Paper No. 154, 38. Nitanda, A. (2024). Impro ved P article Approximation Error for Mean Field Neural Netw orks. A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2024) . Nitanda, A., Lee, A., T an Xing Kai, D., Sak aguc hi, M., and Suzuki, T. (2025). Propagation of Chaos for Mean-Field Langevin Dynamics and its Applica- tion to Mo del Ensem ble. F orty-se c ond International Confer enc e on Machine L e arning (ICML2025) . Nitanda, A., W u, D., and Suzuki, T. (2022). Con vex analysis of the mean field langevin dynamics. In Camps-V alls, G., Ruiz, F. J. R., and V alera, I., edi- tors, Pr o c e e dings of The 25th International Confer- enc e on Artificial Intel ligenc e and Statistics , v olume 151 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 9741–9757. PMLR. P amp el, B., Holbac h, S., Hartung, L., and V alsson, O. (2023). Sampling rare even t energy landscapes via birth-death augmen ted dynamics. Physic al R eview E , 107(2). Rotsk off, G., Jelassi, S., Bruna, J., and V anden- Eijnden, E. (2019). Neuron birth-death dynamics accelerates gradien t descen t and conv erges asymp- totically . In Chaudh uri, K. and Salakh utdino v, R., editors, Pr o c e e dings of the 36th International Con- fer enc e on Machine L e arning , v olume 97 of Pr o- c e e dings of Machine L e arning R ese ar ch , pages 5508– 5517. PMLR. Salim, A., Korba, A., and Luise, G. (2020). The W asserstein proximal gradient algorithm. In A d- vanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 12356–12366. Curran Asso ciates, Inc. Sc huh, K. (2024). Global con tractivity for Langevin dynamics with distribution-dependent forces and uniform in time propagation of chaos. Ann. Inst. Henri Poinc ar´ e Pr ob ab. Stat. , 60(2):753–789. Suzuki, T., Nitanda, A., and W u, D. (2023). Uniform- in-time propagation of c haos for the mean field gradien t Langevin dynamics. The Eleventh Inter- national Confer enc e on L e arning R epr esentations (ICLR2023) . Zh u, J.-J. and Mielk e, A. (2024). Kernel Approxima- tion of Fisher-Rao Gradient Flows. arXiv e-prints , page Chec klist 1. F or all m odels and algorithms presented, chec k if y ou include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or mo del. [Y es] (b) An analysis of the prop erties and complexity (time, space, sample size) of any algorithm. [Not Applicable] (c) (Optional) Anonymized source co de, with sp ecification of all dep endencies, including external libraries. [Not Applicable] 2. F or any theoretical claim, chec k if y ou include: (a) Statemen ts of the full set of assumptions of all theoretical results. [Y es] (b) Complete pro ofs of all theoretical results. [Y es] (c) Clear explanations of any assumptions. [Y es] 3. F or all figures and tables that presen t empirical results, chec k if you include: (a) The code, data, and instructions needed to repro duce the main exp erimen tal results (ei- ther in the supplemental material or as a URL). [Not Applicable] (b) All the training details (e.g., data splits, hy- p erparameters, how they w ere chosen). [Not Applicable] (c) A clear definition of the sp ecific measure or statistics and error bars (e.g., with resp ect to the random seed after running exp erimen ts m ultiple times). [Not Applicable] (d) A description of the computing infrastructure used. (e.g., t ype of GPUs, in ternal cluster, or cloud provider). [Not Applicable] 4. If y ou are using existing assets (e.g., co de, data, mo dels) or curating/releasing new assets, chec k if y ou include: (a) Citations of the creator If your work uses ex- isting assets. [Not Applicable] (b) The license information of the assets, if ap- plicable. [Not Applicable] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Not Applica- ble] (d) Information ab out consent from data pro viders/curators. [Not Applicable] (e) Discussion of sensible conten t if applicable, e.g., p ersonally iden tifiable information or of- fensiv e con tent. [Not Applicable] 5. If you used cro wdsourcing or conducted researc h with human sub jects, chec k if y ou include: (a) The full text of instructions giv en to partici- pan ts and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approv als if applicable. [Not Appli- cable] (c) The estimated hourly w age paid to partici- pan ts and the total amoun t sp en t on partic- ipan t comp ensation. [Not Applicable] T able of Conten ts 1 INTR ODUCTION 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 MAIN RESUL TS 2 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Existence of mean-field dynamics corresp onding to Fisher-Rao flow . . . . . . . . . . . . . . . . . 3 2.3 Interacting P article Systems and Propagation of Chaos . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Discussion of other propagation of c haos results in the literature . . . . . . . . . . . . . . . . . . 5 3 ENTR OPIC MEAN-FIELD PR OBLEMS 6 3.1 Setting for en tropic mean-field optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Kernelization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Regularization by χ 2 -div ergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 W eak-* Conv ergence of Minimizers under Kernelization . . . . . . . . . . . . . . . . . . . . . . . 8 4 ALGORITHM CONSTR UCTION 8 A WELL-POSEDNESS OF FISHER–RA O FLO WS 14 B PR OP AGA TION OF CHA OS F OR THE INTERACTING P AR TICLE SYSTEM 17 C WEAK-* CONVERGENCE OF KERNELIZED MINIMIZERS 20 C.1 Existence of minimizers - Compact case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Existence of minimizers - Non Compact case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Conv ergence of minimizers (for b oth cases) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D SUPPLEMENT AR Y DEFINITIONS AND TECHNICAL RESUL TS 25 D.1 Flat Deriv ative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2 T echnical Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.3 Construction of the Kernel on a compact space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E VERIFICA TION OF THE BOUNDNESS AND LIPSCHITZ CONDITION 28 E.1 F or Flo w (25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E.2 F or Flo w (26) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.3 F or Flo w (27) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.4 F or Flo w (28) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.5 Boundness and Lipsc hitz Condition for Kernelized Fisher-Rao Gradient Flow Regularized b y Chi Square Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 App endix: On Propagation of Chaos for the Fisher-Rao Gradient Flo w in En tropic Mean Field Optimization A WELL-POSEDNESS OF FISHER–RAO FLO WS Pr o of of L emma 2.5. W e aim to show that if ν t solv es (9), then the pro jected density µ t ( x ) = R ∞ 0 w ν t ( x, w ) dw satisfies (11) in the weak sense, namely we w ant to show for all test function φ ∈ C ∞ c ( X ) ∂ t Z X φ ( x ) µ t ( x ) dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx µ t ( x ) := Z ∞ 0 w ν t ( x, w ) dw . . Since we assumed that ν t solv es (9) weakly , we know that for all ψ ∈ C ∞ c ( X × R + ) we ha ve ∂ t Z X × R + ψ ( x, w ) ν t ( x, w ) dxdw = − Z X × R + ∂ ψ ∂ w ( x, w ) · ν t ( x, w ) · w · a ( µ t , x ) dxdw . Thanks to Remark (2.4), we choose ψ in the form of ψ ( x, w ) = wφ ( x ) for a φ ∈ C ∞ c ( X ). Hence the abov e equation b ecomes ∂ t Z X × R + w φ ( x ) ν t ( x, w ) dxdw = − Z X × R + φ ( x ) · ν t ( x, w ) · w · a ( µ t , x ) dxdw . Note that the LHS could be rewrite as: LH S = ∂ t Z X φ ( x ) Z ∞ 0 w ν t ( x, w ) dw dx = ∂ t Z X φ ( x ) µ t ( x ) dx Moreo ver, the RHS could b e rewrite as: RH S = − Z X φ ( x ) · a ( µ t , x )  Z ∞ 0 w ν t ( x, w ) dw  dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx. Putting it together, we ha ve sho wn that ∂ t Z X φ ( x ) µ t ( x ) dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx, This completes the pro of. Pr o of of The or em 2.6. Let us take a test function f ∈ C ∞ c ( X × R + ). By chain rule, we ha ve f ( X , w t ) = f ( X , w 0 ) + Z t 0 ∂ f ∂ w ( X, w s ) dw s = f ( X , w 0 ) − Z t 0 ∂ f ∂ w ( X, w s ) w s a ( h Law( X , w s ) , X ) ds No w, if we tak e exp ectations on b oth sides, we get Z X × R + f ( x, w ) La w ( X, w t )( x, w ) dxdw = Z X × R + f ( x, w ) La w ( X, w 0 )( x, w ) dxdw − Z t 0 Z X × R + ∂ f ∂ w ( x, w ) w a ( x, h La w ( X, w s )) Law( X , w s )( x, w ) dxdw ds F or the second trem, by using integration b y part, w e get the following equation Z X × R + f ( x, w ) La w ( X, w t )( x, w ) dxdw = Z X × R + f ( x, w ) La w ( X, w 0 )( x, w ) dxdw + Z t 0 Z X × R + f ( x, w ) ∂ ∂ w (La w ( X, w s )( x, w ) w a ( h Law( X, w s ) , x )) dxdw ds where the last equality sho ws Law( X, w s ) is the weak solution in the sense ( ∂ t ν t = ∂ ∂ w ( ν t ( x, w ) w a ( h ν t , x )) ν t ⇀ ν 0 (30) Pr o of of The or em 2.9. W e abbreviate P 2 ( C ([0 , T ]; X × R + )) by P T 2 and define the mapping Ψ : P T 2 → P T 2 , ν 7→ Law( X ν , w ν ) , where, for a fixed ν ∈ P T 2 , ( X ν , w ν 0 ) ∼ ν 0 , dw ν t = − w ν t a  h ν t , X ν  dt, t ≤ T . (31) Here ν is treated as an external input, indep enden t of the process itself, so (31) is not a McKean–Vlaso v equation but an ODE in ( X ν , w ν t ). By the classical existence and uniqueness theory for ODEs with Lipschitz drift, the system admits a unique strong solution whenev er the coefficients are globally Lipschitz in ( x, w ). In our setting, it suffices to chec k that ( x, w ) 7→ w a ( h ν t , x ) is globally Lipschitz. Indeed, for any ( x, w ) , ( x ′ , w ′ ) ∈ X × R + , | w a ( h ν t , x ) − w ′ a ( h ν t , x ′ ) | ≤ | w − w ′ | | a ( h ν t , x ) | + | w ′ | | a ( h ν t , x ) − a ( h ν t , x ′ ) | ≤ M a | w − w ′ | + e M a T L a | x − x ′ | , where M a and L a denote the uniform bound and Lipschitz constant of a , resp ectively . Moreo ver, the boundedness of a implies, via Prop osition D.2, that w ν t is uniformly bounded in t , which ensures that the drift in (31) is indeed globally Lipschitz. F urthermore, E  ∥ ( X ν , w ν ) ∥ 2 T  < ∞ , so for all ν ∈ P T 2 , the mapping Ψ( ν ) remains in P T 2 . No w let ν, ν ′ ∈ P T 2 b e such that ν 0 = ν ′ 0 . W e asso ciate to them the pro cesses ( X ν , w ν ) and ( X ν ′ , w ν ′ ), whic h share identical initial conditions almost surely: ( X ν 0 , w ν 0 ) = ( X ν ′ 0 , w ν ′ 0 ) a.s. Let Ψ( ν ) and Ψ( ν ′ ) denote the probability measures generated by (31). W e then ha ve: E h ∥ w ν − w ν ′ ∥ 2 T i ≤ T E " Z T 0    w ν s a ( h ν s , X ν ) − w ν ′ s a ( h ν ′ s , X ν ′ )    2 ds # ≤ 2 T E " Z T 0    w ν s a ( h ν s , X ν ) − w ν s a ( h ν ′ s , X ν ′ )    2 ds # + 2 T E " Z T 0    w ν s a ( h ν ′ s , X ν ′ ) − w ν ′ s a ( h ν ′ s , X ν ′ )    2 ds # ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2 ( h ν s , h ν ′ s ) +    X ν − X ν ′    2 ds # + 2 T M 2 a E " Z T 0    w ν s − w ν ′ s    2 ds # ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2 ( h ν s , h ν ′ s ) +    X ν − X ν ′    2 ds # + 2 T M 2 a E " Z T 0    w ν − w ν ′    2 s ds # Here the first inequalit y follo ws from the Cauc hy–Sc h warz inequality , the second from the elemen tary bound ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , and the third from the Lipsc hitz con tinuit y of a together with the boundedness of w ν (Prop osition D.2) and the uniform b ound | a | ≤ M a . Finally , the last inequality uses the fact that | f ( s ) | ≤ ∥ f ∥ s for any s ≤ T . Note that we hav e E     X ν 0 − X ν ′ 0    2  = 0 . Com bining them together, we hav e E     X ν − X ν ′    2 T + ∥ w ν − w ν ′ ∥ 2 T  = E h ∥ w ν − w ν ′ ∥ 2 T i ≤ 2 T L 2 a e 2 M a T Z T 0 W 2 2 ( h ν s , h ν ′ s ) ds + 2 T M 2 a E " Z T 0    w ν − w ν ′    2 s ds # . By applying Gr¨ onw all’s lemma together with Prop osition D.2, and then using W 2 ( ν s , ν ′ s ) ≤ W 2 ,s ( ν, ν ′ ) for all s ∈ [0 , T ], we obtain E h ∥ w ν − w ν ′ ∥ 2 T i ≤ 2 T L 2 a e 4 M a T +2 T 2 M 2 a Z T 0 W 2 2 ( ν s , ν ′ s ) ds ≤ 2 T L 2 a e 4 M a T +2 T 2 M 2 a Z T 0 W 2 2 ,s ( ν, ν ′ ) ds. W e denote C := 2 T L 2 a e 4 M a T +2 T 2 M 2 a . W e hav e E h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ C Z t 0 W 2 2 ,s ( ν, ν ′ ) ds. (32) Moreo ver, since the joint laws of (( X ν , w ν ) , ( X ν ′ , w ν ′ )) is one of the coupling b etw een Ψ( ν ) and Ψ( ν ′ ), that is, it b elongs to the set Π(Ψ( ν ) , Ψ( ν ′ )). W e hav e W 2 2 ,T (Ψ( ν ) , Ψ( ν ′ )) = inf π ∈ Π((Ψ( ν ) , Ψ( ν ′ ))) E (( X ν ,w ν ) , ( X ν ′ ,w ν ′ )) ∼ π h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ E h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ C Z T 0 W 2 2 ,s ( ν, ν ′ ) ds W e conclude that W 2 2 ,T (Ψ( ν ) , Ψ( ν ′ )) ≤ C Z T 0 W 2 2 ,s ( ν, ν ′ ) ds. (33) Hence for all ν ∈ P 2 ,T w e ha ve W 2 2 ,T  Ψ k +1 ( ν ) , Ψ k ( ν )  ≤ C Z T 0 W 2 2 ,s  Ψ k ( ν ) , Ψ k − 1 ( ν ′ )  ds ≤ C k Z T 0 Z s 0 Z s 1 0 · · · Z s k − 2 0 W 2 2 ,s k − 1 (Ψ( ν ) , ν ′ ) ds k − 1 · · · ds 1 ds ≤ C k T k k ! W 2 2 ,T (Ψ( ν ) , ν ) . As ( P 2 ,T , W 2 ,T ) is complete and Ψ k ( ν ) is a Cauc hy sequence, the sequence Ψ k ( ν ) conv erges to the unique fixed p oin t of Ψ. B PR OP A GA TION OF CHA OS FOR THE INTERA CTING P AR TICLE SYSTEM R emark B.1 . F or the particle system under consideration, the total weigh t N X i =1 w i,N t is conserved in time. Indeed, using the prop ert y of a that Z a ( m, x ) m ( dx ) = 0 for an y probability measure m, w e compute ∂ t N X i =1 w i,N t = N X i =1 w i,N t a ( h ν N t , X i,N t ) = Z a ( h ν N t , x ) N X i =1 w i,N t δ X i,N t ( x ) dx = N X i =1 w i,N t ! Z a ( h ν N t , x ) h ν N t ( dx ) = 0 , whic h implies P N i =1 w i,N t = P N i =1 w i,N 0 for all t ≥ 0. The same prop ert y holds for the tilde system ˜ ν N t . In this work, we c ho ose P N i =1 w i,N 0 = N . This conv ention ensures that, with the empirical measure ν N t = 1 N N X i =1 δ ( w i,N t ,X i,N t ) , its pro jection h ν N t = 1 N N X i =1 w i,N t δ X i,N t is a probability measure. The same reasoning applies to ˜ ν N t and h ˜ ν N t . W e now state an imp ortan t remark on the conv ergence result. R emark B.2 . Let X b e a separable metric space and ( X i ) i ≥ 1 b e i.i.d. X -v alued random v ariables with law µ . Let µ N := 1 N N X i =1 δ X i b e the empirical measure. If p ≥ 1 and µ ∈ P p ( X ), then W p ( µ N , µ ) → 0 almost surely as N → ∞ , and moreov er, E  W p p ( µ N , µ )  → 0 . A pro of of this result can b e found in (Lack er, 2018, Corollary 2.14). Pr o of of The or em 2.10. Let us rewrite the particle systems (17) and (19) in integral form. F or the indep endent system we ha ve    X i t = X i 0 , w i t = w i 0 − R t 0 w i s a  h ν s , X i  ds, and for the interacting system    ˜ X i,N t = ˜ X i,N 0 , ˜ w i,N t = ˜ w i,N 0 − R t 0 ˜ w i,N s a  h ˜ ν N s , ˜ X i,N  ds. W e hav e E  ∥ w i − ˜ w i,N ∥ 2 T  ≤ E h    w i 0 − ˜ w i,N 0    i + T E " Z T 0    w i s a ( h ν i s , X i ) − ˜ w i,N s a ( h ˜ ν N s , ˜ X i,N )    2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T E " Z T 0    w i s a ( h ν i s , X i ) − w i s a ( h ˜ ν N s , ˜ X i,N )    2 +    w i s a ( h ˜ ν N s , ˜ X i,N ) − ˜ w i,N s a ( ˜ µ N s , ˜ X i,N )    2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν i s , h ˜ ν N s  +    X i − ˜ X i,N    2 ds # + 2 T M 2 a E " Z T 0   w i s − ˜ w i,N s   2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν s , h ˜ ν N s  ds # + (2 T M 2 a + 2 T L 2 a e 2 M a T ) E " Z T 0    X i − ˜ X i,N    2 s +   w i − ˜ w i,N   2 s ds # where the first inequalit y follo ws from in tegrating the square of the drift term o ver [0 , T ] and applying Cauc hy–Sc hw arz inequalit y . The second inequalit y uses the standard b ound ( a + b ) 2 ≤ 2 a 2 + 2 b 2 . The third inequalit y follows from the b oundedness of a by M a , its Lipschitz con tin uity with constan t L a , and the uniform b ound | w i s | ≤ e M a T from Prop osition D.2. Finally , the last inequality is obtained b y combining like terms and applying the b ound | w s | ≤ ∥ w ∥ s . Moreo ver, since X i and ˜ X i,N are constant in time, their sup–norm difference reduces to the point wise difference, i.e., E     X i − ˜ X i,N    2 T  = E     X i − ˜ X i,N    2  . Com bining them together, we hav e E     X i − ˜ X i,N    2 T + ∥ w i − ˜ w i,N ∥ 2 T  ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν s , h ˜ ν N s  ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  +  2 T M 2 a + 2 T L 2 a e 2 M a T  E " Z T 0    X i − ˜ X i,N    2 s +   w i − ˜ w i,N   2 s ds # Let us denote C 1 = 2 T L 2 a e 2 M a T and C 2 = 2 T M 2 a + 2 T L 2 a e 2 M a T . Remark that both C 1 and C 2 are indep en- den t with N the num ber of particles. By applying Gr¨ on wall’s lemma and using the stabilit y estimates from Prop ositions D.2 and D.3, w e obtain: E     X i − ˜ X i,N    2 T + ∥ w i − ˜ w i,N ∥ 2 T  ≤ e C 2 T C 1 E " Z T 0 W 2 2 ( h ν s , h ˜ ν N s ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! ≤ e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ( ν s , ˜ ν N s ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! ≤ e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ,s ( ν, ˜ ν N ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! . (34) W e also hav e E  W 2 2 ,T  ν N , ˜ ν N  ≤ 1 N E " N X i =1    X i − ˜ X i,N    2 T +   w i − ˜ w i,N   2 T # , (35) whic h follows by choosing the sp ecific coupling 1 N P N i =1 δ ( ( w i ,X i ) , ( ˜ w i,N , ˜ X i,N ) ) . By using the triangle inequality , (35) and (34) in order, we obtain: E  W 2 2 ,T  ν, ˜ ν N  ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 E  W 2 2 ,T  ν N , ˜ ν N  ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 N E " N X i =1    X i − ˜ X i,N    2 T +   w i − ˜ w i,N   2 T # ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ,s ( ν, ˜ ν N ) ds # + 2 e C 2 T N N X i =1 E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  By applying Gronw all’s lemma, we ha ve E  W 2 2 ,T  ν, ˜ ν N  ≤ 2 e 2 e C 2 T C 1 e 2 M a T T E  W 2 2 ,T  ν, ν N  + e C 2 T N N X i =1 E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! whic h yields the result thanks to Remark B.2 and our assumption on the initial condition 21. Pr o of of Cor ol lary 2.11. By Prop ositions D.2 and D.3, we hav e for all t ∈ [0 , T ], W 2 2  µ t , ˜ µ N t  ≤ e 2 M a W 2 2  ν t , ˜ ν N t  ≤ e 2 M a W 2 2 ,T  ν, ˜ ν N  . Since the right-hand side do es not dep end on t , taking the supremum o ver t ∈ [0 , T ] yields sup t ∈ [0 ,T ] W 2 2  µ t , ˜ µ N t  ≤ e 2 M a W 2 2 ,T  ν, ˜ ν N  . Applying Theorem 2.10 then gives the desired result. C WEAK-* CONVER GENCE OF KERNELIZED MINIMIZERS In this section, we in vestigate the v ariational prop erties of the k ernelized energy V σ ε ( m ) = F ( m ) + σ Z log  K ε ∗ m π  dm, with particular emphasis on the existence of minimizers and their behaviour as the regularization parameter ε → 0. W e will pro ve Theorem 3.3 under Assumptions 2, 3 and 4 for compact X ⊂ R d , and also formulate an additional result for X = R d . More precisely , for X = R d , we need the following gro wth condition on the p oten tial U (recall that π ∝ e − U ). Assumption 5. W e assume that there exist constants C 0 , A 0 ∈ R and C 1 , A 1 > 0 such that C 0 + C 1 | x | 2 ≤ U ( x ) ≤ A 0 + A 1 | x | 2 , ∀ x ∈ R d . Then we ha ve the following additional result. Theorem C.1 (W eak-* conv ergence of minimizers - Non Compact case) . Supp ose X = R d , let Assumptions 2, 3, 4 and 5 hold, and supp ose the kernel ξ is Gaussian. Then ther e exists a minimizer of V σ ε in P 2 ( R d ) for al l ε > 0 . Mor e over, for al l ε > 0 , if µ ε ∈ P 2 ( R d ) is a minimizer of V σ ε then ther e exists a subse quenc e ( ε k ) k ≥ 1 such that ε k → 0 as k → ∞ such that µ ε k ∗ ⇀ µ as k → ∞ , wher e µ is a minimizer of V σ . W e will split the pro of into the compact case (Theorem 3.3) and the non-compact case (Theorem C.1). W e will follo w Carrillo et al. (2019). C.1 Existence of minimizers - Compact case Prop osition C.2. Supp ose Assumptions 2, 3 and 4 hold. Then for al l ε > 0 , the ener gy functional V σ ε ( m ) := F ( m ) + σ Z X log  K ε ∗ m π  dm is lower semi-c ontinuous with r esp e ct to we ak c onver genc e in P ( X ) . Pr o of. Let ( m n ) n ∈ N ⊂ P ( X ) b e a sequence of probability measures con verging weakly to m ∈ P ( X ). W e aim to sho w lim inf n →∞  F ( m n ) + σ Z log  K ε ∗ m n π  dm n  ≥ F ( m ) + σ Z log  K ε ∗ m π  dm. Since F is lo w er semi-con tinuous and π is contin uous and b ounded from below by assumption we know that m 7→ F ( m ) − σ R log π dm is low er semi-contin uous. Hence it suffices to establish lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm. Note that for fixed ε > 0, the function x 7→ log( K ε ∗ m n ( x )) is con tinuous and bounded from b elow b y log K min ,ε , where K min ,ε := inf x ∈X K ε ( x ). Therefore, we can apply a v ersion of the generalized F atou’s lemma (see, e.g., F ein b erg et al. (2014)), which yields: lim inf n →∞ Z log  K ε ∗ m n ( x ) K min ,ε  dm n ( x ) ≥ Z lim inf n →∞ ,x ′ → x log  K ε ∗ m n ( x ′ ) K min ,ε  dm ( x ) . (36) Moreo ver, for each x ∈ X , the con volution K ε ∗ m n ( x ′ ) con verges p oin twise to K ε ∗ m ( x ) as n → ∞ and x ′ → x , due to weak conv ergence and the con tinuit y of the kernel. Therefore, lim inf n →∞ ,x ′ → x log  K ε ∗ m n ( x ′ ) K min ,ε  = log  K ε ∗ m ( x ) K min ,ε  . (37) Com bining (36) and (37), we conclude that lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm, whic h pro ves the low er semicontin uit y of V σ ε . Pr o of of The or em 3.3 (Step 1: Existenc e of minimizers). F or a compact domain X , the argumen t is straightfor- w ard. The energy V σ ε is bounded from below. Moreov er, an y sequence of probability measures (in particular, an y minimizing sequence) is automatically tight as a direct consequence of compactness. Indeed, for any p ≥ 1, Z X | x | p dm ( x ) ≤ sup x ∈X | x | p . T ogether with the lo w er semicon tin uity established in Prop osition C.3, this implies the existence of a w eakly- ∗ con vergen t subsequence whose limit is a minimizer of V σ ε . C.2 Existence of minimizers - Non Compact case Prop osition C.3. Supp ose Assumptions 2, 3, 4 and 5 hold and the kernel ξ is Gaussian. Then for al l ε > 0 , the ener gy functional V σ ε ( m ) := F ( m ) + σ Z R d log  K ε ∗ m π  dm is lower semi-c ontinuous with r esp e ct to we ak c onver genc e in P 2 ( R d ) . Pr o of. Let ( m n ) n ∈ N ⊂ P 2 ( R d ) be a sequence of probability measures con verging weakly to some m ∈ P 2 ( R d ). W e aim to show lim inf n →∞  F ( m n ) + σ Z log  K ε ∗ m n π  dm n  ≥ F ( m ) + σ Z log  K ε ∗ m π  dm. Since F is lo w er semi-con tinuous and log π is integrable with respect to any m ∈ P 2 ( R d ) due to the quadratic gro wth of U , it suffices to establish lim inf n →∞ Z log ( K ε ∗ m n ) dm n ≥ Z log ( K ε ∗ m ) dm. Then w e w ant to apply the generalized F atou’s Lemma (whic h holds for non negativ e functions). How ev er, unlike in the compact case, when w e work on R d w e do not hav e a lo wer bound for the kernel, hence we need a differen t approac h in order to get a non negative function. F rom Carrillo et al. (2019)[Prop osition 3.9] equation (54) we kno w that if K is Gaussian, then there exists x 0 ∈ R d and C 0 , C 1 ∈ R such that for n sufficient large we ha ve log( K ε ∗ m n )( x ) ≥ C 0 | x − x 0 | 2 + C 1 . If we denote the LHS b y f n ( x ) and RHS by q ( x ), we can use them to apply generalized F atou and we ha ve lim inf n →∞ Z ( f n ( x ) − q ( x )) dm n ( x ) ≥ Z lim inf n →∞ ,x ′ → x ( f n ( x ′ ) − q ( x ′ )) dm ( x ) . (38) Since lim n →∞ Z R d  − q ( x )  dm n ( x ) = Z R d  − q ( x )  dm ( x ) = Z R d lim inf n →∞ x ′ → x  − q ( x ′ )  dm ( x ) , w e can cancel the q terms on b oth sides. This yields the same result as following. W e conclude that lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm, whic h pro ves the low er semicontin uit y of V σ ε . Lemma C.4. (Carril lo et al., 2016, L emma 4.1) Supp ose ρ is a pr ob ability density on R d with finite se c ond moment M 2 ( ρ ) := R R d | x | 2 ρ ( x ) dx . Then for al l δ > 0 , we have Z R d log ρ ( x ) ρ ( x ) dx ≥ −  2 π δ  d/ 2 − δ M 2 ( ρ ) , Pr o of. W e split the integration domain: Z R d log ρ ( x ) ρ ( x ) dx = Z { log ρ ( x ) ≤ 0 } log ρ ( x ) ρ ( x ) dx + Z { log ρ ( x ) > 0 } log ρ ( x ) ρ ( x ) dx ≥ Z { log ρ ( x ) ≤ 0 } log ρ ( x ) ρ ( x ) dx = − Z { ρ ( x ) ≤ 1 } | log ρ ( x ) | ρ ( x ) dx = − Z { ρ ( x ) ≤ e − δ | x | 2 } | log ρ ( x ) | ρ ( x ) dx − Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | log ρ ( x ) | ρ ( x ) dx ≥ − δ Z { ρ ( x ) ≤ e − δ | x | 2 } p ρ ( x ) dx − Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | x | 2 ρ ( x ) dx. where we get the last inequalit y b y using the fact that x | log ( x ) | ≤ √ x for x ∈ (0 , 1]. F or the first term, since ρ ( x ) ≤ 1, we estimate: Z { ρ ( x ) ≤ e − δ | x | 2 } p ρ ( x ) dx ≤ Z R d e − δ | x | 2 / 2 dx =  2 π δ  d/ 2 . F or the second term, note that Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | x | 2 ρ ( x ) dx ≤ M 2 ( ρ ) . Com bining the estimates: Z R d log ρ ( x ) ρ ( x ) dx ≥ − δ M 2 ( ρ ) −  2 π δ  d/ 2 . Prop osition C.5. Supp ose Assumptions 2, 3 and 4 hold. Then we have for al l δ > 0 1 σ V σ ε ( m ) ≥ 1 σ F min + C 0 + C 1 M 2 ( m ) − (2 π /δ ) d/ 2 − 2 δ  M 2 ( m ) + ε 2 M 2 ( ξ )  (39) Pr o of. W e b egin b y rewriting the regularised energy as the sum of three terms: V σ ε ( m ) = F ( m ) + σ Z log K ε ∗ m ( x ) dm ( x ) − σ Z log π ( x ) dm ( x ) . W e estimate each term separately . First, thanks to Assumption 2(ii) F ( m ) ≥ F min . Next, for the third term, note that since π ( x ) = e − U ( x ) , we ha ve − Z log π ( x ) dm ( x ) = Z U ( x ) dm ( x ) ≥ Z C 0 + C 1 | x | 2 dm ( x ) = ( C 0 + C 1 M 2 ( m )) F or the second term, we ha ve for all δ > 0 Z log( K ε ∗ m )( x ) dm ( x ) = Z log( ξ ε ∗ ( ξ ε ∗ m ))( x ) dm ( x ) = Z log  Z ξ ε ( y )( ξ ε ∗ m )( x − y ) dy  dm ( x ) ≥ Z  Z ξ ε ( y ) log ( ξ ε ∗ m ( x − y )) dy  dm ( x ) = Z ξ ε ∗ log( ξ ε ∗ m )( x ) dm ( x ) = Z log( ξ ε ∗ m )( x ) ξ ε ∗ m ( x ) dx ≥ − (2 π /δ ) d/ 2 − δ M 2 ( ξ ε ∗ m ) . where the first inequality is from Jensen’s inequalit y and the last inequality is from Lemma C.4. T o control the second moment of the conv olution, observe that M 2 ( ξ ε ∗ m ) = Z | x | 2 ξ ε ∗ m ( x ) dx = Z | x | 2 Z ξ ε ( x − y ) dm ( y ) dx = Z Z | y + z | 2 ξ ε ( z ) dz dm ( y ) where in the last step w e applied c hange of v ariable z = x − y . Then b y using the inequality | x + y | 2 ≤ 2 | x | 2 + 2 | y | 2 w e ha ve M 2 ( ξ ε ∗ m ) ≤ 2 Z Z | z | 2 ξ ε ( z ) dz dm ( y ) + 2 Z Z | y | 2 ξ ε ( z ) dz dm ( y ) = 2 Z | z | 2 ξ ε ( z ) dz + 2 Z | y | 2 dm ( y ) ≤ 2 ε 2 M 2 ( ξ ) + 2 M 2 ( m ) . where we used the remark D.5 in the last step. Com bining all the ab ov e estimates, we conclude that V σ ε ( m ) is b ounded from ab ov e b y a constant dep ending on σ , F min , δ , M 2 ( ξ ), and M 2 ( m ). Pr o of of The or em C.1 (Step 1: Existenc e of minimizers). F or the non-compact domain R d , Prop osition C.5 pla ys a tw ofold role. First, it ensures that the energy V σ ε is b ounded from b elo w for all m ∈ P 2 ( R d ), so that minimizing sequences are well defined. Second, for any minimizing sequence, choosing δ = C 1 / 4 makes the co efficien t of M 2 ( m ) p ositiv e, yielding the estimate 1 σ V σ ε ( m ) ≥ 1 σ F min + C 0 + C 1 2 M 2 ( m ) − (4 π /C 1 ) d/ 2 − C 1 ε 2 M 2 ( ξ ) . As a consequence, any minimizing sequence ( m n ) n has uniformly bounded second momen ts. By Prokhorov’s theorem and the low er semicon tinuit y of V σ ε , there exists a w eakly- ∗ con vergen t subsequence whose limit is a minimizer. C.3 Conv ergence of minimizers (for b oth cases) Lemma C.6. L et X ⊂ R d b e c omp act and ξ satisfy Assumption 3, or X = R d and ξ b e a Gaussian kernel. L et ( µ ε ) ε> 0 b e a se quenc e in P ( X ) such that µ ε ∗ ⇀ µ as ε → 0 for some µ ∈ P ( X ) . Then ξ ε ∗ µ ε ∗ ⇀ µ, wher e the c onver genc e ∗ ⇀ is understo o d in the b ounde d-Lipschitz sense, i.e. teste d against al l b ounde d Lipschitz functions f : X → R . Pr o of. Supp ose f b e a b ounded Lipschitz function. W e hav e     Z f d ( ξ ε ∗ µ ε ) − Z f dµ     ≤     Z f d ( ξ ε ∗ µ ε ) − Z f dµ ε     +     Z f dµ ε − Z f dµ     . The second term conv erges to 0, and for the first term we hav e     Z f d ( ξ ε ∗ µ ε ) − Z f dµ ε     =     Z Z ( f ( x ) − f ( y )) ξ ε ( x − y ) dy dµ ε ( x )     ≤ ∥∇ f ∥ L ∞     Z Z | x − y | ξ ε ( x − y ) dy dµ ε ( x )     = ∥∇ f ∥ L ∞ Z Z | z | ξ ε ( z ) dz dµ ε ( x ) = ∥∇ f ∥ L ∞ M 1 ( ξ ε ) . In the last inequality , the term M 1 ( ξ ε ) can b e b ounded by ε C ε,d M 1 ( ξ ) or εM 1 ( ξ ) dep ending on the the space X that we w ork with. Thus the first term also conv erges to 0. Theorem C.7. Supp ose F is lower semi-c ontinuous. Our ener gy V σ ε Γ -c onver ges to V σ in the fol lowing sense: F or ( µ ε ) ε ⊂ P ( X ) and µ ∈ P ( X ) , • if µ ε ∗ ⇀ µ , we have lim inf ε → 0 V σ ε ( µ ε ) ≥ V σ ( µ ) • lim sup ε → 0 V σ ε ( µ ) ≤ V σ ( µ ) Pr o of. Since for all µ Z log( K ε ∗ µ ) dµ ≥ Z ξ ε ∗ log( ξ ε ∗ µ ) dµ = Z log( ξ ε ∗ µ ) dξ ε ∗ µ, w e ha ve lim inf ε → 0 V σ ε ( µ ε ) ≥ lim inf ε → 0 ( F ( µ ε ) + σ KL( ξ ε ∗ µ ε | π )) ≥ V σ ( µ ) where the last inequality used l.s.c of F and KL as well as the w eak conv ergence of ξ ε ∗ µ ε that we got from Lemma C.6. Hence the first item is prov en. W e hav e the second item b ecause V σ ( µ ) − V σ ε ( µ ) = σ KL( µ | K ε ∗ µ ) ≥ 0 . Pr o of of The or em 3.3 and The or em C.1 (St ep 2: Conver genc e). F or any ε > 0, since µ ε is a minimizer of V σ ε , w e ha ve for all ν V σ ε ( µ ε ) ≤ V σ ε ( ν ) . Consequen tly , lim inf ε → 0 V σ ε ( µ ε ) ≤ lim sup ε → 0 V σ ε ( ν ) ≤ V σ ( ν ) , where the last inequality follows from Theorem C.7. Since there exists ν suc h that V σ ( ν ) < ∞ , the sequence ( V σ ε ( µ ε )) ε> 0 is uniformly b ounded from ab o ve. Moreo ver, by the quadratic gro wth estimate established ab o ve, there exist constants C 0 ∈ R and C 1 > 0, indep enden t of ε , such that V σ ε ( µ ε ) ≥ F min + C 0 + C 1 2 M 2 ( µ ε ) − (4 π /C 1 ) d/ 2 − C 1 ε 2 M 2 ( ξ ) . Since the left-hand side is uniformly b ounded from abov e and the last term is uniformly b ounded as ε → 0, it follo ws that ( M 2 ( µ ε )) ε> 0 is uniformly b ounded. If X is compact, tightness of ( µ ε ) ε> 0 is immediate. If w e w ork with R d , the uniform second-moment b ound implies tightness. Therefore, up to a subsequence, µ ε ⇀ µ weakly for some µ ∈ P 2 ( X ). By the Γ-conv ergence, we conclude that for all ν V σ ( µ ) ≤ lim inf ε → 0 V σ ε ( µ ε ) ≤ V σ ( ν ) , whic h sho ws that µ is a minimizer of V σ . D SUPPLEMENT AR Y DEFINITIONS AND TECHNICAL RESUL TS D.1 Flat Deriv ativ e Definition D.1. Fix q ≥ 0 and let P q ( X ) b e the space of probability measures on X with finite q -th moments. A functional F : P q ( X ) → R , is said to admit a first order linear deriv ativ e (or a flat deriv ativ e), if there exists a functional δ F δ m : P q ( X ) × X → R , such that 1. F or all a ∈ X , P q ( X ) ∋ m 7→ δ F δ m ( m, a ) is contin uous. 2. F or any ν ∈ P q ( X ), there exists C > 0 such that for all a ∈ X w e ha ve     δ F δ m ( ν, a )     ≤ C (1 + | a | q ) . 3. F or all m , m ′ ∈ P q ( X ), F ( m ′ ) − F ( m ) = Z 1 0 Z X δ F δ m ( m + λ ( m ′ − m ) , a ) ( m ′ − m ) ( da ) dλ. (40) The functional δ F δ m is then called the linear (functional) deriv ativ e of F on P q ( X ). D.2 T echnical Estimates Prop osition D.2 (Uniform Bounds and W asserstein Stabilit y of the Pro jection) . Supp ose Assumption 6 holds, and let T > 0 b e a finite time horizon. Then the fol lowing statements hold for al l t ∈ [0 , T ] : (i) L et w t b e a solution to the p article dynamics (13) . Then w t r emains uniformly b ounde d as e − M a T ≤ w t ≤ e M a T . (41) (ii) L et M > 0 and let ν, ν ′ ∈ P 2 ( C ([0 , T ]; X × [0 , M ])) b e two pr ob ability me asur es on the p ath sp ac e. Then their pr oje cte d me asur es satisfy the fol lowing Wasserstein stability estimate: W 2 2 ( h ν t , h ν ′ t ) ≤ M 2 W 2 2 ( ν t , ν ′ t ) , for al l t ∈ [0 , T ] . In p articular, if ν and ν ′ ar e the laws of solutions to (13) , then due to (41) we obtain W 2 2 ( h ν t , h ν ′ t ) ≤ e 2 M a T W 2 2 ( ν t , ν ′ t ) , for al l t ∈ [0 , T ] . (42) Pr o of. The first item is pro ven directly by the b oundedness assumption. W e hav e w t = e − R t 0 a ( h ν t ,X ) ds . Hence e − M a T = e R T 0 − M a dt ≤ w t ≤ e R T 0 M a dt = e M a T . F or the second item, let denote π t b e any coupling for ν t and ν ′ t . W e denote that ˜ h π t is the coupling for h ν t and h ν ′ t . It is easy to c heck that ˜ h π t is the pro jection (different to (8), this pro jection is for pro duct space) of π t in the sense that for any test function f we ha ve Z X ×X f ( x t , x ′ t ) ˜ h π t ( dx t , dx ′ t ) = Z ( X × [0 ,M ]) × ( X × [0 ,M ]) w t w ′ t f ( x t , x ′ t ) π t ( dx t , dw t , dx ′ t , dw ′ t ) . Hence we ha ve Z X ×X | x t − x ′ t | 2 ˜ h π t ( dx t , dx ′ t ) = Z ( X × [0 ,M ]) × ( X × [0 ,M ]) w t w ′ t | x t − x ′ t | 2 π t ( dx t , dw t , dx ′ t , dw ′ t ) ≤ M 2 Z ( X × [0 ,M ]) × ( X × [0 ,M ]) | x t − x ′ t | 2 π t ( dx t , dw t , dx ′ t , dw ′ t ) ≤ M 2 Z ( X × [0 ,M ]) × ( X × [0 ,M ])  | x t − x ′ t | 2 + | w t − w ′ t | 2  π t ( dx t , dw t , dx ′ t , dw ′ t ) whic h yields the result if we tak e infimum on b oth sides. Prop osition D.3. L et µ , µ ′ ∈ P 2 ( C ([0 , T ]; X )) . Then, for any t ∈ [0 , T ] , W 2 ( µ t , µ ′ t ) ≤ W 2 ,t ( µ, µ ′ ) . Pr o of. Let Π b e a coupling of µ and µ ′ and for a fixed t ∈ [0 , T ] consider a pro jection map Γ t : C ([0 , T ]; X ) → X giv en as Γ t ( γ ) := γ t . Let π t b e the push-forward of Π under Γ t . Then Z X ×X | x − y | 2 π t ( dx, dy ) = Z C ([0 ,T ]; X ) ×C ([0 ,T ]; X ) | γ t − γ ′ t | 2 Π( dγ , dγ ′ ) . Note that π t constructed this w ay is a coupling of µ t = Γ t ( µ ) and µ ′ t = Γ t ( µ ′ ) and hence, using | γ t − γ ′ t | ≤ sup s ∈ [0 ,t ] | γ s − γ ′ s | and then taking the infimum ov er Π on the right-hand side, we get W 2 2 ( µ t , µ ′ t ) ≤ Z X ×X | x − y | 2 π t ( dx, dy ) ≤ W 2 2 ,t ( µ, µ ′ ) . D.3 Construction of the Kernel on a compact s pace Our example is based on the standard Gaussian k ernel truncated and renormalized on a compact space. Let ξ denote the standard Gaussian density on R d : ξ ( z ) := 1 (2 π ) d/ 2 exp  − | z | 2 2  . W e then define the mapping ξ ε : X → R + , ξ ε ( x ) := 1 C ε,d,L · 1 ε d · ξ  x ε  , where C ε,d,L is the normalization constant ensuring R X ξ ε ( x ) dx = 1. W e define C ε,d,L := Z X 1 ε d ξ  x ε  dx and we ha ve C ε,d =  2Φ  L ε  − 1  d , where Φ denotes the cumulativ e distribution function of the standard normal distribution. As ε → 0, we observe that C ε,d,L ↗ 1. Th us, if we restrict ε to a compact interv al (0 , ε max ], then C ε,d,L ∈ "  2Φ  L ε max  − 1  d , 1 ! . (43) W e now examine the prop erties of K ε . 1. Normalization. By construction, ξ ε is a probability densit y function on X . Consequently , its conv olution is also a probability densit y function on X . 2. Upp er and low er bounds. Since ξ is smooth and strictly p ositiv e, so is ξ ε . Moreov er, since ξ ε ( x ) = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 exp  − | x | 2 2 ε 2  , w e obtain the following p oin twise bounds for x ∈ X : sup x ∈X ξ ε ( x ) ≤ 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 , inf x ∈X ξ ε ( x ) ≥ 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 · exp  − dL 2 2 ε 2  . Since K ε = ξ ε ∗ ξ ε , the same type of bounds hold for K ε . In particular, we may in tro duce the constants K max ,ε := sup x ∈X K ε ( x ) , K min ,ε := inf x ∈X K ε ( x ) , whic h are strictly positive and finite. Note that K max ,ε and K min ,ε dep end on ( ε, d, L ), but for simplicit y of notation we suppress this dep endence. 3. Lipsc hitz con tinuit y . Since ξ is smooth, we ha ve |∇ ξ ε ( x ) | = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2     ∇ exp  − | x | 2 2 ε 2      = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 | x | ε 2 exp  − | x | 2 2 ε 2  ≤ 1 C ε,d · 1 ε d +1 · 1 (2 π ) d/ 2 e − 1 / 2 , where the last inequality follo ws from the fact that r e − r 2 / 2 ≤ e − 1 / 2 for all r ≥ 0. Therefore, ξ ε is globally Lipschitz with constant of order O ( ε − ( d +1) ). Moreov er, since ∇ K ε = ( ∇ ξ ε ) ∗ ξ ε , w e obtain ∥∇ K ε ∥ ∞ ≤ ∥∇ ξ ε ∥ ∞ ∥ ξ ε ∥ L 1 = ∥∇ ξ ε ∥ ∞ , whic h sho ws that K ε inherits the same Lipsc hitz constant. Hence K ε is Lipschitz con tinuous on X with constan t of order O ( ε − ( d +1) ). R emark D.4 . The kernel K ε defined ab o ve is a smo oth function on X , b ounded from below and abov e and globally Lipschitz. More precisely , there exist p ositiv e constants 0 < K min ,ε ≤ K max ,ε < ∞ , L K ε < ∞ , suc h that K min ,ε ≤ K ε ( x ) ≤ K max ,ε (44) and | K ε ( x ) − K ε ( y ) | ≤ L K ε | x − y | , ∀ x, y ∈ X . The v alues of K min ,ε , K max ,ε , and L K ε dep end on the dimension d , the diameter of X , the parameter ε , and the c hoice of the base function ξ . R emark D.5 . F or p ≥ 0 and an integrable kernel ξ : R d → [0 , ∞ ). Supp ose M p ( ξ ) := R R d | u | p ξ ( u ) du < ∞ . Let ξ ε b e as in Assumption 3. Then the p -th moment of ξ ε satisfies M p ( ξ ε ) := Z X | x | p ξ ε ( x ) dx = ε p C ε,d Z [ − L/ε,L/ε ] d | u | p ξ ( u ) du ≤ ε p C ε,d M p ( ξ ) . E VERIFICA TION OF THE BOUNDNESS AND LIPSCHITZ CONDITION The main results ab ov e—existence and uniqueness (Theorem 2.9) and propagation of c haos (Theorem 2.10)—re- quire that the drift a ( m, x ) b e uniformly b ounded and Lipsc hitz contin uous in b oth the spatial v ariable x and the measure v ariable m . W e prov e that these prop erties hold for a broad class of kernel-based drifts a . The resulting b ounds are explicit and dep end only on the smo othing kernel, the reference density , and the diameter of X . Prop osition E.1. L et a ( m, x ) b e given by any of the kernelize d forms (25) – (28) . Supp ose Assumptions 2, 3, and 4 hold. Then a ( m, x ) satisfies the b ounde dness and Lipschitz c onditions (6) and (7) . The Lipsc hitz contin uit y and b oundedness conditions can also b e extended to a Fisher–Rao gradient flo w for an energy functional with the kernelized c hi-square divergence as a regularizer. Prop osition E.2. Consider the ener gy functional F ( m ) + σ Z  K ε ∗ m π − 1  2 π ( x ) dx. Its c orr esp onding Fisher–R ao gr adient flow takes the form (4) , wher e a ( m, x ) = δ F δ µ ( m, x ) + 2 σ K ε ∗  K ε ∗ m π  ( x ) − 2 σ Z K ε ∗  K ε ∗ m π  ( y ) m ( dy ) . Supp ose that Assumptions 2, 3, and 4 hold. Then a ( m, x ) satisfies b oth the b ounde dness c ondition (6) and the Lipschitz c ontinuity c ondition (7) . Prop osition E.3. F or the b ounde dness c ondition, we have the fol lowing estimates (uniform in µ ∈ P ( X ) and x ∈ X ): (1)    δ F δ µ ( µ, x )    ≤ C F . (2)    log K ε ∗ µ ( x ) π ( x )    ≤ max  log π max K min ,ε , log K max ,ε π min  . (3) KL( K ε ∗ µ | π ) ≤ log K max ,ε π min . (4)    log K ε ∗ µ ( x ) K ε ∗ π ( x )    ≤ max  log π max K min ,ε , log K max ,ε π min  . (5) KL( K ε ∗ µ | K ε ∗ π ) ≤ log K max ,ε π min . (6) K ε ∗  µ K ε ∗ µ  ( x ) ≤ K max ,ε K min ,ε . (7) 1 ( K ε ∗ µ )( x ) ≤ 1 K min ,ε . Pr o of. E.3(1) follo ws from Assumption 2. F or E.3(2), E.3(3), E.3(4), and E.3(5), w e use K min ,ε π max ≤ K ε ∗ µ ( x ) π ( x ) ≤ K max ,ε π min , ∀ x ∈ X , whic h follo ws from Assumptions 3 and 4. F or E.3(6), we compute K ε ∗  µ K ε ∗ µ  ( x ) = Z K ε ( x − y )  µ K ε ∗ µ  ( y ) dy = Z K ε ( x − y ) K ε ∗ µ ( y ) µ ( y ) dy ≤ K max ,ε K min ,ε . Finally , E.3(7) is immediate from ( K ε ∗ µ )( x ) ≥ K min ,ε . Prop osition E.4. F or al l µ, ν ∈ P 2 ( X ) and x, y ∈ X , the fol lowing quantitative b ounds hold: (1)     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     ≤ L F  | x − y | + W 2 ( µ, ν )  . (2) | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ≤ L K ε K min ,ε W 1 ( µ, ν ) . (3) | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | ≤ L K ε K min ,ε | x − y | . (4) | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤ L π π min | x − y | . (5)     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (6)     K ε ∗  log K ε ∗ ν π  ( x ) − K ε ∗  log K ε ∗ ν π  ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (7)     K ε ∗  log K ε ∗ ν K ε ∗ π  ( x ) − K ε ∗  log K ε ∗ ν K ε ∗ π  ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (8) | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) . (9)     1 ( K ε ∗ µ )( x ) − 1 ( K ε ∗ µ )( y )     ≤ L K ε K 2 min ,ε | x − y | . Pr o of. F or (E.4(1)), we use Assumption 2. F or (E.4(2)), | log( K ε ∗ µ )( x ) − log ( K ε ∗ ν )( x ) | ≤ 1 K min ,ε | ( K ε ∗ µ )( x ) − ( K ε ∗ ν )( x ) | = 1 K min ,ε     Z K ε ( x − z ) ( µ − ν )( dz )     ≤ L K ε K min ,ε W 1 ( µ, ν ) , F or (E.4(3)), use that log is 1 /K min ,ε -Lipsc hitz on [ K min ,ε , ∞ ) and that x 7→ ( K ε ∗ ν )( x ) is L K ε -Lipsc hitz. Hence | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | ≤ 1 K min ,ε | K ε ∗ ν ( x ) − K ε ∗ ν ( y ) | = 1 K min ,ε     Z  K ε ( z − x ) − K ε ( z − y )  ν ( dz )     ≤ L K ε K min ,ε | x − y | . F or (E.4(4)), again we use that log is 1 /π min -Lipsc hitz on [ π min , ∞ ) and that π is L π -Lipsc hitz. Hence | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤ 1 π min | K ε ∗ π ( x ) − K ε ∗ π ( y ) | ≤ 1 π min Z K ε ( z ) | π ( x − z ) − π ( y − z ) | dz ≤ L π π min | x − y | . F or (E.4(5)), again we use the Lipschitz condition of log and we ha ve     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     ≤ | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | + | log π ( x ) − log π ( y ) | ≤ 1 K min ,ε | K ε ∗ ν ( x ) − K ε ∗ ν ( y ) | + 1 π min | π ( x ) − π ( y ) | = 1 K min ,ε     Z  K ε ( z − x ) − K ε ( z − y )  ν ( dz )     + 1 π min | π ( x ) − π ( y ) | ≤  L K ε K min ,ε + L π π min  | x − y | . (45) F or (E.4(6)), we use (E.4(5)) and the fact that Lip( K ε ∗ f ) ≤ Lip( f ). F or (E.4(7)), we use (E.4(3)), (E.4(4)), and Lip( K ε ∗ f ) ≤ Lip( f ). F or (E.4(8)), we hav e | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | ≤     Z  log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )  ( K ε ∗ µ )( x ) dx     +     Z  ( K ε ∗ µ )( x ) − ( K ε ∗ ν )( x )  log K ε ∗ ν ( x ) π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ( K ε ∗ µ )( x ) dx +     Z ( µ ( x ) − ν ( x )) K ε ∗  log K ε ∗ ν ( x ) π ( x )  dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , where in the last line w e used (E.4(2)) and (E.4(4)). F or (E.4(9)),     1 ( K ε ∗ µ )( x ) − 1 ( K ε ∗ µ )( y )     =     ( K ε ∗ µ )( x ) − ( K ε ∗ µ )( y ) ( K ε ∗ µ )( x )( K ε ∗ µ )( y )     ≤ 1 K 2 min ,ε | ( K ε ∗ µ )( x ) − ( K ε ∗ µ )( y ) | ≤ L K ε K 2 min ,ε | x − y | . E.1 F or Flo w (25) Define the co efficient associated with the kernelized flow (25) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ  ( K ε ∗ log K ε ∗ µ π )( x ) − KL( K ε ∗ µ | π )  . By Prop osition E.1, a satisfies b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constan ts C 1 ε > 0 and L 1 ε > 0 (giv en explicitly in Proposition E.1) suc h that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 1 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 1 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Thanks to (E.3(1)) and (E.3(2)) we ha ve     log K ε ∗ µ ( x ) π ( x )     ≤ max  log π max K min ,ε , log K max ,ε π min  , and KL ( K ε ∗ µ | π ) ≤ log K max ,ε π min . (46) Therefore, the desired b oundedness condition holds with the constant C 1 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  + σ log K max ,ε π min . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     , ( iii ) := | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | . Thanks to (E.4(1)), we ha ve ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (47) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )     +     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     . W e b ound these tw o term separately by (E.4(2)) and (E.4(4)) and we get: ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (48) Finally , for the terms (iii), thanks to (E.4(5)) w e ha ve ( iii ) ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) (49) Com bining the bounds from (47), (48), and (49), and using the fact that W 1 ( µ, ν ) ≤ W 2 ( µ, ν ), we obtain the desired Lipschitz estimate with the stated constant. L 1 ε := L F + σ  3 L K ε K min ,ε + 2 L π π min  . E.2 F or Flo w (26) Define the co efficient associated with the kernelized flow (26) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ   K ε ∗ log K ε ∗ µ K ε ∗ π  ( x ) − KL( K ε ∗ µ | K ε ∗ π )  . By Prop osition E.1, a satisfies b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constants C 2 ε > 0 and L 2 ε > 0 such that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 2 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 2 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Thanks to (E.3(3)) and (E.3(4)) we ha ve     log K ε ∗ µ ( x ) K ε ∗ π ( x )     ≤ max  log π max K min ,ε , log K max ,ε π min  , and KL ( K ε ∗ µ | K ε ∗ π ) ≤ log K max ,ε π min . (50) Therefore, the desired b oundedness condition holds with the constant C 2 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  + σ log K max ,ε π min . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     , ( iii ) := | KL( K ε ∗ µ | K ε ∗ π ) − KL( K ε ∗ ν | K ε ∗ π ) | . W e will show separately that eac h part is Lipschitz. Thanks to (E.4(1)), we hav e ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (51) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )     +     log K ε ∗ ν ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     F or the first term on RHS, thanks to (E.4(2))w e ha ve     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )     = | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ≤ L K ε K min ,ε W 1 ( µ, ν ) . F or the second term on RHS, we ha ve     log K ε ∗ ν ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     ≤ | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | + | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤  L K ε K min ,ε + L π π min  | x − y | where the second inequality follo ws from (E.4(6)) and (E.4(7)). Hence ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (52) Regarding the term ( iii ), we ha ve ( iii ) := | KL( K ε ∗ µ | K ε ∗ π ) − KL( K ε ∗ ν | K ε ∗ π ) | ≤     Z  log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )  K ε ∗ µ ( x ) dx     +     Z ( K ε ∗ µ ( x ) − K ε ∗ ν ( x )) log K ε ∗ ν ( x ) K ε ∗ π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | K ε ∗ µ ( x ) dx +     Z ( µ ( x ) − ν ( x )) K ε ∗  log K ε ∗ ν ( x ) K ε ∗ π ( x )  dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (53) where in the last inequalit y , w e used (E.4(2)) and (E.4(7)). Combining (51), (52), and (53) with the fact that W 1 ( µ, ν ) ≤ W 2 ( µ, ν ), w e conclude that L 2 ε = L 1 ε , which completes the pro of. E.3 F or Flo w (27) Define the co efficient associated with the kernelized flow (27) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ log ( K ε ∗ µ )( x ) π ( x ) +  K ε ∗ µ K ε ∗ µ  ( x ) − Z X log ( K ε ∗ µ )( z ) π ( z ) µ ( z ) dz − 1 ! . Under Assumptions 2, 3, and 4, and by Propos ition E.1, there exist constan ts C 3 ε > 0 and L 3 ε > 0 suc h that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 3 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 3 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. F or the b oundedness condition, we hav e already established uniform b ounds for δ F δ µ ( µ, x ) and log K ε ∗ µ ( x ) π ( x ) from (E.3(1)) and (E.3(2)). In particular, w e recall that     δ F δ µ ( µ, x )     +    log K ε ∗ µ ( x ) π ( x )    +     Z log K ε ∗ µ ( x ) π ( x ) µ ( x ) dx     ≤ C F + 2 max  log π max K min ,ε , log K max ,ε π min  . Moreo ver, thanks to (E.3(6)), for p ositiv e term K ε ∗  µ K ε ∗ µ  , we ha ve K ε ∗  µ K ε ∗ µ  ( x ) ≤ K max ,ε K min ,ε . Hence we conclude that C 3 ε = C F + 2 σ max  log π max K min ,ε , log K max ,ε π min  + σ  K max ,ε K min ,ε + 1  . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) + σ ( iv ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     , ( iii ) :=    K ε ∗  µ K ε ∗ µ  ( x ) − K ε ∗  ν K ε ∗ ν  ( y )    , ( iv ) :=     Z log K ε ∗ µ ( z ) π ( z ) µ ( z ) dz − Z log K ε ∗ ν ( z ) π ( z ) ν ( z ) dz     . Thanks to (E.4(1)), we ha ve ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (54) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )     +     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     . W e b ound these tw o term separately by (E.4(2)) and (E.4(4)) and we get: ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (55) No w let us chec k item ( iii ). W e write h µ ( x ) = 1 ( K ε ∗ µ )( x ) , h ν ( x ) = 1 ( K ε ∗ ν )( x ) . Decomp ose K ε ∗  µ K ε ∗ µ  ( y ) − K ε ∗  ν K ε ∗ ν  ( y ) = Z K ε ( y − x ) h µ ( x ) d ( µ − ν )( x ) + Z K ε ( y − x )  h µ ( x ) − h ν ( x )  dν ( x ) =: T 1 + T 2 . F or T 1 , set f µ ( x ) = K ε ( y − x ) h µ ( x ). Using (E.3(7)) and (E.4(9)) and the pro duct rule for Lipschitz functions, Lip( f µ ) ≤ Lip( K ε ) ∥ h µ ∥ ∞ + ∥ K ε ∥ ∞ Lip( h µ ) ≤ L K ε K min ,ε + K max ,ε L K ε K 2 min ,ε . By the Kantoro vich–Rubinstein theorem, | T 1 | =     Z f µ d ( µ − ν )     ≤ Lip( f µ ) W 1 ( µ, ν ) ≤ L K ε K min ,ε + K max ,ε L K ε K 2 min ,ε ! W 1 ( µ, ν ) . F or T 2 , first note that | h µ ( x ) − h ν ( x ) | = | ( K ε ∗ ( ν − µ ))( x ) | ( K ε ∗ µ )( x )( K ε ∗ ν )( x ) ≤ 1 K 2 min ,ε sup z     Z K ε ( z − w ) d ( µ − ν )( w )     . F or each fixed z , the function w 7→ K ε ( z − w ) is L K ε -Lipsc hitz. Applying the Kantoro vic h–Rubinstein theorem giv es sup z     Z K ε ( z − w ) d ( µ − ν )( w )     ≤ L K ε W 1 ( µ, ν ) . Hence ∥ h µ − h ν ∥ ∞ ≤ L K ε K 2 min ,ε W 1 ( µ, ν ), and therefore | T 2 | ≤ ∥ K ε ∥ ∞ ∥ h µ − h ν ∥ ∞ ≤ K max ,ε L K ε K 2 min ,ε W 1 ( µ, ν ) . Com bining the b ounds for T 1 and T 2 yields ( iii ) ≤ L K ε K min ,ε + 2 K max ,ε L K ε K 2 min ,ε ! W 1 ( µ, ν ) . (56) No w let us chec k item ( iv ). ( iv ) ≤     Z  log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )  µ ( x ) dx     +     Z ( µ ( x ) − ν ( x )) log K ε ∗ ν ( x ) π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | µ ( x ) dx +     Z ( µ ( x ) − ν ( x )) log K ε ∗ ν ( x ) π ( x ) dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (57) where in the last inequality , w e used (E.4(2)) and (E.4(5)). Com bining (54), (55), (56) and (57), we could conclude that a ( m, x ) is Lipschitz with constant L 3 ε = L F +  4 L K ε K min ,ε + 2 L π π min  + 2 K max ,ε L K ε K 2 min ,ε . E.4 F or Flo w (28) Define the co efficient associated with the kernelized flow (28) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ   K ε ∗ log K ε ∗ µ π  ( x ) − KL( K ε ∗ µ | π )  . By Prop osition E.1, a satisfies b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constants C 4 ε > 0 and L 4 ε > 0 such that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 4 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 4 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Bounde dness. By (E.3(2)) and R X K ε = 1,     K ε ∗ log K ε ∗ µ π  ( x )    =     Z K ε ( x − y ) log ( K ε ∗ µ )( y ) π ( y ) dy     ≤ Z K ε ( x − y )    log ( K ε ∗ µ )( y ) π ( y )    dy ≤ max  log π max K min ,ε , log K max ,ε π min  . By the same reasoning, KL( K ε ∗ µ | π ) is b ounded by the same constan t. Hence w e ma y tak e C 4 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=    δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )    , ( ii ) :=     K ε ∗ log K ε ∗ µ π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( y )    , ( iii ) :=    KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π )    . F or ( i ), Assumption 2 yields ( i ) ≤ L F  W 2 ( µ, ν ) + | x − y |  . (58) F or ( ii ), split into a measure part and a space part: ( ii ) ≤     K ε ∗ log K ε ∗ µ π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( x )    +     K ε ∗ log K ε ∗ ν π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( y )    ≤ Z K ε ( x − y )     log K ε ∗ µ π ( y ) − log K ε ∗ ν π ( y )     dy + Z K ε ( z )     log K ε ∗ ν π ( x − z ) − log K ε ∗ ν π ( y − z )     dx ≤  L K ε K min ,ε + L π π min   W 1 ( µ, ν ) + | x − y |  , (59) where the last inequality is from (E.4(2)) and (E.4(5)). F or ( iii ), ( iii ) ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (60) b ecause of (E.4(8)). Com bining (58), (59) and (60) w e could choose L 4 ε = L F + σ  3 L K ε K min ,ε + 2 L π π min  . E.5 Boundness and Lipsc hitz Condition for Kernelized Fisher-Rao Gradien t Flo w Regularized b y Chi Square Divergence Pr o of. Using the uniform b ound on δ F δ µ and π ≥ π min ,     K ε ∗  K ε ∗ µ π  ( x )     =     Z K ε ( x − z ) ( K ε ∗ µ )( z ) π ( z ) dz     ≤ K max ,ε π min , and the same b ound holds for R K ε ∗  K ε ∗ µ π  dµ . Hence | a ( µ, x ) | ≤ C F + 4 σ K max ,ε π min =: C 5 ε . W e now analyze the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + 2 σ ( ii ) + 2 σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( y )     , ( iii ) :=     Z K ε ∗  K ε ∗ µ π  dµ − Z K ε ∗  K ε ∗ ν π  dν     . F or ( i ), ( i ) ≤ L F  W 2 ( µ, ν ) + | x − y |  . F or ( ii ), split     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( x )     =     Z ϕ x ( z ) d ( µ − ν )( z )     , with ϕ x ( z ) := Z K ε ( x − w ) π ( w ) K ε ( z − w ) dw. Then Lip( ϕ x ) ≤ L K ε π min , hence by the Kantoro vich–Rubinstein theorem,     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( x )     ≤ L K ε π min W 1 ( µ, ν ) . Moreo ver,     K ε ∗  K ε ∗ ν π  ( x ) − K ε ∗  K ε ∗ ν π  ( y )     ≤ L K ε π min | x − y | . Th us ( ii ) ≤ L K ε π min  W 1 ( µ, ν ) + | x − y |  . F or ( iii ), set f µ ( z ) = K ε ∗  K ε ∗ µ π  ( z ) and write ( iii ) ≤     Z K ε ∗  K ε ∗ µ π  d ( µ − ν )     + Z     K ε ∗  K ε ∗ µ π  − K ε ∗  K ε ∗ ν π      dν. Since Lip( K ε ∗  K ε ∗ µ π  ) ≤ Lip( K ε ∗ µ π ) L K ε π min , the first item is bounded by L K ε π min W 1 ( µ, ν ). F urthermore,     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π ( x )      ≤     K ε ∗ ( µ − ν ) π     ∞ ≤ L K ε π min W 1 ( µ, ν ) , hence ( iii ) ≤ 2 L K ε π min W 1 ( µ, ν ) . Using W 1 ( µ, ν ) ≤ W 2 ( µ, ν ) and collecting the b ounds gives | a ( µ, x ) − a ( ν, y ) | ≤  L F + 6 σ L K ε π min   W 2 ( µ, ν ) + | x − y |  . Therefore the Lipschitz estimate holds with L 5 ε := L F + 6 σ L K ε π min .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment