On propagation of chaos for the Fisher-Rao gradient flow in entropic mean-field optimization

On propagation of c haos for the Fisher-Rao gradien t ﬂo w in en tropic mean-ﬁeld optimization P etra Lazi ´ c Linshan Liu Mateusz B. Ma jk a Univ ersity of Ljubljana and Univ ersity of Zagreb Heriot-W att Universit y and Maxw ell Institute for Mathematical Sciences Heriot-W att Universit y and Maxw ell Institute for Mathematical Sciences Abstract W e consider a class of optimization problems on the space of probabilit y measures moti- v ated by the mean-ﬁeld approach to study- ing neural net w orks. Suc h problems can b e solv ed b y constructing contin uous-time gra- dien t ﬂo ws that con v erge to the minimizer of the energy function under consideration, and then implementing discrete-time algorithms that approximate the ﬂow. In this work, we fo cus on the Fisher-Rao gradien t ﬂow and w e construct an in teracting particle system that appro ximates the ﬂo w as its mean-ﬁeld limit. W e discuss the connection b etw een the en- ergy function, the gradien t ﬂow and the par- ticle system and explain diﬀeren t approac hes to smo othing out the energy function with an appropriate kernel in a w ay that allo ws for the particle system to b e well-deﬁned. W e pro vide a rigorous pro of of the existence and uniqueness of thus obtained kernelized ﬂo ws, as w ell as a propagation of c haos re- sult that pro vides a theoretical justiﬁcation for using the corresp onding kernelized parti- cle systems as appro ximation algorithms in en tropic mean-ﬁeld optimization. 1 INTR ODUCTION W e consider the follo wing optimization problem on the space of probability measures P ( X ) on X ⊂ R d min m ∈P ( X ) V σ ( m ) , V σ ( m ) := F ( m ) + σ KL( m | π ) , (1) Pro ceedings of the 29 th In ternational Conference on Arti- ﬁcial Intelligence and Statistics (AIST A TS) 2026, T angier, Moro cco. PMLR: V olume 300. Copyrigh t 2026 by the au- thor(s). where F : P ( X ) → R is a (p ossibly non-conv ex) func- tional bounded from b elow, σ > 0 is a regularization parameter, π ∈ P ( X ) is a ﬁxed reference measure and KL denotes the relative entrop y (the KL-div ergence). While some general results in Section 2 will b e stated for domains X ⊂ R d whic h do not ha ve to b e com- pact, for the crucial examples studied in Section 3 we will additionally require X to b e compact. In recen t y ears, there has b een considerable in terest in problems of this t yp e, motiv ated b y the mean-ﬁeld approach to the problem of training neural net works (see Mei et al. (2018) or (Hu et al., 2021, Section 3) and the references therein), as well as in the context of reinforcement learning, in p olicy optimization for entrop y-regularized Mark ov Decision Pro cesses with neural netw ork ap- pro ximation in the mean-ﬁeld regime (Leahy et al., 2022; Lascu and Ma jk a, 2025). In order to solv e (1), one t ypically aims to construct a gradien t ﬂo w ( µ t ) t ≥ 0 on P ( X ) that conv erges to the minimizer m ∗ ,σ of (1) when t → ∞ . The most com- monly used example is the W asserstein gradient ﬂo w ∂ t µ t = ∇ ·  µ t ∇ δ V σ δ µ ( µ t , · )  (2) deﬁned via the ﬂat deriv ative (ﬁrst v ariation) δ V σ δ µ of the energy function V σ (see Deﬁnition D.1). The con- ditions guaranteeing the conv ergence of (2) to m ∗ ,σ ha ve b een studied b y numerous authors in v arious settings, including Ambrosio et al. (2008); Hu et al. (2021); Nitanda et al. (2022); Chizat (2022); Leahy et al. (2022) and many others. Ho wev er, from the p oin t of view of applications, an equally important question is ho w to appro ximate gra- dien t ﬂows such as (2) by a practically implemen table algorithm. One possible approac h leads through the Jordan-Kinderlehrer-Otto (JK O) sc hemes (see Jordan et al. (1998) for the original paper or Salim et al. (2020); Lascu et al. (2024) for more recen t exp osi- tions). Another, which we are going to fo cus on in the present pap er, utilizes an in terpretation of (2) as the mean-ﬁeld limit of an interacting particle system. In the latter approac h, one typically aims to prov e a propagation of chaos result, i.e., a result that shows that as the num b er of particles approaches inﬁnity , particles become indep enden t and they all follow the same mean-ﬁeld dynamics. This can b e then used as a theoretical justiﬁcation that an appropriately con- structed interacting particle system may b e used to pro duce (after a discretisation) an algorithm that ap- pro ximates the minimizer of (1) when the n umber of particles and the num b er of iterations are b oth suﬃ- cien tly large. Propagation of c haos for the W asserstein gradien t ﬂo w has b een studied in detail in v arious settings (uti- lizing the in terpretation of (2) as the F okker-Planc k equation for the mean-ﬁeld Langevin SDE). A (far from complete) list of references includes Chen et al. (2025); Monmarc h´ e et al. (2024); Durm us et al. (2020); Carrillo et al. (2003); Malrieu (2001); Delarue and Tse (2025); Lack er and Le Flem (2023); Suzuki et al. (2023); Nitanda (2024); Nitanda et al. (2025); Gu and Kim (2025). A related active strain of research in- v olves the propagation of c haos for kinetic mo dels Monmarc h´ e (2017); Guillin and Monmarch ´ e (2021); Sc huh (2024); Chen et al. (2024). In the presen t w ork, we focus on a diﬀeren t gradient ﬂo w, the so-called Fisher-Rao gradient ﬂo w given b y ∂ t µ t = − µ t δ V σ δ µ ( µ t , · ) . (3) The interest in studying this ﬂow in the context of optimization problems (1) is motiv ated b y the fact that in some settings, its conv ergence to the mini- mizer could be easier to verify than for the W asserstein ﬂo w Liu et al. (2023); Kerimkulo v et al. (2025); Lascu et al. (2024). There has b een a considerable litera- ture studying the fundamental prop erties of Fisher- Rao gradien t ﬂo ws suc h as well-posedness in v arious settings, see e.g. Carrillo et al. (2024); Zhu and Mielke (2024) and the references therein, also in combina- tion with the W asserstein ﬂow as the W asserstein- Fisher-Rao gradient ﬂow (also kno wn as Hellinger- Kan torovic h) Liero et al. (2018); Gallou ¨ et and Mon- saingeon (2017); Lu et al. (2019); Rotsk oﬀ et al. (2019). Ho wev er, unlike for the W asserstein gradien t ﬂow (2), in the con text of the Fisher-Rao ﬂow (3) m uch less is kno wn ab out particle appro ximations (with a few ex- ceptions that will b e discussed in detail in Section 2.4). In particular, the main goal of the present pap er is to ﬁll a gap in the literature by providing a rigorous pro of for a particle appro ximation for the Fisher-Rao gradi- en t ﬂow (3) corresp onding to a class of minimization problems of the form (1). 1.1 Con tributions W e summarize the main contributions of our pap er: • In Section 2, w e prop ose a general framework for studying particle appro ximations of Fisher-Rao gradien t ﬂows. This part extends the results from Ca vil et al. (2017); Lu et al. (2019); Rotsk oﬀ et al. (2019); Domingo-Enric h et al. (2020) (see Remark 2.8 and Section 2.4 for details). • In Section 3, we discuss diﬀeren t approaches to smo othing out Fisher-Rao ﬂows (which is cru- cial for practical implementation) and we show that all the discussed methods fall within our framew ork from Section 2. This part expands on Lu et al. (2019); P amp el et al. (2023); Lu et al. (2023); Carrillo et al. (2019). • In Section 4, we prop ose a practical algorithm for appro ximating Fisher-Rao ﬂows, utilizing a metho d from Section 3. 2 MAIN RESUL TS In this section, w e work on a general (not necessarily compact) space X ⊂ R d and we focus on the ﬂo w ∂ t µ t = − µ t a ( µ t , · ) (4) where the ﬂat deriv ative δ V σ δ µ on the right hand side of (3) has been replaced with a function a : P ( X ) × X → R . In Section 3, we will explain the rationale b ehind appro ximating the ﬂat deriv ativ e δ V σ δ µ of the energy function V σ in (1) with its kernelized counterpart. Diﬀeren t choices of kernelizations will lead to diﬀer- en t choices of a and hence in this section we k eep our notation general, in order to pro duce a broadly ap- plicable theoretical framework. F or conv enience, we will refer to ﬂows (4) as Fisher-Rao ﬂows, although it should b e understo o d that they are ”true” Fisher-Rao ﬂo ws only for some choices of a (see Section 3.2 for more details). W e remark that we alw ays c ho ose a in a wa y that en- sures that µ t remains a probability distribution. This is achiev ed by requiring that for all µ ∈ P ( X ), Z X a ( µ, x ) µ ( dx ) = 0 , (5) whic h then immediately implies that ∂ t R X µ t ( dx ) = − R X a ( µ t , x ) µ t ( dx ) = 0, and hence the ﬂow preserves the mass of the initial me asure. It is easy to c heck that all examples of a studied in Section 3 satisfy prop ert y (5). 2.1 Preliminaries Before we pro ceed, let us introduce s ome necessary notation. F or p ∈ [1 , ∞ ), and for any normed vector space ( E , ∥·∥ ), w e deﬁne the set of probability mea- sures with ﬁnite p -th moment P p ( E ) as P p ( E ) :=  µ ∈ P ( E ) : Z E ∥ x ∥ p µ ( dx ) < ∞  . W e equip P p ( E ) with the p -W asserstein distance W p ( m, m ′ ) := inf π ∈ Π( m,m ′ ) Z E ∥ x − y ∥ p π ( dx, dy ) ! 1 /p , where Π( m, m ′ ) ⊂ P ( E × E ) is the set of all couplings of m and m ′ . F or any T > 0, w e consider the path space C ([0 , T ]; E ) with the supremum norm ∥ x − y ∥ T := sup 0 ≤ s ≤ T ∥ x s − y s ∥ , and write P p ( C ([0 , T ]; E )) for the corresp onding p - momen t space. The induced p -W asserstein distance on path measures is W p,T ( m, m ′ ) := inf π ∈ Π( m,m ′ ) Z ∥ x − y ∥ p T π ( dx, dy ) ! 1 /p . W e require a to satisfy the following assumption: Assumption 1. F unction a is b ounded and Lipsc hitz, i.e., there exist c onstan ts M a , L a > 0 suc h that for all x, y ∈ X and m, m ′ ∈ P 2 ( X ), | a ( m, x ) | ≤ M a , (6) and | a ( m, x ) − a ( m ′ , y ) | ≤ L a  | x − y | + W 2 ( m, m ′ )  . (7) F ollo wing the ideas from Liero et al. (2018) (see also Domingo-Enric h et al. (2020)), we work with the no- tions of lifts and pro jections of measures. This will al- lo w us to use measures deﬁned on the extended space X × R + , with the second comp onent represen ting a lo cal w eight. Deﬁnition 2.1 (Lifted measure and pro jection) . Let µ ∈ P 1 ( X ) and ν ∈ P 1 ( X × R + ). W e sa y that ν is a lifted measure of µ , and conv ersely that µ is the pro jection of ν , if for all φ ∈ C ∞ c ( X ), Z X φ ( x ) dµ ( x ) = Z X × R + w φ ( x ) dν ( x, w ) , (8) where C ∞ c ( X ) denotes the space of smo oth, compactly supp orted functions on X . In this case w e deﬁne the pro jection op erator h : P 1 ( X × R + ) → P 1 ( X ) , so that µ = h ν whenever (8) holds. Note that the requirement for the measures in Deﬁni- tion 2.1 to hav e ﬁnite ﬁrst momen ts ensures that the in tegral on the right hand side of (8) is ﬁnite. 2.2 Existence of mean-ﬁeld dynamics corresp onding to Fisher-Rao ﬂo w Building on the notion of lifted and pro jected mea- sures deﬁned ab ov e, we no w establish a rigorous con- nection betw een the Fisher–Rao gradient ﬂow and a corresp onding mean-ﬁeld (single particle) dynamics. The k ey idea is to lift the ﬂow from the probabilit y space P 1 ( X ) to the extended space P 1 ( X × R + ), where the additional co ordinate w represents a lo cal mass w eight. This lifted formulation rev eals that the origi- nal Fisher–Rao ﬂo w can b e in terpreted as the pro jec- tion of an evolution equation on the extended space. Based on this representation, we construct a mean- ﬁeld dynamic whose la w ev olves according to the lifted ﬂo w, and show that, under suitable regularit y assump- tions, this dynamic admits a unique solution on any ﬁnite time horizon. W e begin b y stating the precise corresp ondence betw een the lifted and pro jected ﬂo ws. Deﬁnition 2.2. Let ν 0 ∈ P 1 ( X × R + ). W e say that ( ν t ) t ≥ 0 ⊂ P 1 ( X × R + ) is a weak solution to the lifted ﬂo w ∂ t ν t ( x, w ) = ∂ ∂ w  ν t ( x, w ) w a ( h ν t , x )  , (9) with initial condition ν 0 , if for an y ψ ∈ C ∞ c ( X × R + ) and any t > 0 w e ha ve d dt Z X × R + ψ ( x, w ) dν t ( x, w ) = − Z X × R + w a ( h ν t , x ) ∂ ∂ w ψ ( x, w ) dν t ( x, w ) . (10) Deﬁnition 2.3. Let µ 0 ∈ P 1 ( X ). W e sa y that ( µ t ) t ≥ 0 ⊂ P 1 ( X ) is a w eak solution to the Fisher-Rao gradien t ﬂo w ∂ t µ t = − µ t a ( µ t , · ) , (11) with initial condition µ 0 , if for any ψ ∈ C ∞ c ( X ) and an y t > 0 we ha ve d dt Z X ψ ( x ) dµ t ( x ) = − Z X ψ ( x ) a ( µ t , x ) dµ t ( x ) . (12) R emark 2.4 . Note that b y a standard appro ximation argumen t, if ( ν t ) t ≥ 0 is a weak solution to the lifted ﬂo w in the sense of Deﬁnition 2.2, then (10) holds also for functions of the form ψ ( x, w ) = w ϕ ( x ) , ϕ ∈ C ∞ c ( X ) . This will be important for argumen ts where w e switc h b et ween the lifted and pro jected ﬂows, esp ecially in the pro of of the follo wing lemma. Lemma 2.5. L et ( ν t ) t ≥ 0 ⊂ P 1 ( X × R + ) b e a we ak solution to the lifte d ﬂow (9) in the sense of Deﬁni- tion 2.2, with initial c ondition ν 0 ∈ P 1 ( X × R + ) . Then the pr oje cte d ﬂow ( µ t ) t ≥ 0 := ( h ν t ) t ≥ 0 ⊂ P 1 ( X ) is a we ak solution to the Fisher–R ao ﬂow (11) in the sense of Deﬁnition 2.3, with initial c ondition µ 0 := h ν 0 . With this relation at hand, we can construct a mean- ﬁeld particle dynamic whose law evolv es according to the lifted ﬂow. The precise statemen t is giv en in the follo wing theorem. Theorem 2.6. L et ν 0 ∈ P 1 ( X × R + ) . Consider the me an-ﬁeld system dX t = 0 , dw t = − w t a  h ν t , X t  dt, ( X 0 , w 0 ) ∼ ν 0 , (13) wher e ν t := Law( X t , w t ) denotes the joint law of ( X t , w t ) . If system (13) admits a solution, then the curve ( ν t ) t ≥ 0 is a we ak solution to the lifte d Fisher– R ao ﬂow (9) in the sense of Deﬁnition 2.2, and henc e ( µ t ) t ≥ 0 := ( h ν t ) t ≥ 0 is a we ak solution to the Fisher– R ao ﬂow (11) in the sense of Deﬁnition 2.3. R emark 2.7 . System (13) can b e interp reted as a par- ticle X t with a corresponding weigh t w t . Note that the spatial p osition X t of the particle is sampled once from the initial probability distribution and remains ﬁxed o ver time, while only the asso ciated w eigh t w t ev olves according to the dynamic in (13). This reﬂects the geometry of the Fisher–Rao metric, which gov erns mass change without spatial transp ort. R emark 2.8 . Consider the setting corresp onding to our primary ob ject of interest, i.e., when the function a in (4) is giv en b y the ﬂat deriv ative of V σ deﬁned in (1). Then, for measures µ such that V σ has a ﬂat deriv ative at µ , a ( µ, · ) = δ F δ µ ( µ, · ) + σ log µ π (14) (up to an additive constan t - the details will b e dis- cussed in Section 3). If we assume the necessary dif- feren tiability , w e can consider the W asserstein-Fisher- Rao gradient ﬂo w given by ∂ t µ t ( x ) = ∇ · ( µ t ∇ a ( µ t , x )) − µ t ( x ) a ( µ t , x ) . (15) Then, due to the in terpretation of the W asserstein gra- dien t ﬂow as the mean-ﬁeld Langevin SDE (Hu et al., 2021), we would exp ect to obtain the following corre- sp onding mean-ﬁeld (single particle) system dX t = −∇  δ F δ µ ( h ν t , X t ) − σ log π ( X t )  dt + √ 2 σ dW t , dw t = − w t a ( h ν t , X t ) dt, ν t := La w( X t , w t ) , (16) where b oth the lo cation of the particle and the weigh t ev olve in time. There are, ho wev er, several tec hnical c hallenges with studying system (16). If (16) is con- sidered on a compact space X ⊂ R d , this creates an issue with the W asserstein part of the ﬂo w (due to the diﬀusiv e b ehaviour of X t ) and one needs to mak e sure that the particle stays on the right space. On the other hand, if X = R d , then the Fisher-Rao part b e- comes problematic since the function a deﬁned in (14) is un b ounded (due to the ﬂat deriv ativ e of the rela- tiv e en tropy b eing unbounded). While working on the presen t pap er, w e were unable to ov ercome these tech- nical challenges whic h is why we fo cus on the system (13) corresp onding to the ”pure” Fisher-Rao gradient ﬂo w (and in Section 3 when we discuss sp eciﬁc choices of a related to (14), we work on a compact X to en- sure that a sta ys b ounded). This is consistent with a recen t pap er (Domingo-Enric h et al., 2020), whic h studied (in the context of t wo-pla yer game theory) W asserstein-Fisher-Rao ﬂo ws corresponding to energy functions (1) without entrop y-regularization (i.e., with σ = 0), whic h leads to a system of the form (16) where X t mo ves only according to a deterministic gradien t descen t; as well as W asserstein ﬂo ws corresp onding to (1) with en tropy regularization but without the Fisher- Rao part. Similarly to Domingo-Enrich et al. (2020), w e are unable to cov er W asserstein-Fisher-Rao (WFR) ﬂo ws corresponding to (1) with en tropy-regularization, but unlik e Domingo-Enrich et al. (2020), w e study Fisher-Rao ﬂo ws for (1) with entrop y-regularization, whic h were not co vered there. W e will attempt to treat the challenging case of WFR ﬂo ws in a future w ork. W e now address the well-posedness of mean-ﬁeld par- ticle dynamics. Theorem 2.9. Supp ose that a satisﬁes Assumption 1. L et the initial law ν 0 ∈ P 2 ( X × R + ) . Then, for any T > 0 , ther e exists a unique r andom dynamic ( X t , w t ) t ∈ [0 ,T ] that solves system (13) . In p articular, La w  ( X t , w t ) t ∈ [0 ,T ]  ∈ P 2  C ([0 , T ]; X × R + )  . Mor e over, the solution is p athwise unique: if two so- lutions shar e the same initial c ondition almost sur ely, then they c oincide for al l t ∈ [0 , T ] almost sur ely. 2.3 In teracting P article Systems and Propagation of Chaos W e in vestigate the appro ximation of the Fisher–Rao gradien t ﬂow by in teracting particle systems. Our ob- jectiv e is to establish propagation of c haos: as the n umber of particles N → ∞ , the empirical measure of the interacting system con verges to the law of i.i.d. copies of the mean-ﬁeld solution, with con vergence measured in the 2-W asserstein distance. Before we discuss the in teracting particle system that is our main ob ject of interest, in order to make the no- tation precise we ﬁrst introduce a system ( X i,N , w i,N t ) of N i.i.d. copies of the mean-ﬁeld dynamic (13), whic h will be used as an auxiliary tool in our approximation estimates. F or each i ∈ J 1 , N K , we deﬁne the dynamic of ( X i,N , w i,N t ) by X i,N ∼ µ i,N 0 , w i,N 0 ≥ 0 , with N X i =1 w i,N 0 = N , dw i,N t = − w i,N t a ( h ν i t , X i,N ) dt, (17) where ν i t := Law( X i,N , w i,N t ) is the marginal la w of the i -th particle. W e denote the empirical measure of th us obtained non-inter acting system b y ν N t := 1 N N X i =1 δ ( X i,N ,w i,N t ) . (18) W e will use this notation throughout our proofs in Ap- p endix B. In teracting particle system. W e are no w ready to in tro duce the interacting particle system, where the in teraction is gov erned by weigh ted empirical distri- butions. F or eac h i ∈ J 1 , N K , w e deﬁne ( ˜ X i,N , ˜ w i,N t ) b y ˜ X i,N ∼ ˜ µ i,N 0 , ˜ w i,N 0 ≥ 0 , with N X i =1 ˜ w i,N 0 = N , d ˜ w i,N t = − ˜ w i,N t a  ˜ µ N t , ˜ X i,N  dt, (19) where ˜ µ N t := 1 N P N i =1 ˜ w i,N t δ ˜ X i,N is the weigh ted em- pirical distribution. Note that the empirical measure on the extended space is ˜ ν N t := 1 N N X i =1 δ ( ˜ X i,N , ˜ w i,N t ) , so that ˜ µ N t = h ˜ ν N t . (20) W e now state the main result on the propagation of c haos. Theorem 2.10. Supp ose that the function a satisﬁes Assumption 1. Fix a ﬁnite time horizon T > 0 and let ν ∈ P 2  C ([0 , T ]; X × R + )  b e the law of the single p article ( X, w t ) = ( X , w t ) t ∈ [0 ,T ] deﬁne d by the me an- ﬁeld dynamic (13) . Cho ose p articles ( X i,N , w i,N t ) as i.i.d. c opies of ( X, w t ) as deﬁne d in (17) and let ( ˜ X i,N , ˜ w i,N t ) b e the inter acting p article system deﬁne d by (19) . Supp ose further that the initial c onditions satisfy lim N →∞ 1 N N X i =1 E     X i,N − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  = 0 . (21) L et ˜ ν N ∈ P 2  C ([0 , T ]; X × R + )  b e the empiric al me a- sur e of the inter acting p article system deﬁne d by (20) . Then, lim N →∞ E  W 2 ,T ( ˜ ν N , ν )  = 0 . W e remark that condition (21) is automatically satis- ﬁed when the initializations for b oth systems consid- ered in Theorem 2.10 are iden tical. Corollary 2.11. Under the assumptions of The o- r em 2.10, the pr oje cte d empiric al distribution ˜ µ N t := h ˜ ν N t c onver ges in the 2-Wasserstein distanc e, uni- formly on [0 , T ] , to the me an-ﬁeld law µ t := h ν t . That is, for any ﬁnite time horizon T > 0 , we have lim N →∞ sup t ∈ [0 ,T ] E [ W 2  µ t , ˜ µ N t  ] = 0 . W e remark that Theorem 2.10 and Corollary 2.11 are non-quan titative, and obtaining con vergence rates w ould require further work, which is b eyond the scop e of the present pap er. 2.4 Discussion of other propagation of chaos results in the literature The framework in Cavil et al. (2017) establishes prop- agation of chaos results for particle appro ximations of the following class of non-conserv ativ e nonlinear PDEs: ∂ t v = d X i,j =1 ∂ 2 ij  (ΦΦ ⊤ ) i,j ( t, x, v ) v  − ∇ · ( g ( t, x, v ) v ) + Λ( t, x, v ) v , v (0 , dx ) = v 0 ( dx ) . (22) In this equation, the ﬁrst tw o terms corresp ond to the W asserstein component of WFR ﬂows, while the last term pla ys the role of a Fisher–Rao-t yp e reaction. Ho wev er, unlik e the WFR ﬂo w (15), where both the transp ort and reaction components are deriv ed from a single energy functional, here the diﬀerent terms are sp eciﬁed indep endently and ma y correspond to unre- lated energies. More imp ortan tly , the reaction term Λ( t, x, v ) v in Ca vil et al. (2017) depends lo cally on the solution v : that is, Λ is ev aluated using the p oint wise v alue v ( t, x ), without reference to the global structure of the distribution. This contrasts with our setting, where, as w e will explain in Section 3, the reaction term in- v olves a nonlo cal dep endence on the entire probabilit y measure. Hence, PDE (22) does not in general con- serv e total mass: the solution v ( t, · ) is not necessarily a probabilit y densit y for all t . In con trast, the nonlocal normalization we include ensures mass preserv ation at all times. Moreo ver, the regularity assumptions also diﬀer, cf. Assumption 1 in Ca vil et al. (2017) to our Assumption 1. Another k ey distinction is that we w ork on a compact state space X ⊂ R d , whereas the equations in Cavil et al. (2017) are deﬁned on the whole space R d . Ex- tending the results from Ca vil et al. (2017) to compact domains w ould require working with appropriately de- ﬁned b oundary conditions for the PDE (22). As we discussed in Remark 2.8, another related pa- p er (Domingo-Enric h et al., 2020) studied propagation of chaos for W asserstein-Fisher-Rao ﬂows without en- trop y regularization, and for W assesrstein ﬂows cor- resp onding to the entrop y-regularized energy (1) (but without the Fisher-Rao part) and hence their results are not applicable to our s etting. Finally , there has been some work on the propagation of c haos for interacting particle systems deﬁned with a killing-replication mechanism with e xponential clo c ks (rather than with ev olving w eigh ts - cf. also the dis- cussion in Section 4) in Rotsk oﬀ et al. (2019); Lu et al. (2019). How ev er, we were unable to make the proofs from those w orks fully rigorous in our setting (1), due to the unboundedness of the ﬂat deriv ative of the rel- ativ e en tropy . 3 ENTR OPIC MEAN-FIELD PR OBLEMS In this section, w e assume X to b e a compact subset of R d , and we fo cus on the energy function V σ ( m ) = F ( m ) + σ KL( m | π ) , for all m ∈ P ( X ) . (23) Note that the choice of a compact X remov es the tec hnical problem with ﬂat diﬀerentiabilit y of the KL- div ergence, which on un b ounded domains would hav e to b e rigorously justiﬁed Liu et al. (2023); Aubin- F rank owski et al. (2022). Note also that w e do not require conv exity of X since the pure Fisher-Rao ﬂow do es not c hange the supp ort of the initial measure, i.e., it evolv es only the mass without an y transp ort in the state space. The key diﬃculty in dealing with Fisher-Rao ﬂows (3) corresp onding to the energy function (23) lies in the construction of the corresp onding in teracting particle system. In our setting, one can show that the ﬂat deriv ativ e of V σ is given b y δ V σ δ µ ( µ t , x ) = δ F δ µ ( µ t , x ) + σ log µ t ( x ) π ( x ) − σ KL( µ t | π ) . (24) Note that the term − σ KL( µ t | π ) is necessary to en- sure that the equation (3) is conserv ativ e, i.e., that all measures µ t are indeed probability measures. Ho w- ev er, this mak es (24) well-deﬁned only if the measures µ t are absolutely contin uous with resp ect to π , which creates problems with deﬁning a particle system ap- pro ximating (3). Indeed, one cannot ev aluate the righ t hand side of (3) at the empirical measure of a corre- sp onding particle system, whic h mak es it necessary to deﬁne an auxiliary ﬂow, where the problematic terms are replaced b y their counterparts inv olving conv olu- tions with appropriately deﬁned kernels. This is the rationale behind the in tro duction of the so-called ”k er- nelized” ﬂows, where instead of the expression in (24), w e w ork with its kernelized version. 3.1 Setting for en tropic mean-ﬁeld optimization Throughout Section 3, w e impose the follo wing stand- ing assumptions on the energy functional F , the mol- liﬁer kernel K ε , and the reference measure π . Assumption 2. W e assume the energy functional F : P ( X ) → R satisﬁes the following prop erties: (i) F is lo wer semi-contin uous with respect to w eak con vergence in P 2 ( X ). (ii) F is bounded from b elo w: there exists F min ∈ R suc h that F ( m ) ≥ F min for all m ∈ P ( X ) . (iii) Flat deriv ative is bounded: there exists a constan t C F > 0 such that for all µ ∈ P ( X ) and x ∈ X ,     δ F δ µ ( µ, x )     ≤ C F . (iv) Flat deriv ativ e is Lipschitz: there exists a con- stan t L F > 0 suc h that for all µ, ν ∈ P ( X ) and x, y ∈ X ,     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . Note that these assumptions on F are standard in the mean-ﬁeld optimization literature and, in particular, they are satisﬁed in mean-ﬁeld mo dels of neural net- w orks studied in pap ers suc h as Hu et al. (2021); Chen et al. (2023), as well as in mean-ﬁeld models in p olicy optimization in reinforcemen t learning (Leahy et al., 2022; Lascu and Ma jk a, 2025). Assumption 3. Let ξ : R d → R + b e a smooth, Lip- sc hitz (with constant L ξ ), radial probability densit y function with full supp ort R d and ﬁnite second mo- men t. F or ε > 0, deﬁne the rescaled function on the compact space X by ξ ε ( x ) := 1 C ε,d ε − d ξ  x ε  , where C ε,d is the normalization constan t ensuring that Z X ξ ε ( x ) dx = 1 . In particular, ξ ε is a probability density function on X . The molliﬁer kernel K ε on X is then deﬁned as K ε ( x ) :=  ξ ε ∗ ξ ε  ( x ) . Assumption 4. W e assume the reference densit y π ( x ) satisﬁes the follo wing prop erties: (i) π ( x ) = e − U ( x ) for some con tinuous p otential func- tion U : X → R . (ii) π is globally Lipschitz on X : there exists L π > 0 suc h that | π ( x ) − π ( y ) | ≤ L π | x − y | for all x, y ∈ X . Note that under Assumption 4, since π ∝ e − U and X is compact, there exist constants 0 < π min < π max < ∞ suc h that π min ≤ π ( x ) ≤ π max , for all x ∈ X . 3.2 Kernelization Strategies W e summarize four k ernelization strategies for ap- pro ximating Fisher–Rao ﬂo ws corresponding to energy functions (1), based on related strategies that hav e app eared in the literature in recent years. W e will then sho w in Prop osition 3.1 that for each of these approac hes, the resulting function a satisﬁes our As- sumption 1 and therefore all our results from Section 2 are applicable. Recall our notation for the ﬂo w ∂ t µ t = − µ t a ( µ t , · ) and consider the follo wing c hoices of a that approximate the ﬂat deriv ativ e δ V σ δ µ giv en in (24). Smo othing only the evolving measure (Lu et al., 2019). Here the k ernel is applied to µ t b oth inside the logarithm and in the KL div ergence. The resulting dynamics read ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ log K ε ∗ µ t π − σ KL( K ε ∗ µ t | π ) ! . (25) Smo othing b oth the ev olving and the target measures (Pampel et al., 2023). In this v ariant, b oth µ t and π are molliﬁed by K ε , which yields ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ log K ε ∗ µ t K ε ∗ π − σ KL( K ε ∗ µ t | K ε ∗ π ) ! . (26) Kernelizing the energy via Lu et al. (2023). Here the k ernel is applied already at the level of the en- ergy function, rather than just at the level of the ﬂow; one studies the ”true” Fisher–Rao gradient ﬂo w corre- sp onding to the mo diﬁed energy function V σ ε ( m ) = F ( m ) + σ Z log K ε ∗ m π m ( dx ) . The induced dynamics takes the form ∂ t µ t = − µ t σ 1 σ δ F δ µ ( µ t , · ) + log K ε ∗ µ t π + K ε ∗  µ t K ε ∗ µ t  − Z log K ε ∗ µ t π µ t ( dx ) − 1 ! . (27) Kernelizing the energy via Carrillo et al. (2019). Another c hoice of a k ernelized energy func- tion replaces the KL term by KL( K ε ∗ m | π ), leading to the energy V σ ε ( m ) = F ( m ) + σ KL( K ε ∗ m | π ) , whose Fisher–Rao gradient ﬂow is ∂ t µ t = − µ t δ F δ µ ( µ t , · ) + σ K ε ∗ log K ε ∗ µ t π − σ KL( K ε ∗ µ t | π ) ! . (28) Both (27) and (28) preserv e the structure of Fisher–Rao ﬂows, i.e, they are genuine Fisher–Rao gradien t ﬂo ws of the corresp onding kernelized energy functions. A notable dra wback of (28), how ev er, is the presence of nested con volutions and in tegrals—e.g., the term K ε ∗ log( K ε ∗ µ t /π )—whic h prev ents a full discretization into ﬁnite particle sums even when µ t is an empirical measure, whic h in practice entails higher computational cost (see the discussion in (Carrillo et al., 2019, Section 1)). F or all kernelizations deﬁned abov e, we ha ve the fol- lo wing result. Prop osition 3.1. Supp ose Assumptions 2, 3 and 4 hold. Consider the gr adient ﬂow ∂ t µ t = − µ t a ( µ t , · ) deﬁne d via (25) , (26) , (27) or (28) . Then a satisﬁes Assumption 1. The pro of can b e found in App endix E. 3.3 Regularization by χ 2 -div ergence Ev en though the relative entrop y is the most p opular c hoice of regularizer in mean-ﬁeld optimization prob- lems, other c hoices are p ossible, and our framework is indeed applicable to energy functions more general than (1). T o illustrate this, w e discuss brieﬂy en- ergy functions regularized by the χ 2 -div ergence (with an appropriate kernelization analogous to kerneliza- tion (28) for the entrop y-regularized energy). Re- call that given a ﬁxed reference measure π ∈ P ( X ), for an y measure µ ∈ P ( X ) absolutely contin uous with resp ect to π , the χ 2 -div ergence is deﬁned by χ 2 ( m | π ) = R ( dm dπ − 1) 2 dπ . W e consider the functional V σ ε ( m ) := F ( m ) + σ χ 2 ( K ε ∗ m | π ). Its Fisher–Rao gradien t ﬂo w is ∂ t µ t = − µ t a ( µ t , · ), where a ( m, x ) := δ F δ µ ( m, x ) + 2 σ K ε ∗  K ε ∗ m π  ( x ) − 2 σ Z K ε ∗  K ε ∗ m π  ( z ) m ( z ) dz . (29) W e hav e the following result. Prop osition 3.2. Supp ose Assumptions 2, 3 and 4 hold. Then the function a deﬁne d by (29) satisﬁes As- sumption 1. The pro of can b e found in App endix E.5. 3.4 W eak-* Con v ergence of Minimizers under Kernelization In the con text of mean-ﬁeld optimization problems (1), an imp ortan t question related to kernelizations of the corresp onding gradient ﬂows is whether those k ernel- ized gradient ﬂows are asso ciated with energy func- tions whose minimizers are close to the minimizers of the original energy function in (1). In this subsection, w e partially answer this question for the kernelized Fisher–Rao gradient ﬂow asso ciated with the kernel- ization in (27), namely the case where the ﬂow corre- sp onds to the energy function V σ ε ( m ) := F ( m ) + σ Z X log K ε ∗ m π dm ( x ) . W e study the limiting behaviour of minimizers as ε → 0, and prov e that any sequence of minimizers of V σ ε admits a subsequence conv erging in the w eak-* top ology to a minimizer of the original problem (1). Theorem 3.3 (W eak-* conv ergence of minimizers) . Supp ose Assumptions 2, 3 and 4 hold. F or e ach ε > 0 , let µ ε ∈ P 2 ( X ) b e a minimizer of V σ ε . Then ther e exists a subse quenc e ( ε k ) k ≥ 1 such that ε k → 0 as k → ∞ , and a me asur e µ ∈ P 2 ( X ) such that µ ε k ∗ ⇀ µ as k → ∞ , and µ is a minimizer of V σ given in (1) . W e w ould lik e to remark that a v ersion of Theorem 3.3 can also be obtained, under some additional assump- tions, on X = R d . F or the sak e of completeness, we discuss the details in App endix C, even though in or- der to apply the remaining results in this pap er to the energy function (23), we require compactness of X . 4 ALGORITHM CONSTR UCTION The practical usage of the discussed gradien t ﬂows dep ends on dev eloping implemen table algorithms for appro ximating the corresp onding in teracting particle systems (19). In our algorithm, w e draw the position of the particles ˜ X i,N at the b eginning from the initial dis- tributions ˜ µ i,N 0 , and afterwards they remain constan t. Then w e apply a time discretization of the equations for the w eights ˜ w i,N b y an Euler sc heme, and hence w e obtain a numerical sc heme which updates the weigh ts of particles in each step. In the algorithm discussed in this section, the dynamics for the w eigh ts follo ws the equation (25), for a smo oth kernel K ε . Com bining the theoretical results of the presen t pap er with results guaranteeing con v ergence of the Fisher- Rao gradien t ﬂo w to the minimizer m σ, ∗ of (1) as t → ∞ (see e.g. Liu et al. (2023)), suggests that for a large n umber of particles, a large n umber of iterations, and a small ε , the output of our algorithm should pro- vide a reasonable approximation of m σ, ∗ . Ho wev er, a full quantitativ e analysis of the resulting approxi- mation error remains a challenging op en problem for future researc h. Note that the full analysis w ould need to take into account the following four errors: the er- ror b etw een the con tin uous ﬂo w and the target (due to running the algorithm for ﬁnite time), the error b e- t ween the ”correct” ﬂow and the kernelized ﬂow (due to the use of the k ernel K ε ), the error b et ween the con- tin uous particle system and the con tinuous mean-ﬁeld ﬂo w (due to using a ﬁnite num b er of particles) and the discretisation error for the particle system (due to a p ositiv e time-step). Algorithm: Fisher–Rao gradient descen t Input: particles X i ∼ µ i 0 with uniform weigh ts w i 0 = 1, for i = 1 , . . . , N ∆ t = time interv al, J = num ber of iterations Steps: up date w eights for j = 1 : J do ˜ µ N j − 1 = 1 N P N l =1 w l j − 1 δ X l for i = 1 : N do ˜ V j i := δ F δ µ ( ˜ µ N j − 1 , X i ) + σ log  1 N P N l =1 w l j − 1 K ε ( X i − X l )  − σ log π ( X i ) − σ 1 N P N k =1 w k j − 1 ×  log  1 N P N l =1 w l j − 1 K ε ( X k − X l )  − log π ( X k )  ˆ w i j = w i j − 1 exp  − ˜ V i j ∆ t  w i j = N ˆ w i j / P N l =1 ˆ w l j Output: the weigh ted empirical distribution of the particles approximates m σ, ∗ Note that we need the co eﬃcien t N in the up date for w i j since the w eigh ts are supp osed to add up to N (cf. (19)). W e remark that in our setting it is also p ossible to con- struct an algorithm c orresponding to the W asserstein- Fisher-Rao gradien t ﬂo w (16). This algorithm w ould ha ve an additional step within the outer lo op, corre- sp onding to the diﬀusion mov emen t of the particles due to the W asserstein part of the ﬂow (i.e., this step w ould corresp ond to the discretisation of the SDE in (16)). F rom a practical p oint of view, such an algo- rithm is exp ected to perform better than the algorithm corresp onding to the ”pure” Fisher-Rao ﬂo w, due to the additional exploration of the state space provided b y the diﬀusion mo vemen t (the main drawbac k of the pure Fisher-Rao ﬂo w is that it does not expand the supp ort of the initial distribution of the particles - see also Remark 4.1). Ho wev er, since our theoretical re- sults co ver only the case of the pure Fisher-Rao ﬂo w (cf. Remark 2.8), w e formulate the algorithm without the diﬀusion part. Finally , note that due to the F eynman-Kac form ula, there is a p ossible interpretation of the Fisher-Rao ﬂo w (4) as a Kolmogorov equation for a sto c hastic pro cess with killing (see (Applebaum, 2009, Section 6.7.2) or (Karatzas and Shrev e, 1991, Theorem 5.7.6 and Exer- cise 5.7.10)). This leads to an alternative idea for con- structing a particle system appro ximating (4), which instead of rebalancing weigh ts, uses killing and replica- tion mechanism with appropriately deﬁned exponen- tial clo c ks. Such algorithms were used in Lu et al. (2019); Pampel et al. (2023); Rotskoﬀ et al. (2019), ho wev er, w e were unable to pro vide a rigorous pro of for propagation of c haos for suc h particle systems in our framew ork (1), which is wh y in this pap er we are w orking with system (19). R emark 4.1 . As the ﬁnal remark on the practical ap- plicabilit y of the Fisher-Rao algorithm presented in this section, w e w ould like to stress that the main is- sue with the pure Fisher-Rao ﬂow is that it is very sensitiv e to initialization, which is reﬂected in the the- oretical results in pap ers such as Lu et al. (2019, 2023); Liu et al. (2023), which state that, in order for the FR ﬂo w to con v erge (exponentially) to the target µ ∗ , the initialization µ 0 has to satisfy a ”w arm start” condi- tion, i.e., there has to exist a constan t C > 0 suc h that for any x ∈ X , dµ 0 dµ ∗ ( x ) ≥ C. In other words, µ 0 and µ ∗ ha ve to b e suﬃciently sim- ilar to each other and in particular they need to ha ve matc hing supp orts. Since in practice it may be diﬃ- cult to choose initialization in a w a y that guaran tees this ”warm start” condition, esp ecially in v ery high- dimensional settings, this ma y limit the applicability of the pure FR ﬂow. Ho wev er, a natural idea for constructing a practically feasible algorithm that utilizes the pure FR ﬂow, would b e to initially run a diﬀerent ﬂo w to provide an initial exploration of the state space and then to switch to the pure FR ﬂow. F or instance, one could ﬁrst run a particle system appro ximating the pure W asserstein ﬂo w (from an arbitrary initialization), for a certain amoun t of time t 0 > 0 that guarantees that there exists C > 0 such that for any x ∈ X , dµ t 0 dµ ∗ ( x ) ≥ C and then ”switch oﬀ” the W asserstein ﬂow and run a particle appro ximation of the pure FR ﬂo w. According to the theory from the pap ers cited ab ov e, a ﬂow like this would con v erge exp onentially to the target (and the corresp onding algorithm would b e cheaper to run than an algorithm that uses both W asserstein and FR ﬂo ws all the time). A fully rigorous analysis of the corresp onding algorithm w ould require propagation of c haos results for both pure W asserstein ﬂo ws (which has b een co vered extensiv ely in the literature, cf. the discussion in the in tro duction) and the pure FR ﬂows (whic h w e pro vide in the present pap er). A full study of suc h algorithms (and in particular of the question of ho w to optimally c ho ose t 0 ) will be the topic of our future work. Ak cnowledgemen t PL ac knowledges funding from the Slo v enian Research and Inno v ation Agency (ARIS) under programme No. P1-0448 and Croatian Science F oundation gran t no. 2277. References Am brosio, L., Gigli, N., and Sa v ar ´ e, G. (2008). Gr a- dient Flows in Metric Sp ac es and in the Sp ac e of Pr ob ability Me asur es . Lectures in Mathematics ETH Z ¨ uric h. Birkh¨ auser V erlag, Basel, 2 edition. Applebaum, D. (2009). L´ evy pr o c esses and sto chastic c alculus , v olume 116 of Cambridge Studies in A d- vanc e d Mathematics . Cam bridge Univ ersity Press, Cam bridge, second edition. Aubin-F rank owski, P .-C., Korba, A., and L ´ eger, F. (2022). Mirror descent with relativ e smo othness in measure spaces, with application to Sinkhorn and EM. In A dvanc es in Neur al Information Pr o c ess- ing Systems , v olume 35, pages 17263–17275. Curran Asso ciates, Inc. Carrillo, J. A., Chen, Y., Zhengyu Huang, D., Huang, J., and W ei, D. (2024). Fisher-Rao Gradien t Flow: Geo desic Conv exity and F unctional Inequalities. arXiv e-prints , page Carrillo, J. A., Craig, K., and Patacc hini, F. S. (2019). A blob metho d for diﬀusion. Calculus of V ariations and Partial Diﬀer ential Equations , 58(2):53. Carrillo, J. A., McCann, R. J., and Villani, C. (2003). Kinetic equilibration rates for granular me- dia and related equations: entrop y dissipation and mass transp ortation estimates. R ev. Mat. Ib er o am. , 19(3):971–1018. Carrillo, J. A., Patacc hini, F. S., Sternberg, P ., and W olansky , G. (2016). Conv ergence of a particle metho d for diﬀusiv e gradient ﬂo ws in one dimen- sion. SIAM Journal on Mathematic al Analysis , 48(6):3708–3741. Ca vil, A. L., Oudjane, N., and Russo, F. (2017). Parti- cle system algorithm and chaos propagation related to non-conserv ative mc kean type sto c hastic diﬀeren- tial equations. Sto chastics and Partial Diﬀer ential Equations: Analysis and Computations , 5(1):1–37. Chen, F., Lin, Y., Ren, Z., and W ang, S. (2024). Uniform-in-time propagation of chaos for kinetic mean ﬁeld langevin dynamics. Ele ctr onic Journal of Pr ob ability , 29(none). Chen, F., Ren, Z., and W ang, S. (2023). Entropic ﬁctitious play for mean ﬁeld optimization problem. J. Mach. L e arn. R es. , 24:Paper No. [211], 36. Chen, F., Ren, Z., and W ang, S. (2025). Uniform-in- time propagation of chaos for mean ﬁeld Langevin dynamics. Ann. Inst. Henri Poinc ar ´ e Pr ob ab. Stat. , 61(4):2357–2404. Chizat, L. (2022). Mean-ﬁeld langevin dynamics: Ex- p onen tial conv ergence and annealing. T r ansactions on Machine L e arning R ese ar ch . Delarue, F. and Tse, A. (2025). Uniform in time weak propagation of c haos on the torus. A nn. Inst. Henri Poinc ar ´ e Pr ob ab. Stat. , 61(2):1021–1074. Domingo-Enric h, C., Jelassi, S., Mensch, A., Rotsk oﬀ, G., and Bruna, J. (2020). A mean-ﬁeld analysis of tw o-pla yer zero-sum games. In Laro c helle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 20215–20226. Curran As- so ciates, Inc. Durm us, A., Eb erle, A., Guillin, A., and Zimmer, R. (2020). An elemen tary approac h to uniform in time propagation of chaos. Pr o c. Am. Math. So c. , 148(12):5387–5398. F ein b erg, E. A., Kasyano v, P . O., and Zadoianch uk, N. V. (2014). F atou’s lemma for weakly con verging probabilities. The ory of Pr ob ability & Its Applic a- tions , 58(4):683–689. Gallou ¨ et, T. O. and Monsaingeon, L. (2017). A jko splitting sc heme for k antoro vic h–ﬁsher–rao gradien t ﬂo ws. SIAM Journal on Mathematic al Analysis , 49(2):1100–1130. Gu, A. and Kim, J. (2025). Mirror Mean- Field Langevin Dynamics. arXiv e-prints , page Guillin, A. and Monmarch ´ e, P . (2021). Uniform long- time and propagation of chaos estimates for mean ﬁeld kinetic particles in non-con vex landscap es. J. Stat. Phys. , 185(2):20. Id/No 15. Hu, K., Ren, Z., ˇ Si ˇ sk a, D., and Szpruch, L. (2021). Mean-ﬁeld langevin dynamics and energy landscap e of neural net works. A nnales de l’Institut Henri Poinc ar ´ e, Pr ob abilit´ es et Statistiques , 57(4):2043– 2065. Jordan, R., Kinderlehrer, D., and Otto, F. (1998). The v ariational formulation of the Fokker-Planc k equa- tion. SIAM J. Math. Anal. , 29(1):1–17. Karatzas, I. and Shreve, S. E. (1991). Br ownian mo- tion and sto chastic c alculus , volume 113 of Gr aduate T exts in Mathematics . Springer-V erlag, New Y ork, second edition. Kerimkulo v, B., Leahy , J., ˇ Si ˇ sk a, D., Szpruc h, L., and Zhang, Y. (2025). A ﬁsher–rao gradient ﬂo w for entrop y-regularised mark ov decision pro cesses in p olish spaces. F oundations of Computational Math- ematics . Lac ker, D. (2018). Mean ﬁeld games and interacting particle systems. Lac ker, D. and Le Flem, L. (2023). Sharp uniform-in- time propagation of c haos. Pr ob ability The ory and R elate d Fields , pages 1–38. Lascu, R.-A. and Ma jk a, M. B. (2025). Non-conv ex en tropic mean-ﬁeld optimization via Best Resp onse ﬂo w. A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2025) . Lascu, R.-A., Ma jk a, M. B., and Szpruch, L. (2024). A ﬁsher-rao gradien t ﬂo w for entropic mean-ﬁeld min- max games. T r ansactions on Machine L e arning R e- se ar ch . Lascu, R.-A., Ma jk a, M. B., ˇ Si ˇ sk a, D., and Szpruch, L. (2024). Linear con vergence of pro ximal descent sc hemes on the W asserstein space. arXiv e-prints , page Leah y , J.-M., Kerimkulov, B., ˇ Si ˇ sk a, D., and Szpruc h, L. (2022). Con vergence of p olicy gradient for en- trop y regularized MDPs with neural net work ap- pro ximation in the mean-ﬁeld regime. In Pr o c e e dings of the 39th International Confer enc e on Machine L e arning , volume 162 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 12222–12252. PMLR. Liero, M., Mielk e, A., and Sav ar´ e, G. (2018). Optimal en tropy-transport problems and a new Hellinger- Kan torovic h distance b etw een p ositiv e measures. In- vent. Math. , 211(3):969–1117. Liu, L., Ma jk a, M. B., and Szpruch, L. (2023). P olyak– Lo jasiewicz inequality on the space of mea- sures and conv ergence of mean-ﬁeld birth-death pro- cesses. Appl Math Optim 87, 48 (2023). Lu, Y., Lu, J., and Nolen, J. (2019). Accelerating langevin sampling with birth-death. arXiv e-prints. Lu, Y., Slepˇ cev, D., and W ang, L. (2023). Birth–death dynamics for sampling: global con v ergence, ap- pro ximations and their asymptotics. Nonline arity , 36(11):5731. Malrieu, F. (2001). Logarithmic Sobolev inequali- ties for some nonlinear PDE’s. Sto chastic Pr o c esses Appl. , 95(1):109–132. Mei, S., Montanari, A., and Nguyen, P .-M. (2018). A mean ﬁeld view of the landscap e of t wo-la yer neural net works. Pr o c e e dings of the National A c ademy of Scienc es , 115(33). Monmarc h´ e, P . (2017). Long-time behaviour and prop- agation of c haos for mean ﬁeld kinetic particles. Sto chastic Pr o c esses Appl. , 127(6):1721–1737. Monmarc h´ e, P ., Ren, Z., and W ang, S. (2024). Time- uniform log-Sob olev inequalities and applications to propagation of c haos. Ele ctr on. J. Pr ob ab. , 29:Paper No. 154, 38. Nitanda, A. (2024). Impro ved P article Approximation Error for Mean Field Neural Netw orks. A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2024) . Nitanda, A., Lee, A., T an Xing Kai, D., Sak aguc hi, M., and Suzuki, T. (2025). Propagation of Chaos for Mean-Field Langevin Dynamics and its Applica- tion to Mo del Ensem ble. F orty-se c ond International Confer enc e on Machine L e arning (ICML2025) . Nitanda, A., W u, D., and Suzuki, T. (2022). Con vex analysis of the mean ﬁeld langevin dynamics. In Camps-V alls, G., Ruiz, F. J. R., and V alera, I., edi- tors, Pr o c e e dings of The 25th International Confer- enc e on Artiﬁcial Intel ligenc e and Statistics , v olume 151 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 9741–9757. PMLR. P amp el, B., Holbac h, S., Hartung, L., and V alsson, O. (2023). Sampling rare even t energy landscapes via birth-death augmen ted dynamics. Physic al R eview E , 107(2). Rotsk oﬀ, G., Jelassi, S., Bruna, J., and V anden- Eijnden, E. (2019). Neuron birth-death dynamics accelerates gradien t descen t and conv erges asymp- totically . In Chaudh uri, K. and Salakh utdino v, R., editors, Pr o c e e dings of the 36th International Con- fer enc e on Machine L e arning , v olume 97 of Pr o- c e e dings of Machine L e arning R ese ar ch , pages 5508– 5517. PMLR. Salim, A., Korba, A., and Luise, G. (2020). The W asserstein proximal gradient algorithm. In A d- vanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 12356–12366. Curran Asso ciates, Inc. Sc huh, K. (2024). Global con tractivity for Langevin dynamics with distribution-dependent forces and uniform in time propagation of chaos. Ann. Inst. Henri Poinc ar´ e Pr ob ab. Stat. , 60(2):753–789. Suzuki, T., Nitanda, A., and W u, D. (2023). Uniform- in-time propagation of c haos for the mean ﬁeld gradien t Langevin dynamics. The Eleventh Inter- national Confer enc e on L e arning R epr esentations (ICLR2023) . Zh u, J.-J. and Mielk e, A. (2024). Kernel Approxima- tion of Fisher-Rao Gradient Flows. arXiv e-prints , page Chec klist 1. F or all m odels and algorithms presented, chec k if y ou include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or mo del. [Y es] (b) An analysis of the prop erties and complexity (time, space, sample size) of any algorithm. [Not Applicable] (c) (Optional) Anonymized source co de, with sp eciﬁcation of all dep endencies, including external libraries. [Not Applicable] 2. F or any theoretical claim, chec k if y ou include: (a) Statemen ts of the full set of assumptions of all theoretical results. [Y es] (b) Complete pro ofs of all theoretical results. [Y es] (c) Clear explanations of any assumptions. [Y es] 3. F or all ﬁgures and tables that presen t empirical results, chec k if you include: (a) The code, data, and instructions needed to repro duce the main exp erimen tal results (ei- ther in the supplemental material or as a URL). [Not Applicable] (b) All the training details (e.g., data splits, hy- p erparameters, how they w ere chosen). [Not Applicable] (c) A clear deﬁnition of the sp eciﬁc measure or statistics and error bars (e.g., with resp ect to the random seed after running exp erimen ts m ultiple times). [Not Applicable] (d) A description of the computing infrastructure used. (e.g., t ype of GPUs, in ternal cluster, or cloud provider). [Not Applicable] 4. If y ou are using existing assets (e.g., co de, data, mo dels) or curating/releasing new assets, chec k if y ou include: (a) Citations of the creator If your work uses ex- isting assets. [Not Applicable] (b) The license information of the assets, if ap- plicable. [Not Applicable] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Not Applica- ble] (d) Information ab out consent from data pro viders/curators. [Not Applicable] (e) Discussion of sensible conten t if applicable, e.g., p ersonally iden tiﬁable information or of- fensiv e con tent. [Not Applicable] 5. If you used cro wdsourcing or conducted researc h with human sub jects, chec k if y ou include: (a) The full text of instructions giv en to partici- pan ts and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approv als if applicable. [Not Appli- cable] (c) The estimated hourly w age paid to partici- pan ts and the total amoun t sp en t on partic- ipan t comp ensation. [Not Applicable] T able of Conten ts 1 INTR ODUCTION 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 MAIN RESUL TS 2 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Existence of mean-ﬁeld dynamics corresp onding to Fisher-Rao ﬂow . . . . . . . . . . . . . . . . . 3 2.3 Interacting P article Systems and Propagation of Chaos . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Discussion of other propagation of c haos results in the literature . . . . . . . . . . . . . . . . . . 5 3 ENTR OPIC MEAN-FIELD PR OBLEMS 6 3.1 Setting for en tropic mean-ﬁeld optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Kernelization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Regularization by χ 2 -div ergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 W eak-* Conv ergence of Minimizers under Kernelization . . . . . . . . . . . . . . . . . . . . . . . 8 4 ALGORITHM CONSTR UCTION 8 A WELL-POSEDNESS OF FISHER–RA O FLO WS 14 B PR OP AGA TION OF CHA OS F OR THE INTERACTING P AR TICLE SYSTEM 17 C WEAK-* CONVERGENCE OF KERNELIZED MINIMIZERS 20 C.1 Existence of minimizers - Compact case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Existence of minimizers - Non Compact case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Conv ergence of minimizers (for b oth cases) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D SUPPLEMENT AR Y DEFINITIONS AND TECHNICAL RESUL TS 25 D.1 Flat Deriv ative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2 T echnical Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.3 Construction of the Kernel on a compact space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E VERIFICA TION OF THE BOUNDNESS AND LIPSCHITZ CONDITION 28 E.1 F or Flo w (25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E.2 F or Flo w (26) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.3 F or Flo w (27) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.4 F or Flo w (28) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.5 Boundness and Lipsc hitz Condition for Kernelized Fisher-Rao Gradient Flow Regularized b y Chi Square Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 App endix: On Propagation of Chaos for the Fisher-Rao Gradient Flo w in En tropic Mean Field Optimization A WELL-POSEDNESS OF FISHER–RAO FLO WS Pr o of of L emma 2.5. W e aim to show that if ν t solv es (9), then the pro jected density µ t ( x ) = R ∞ 0 w ν t ( x, w ) dw satisﬁes (11) in the weak sense, namely we w ant to show for all test function φ ∈ C ∞ c ( X ) ∂ t Z X φ ( x ) µ t ( x ) dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx µ t ( x ) := Z ∞ 0 w ν t ( x, w ) dw . . Since we assumed that ν t solv es (9) weakly , we know that for all ψ ∈ C ∞ c ( X × R + ) we ha ve ∂ t Z X × R + ψ ( x, w ) ν t ( x, w ) dxdw = − Z X × R + ∂ ψ ∂ w ( x, w ) · ν t ( x, w ) · w · a ( µ t , x ) dxdw . Thanks to Remark (2.4), we choose ψ in the form of ψ ( x, w ) = wφ ( x ) for a φ ∈ C ∞ c ( X ). Hence the abov e equation b ecomes ∂ t Z X × R + w φ ( x ) ν t ( x, w ) dxdw = − Z X × R + φ ( x ) · ν t ( x, w ) · w · a ( µ t , x ) dxdw . Note that the LHS could be rewrite as: LH S = ∂ t Z X φ ( x ) Z ∞ 0 w ν t ( x, w ) dw dx = ∂ t Z X φ ( x ) µ t ( x ) dx Moreo ver, the RHS could b e rewrite as: RH S = − Z X φ ( x ) · a ( µ t , x )  Z ∞ 0 w ν t ( x, w ) dw  dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx. Putting it together, we ha ve sho wn that ∂ t Z X φ ( x ) µ t ( x ) dx = − Z X φ ( x ) · a ( µ t , x ) · µ t ( x ) dx, This completes the pro of. Pr o of of The or em 2.6. Let us take a test function f ∈ C ∞ c ( X × R + ). By chain rule, we ha ve f ( X , w t ) = f ( X , w 0 ) + Z t 0 ∂ f ∂ w ( X, w s ) dw s = f ( X , w 0 ) − Z t 0 ∂ f ∂ w ( X, w s ) w s a ( h Law( X , w s ) , X ) ds No w, if we tak e exp ectations on b oth sides, we get Z X × R + f ( x, w ) La w ( X, w t )( x, w ) dxdw = Z X × R + f ( x, w ) La w ( X, w 0 )( x, w ) dxdw − Z t 0 Z X × R + ∂ f ∂ w ( x, w ) w a ( x, h La w ( X, w s )) Law( X , w s )( x, w ) dxdw ds F or the second trem, by using integration b y part, w e get the following equation Z X × R + f ( x, w ) La w ( X, w t )( x, w ) dxdw = Z X × R + f ( x, w ) La w ( X, w 0 )( x, w ) dxdw + Z t 0 Z X × R + f ( x, w ) ∂ ∂ w (La w ( X, w s )( x, w ) w a ( h Law( X, w s ) , x )) dxdw ds where the last equality sho ws Law( X, w s ) is the weak solution in the sense ( ∂ t ν t = ∂ ∂ w ( ν t ( x, w ) w a ( h ν t , x )) ν t ⇀ ν 0 (30) Pr o of of The or em 2.9. W e abbreviate P 2 ( C ([0 , T ]; X × R + )) by P T 2 and deﬁne the mapping Ψ : P T 2 → P T 2 , ν 7→ Law( X ν , w ν ) , where, for a ﬁxed ν ∈ P T 2 , ( X ν , w ν 0 ) ∼ ν 0 , dw ν t = − w ν t a  h ν t , X ν  dt, t ≤ T . (31) Here ν is treated as an external input, indep enden t of the process itself, so (31) is not a McKean–Vlaso v equation but an ODE in ( X ν , w ν t ). By the classical existence and uniqueness theory for ODEs with Lipschitz drift, the system admits a unique strong solution whenev er the coeﬃcients are globally Lipschitz in ( x, w ). In our setting, it suﬃces to chec k that ( x, w ) 7→ w a ( h ν t , x ) is globally Lipschitz. Indeed, for any ( x, w ) , ( x ′ , w ′ ) ∈ X × R + , | w a ( h ν t , x ) − w ′ a ( h ν t , x ′ ) | ≤ | w − w ′ | | a ( h ν t , x ) | + | w ′ | | a ( h ν t , x ) − a ( h ν t , x ′ ) | ≤ M a | w − w ′ | + e M a T L a | x − x ′ | , where M a and L a denote the uniform bound and Lipschitz constant of a , resp ectively . Moreo ver, the boundedness of a implies, via Prop osition D.2, that w ν t is uniformly bounded in t , which ensures that the drift in (31) is indeed globally Lipschitz. F urthermore, E  ∥ ( X ν , w ν ) ∥ 2 T  < ∞ , so for all ν ∈ P T 2 , the mapping Ψ( ν ) remains in P T 2 . No w let ν, ν ′ ∈ P T 2 b e such that ν 0 = ν ′ 0 . W e asso ciate to them the pro cesses ( X ν , w ν ) and ( X ν ′ , w ν ′ ), whic h share identical initial conditions almost surely: ( X ν 0 , w ν 0 ) = ( X ν ′ 0 , w ν ′ 0 ) a.s. Let Ψ( ν ) and Ψ( ν ′ ) denote the probability measures generated by (31). W e then ha ve: E h ∥ w ν − w ν ′ ∥ 2 T i ≤ T E " Z T 0    w ν s a ( h ν s , X ν ) − w ν ′ s a ( h ν ′ s , X ν ′ )    2 ds # ≤ 2 T E " Z T 0    w ν s a ( h ν s , X ν ) − w ν s a ( h ν ′ s , X ν ′ )    2 ds # + 2 T E " Z T 0    w ν s a ( h ν ′ s , X ν ′ ) − w ν ′ s a ( h ν ′ s , X ν ′ )    2 ds # ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2 ( h ν s , h ν ′ s ) +    X ν − X ν ′    2 ds # + 2 T M 2 a E " Z T 0    w ν s − w ν ′ s    2 ds # ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2 ( h ν s , h ν ′ s ) +    X ν − X ν ′    2 ds # + 2 T M 2 a E " Z T 0    w ν − w ν ′    2 s ds # Here the ﬁrst inequalit y follo ws from the Cauc hy–Sc h warz inequality , the second from the elemen tary bound ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , and the third from the Lipsc hitz con tinuit y of a together with the boundedness of w ν (Prop osition D.2) and the uniform b ound | a | ≤ M a . Finally , the last inequality uses the fact that | f ( s ) | ≤ ∥ f ∥ s for any s ≤ T . Note that we hav e E     X ν 0 − X ν ′ 0    2  = 0 . Com bining them together, we hav e E     X ν − X ν ′    2 T + ∥ w ν − w ν ′ ∥ 2 T  = E h ∥ w ν − w ν ′ ∥ 2 T i ≤ 2 T L 2 a e 2 M a T Z T 0 W 2 2 ( h ν s , h ν ′ s ) ds + 2 T M 2 a E " Z T 0    w ν − w ν ′    2 s ds # . By applying Gr¨ onw all’s lemma together with Prop osition D.2, and then using W 2 ( ν s , ν ′ s ) ≤ W 2 ,s ( ν, ν ′ ) for all s ∈ [0 , T ], we obtain E h ∥ w ν − w ν ′ ∥ 2 T i ≤ 2 T L 2 a e 4 M a T +2 T 2 M 2 a Z T 0 W 2 2 ( ν s , ν ′ s ) ds ≤ 2 T L 2 a e 4 M a T +2 T 2 M 2 a Z T 0 W 2 2 ,s ( ν, ν ′ ) ds. W e denote C := 2 T L 2 a e 4 M a T +2 T 2 M 2 a . W e hav e E h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ C Z t 0 W 2 2 ,s ( ν, ν ′ ) ds. (32) Moreo ver, since the joint laws of (( X ν , w ν ) , ( X ν ′ , w ν ′ )) is one of the coupling b etw een Ψ( ν ) and Ψ( ν ′ ), that is, it b elongs to the set Π(Ψ( ν ) , Ψ( ν ′ )). W e hav e W 2 2 ,T (Ψ( ν ) , Ψ( ν ′ )) = inf π ∈ Π((Ψ( ν ) , Ψ( ν ′ ))) E (( X ν ,w ν ) , ( X ν ′ ,w ν ′ )) ∼ π h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ E h ∥ X ν − X ν ′ ∥ 2 T + ∥ w ν − w ν ′ ∥ 2 T i ≤ C Z T 0 W 2 2 ,s ( ν, ν ′ ) ds W e conclude that W 2 2 ,T (Ψ( ν ) , Ψ( ν ′ )) ≤ C Z T 0 W 2 2 ,s ( ν, ν ′ ) ds. (33) Hence for all ν ∈ P 2 ,T w e ha ve W 2 2 ,T  Ψ k +1 ( ν ) , Ψ k ( ν )  ≤ C Z T 0 W 2 2 ,s  Ψ k ( ν ) , Ψ k − 1 ( ν ′ )  ds ≤ C k Z T 0 Z s 0 Z s 1 0 · · · Z s k − 2 0 W 2 2 ,s k − 1 (Ψ( ν ) , ν ′ ) ds k − 1 · · · ds 1 ds ≤ C k T k k ! W 2 2 ,T (Ψ( ν ) , ν ) . As ( P 2 ,T , W 2 ,T ) is complete and Ψ k ( ν ) is a Cauc hy sequence, the sequence Ψ k ( ν ) conv erges to the unique ﬁxed p oin t of Ψ. B PR OP A GA TION OF CHA OS FOR THE INTERA CTING P AR TICLE SYSTEM R emark B.1 . F or the particle system under consideration, the total weigh t N X i =1 w i,N t is conserved in time. Indeed, using the prop ert y of a that Z a ( m, x ) m ( dx ) = 0 for an y probability measure m, w e compute ∂ t N X i =1 w i,N t = N X i =1 w i,N t a ( h ν N t , X i,N t ) = Z a ( h ν N t , x ) N X i =1 w i,N t δ X i,N t ( x ) dx = N X i =1 w i,N t ! Z a ( h ν N t , x ) h ν N t ( dx ) = 0 , whic h implies P N i =1 w i,N t = P N i =1 w i,N 0 for all t ≥ 0. The same prop ert y holds for the tilde system ˜ ν N t . In this work, we c ho ose P N i =1 w i,N 0 = N . This conv ention ensures that, with the empirical measure ν N t = 1 N N X i =1 δ ( w i,N t ,X i,N t ) , its pro jection h ν N t = 1 N N X i =1 w i,N t δ X i,N t is a probability measure. The same reasoning applies to ˜ ν N t and h ˜ ν N t . W e now state an imp ortan t remark on the conv ergence result. R emark B.2 . Let X b e a separable metric space and ( X i ) i ≥ 1 b e i.i.d. X -v alued random v ariables with law µ . Let µ N := 1 N N X i =1 δ X i b e the empirical measure. If p ≥ 1 and µ ∈ P p ( X ), then W p ( µ N , µ ) → 0 almost surely as N → ∞ , and moreov er, E  W p p ( µ N , µ )  → 0 . A pro of of this result can b e found in (Lack er, 2018, Corollary 2.14). Pr o of of The or em 2.10. Let us rewrite the particle systems (17) and (19) in integral form. F or the indep endent system we ha ve    X i t = X i 0 , w i t = w i 0 − R t 0 w i s a  h ν s , X i  ds, and for the interacting system    ˜ X i,N t = ˜ X i,N 0 , ˜ w i,N t = ˜ w i,N 0 − R t 0 ˜ w i,N s a  h ˜ ν N s , ˜ X i,N  ds. W e hav e E  ∥ w i − ˜ w i,N ∥ 2 T  ≤ E h    w i 0 − ˜ w i,N 0    i + T E " Z T 0    w i s a ( h ν i s , X i ) − ˜ w i,N s a ( h ˜ ν N s , ˜ X i,N )    2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T E " Z T 0    w i s a ( h ν i s , X i ) − w i s a ( h ˜ ν N s , ˜ X i,N )    2 +    w i s a ( h ˜ ν N s , ˜ X i,N ) − ˜ w i,N s a ( ˜ µ N s , ˜ X i,N )    2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν i s , h ˜ ν N s  +    X i − ˜ X i,N    2 ds # + 2 T M 2 a E " Z T 0   w i s − ˜ w i,N s   2 ds # ≤ E h    w i 0 − ˜ w i,N 0    i + 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν s , h ˜ ν N s  ds # + (2 T M 2 a + 2 T L 2 a e 2 M a T ) E " Z T 0    X i − ˜ X i,N    2 s +   w i − ˜ w i,N   2 s ds # where the ﬁrst inequalit y follo ws from in tegrating the square of the drift term o ver [0 , T ] and applying Cauc hy–Sc hw arz inequalit y . The second inequalit y uses the standard b ound ( a + b ) 2 ≤ 2 a 2 + 2 b 2 . The third inequalit y follows from the b oundedness of a by M a , its Lipschitz con tin uity with constan t L a , and the uniform b ound | w i s | ≤ e M a T from Prop osition D.2. Finally , the last inequality is obtained b y combining like terms and applying the b ound | w s | ≤ ∥ w ∥ s . Moreo ver, since X i and ˜ X i,N are constant in time, their sup–norm diﬀerence reduces to the point wise diﬀerence, i.e., E     X i − ˜ X i,N    2 T  = E     X i − ˜ X i,N    2  . Com bining them together, we hav e E     X i − ˜ X i,N    2 T + ∥ w i − ˜ w i,N ∥ 2 T  ≤ 2 T L 2 a e 2 M a T E " Z T 0 W 2 2  h ν s , h ˜ ν N s  ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  +  2 T M 2 a + 2 T L 2 a e 2 M a T  E " Z T 0    X i − ˜ X i,N    2 s +   w i − ˜ w i,N   2 s ds # Let us denote C 1 = 2 T L 2 a e 2 M a T and C 2 = 2 T M 2 a + 2 T L 2 a e 2 M a T . Remark that both C 1 and C 2 are indep en- den t with N the num ber of particles. By applying Gr¨ on wall’s lemma and using the stabilit y estimates from Prop ositions D.2 and D.3, w e obtain: E     X i − ˜ X i,N    2 T + ∥ w i − ˜ w i,N ∥ 2 T  ≤ e C 2 T C 1 E " Z T 0 W 2 2 ( h ν s , h ˜ ν N s ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! ≤ e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ( ν s , ˜ ν N s ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! ≤ e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ,s ( ν, ˜ ν N ) ds # + E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! . (34) W e also hav e E  W 2 2 ,T  ν N , ˜ ν N  ≤ 1 N E " N X i =1    X i − ˜ X i,N    2 T +   w i − ˜ w i,N   2 T # , (35) whic h follows by choosing the sp eciﬁc coupling 1 N P N i =1 δ ( ( w i ,X i ) , ( ˜ w i,N , ˜ X i,N ) ) . By using the triangle inequality , (35) and (34) in order, we obtain: E  W 2 2 ,T  ν, ˜ ν N  ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 E  W 2 2 ,T  ν N , ˜ ν N  ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 N E " N X i =1    X i − ˜ X i,N    2 T +   w i − ˜ w i,N   2 T # ≤ 2 E  W 2 2 ,T  ν, ν N  + 2 e C 2 T C 1 e 2 M a T E " Z T 0 W 2 2 ,s ( ν, ˜ ν N ) ds # + 2 e C 2 T N N X i =1 E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  By applying Gronw all’s lemma, we ha ve E  W 2 2 ,T  ν, ˜ ν N  ≤ 2 e 2 e C 2 T C 1 e 2 M a T T E  W 2 2 ,T  ν, ν N  + e C 2 T N N X i =1 E     X i − ˜ X i,N    2 +    w i 0 − ˜ w i,N 0    2  ! whic h yields the result thanks to Remark B.2 and our assumption on the initial condition 21. Pr o of of Cor ol lary 2.11. By Prop ositions D.2 and D.3, we hav e for all t ∈ [0 , T ], W 2 2  µ t , ˜ µ N t  ≤ e 2 M a W 2 2  ν t , ˜ ν N t  ≤ e 2 M a W 2 2 ,T  ν, ˜ ν N  . Since the right-hand side do es not dep end on t , taking the supremum o ver t ∈ [0 , T ] yields sup t ∈ [0 ,T ] W 2 2  µ t , ˜ µ N t  ≤ e 2 M a W 2 2 ,T  ν, ˜ ν N  . Applying Theorem 2.10 then gives the desired result. C WEAK-* CONVER GENCE OF KERNELIZED MINIMIZERS In this section, we in vestigate the v ariational prop erties of the k ernelized energy V σ ε ( m ) = F ( m ) + σ Z log  K ε ∗ m π  dm, with particular emphasis on the existence of minimizers and their behaviour as the regularization parameter ε → 0. W e will pro ve Theorem 3.3 under Assumptions 2, 3 and 4 for compact X ⊂ R d , and also formulate an additional result for X = R d . More precisely , for X = R d , we need the following gro wth condition on the p oten tial U (recall that π ∝ e − U ). Assumption 5. W e assume that there exist constants C 0 , A 0 ∈ R and C 1 , A 1 > 0 such that C 0 + C 1 | x | 2 ≤ U ( x ) ≤ A 0 + A 1 | x | 2 , ∀ x ∈ R d . Then we ha ve the following additional result. Theorem C.1 (W eak-* conv ergence of minimizers - Non Compact case) . Supp ose X = R d , let Assumptions 2, 3, 4 and 5 hold, and supp ose the kernel ξ is Gaussian. Then ther e exists a minimizer of V σ ε in P 2 ( R d ) for al l ε > 0 . Mor e over, for al l ε > 0 , if µ ε ∈ P 2 ( R d ) is a minimizer of V σ ε then ther e exists a subse quenc e ( ε k ) k ≥ 1 such that ε k → 0 as k → ∞ such that µ ε k ∗ ⇀ µ as k → ∞ , wher e µ is a minimizer of V σ . W e will split the pro of into the compact case (Theorem 3.3) and the non-compact case (Theorem C.1). W e will follo w Carrillo et al. (2019). C.1 Existence of minimizers - Compact case Prop osition C.2. Supp ose Assumptions 2, 3 and 4 hold. Then for al l ε > 0 , the ener gy functional V σ ε ( m ) := F ( m ) + σ Z X log  K ε ∗ m π  dm is lower semi-c ontinuous with r esp e ct to we ak c onver genc e in P ( X ) . Pr o of. Let ( m n ) n ∈ N ⊂ P ( X ) b e a sequence of probability measures con verging weakly to m ∈ P ( X ). W e aim to sho w lim inf n →∞  F ( m n ) + σ Z log  K ε ∗ m n π  dm n  ≥ F ( m ) + σ Z log  K ε ∗ m π  dm. Since F is lo w er semi-con tinuous and π is contin uous and b ounded from below by assumption we know that m 7→ F ( m ) − σ R log π dm is low er semi-contin uous. Hence it suﬃces to establish lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm. Note that for ﬁxed ε > 0, the function x 7→ log( K ε ∗ m n ( x )) is con tinuous and bounded from b elow b y log K min ,ε , where K min ,ε := inf x ∈X K ε ( x ). Therefore, we can apply a v ersion of the generalized F atou’s lemma (see, e.g., F ein b erg et al. (2014)), which yields: lim inf n →∞ Z log  K ε ∗ m n ( x ) K min ,ε  dm n ( x ) ≥ Z lim inf n →∞ ,x ′ → x log  K ε ∗ m n ( x ′ ) K min ,ε  dm ( x ) . (36) Moreo ver, for each x ∈ X , the con volution K ε ∗ m n ( x ′ ) con verges p oin twise to K ε ∗ m ( x ) as n → ∞ and x ′ → x , due to weak conv ergence and the con tinuit y of the kernel. Therefore, lim inf n →∞ ,x ′ → x log  K ε ∗ m n ( x ′ ) K min ,ε  = log  K ε ∗ m ( x ) K min ,ε  . (37) Com bining (36) and (37), we conclude that lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm, whic h pro ves the low er semicontin uit y of V σ ε . Pr o of of The or em 3.3 (Step 1: Existenc e of minimizers). F or a compact domain X , the argumen t is straightfor- w ard. The energy V σ ε is bounded from below. Moreov er, an y sequence of probability measures (in particular, an y minimizing sequence) is automatically tight as a direct consequence of compactness. Indeed, for any p ≥ 1, Z X | x | p dm ( x ) ≤ sup x ∈X | x | p . T ogether with the lo w er semicon tin uity established in Prop osition C.3, this implies the existence of a w eakly- ∗ con vergen t subsequence whose limit is a minimizer of V σ ε . C.2 Existence of minimizers - Non Compact case Prop osition C.3. Supp ose Assumptions 2, 3, 4 and 5 hold and the kernel ξ is Gaussian. Then for al l ε > 0 , the ener gy functional V σ ε ( m ) := F ( m ) + σ Z R d log  K ε ∗ m π  dm is lower semi-c ontinuous with r esp e ct to we ak c onver genc e in P 2 ( R d ) . Pr o of. Let ( m n ) n ∈ N ⊂ P 2 ( R d ) be a sequence of probability measures con verging weakly to some m ∈ P 2 ( R d ). W e aim to show lim inf n →∞  F ( m n ) + σ Z log  K ε ∗ m n π  dm n  ≥ F ( m ) + σ Z log  K ε ∗ m π  dm. Since F is lo w er semi-con tinuous and log π is integrable with respect to any m ∈ P 2 ( R d ) due to the quadratic gro wth of U , it suﬃces to establish lim inf n →∞ Z log ( K ε ∗ m n ) dm n ≥ Z log ( K ε ∗ m ) dm. Then w e w ant to apply the generalized F atou’s Lemma (whic h holds for non negativ e functions). How ev er, unlike in the compact case, when w e work on R d w e do not hav e a lo wer bound for the kernel, hence we need a diﬀeren t approac h in order to get a non negative function. F rom Carrillo et al. (2019)[Prop osition 3.9] equation (54) we kno w that if K is Gaussian, then there exists x 0 ∈ R d and C 0 , C 1 ∈ R such that for n suﬃcient large we ha ve log( K ε ∗ m n )( x ) ≥ C 0 | x − x 0 | 2 + C 1 . If we denote the LHS b y f n ( x ) and RHS by q ( x ), we can use them to apply generalized F atou and we ha ve lim inf n →∞ Z ( f n ( x ) − q ( x )) dm n ( x ) ≥ Z lim inf n →∞ ,x ′ → x ( f n ( x ′ ) − q ( x ′ )) dm ( x ) . (38) Since lim n →∞ Z R d  − q ( x )  dm n ( x ) = Z R d  − q ( x )  dm ( x ) = Z R d lim inf n →∞ x ′ → x  − q ( x ′ )  dm ( x ) , w e can cancel the q terms on b oth sides. This yields the same result as following. W e conclude that lim inf n →∞ Z log( K ε ∗ m n ) dm n ≥ Z log( K ε ∗ m ) dm, whic h pro ves the low er semicontin uit y of V σ ε . Lemma C.4. (Carril lo et al., 2016, L emma 4.1) Supp ose ρ is a pr ob ability density on R d with ﬁnite se c ond moment M 2 ( ρ ) := R R d | x | 2 ρ ( x ) dx . Then for al l δ > 0 , we have Z R d log ρ ( x ) ρ ( x ) dx ≥ −  2 π δ  d/ 2 − δ M 2 ( ρ ) , Pr o of. W e split the integration domain: Z R d log ρ ( x ) ρ ( x ) dx = Z { log ρ ( x ) ≤ 0 } log ρ ( x ) ρ ( x ) dx + Z { log ρ ( x ) > 0 } log ρ ( x ) ρ ( x ) dx ≥ Z { log ρ ( x ) ≤ 0 } log ρ ( x ) ρ ( x ) dx = − Z { ρ ( x ) ≤ 1 } | log ρ ( x ) | ρ ( x ) dx = − Z { ρ ( x ) ≤ e − δ | x | 2 } | log ρ ( x ) | ρ ( x ) dx − Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | log ρ ( x ) | ρ ( x ) dx ≥ − δ Z { ρ ( x ) ≤ e − δ | x | 2 } p ρ ( x ) dx − Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | x | 2 ρ ( x ) dx. where we get the last inequalit y b y using the fact that x | log ( x ) | ≤ √ x for x ∈ (0 , 1]. F or the ﬁrst term, since ρ ( x ) ≤ 1, we estimate: Z { ρ ( x ) ≤ e − δ | x | 2 } p ρ ( x ) dx ≤ Z R d e − δ | x | 2 / 2 dx =  2 π δ  d/ 2 . F or the second term, note that Z { e − δ | x | 2 ≤ ρ ( x ) ≤ 1 } | x | 2 ρ ( x ) dx ≤ M 2 ( ρ ) . Com bining the estimates: Z R d log ρ ( x ) ρ ( x ) dx ≥ − δ M 2 ( ρ ) −  2 π δ  d/ 2 . Prop osition C.5. Supp ose Assumptions 2, 3 and 4 hold. Then we have for al l δ > 0 1 σ V σ ε ( m ) ≥ 1 σ F min + C 0 + C 1 M 2 ( m ) − (2 π /δ ) d/ 2 − 2 δ  M 2 ( m ) + ε 2 M 2 ( ξ )  (39) Pr o of. W e b egin b y rewriting the regularised energy as the sum of three terms: V σ ε ( m ) = F ( m ) + σ Z log K ε ∗ m ( x ) dm ( x ) − σ Z log π ( x ) dm ( x ) . W e estimate each term separately . First, thanks to Assumption 2(ii) F ( m ) ≥ F min . Next, for the third term, note that since π ( x ) = e − U ( x ) , we ha ve − Z log π ( x ) dm ( x ) = Z U ( x ) dm ( x ) ≥ Z C 0 + C 1 | x | 2 dm ( x ) = ( C 0 + C 1 M 2 ( m )) F or the second term, we ha ve for all δ > 0 Z log( K ε ∗ m )( x ) dm ( x ) = Z log( ξ ε ∗ ( ξ ε ∗ m ))( x ) dm ( x ) = Z log  Z ξ ε ( y )( ξ ε ∗ m )( x − y ) dy  dm ( x ) ≥ Z  Z ξ ε ( y ) log ( ξ ε ∗ m ( x − y )) dy  dm ( x ) = Z ξ ε ∗ log( ξ ε ∗ m )( x ) dm ( x ) = Z log( ξ ε ∗ m )( x ) ξ ε ∗ m ( x ) dx ≥ − (2 π /δ ) d/ 2 − δ M 2 ( ξ ε ∗ m ) . where the ﬁrst inequality is from Jensen’s inequalit y and the last inequality is from Lemma C.4. T o control the second moment of the conv olution, observe that M 2 ( ξ ε ∗ m ) = Z | x | 2 ξ ε ∗ m ( x ) dx = Z | x | 2 Z ξ ε ( x − y ) dm ( y ) dx = Z Z | y + z | 2 ξ ε ( z ) dz dm ( y ) where in the last step w e applied c hange of v ariable z = x − y . Then b y using the inequality | x + y | 2 ≤ 2 | x | 2 + 2 | y | 2 w e ha ve M 2 ( ξ ε ∗ m ) ≤ 2 Z Z | z | 2 ξ ε ( z ) dz dm ( y ) + 2 Z Z | y | 2 ξ ε ( z ) dz dm ( y ) = 2 Z | z | 2 ξ ε ( z ) dz + 2 Z | y | 2 dm ( y ) ≤ 2 ε 2 M 2 ( ξ ) + 2 M 2 ( m ) . where we used the remark D.5 in the last step. Com bining all the ab ov e estimates, we conclude that V σ ε ( m ) is b ounded from ab ov e b y a constant dep ending on σ , F min , δ , M 2 ( ξ ), and M 2 ( m ). Pr o of of The or em C.1 (Step 1: Existenc e of minimizers). F or the non-compact domain R d , Prop osition C.5 pla ys a tw ofold role. First, it ensures that the energy V σ ε is b ounded from b elo w for all m ∈ P 2 ( R d ), so that minimizing sequences are well deﬁned. Second, for any minimizing sequence, choosing δ = C 1 / 4 makes the co eﬃcien t of M 2 ( m ) p ositiv e, yielding the estimate 1 σ V σ ε ( m ) ≥ 1 σ F min + C 0 + C 1 2 M 2 ( m ) − (4 π /C 1 ) d/ 2 − C 1 ε 2 M 2 ( ξ ) . As a consequence, any minimizing sequence ( m n ) n has uniformly bounded second momen ts. By Prokhorov’s theorem and the low er semicon tinuit y of V σ ε , there exists a w eakly- ∗ con vergen t subsequence whose limit is a minimizer. C.3 Conv ergence of minimizers (for b oth cases) Lemma C.6. L et X ⊂ R d b e c omp act and ξ satisfy Assumption 3, or X = R d and ξ b e a Gaussian kernel. L et ( µ ε ) ε> 0 b e a se quenc e in P ( X ) such that µ ε ∗ ⇀ µ as ε → 0 for some µ ∈ P ( X ) . Then ξ ε ∗ µ ε ∗ ⇀ µ, wher e the c onver genc e ∗ ⇀ is understo o d in the b ounde d-Lipschitz sense, i.e. teste d against al l b ounde d Lipschitz functions f : X → R . Pr o of. Supp ose f b e a b ounded Lipschitz function. W e hav e     Z f d ( ξ ε ∗ µ ε ) − Z f dµ     ≤     Z f d ( ξ ε ∗ µ ε ) − Z f dµ ε     +     Z f dµ ε − Z f dµ     . The second term conv erges to 0, and for the ﬁrst term we hav e     Z f d ( ξ ε ∗ µ ε ) − Z f dµ ε     =     Z Z ( f ( x ) − f ( y )) ξ ε ( x − y ) dy dµ ε ( x )     ≤ ∥∇ f ∥ L ∞     Z Z | x − y | ξ ε ( x − y ) dy dµ ε ( x )     = ∥∇ f ∥ L ∞ Z Z | z | ξ ε ( z ) dz dµ ε ( x ) = ∥∇ f ∥ L ∞ M 1 ( ξ ε ) . In the last inequality , the term M 1 ( ξ ε ) can b e b ounded by ε C ε,d M 1 ( ξ ) or εM 1 ( ξ ) dep ending on the the space X that we w ork with. Thus the ﬁrst term also conv erges to 0. Theorem C.7. Supp ose F is lower semi-c ontinuous. Our ener gy V σ ε Γ -c onver ges to V σ in the fol lowing sense: F or ( µ ε ) ε ⊂ P ( X ) and µ ∈ P ( X ) , • if µ ε ∗ ⇀ µ , we have lim inf ε → 0 V σ ε ( µ ε ) ≥ V σ ( µ ) • lim sup ε → 0 V σ ε ( µ ) ≤ V σ ( µ ) Pr o of. Since for all µ Z log( K ε ∗ µ ) dµ ≥ Z ξ ε ∗ log( ξ ε ∗ µ ) dµ = Z log( ξ ε ∗ µ ) dξ ε ∗ µ, w e ha ve lim inf ε → 0 V σ ε ( µ ε ) ≥ lim inf ε → 0 ( F ( µ ε ) + σ KL( ξ ε ∗ µ ε | π )) ≥ V σ ( µ ) where the last inequality used l.s.c of F and KL as well as the w eak conv ergence of ξ ε ∗ µ ε that we got from Lemma C.6. Hence the ﬁrst item is prov en. W e hav e the second item b ecause V σ ( µ ) − V σ ε ( µ ) = σ KL( µ | K ε ∗ µ ) ≥ 0 . Pr o of of The or em 3.3 and The or em C.1 (St ep 2: Conver genc e). F or any ε > 0, since µ ε is a minimizer of V σ ε , w e ha ve for all ν V σ ε ( µ ε ) ≤ V σ ε ( ν ) . Consequen tly , lim inf ε → 0 V σ ε ( µ ε ) ≤ lim sup ε → 0 V σ ε ( ν ) ≤ V σ ( ν ) , where the last inequality follows from Theorem C.7. Since there exists ν suc h that V σ ( ν ) < ∞ , the sequence ( V σ ε ( µ ε )) ε> 0 is uniformly b ounded from ab o ve. Moreo ver, by the quadratic gro wth estimate established ab o ve, there exist constants C 0 ∈ R and C 1 > 0, indep enden t of ε , such that V σ ε ( µ ε ) ≥ F min + C 0 + C 1 2 M 2 ( µ ε ) − (4 π /C 1 ) d/ 2 − C 1 ε 2 M 2 ( ξ ) . Since the left-hand side is uniformly b ounded from abov e and the last term is uniformly b ounded as ε → 0, it follo ws that ( M 2 ( µ ε )) ε> 0 is uniformly b ounded. If X is compact, tightness of ( µ ε ) ε> 0 is immediate. If w e w ork with R d , the uniform second-moment b ound implies tightness. Therefore, up to a subsequence, µ ε ⇀ µ weakly for some µ ∈ P 2 ( X ). By the Γ-conv ergence, we conclude that for all ν V σ ( µ ) ≤ lim inf ε → 0 V σ ε ( µ ε ) ≤ V σ ( ν ) , whic h sho ws that µ is a minimizer of V σ . D SUPPLEMENT AR Y DEFINITIONS AND TECHNICAL RESUL TS D.1 Flat Deriv ativ e Deﬁnition D.1. Fix q ≥ 0 and let P q ( X ) b e the space of probability measures on X with ﬁnite q -th moments. A functional F : P q ( X ) → R , is said to admit a ﬁrst order linear deriv ativ e (or a ﬂat deriv ativ e), if there exists a functional δ F δ m : P q ( X ) × X → R , such that 1. F or all a ∈ X , P q ( X ) ∋ m 7→ δ F δ m ( m, a ) is contin uous. 2. F or any ν ∈ P q ( X ), there exists C > 0 such that for all a ∈ X w e ha ve     δ F δ m ( ν, a )     ≤ C (1 + | a | q ) . 3. F or all m , m ′ ∈ P q ( X ), F ( m ′ ) − F ( m ) = Z 1 0 Z X δ F δ m ( m + λ ( m ′ − m ) , a ) ( m ′ − m ) ( da ) dλ. (40) The functional δ F δ m is then called the linear (functional) deriv ativ e of F on P q ( X ). D.2 T echnical Estimates Prop osition D.2 (Uniform Bounds and W asserstein Stabilit y of the Pro jection) . Supp ose Assumption 6 holds, and let T > 0 b e a ﬁnite time horizon. Then the fol lowing statements hold for al l t ∈ [0 , T ] : (i) L et w t b e a solution to the p article dynamics (13) . Then w t r emains uniformly b ounde d as e − M a T ≤ w t ≤ e M a T . (41) (ii) L et M > 0 and let ν, ν ′ ∈ P 2 ( C ([0 , T ]; X × [0 , M ])) b e two pr ob ability me asur es on the p ath sp ac e. Then their pr oje cte d me asur es satisfy the fol lowing Wasserstein stability estimate: W 2 2 ( h ν t , h ν ′ t ) ≤ M 2 W 2 2 ( ν t , ν ′ t ) , for al l t ∈ [0 , T ] . In p articular, if ν and ν ′ ar e the laws of solutions to (13) , then due to (41) we obtain W 2 2 ( h ν t , h ν ′ t ) ≤ e 2 M a T W 2 2 ( ν t , ν ′ t ) , for al l t ∈ [0 , T ] . (42) Pr o of. The ﬁrst item is pro ven directly by the b oundedness assumption. W e hav e w t = e − R t 0 a ( h ν t ,X ) ds . Hence e − M a T = e R T 0 − M a dt ≤ w t ≤ e R T 0 M a dt = e M a T . F or the second item, let denote π t b e any coupling for ν t and ν ′ t . W e denote that ˜ h π t is the coupling for h ν t and h ν ′ t . It is easy to c heck that ˜ h π t is the pro jection (diﬀerent to (8), this pro jection is for pro duct space) of π t in the sense that for any test function f we ha ve Z X ×X f ( x t , x ′ t ) ˜ h π t ( dx t , dx ′ t ) = Z ( X × [0 ,M ]) × ( X × [0 ,M ]) w t w ′ t f ( x t , x ′ t ) π t ( dx t , dw t , dx ′ t , dw ′ t ) . Hence we ha ve Z X ×X | x t − x ′ t | 2 ˜ h π t ( dx t , dx ′ t ) = Z ( X × [0 ,M ]) × ( X × [0 ,M ]) w t w ′ t | x t − x ′ t | 2 π t ( dx t , dw t , dx ′ t , dw ′ t ) ≤ M 2 Z ( X × [0 ,M ]) × ( X × [0 ,M ]) | x t − x ′ t | 2 π t ( dx t , dw t , dx ′ t , dw ′ t ) ≤ M 2 Z ( X × [0 ,M ]) × ( X × [0 ,M ])  | x t − x ′ t | 2 + | w t − w ′ t | 2  π t ( dx t , dw t , dx ′ t , dw ′ t ) whic h yields the result if we tak e inﬁmum on b oth sides. Prop osition D.3. L et µ , µ ′ ∈ P 2 ( C ([0 , T ]; X )) . Then, for any t ∈ [0 , T ] , W 2 ( µ t , µ ′ t ) ≤ W 2 ,t ( µ, µ ′ ) . Pr o of. Let Π b e a coupling of µ and µ ′ and for a ﬁxed t ∈ [0 , T ] consider a pro jection map Γ t : C ([0 , T ]; X ) → X giv en as Γ t ( γ ) := γ t . Let π t b e the push-forward of Π under Γ t . Then Z X ×X | x − y | 2 π t ( dx, dy ) = Z C ([0 ,T ]; X ) ×C ([0 ,T ]; X ) | γ t − γ ′ t | 2 Π( dγ , dγ ′ ) . Note that π t constructed this w ay is a coupling of µ t = Γ t ( µ ) and µ ′ t = Γ t ( µ ′ ) and hence, using | γ t − γ ′ t | ≤ sup s ∈ [0 ,t ] | γ s − γ ′ s | and then taking the inﬁmum ov er Π on the right-hand side, we get W 2 2 ( µ t , µ ′ t ) ≤ Z X ×X | x − y | 2 π t ( dx, dy ) ≤ W 2 2 ,t ( µ, µ ′ ) . D.3 Construction of the Kernel on a compact s pace Our example is based on the standard Gaussian k ernel truncated and renormalized on a compact space. Let ξ denote the standard Gaussian density on R d : ξ ( z ) := 1 (2 π ) d/ 2 exp  − | z | 2 2  . W e then deﬁne the mapping ξ ε : X → R + , ξ ε ( x ) := 1 C ε,d,L · 1 ε d · ξ  x ε  , where C ε,d,L is the normalization constant ensuring R X ξ ε ( x ) dx = 1. W e deﬁne C ε,d,L := Z X 1 ε d ξ  x ε  dx and we ha ve C ε,d =  2Φ  L ε  − 1  d , where Φ denotes the cumulativ e distribution function of the standard normal distribution. As ε → 0, we observe that C ε,d,L ↗ 1. Th us, if we restrict ε to a compact interv al (0 , ε max ], then C ε,d,L ∈ "  2Φ  L ε max  − 1  d , 1 ! . (43) W e now examine the prop erties of K ε . 1. Normalization. By construction, ξ ε is a probability densit y function on X . Consequently , its conv olution is also a probability densit y function on X . 2. Upp er and low er bounds. Since ξ is smooth and strictly p ositiv e, so is ξ ε . Moreov er, since ξ ε ( x ) = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 exp  − | x | 2 2 ε 2  , w e obtain the following p oin twise bounds for x ∈ X : sup x ∈X ξ ε ( x ) ≤ 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 , inf x ∈X ξ ε ( x ) ≥ 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 · exp  − dL 2 2 ε 2  . Since K ε = ξ ε ∗ ξ ε , the same type of bounds hold for K ε . In particular, we may in tro duce the constants K max ,ε := sup x ∈X K ε ( x ) , K min ,ε := inf x ∈X K ε ( x ) , whic h are strictly positive and ﬁnite. Note that K max ,ε and K min ,ε dep end on ( ε, d, L ), but for simplicit y of notation we suppress this dep endence. 3. Lipsc hitz con tinuit y . Since ξ is smooth, we ha ve |∇ ξ ε ( x ) | = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2     ∇ exp  − | x | 2 2 ε 2      = 1 C ε,d · 1 ε d · 1 (2 π ) d/ 2 | x | ε 2 exp  − | x | 2 2 ε 2  ≤ 1 C ε,d · 1 ε d +1 · 1 (2 π ) d/ 2 e − 1 / 2 , where the last inequality follo ws from the fact that r e − r 2 / 2 ≤ e − 1 / 2 for all r ≥ 0. Therefore, ξ ε is globally Lipschitz with constant of order O ( ε − ( d +1) ). Moreov er, since ∇ K ε = ( ∇ ξ ε ) ∗ ξ ε , w e obtain ∥∇ K ε ∥ ∞ ≤ ∥∇ ξ ε ∥ ∞ ∥ ξ ε ∥ L 1 = ∥∇ ξ ε ∥ ∞ , whic h sho ws that K ε inherits the same Lipsc hitz constant. Hence K ε is Lipschitz con tinuous on X with constan t of order O ( ε − ( d +1) ). R emark D.4 . The kernel K ε deﬁned ab o ve is a smo oth function on X , b ounded from below and abov e and globally Lipschitz. More precisely , there exist p ositiv e constants 0 < K min ,ε ≤ K max ,ε < ∞ , L K ε < ∞ , suc h that K min ,ε ≤ K ε ( x ) ≤ K max ,ε (44) and | K ε ( x ) − K ε ( y ) | ≤ L K ε | x − y | , ∀ x, y ∈ X . The v alues of K min ,ε , K max ,ε , and L K ε dep end on the dimension d , the diameter of X , the parameter ε , and the c hoice of the base function ξ . R emark D.5 . F or p ≥ 0 and an integrable kernel ξ : R d → [0 , ∞ ). Supp ose M p ( ξ ) := R R d | u | p ξ ( u ) du < ∞ . Let ξ ε b e as in Assumption 3. Then the p -th moment of ξ ε satisﬁes M p ( ξ ε ) := Z X | x | p ξ ε ( x ) dx = ε p C ε,d Z [ − L/ε,L/ε ] d | u | p ξ ( u ) du ≤ ε p C ε,d M p ( ξ ) . E VERIFICA TION OF THE BOUNDNESS AND LIPSCHITZ CONDITION The main results ab ov e—existence and uniqueness (Theorem 2.9) and propagation of c haos (Theorem 2.10)—re- quire that the drift a ( m, x ) b e uniformly b ounded and Lipsc hitz contin uous in b oth the spatial v ariable x and the measure v ariable m . W e prov e that these prop erties hold for a broad class of kernel-based drifts a . The resulting b ounds are explicit and dep end only on the smo othing kernel, the reference density , and the diameter of X . Prop osition E.1. L et a ( m, x ) b e given by any of the kernelize d forms (25) – (28) . Supp ose Assumptions 2, 3, and 4 hold. Then a ( m, x ) satisﬁes the b ounde dness and Lipschitz c onditions (6) and (7) . The Lipsc hitz contin uit y and b oundedness conditions can also b e extended to a Fisher–Rao gradient ﬂo w for an energy functional with the kernelized c hi-square divergence as a regularizer. Prop osition E.2. Consider the ener gy functional F ( m ) + σ Z  K ε ∗ m π − 1  2 π ( x ) dx. Its c orr esp onding Fisher–R ao gr adient ﬂow takes the form (4) , wher e a ( m, x ) = δ F δ µ ( m, x ) + 2 σ K ε ∗  K ε ∗ m π  ( x ) − 2 σ Z K ε ∗  K ε ∗ m π  ( y ) m ( dy ) . Supp ose that Assumptions 2, 3, and 4 hold. Then a ( m, x ) satisﬁes b oth the b ounde dness c ondition (6) and the Lipschitz c ontinuity c ondition (7) . Prop osition E.3. F or the b ounde dness c ondition, we have the fol lowing estimates (uniform in µ ∈ P ( X ) and x ∈ X ): (1)    δ F δ µ ( µ, x )    ≤ C F . (2)    log K ε ∗ µ ( x ) π ( x )    ≤ max  log π max K min ,ε , log K max ,ε π min  . (3) KL( K ε ∗ µ | π ) ≤ log K max ,ε π min . (4)    log K ε ∗ µ ( x ) K ε ∗ π ( x )    ≤ max  log π max K min ,ε , log K max ,ε π min  . (5) KL( K ε ∗ µ | K ε ∗ π ) ≤ log K max ,ε π min . (6) K ε ∗  µ K ε ∗ µ  ( x ) ≤ K max ,ε K min ,ε . (7) 1 ( K ε ∗ µ )( x ) ≤ 1 K min ,ε . Pr o of. E.3(1) follo ws from Assumption 2. F or E.3(2), E.3(3), E.3(4), and E.3(5), w e use K min ,ε π max ≤ K ε ∗ µ ( x ) π ( x ) ≤ K max ,ε π min , ∀ x ∈ X , whic h follo ws from Assumptions 3 and 4. F or E.3(6), we compute K ε ∗  µ K ε ∗ µ  ( x ) = Z K ε ( x − y )  µ K ε ∗ µ  ( y ) dy = Z K ε ( x − y ) K ε ∗ µ ( y ) µ ( y ) dy ≤ K max ,ε K min ,ε . Finally , E.3(7) is immediate from ( K ε ∗ µ )( x ) ≥ K min ,ε . Prop osition E.4. F or al l µ, ν ∈ P 2 ( X ) and x, y ∈ X , the fol lowing quantitative b ounds hold: (1)     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     ≤ L F  | x − y | + W 2 ( µ, ν )  . (2) | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ≤ L K ε K min ,ε W 1 ( µ, ν ) . (3) | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | ≤ L K ε K min ,ε | x − y | . (4) | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤ L π π min | x − y | . (5)     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (6)     K ε ∗  log K ε ∗ ν π  ( x ) − K ε ∗  log K ε ∗ ν π  ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (7)     K ε ∗  log K ε ∗ ν K ε ∗ π  ( x ) − K ε ∗  log K ε ∗ ν K ε ∗ π  ( y )     ≤  L K ε K min ,ε + L π π min  | x − y | . (8) | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) . (9)     1 ( K ε ∗ µ )( x ) − 1 ( K ε ∗ µ )( y )     ≤ L K ε K 2 min ,ε | x − y | . Pr o of. F or (E.4(1)), we use Assumption 2. F or (E.4(2)), | log( K ε ∗ µ )( x ) − log ( K ε ∗ ν )( x ) | ≤ 1 K min ,ε | ( K ε ∗ µ )( x ) − ( K ε ∗ ν )( x ) | = 1 K min ,ε     Z K ε ( x − z ) ( µ − ν )( dz )     ≤ L K ε K min ,ε W 1 ( µ, ν ) , F or (E.4(3)), use that log is 1 /K min ,ε -Lipsc hitz on [ K min ,ε , ∞ ) and that x 7→ ( K ε ∗ ν )( x ) is L K ε -Lipsc hitz. Hence | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | ≤ 1 K min ,ε | K ε ∗ ν ( x ) − K ε ∗ ν ( y ) | = 1 K min ,ε     Z  K ε ( z − x ) − K ε ( z − y )  ν ( dz )     ≤ L K ε K min ,ε | x − y | . F or (E.4(4)), again we use that log is 1 /π min -Lipsc hitz on [ π min , ∞ ) and that π is L π -Lipsc hitz. Hence | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤ 1 π min | K ε ∗ π ( x ) − K ε ∗ π ( y ) | ≤ 1 π min Z K ε ( z ) | π ( x − z ) − π ( y − z ) | dz ≤ L π π min | x − y | . F or (E.4(5)), again we use the Lipschitz condition of log and we ha ve     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     ≤ | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | + | log π ( x ) − log π ( y ) | ≤ 1 K min ,ε | K ε ∗ ν ( x ) − K ε ∗ ν ( y ) | + 1 π min | π ( x ) − π ( y ) | = 1 K min ,ε     Z  K ε ( z − x ) − K ε ( z − y )  ν ( dz )     + 1 π min | π ( x ) − π ( y ) | ≤  L K ε K min ,ε + L π π min  | x − y | . (45) F or (E.4(6)), we use (E.4(5)) and the fact that Lip( K ε ∗ f ) ≤ Lip( f ). F or (E.4(7)), we use (E.4(3)), (E.4(4)), and Lip( K ε ∗ f ) ≤ Lip( f ). F or (E.4(8)), we hav e | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | ≤     Z  log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )  ( K ε ∗ µ )( x ) dx     +     Z  ( K ε ∗ µ )( x ) − ( K ε ∗ ν )( x )  log K ε ∗ ν ( x ) π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ( K ε ∗ µ )( x ) dx +     Z ( µ ( x ) − ν ( x )) K ε ∗  log K ε ∗ ν ( x ) π ( x )  dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , where in the last line w e used (E.4(2)) and (E.4(4)). F or (E.4(9)),     1 ( K ε ∗ µ )( x ) − 1 ( K ε ∗ µ )( y )     =     ( K ε ∗ µ )( x ) − ( K ε ∗ µ )( y ) ( K ε ∗ µ )( x )( K ε ∗ µ )( y )     ≤ 1 K 2 min ,ε | ( K ε ∗ µ )( x ) − ( K ε ∗ µ )( y ) | ≤ L K ε K 2 min ,ε | x − y | . E.1 F or Flo w (25) Deﬁne the co eﬃcient associated with the kernelized ﬂow (25) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ  ( K ε ∗ log K ε ∗ µ π )( x ) − KL( K ε ∗ µ | π )  . By Prop osition E.1, a satisﬁes b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constan ts C 1 ε > 0 and L 1 ε > 0 (giv en explicitly in Proposition E.1) suc h that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 1 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 1 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Thanks to (E.3(1)) and (E.3(2)) we ha ve     log K ε ∗ µ ( x ) π ( x )     ≤ max  log π max K min ,ε , log K max ,ε π min  , and KL ( K ε ∗ µ | π ) ≤ log K max ,ε π min . (46) Therefore, the desired b oundedness condition holds with the constant C 1 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  + σ log K max ,ε π min . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     , ( iii ) := | KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π ) | . Thanks to (E.4(1)), we ha ve ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (47) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )     +     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     . W e b ound these tw o term separately by (E.4(2)) and (E.4(4)) and we get: ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (48) Finally , for the terms (iii), thanks to (E.4(5)) w e ha ve ( iii ) ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) (49) Com bining the bounds from (47), (48), and (49), and using the fact that W 1 ( µ, ν ) ≤ W 2 ( µ, ν ), we obtain the desired Lipschitz estimate with the stated constant. L 1 ε := L F + σ  3 L K ε K min ,ε + 2 L π π min  . E.2 F or Flo w (26) Deﬁne the co eﬃcient associated with the kernelized ﬂow (26) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ   K ε ∗ log K ε ∗ µ K ε ∗ π  ( x ) − KL( K ε ∗ µ | K ε ∗ π )  . By Prop osition E.1, a satisﬁes b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constants C 2 ε > 0 and L 2 ε > 0 such that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 2 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 2 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Thanks to (E.3(3)) and (E.3(4)) we ha ve     log K ε ∗ µ ( x ) K ε ∗ π ( x )     ≤ max  log π max K min ,ε , log K max ,ε π min  , and KL ( K ε ∗ µ | K ε ∗ π ) ≤ log K max ,ε π min . (50) Therefore, the desired b oundedness condition holds with the constant C 2 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  + σ log K max ,ε π min . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     , ( iii ) := | KL( K ε ∗ µ | K ε ∗ π ) − KL( K ε ∗ ν | K ε ∗ π ) | . W e will show separately that eac h part is Lipschitz. Thanks to (E.4(1)), we hav e ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (51) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )     +     log K ε ∗ ν ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     F or the ﬁrst term on RHS, thanks to (E.4(2))w e ha ve     log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )     = | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | ≤ L K ε K min ,ε W 1 ( µ, ν ) . F or the second term on RHS, we ha ve     log K ε ∗ ν ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( y ) K ε ∗ π ( y )     ≤ | log K ε ∗ ν ( x ) − log K ε ∗ ν ( y ) | + | log K ε ∗ π ( x ) − log K ε ∗ π ( y ) | ≤  L K ε K min ,ε + L π π min  | x − y | where the second inequality follo ws from (E.4(6)) and (E.4(7)). Hence ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (52) Regarding the term ( iii ), we ha ve ( iii ) := | KL( K ε ∗ µ | K ε ∗ π ) − KL( K ε ∗ ν | K ε ∗ π ) | ≤     Z  log K ε ∗ µ ( x ) K ε ∗ π ( x ) − log K ε ∗ ν ( x ) K ε ∗ π ( x )  K ε ∗ µ ( x ) dx     +     Z ( K ε ∗ µ ( x ) − K ε ∗ ν ( x )) log K ε ∗ ν ( x ) K ε ∗ π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | K ε ∗ µ ( x ) dx +     Z ( µ ( x ) − ν ( x )) K ε ∗  log K ε ∗ ν ( x ) K ε ∗ π ( x )  dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (53) where in the last inequalit y , w e used (E.4(2)) and (E.4(7)). Combining (51), (52), and (53) with the fact that W 1 ( µ, ν ) ≤ W 2 ( µ, ν ), w e conclude that L 2 ε = L 1 ε , which completes the pro of. E.3 F or Flo w (27) Deﬁne the co eﬃcient associated with the kernelized ﬂow (27) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ log ( K ε ∗ µ )( x ) π ( x ) +  K ε ∗ µ K ε ∗ µ  ( x ) − Z X log ( K ε ∗ µ )( z ) π ( z ) µ ( z ) dz − 1 ! . Under Assumptions 2, 3, and 4, and by Propos ition E.1, there exist constan ts C 3 ε > 0 and L 3 ε > 0 suc h that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 3 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 3 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. F or the b oundedness condition, we hav e already established uniform b ounds for δ F δ µ ( µ, x ) and log K ε ∗ µ ( x ) π ( x ) from (E.3(1)) and (E.3(2)). In particular, w e recall that     δ F δ µ ( µ, x )     +    log K ε ∗ µ ( x ) π ( x )    +     Z log K ε ∗ µ ( x ) π ( x ) µ ( x ) dx     ≤ C F + 2 max  log π max K min ,ε , log K max ,ε π min  . Moreo ver, thanks to (E.3(6)), for p ositiv e term K ε ∗  µ K ε ∗ µ  , we ha ve K ε ∗  µ K ε ∗ µ  ( x ) ≤ K max ,ε K min ,ε . Hence we conclude that C 3 ε = C F + 2 σ max  log π max K min ,ε , log K max ,ε π min  + σ  K max ,ε K min ,ε + 1  . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) + σ ( iv ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     , ( iii ) :=    K ε ∗  µ K ε ∗ µ  ( x ) − K ε ∗  ν K ε ∗ ν  ( y )    , ( iv ) :=     Z log K ε ∗ µ ( z ) π ( z ) µ ( z ) dz − Z log K ε ∗ ν ( z ) π ( z ) ν ( z ) dz     . Thanks to (E.4(1)), we ha ve ( i ) ≤ L F ( W 2 ( µ, ν ) + | x − y | ) . (54) F or the term (ii), by triangle inequality w e hav e ( ii ) ≤     log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )     +     log K ε ∗ ν ( x ) π ( x ) − log K ε ∗ ν ( y ) π ( y )     . W e b ound these tw o term separately by (E.4(2)) and (E.4(4)) and we get: ( ii ) ≤ L K ε K min ,ε W 1 ( µ, ν ) +  L K ε K min ,ε + L π π min  | x − y | . (55) No w let us chec k item ( iii ). W e write h µ ( x ) = 1 ( K ε ∗ µ )( x ) , h ν ( x ) = 1 ( K ε ∗ ν )( x ) . Decomp ose K ε ∗  µ K ε ∗ µ  ( y ) − K ε ∗  ν K ε ∗ ν  ( y ) = Z K ε ( y − x ) h µ ( x ) d ( µ − ν )( x ) + Z K ε ( y − x )  h µ ( x ) − h ν ( x )  dν ( x ) =: T 1 + T 2 . F or T 1 , set f µ ( x ) = K ε ( y − x ) h µ ( x ). Using (E.3(7)) and (E.4(9)) and the pro duct rule for Lipschitz functions, Lip( f µ ) ≤ Lip( K ε ) ∥ h µ ∥ ∞ + ∥ K ε ∥ ∞ Lip( h µ ) ≤ L K ε K min ,ε + K max ,ε L K ε K 2 min ,ε . By the Kantoro vich–Rubinstein theorem, | T 1 | =     Z f µ d ( µ − ν )     ≤ Lip( f µ ) W 1 ( µ, ν ) ≤ L K ε K min ,ε + K max ,ε L K ε K 2 min ,ε ! W 1 ( µ, ν ) . F or T 2 , ﬁrst note that | h µ ( x ) − h ν ( x ) | = | ( K ε ∗ ( ν − µ ))( x ) | ( K ε ∗ µ )( x )( K ε ∗ ν )( x ) ≤ 1 K 2 min ,ε sup z     Z K ε ( z − w ) d ( µ − ν )( w )     . F or each ﬁxed z , the function w 7→ K ε ( z − w ) is L K ε -Lipsc hitz. Applying the Kantoro vic h–Rubinstein theorem giv es sup z     Z K ε ( z − w ) d ( µ − ν )( w )     ≤ L K ε W 1 ( µ, ν ) . Hence ∥ h µ − h ν ∥ ∞ ≤ L K ε K 2 min ,ε W 1 ( µ, ν ), and therefore | T 2 | ≤ ∥ K ε ∥ ∞ ∥ h µ − h ν ∥ ∞ ≤ K max ,ε L K ε K 2 min ,ε W 1 ( µ, ν ) . Com bining the b ounds for T 1 and T 2 yields ( iii ) ≤ L K ε K min ,ε + 2 K max ,ε L K ε K 2 min ,ε ! W 1 ( µ, ν ) . (56) No w let us chec k item ( iv ). ( iv ) ≤     Z  log K ε ∗ µ ( x ) π ( x ) − log K ε ∗ ν ( x ) π ( x )  µ ( x ) dx     +     Z ( µ ( x ) − ν ( x )) log K ε ∗ ν ( x ) π ( x ) dx     ≤ Z | log K ε ∗ µ ( x ) − log K ε ∗ ν ( x ) | µ ( x ) dx +     Z ( µ ( x ) − ν ( x )) log K ε ∗ ν ( x ) π ( x ) dx     ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (57) where in the last inequality , w e used (E.4(2)) and (E.4(5)). Com bining (54), (55), (56) and (57), we could conclude that a ( m, x ) is Lipschitz with constant L 3 ε = L F +  4 L K ε K min ,ε + 2 L π π min  + 2 K max ,ε L K ε K 2 min ,ε . E.4 F or Flo w (28) Deﬁne the co eﬃcient associated with the kernelized ﬂow (28) by a ( µ, x ) := δ F δ µ ( µ, x ) + σ   K ε ∗ log K ε ∗ µ π  ( x ) − KL( K ε ∗ µ | π )  . By Prop osition E.1, a satisﬁes b oth a b oundedness estimate and a Lipsc hitz estimate. In particular, under Assumptions 2, 3, and 4, there exist constants C 4 ε > 0 and L 4 ε > 0 such that, for all µ, ν ∈ P ( X ) and x, y ∈ X , | a ( µ, x ) | ≤ C 4 ε , | a ( µ, x ) − a ( ν, y ) | ≤ L 4 ε  W 2 ( µ, ν ) + | x − y |  . Pr o of. Bounde dness. By (E.3(2)) and R X K ε = 1,     K ε ∗ log K ε ∗ µ π  ( x )    =     Z K ε ( x − y ) log ( K ε ∗ µ )( y ) π ( y ) dy     ≤ Z K ε ( x − y )    log ( K ε ∗ µ )( y ) π ( y )    dy ≤ max  log π max K min ,ε , log K max ,ε π min  . By the same reasoning, KL( K ε ∗ µ | π ) is b ounded by the same constan t. Hence w e ma y tak e C 4 ε := C F + σ max  log π max K min ,ε , log K max ,ε π min  . W e now turn to the pro of of the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + σ ( ii ) + σ ( iii ) , where ( i ) :=    δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )    , ( ii ) :=     K ε ∗ log K ε ∗ µ π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( y )    , ( iii ) :=    KL( K ε ∗ µ | π ) − KL( K ε ∗ ν | π )    . F or ( i ), Assumption 2 yields ( i ) ≤ L F  W 2 ( µ, ν ) + | x − y |  . (58) F or ( ii ), split into a measure part and a space part: ( ii ) ≤     K ε ∗ log K ε ∗ µ π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( x )    +     K ε ∗ log K ε ∗ ν π  ( x ) −  K ε ∗ log K ε ∗ ν π  ( y )    ≤ Z K ε ( x − y )     log K ε ∗ µ π ( y ) − log K ε ∗ ν π ( y )     dy + Z K ε ( z )     log K ε ∗ ν π ( x − z ) − log K ε ∗ ν π ( y − z )     dx ≤  L K ε K min ,ε + L π π min   W 1 ( µ, ν ) + | x − y |  , (59) where the last inequality is from (E.4(2)) and (E.4(5)). F or ( iii ), ( iii ) ≤  2 L K ε K min ,ε + L π π min  W 1 ( µ, ν ) , (60) b ecause of (E.4(8)). Com bining (58), (59) and (60) w e could choose L 4 ε = L F + σ  3 L K ε K min ,ε + 2 L π π min  . E.5 Boundness and Lipsc hitz Condition for Kernelized Fisher-Rao Gradien t Flo w Regularized b y Chi Square Divergence Pr o of. Using the uniform b ound on δ F δ µ and π ≥ π min ,     K ε ∗  K ε ∗ µ π  ( x )     =     Z K ε ( x − z ) ( K ε ∗ µ )( z ) π ( z ) dz     ≤ K max ,ε π min , and the same b ound holds for R K ε ∗  K ε ∗ µ π  dµ . Hence | a ( µ, x ) | ≤ C F + 4 σ K max ,ε π min =: C 5 ε . W e now analyze the Lipschitz condition. By the triangle inequality , | a ( µ, x ) − a ( ν, y ) | ≤ ( i ) + 2 σ ( ii ) + 2 σ ( iii ) , where ( i ) :=     δ F δ µ ( µ, x ) − δ F δ µ ( ν, y )     , ( ii ) :=     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( y )     , ( iii ) :=     Z K ε ∗  K ε ∗ µ π  dµ − Z K ε ∗  K ε ∗ ν π  dν     . F or ( i ), ( i ) ≤ L F  W 2 ( µ, ν ) + | x − y |  . F or ( ii ), split     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( x )     =     Z ϕ x ( z ) d ( µ − ν )( z )     , with ϕ x ( z ) := Z K ε ( x − w ) π ( w ) K ε ( z − w ) dw. Then Lip( ϕ x ) ≤ L K ε π min , hence by the Kantoro vich–Rubinstein theorem,     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π  ( x )     ≤ L K ε π min W 1 ( µ, ν ) . Moreo ver,     K ε ∗  K ε ∗ ν π  ( x ) − K ε ∗  K ε ∗ ν π  ( y )     ≤ L K ε π min | x − y | . Th us ( ii ) ≤ L K ε π min  W 1 ( µ, ν ) + | x − y |  . F or ( iii ), set f µ ( z ) = K ε ∗  K ε ∗ µ π  ( z ) and write ( iii ) ≤     Z K ε ∗  K ε ∗ µ π  d ( µ − ν )     + Z     K ε ∗  K ε ∗ µ π  − K ε ∗  K ε ∗ ν π      dν. Since Lip( K ε ∗  K ε ∗ µ π  ) ≤ Lip( K ε ∗ µ π ) L K ε π min , the ﬁrst item is bounded by L K ε π min W 1 ( µ, ν ). F urthermore,     K ε ∗  K ε ∗ µ π  ( x ) − K ε ∗  K ε ∗ ν π ( x )      ≤     K ε ∗ ( µ − ν ) π     ∞ ≤ L K ε π min W 1 ( µ, ν ) , hence ( iii ) ≤ 2 L K ε π min W 1 ( µ, ν ) . Using W 1 ( µ, ν ) ≤ W 2 ( µ, ν ) and collecting the b ounds gives | a ( µ, x ) − a ( ν, y ) | ≤  L F + 6 σ L K ε π min   W 2 ( µ, ν ) + | x − y |  . Therefore the Lipschitz estimate holds with L 5 ε := L F + 6 σ L K ε π min .

On propagation of chaos for the Fisher-Rao gradient flow in entropic mean-field optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment