A variational framework for modal estimation

A v ariational framew ork for mo dal estimation Tˆ am Le Minh 1 , 2 , July an Arb el 2 , Florence F orbes 2 and Hien Duy Nguy en 3 , 4 1 Scho ol of Me dicine and Dentistry, Griﬃth University, Nathan, QLD 4101, Austr alia 2 Univ. Gr enoble Alp es, Inria, CNRS, Gr enoble INP, LJK, 38000 Gr enoble, F r anc e 3 Scho ol of Computing, Engine ering and Mathematic al Scienc es, L a T r ob e University, Bundo or a, VIC 3086, Austr alia 4 Institute of Mathematics for Industry, Kyushu Univ., Nishi War d, F ukuoka 819-0395, Jap an Abstract. W e approac h multiv ariate mo de estimation through Gibbs distributions and introduce GER VE (Gibbs-measure En tropy-Regularised V ariational Estimation), a lik eliho o d-free framework that approximates Gibbs measures directly from samples by maximizing an entrop y-regularised v ariational ob jectiv e with natural-gradien t up dates. GER VE brings together k ernel density estimation, mean-shift, v ariational inference, and annealing in a single platform for mo de estimation. It ﬁts Gaussian mixtures that concen- trate on high-densit y regions and yields cluster assignmen ts from resp onsibilities, with re- duced sensitivit y to the chosen n um b er of comp onen ts. W e pro vide theory in tw o regimes: as the Gibbs temperature approac hes zero, mixture comp onen ts con verge to population mo des; at ﬁxed temp erature, maximisers of the empirical ob jective exist, are consisten t, and are asymptotically normal. W e also propose a bo otstrap pro cedure for p er-mo de con- ﬁdence ellipses and stability scores. Simulation and real-data studies sho w accurate mo de reco very and emergen t clustering, robust to mixture o verspeciﬁcation. GER VE is a prac- tical likelihoo d-free approac h when the n umber of modes or groups is unknown and full densit y estimation is impractical. Keyw ords. Mo de estimation; Mo dal clustering; V ariational metho ds; Gibbs measures; Annealing; Bo otstrap. 1 In tro duction Understanding the structure of an unknown probabilit y distribution p on R d , observed through samples, is a cen tral task in statistics. Modes, the lo cal maxima of p , rev eal data high-densit y regions. Detecting m ultiple modes exposes heterogeneit y , supports cluster- ing and segme n tation, and oﬀers alternativ e explanations in inv erse problems ( Carreira- 1 P erpi ˜ nan , 2006 ; Chac´ on , 2015 ). Recen tly , it has also emerged as a preliminary requiremen t for new eﬃcien t sampling metho ds of m ultimo dal high-dimensional distributions ( Noble et al. , 2025 ). Classical approac hes to mo de estimation are commonly group ed into three main fam- ilies. The ﬁrst comprises geometric methods that detect high-densit y regions from the sample ( Chernoﬀ , 1964 ; Dalenius , 1965 ; V enter , 1967 ; Sager , 1978 ). These metho ds are simple and robust but tend to degrade in high dimensions. The second ﬁts a smo oth densit y and then lo cates its modes ( P arzen , 1962 ; Loftsgaarden and Quesen b erry , 1965 ; Kronmal and T arter , 1968 ; Sam worth , 2018 ; Liu and Ghosh , 2020 ). This route is princi- pled but suﬀers from smo othing bias and high computational cost. The third consists of mean-shift algorithms that ascend the gradien t of a k ernel densit y estimate (KDE) without requiring ﬁtting a global model ( F ukunaga and Hostetler , 1975 ; Comaniciu and Meer , 2002 ; Carreira-P erpi ˜ nan , 2007 ; Arias-Castro et al. , 2016 ). These methods are eﬃcient but band- width sensitiv e and can miss w ell-separated modes. T o our knowledge, no existing metho d sim ultaneously targets m ultiple densit y mo des, pro vides v alid uncertaint y quan tiﬁcation, and scales with dimension via simple algorithmic up dates, without requiring a full density estimate ( Genov ese et al. , 2016 ; Ameijeiras-Alonso et al. , 2021 ). W e prop ose a diﬀeren t p ersp ectiv e: w e recast the task of mo de estimation into that of appro ximating a Gibbs measure from samples. Let f : S → R be a measurable function and ﬁx a temp erature parameter ω > 0. Deﬁne g ω ( d x ) = exp( f ( x ) /ω ) Z ω ( f ) d x , with Z ω ( f ) := Z S exp( f ( y ) /ω ) d y , and assume Z ω ( f ) < ∞ . W e assume that S ⊂ R d has ﬁnite Leb esgue measure. The Gibbs measure g ω preserv es the mo des of f and is the maximiser, among the unconstrained family of distributions, of the entrop y-regularised v ariational ob jective ( Donsker and V aradhan , 1976 ): L ω ( q ) = E q [ f ( X )] + ω H S ( q ) , with H S ( q ) = − Z S q ( y ) log q ( y ) d y . (1) W e in tro duce GER VE ( Gibbs-me asur e Entr opy-R e gularise d V ariational Estimation ), a likelihoo d-free v ariational framew ork that appro ximates the Gibbs measure asso ciated with a density f := p , in a Gaussian-mixture family from i.i.d. samples X 1: N ∼ p , where p is nonnegativ e, supported on a subset of S . Since p is a density , but not analytically a v ailable, we lev erage the iden tity E q [ p ( X )] = E p [ q ( X )] to target, with natural-gradient up dates, the maximisation of an empirical v ersion of L ω a veraging the v ariational densit y q ov er the observ ed samples: b L ω ,N ( q ) = 1 N N X i =1 q ( X i ) + ω H S ( q ) . (2) 2 This estimation problem requires fundamen tally diﬀerent treatmen t from optimisation set- tings where p (or f ) can b e ev aluated: we develop asymptotic theory for maximisers, establish consistency and asymptotic normalit y , and in tro duce bo otstrap-based inference for p er-mo de uncertain t y quan tiﬁcation, extending recen t computational techniques from v ariational optimisation ( Khan and Rue , 2023 ; Le Minh et al. , 2025 ) to a statistical esti- mation metho dology . By ﬁtting Gaussian mixtures to the Gibbs densit y , GER VE uniﬁes ideas from mean- shift, kernel density estimation, en trop y annealing, and v ariational inference within one algorithmic paradigm. The v ariational formulation is well-suited to multimodal and multi- v ariate settings, where geometric or purely nonparametric mo de estimators often struggle. The en tropy term induces repulsion betw een mixture comp onen ts, enabling exploration of the mo dal landscap e, while annealing ( ω → 0) concentrates mass at the global mo des. This allo ws o v ercomplete mixtures to automatically adapt to the eﬀectiv e n um b er of modes without prior speciﬁcation. Our main con tributions are: 1. V ariational formulation for mo de estimation. W e cast mode estimation as Gibbs- measure appro ximation from samples, enabling mo de ﬁnding and uncertain ty quan- tiﬁcation when the n umber of mo des is unkno wn and density estimation is impracti- cal. 2. Asymptotic the ory for multimo dal r e c overy. W e establish t wo complemen tary theo- retical regimes: As ω → 0, the v ariational mixture concen trates on the p opulation mo des. A t ﬁxed ω > 0, empirical maximisers are consisten t and asymptotically normal. 3. Bo otstr ap unc ertainty for mo des. W e pro vide p er-mo de conﬁdence ellipses and sta- bilit y scores via a matc hed-mo de bo otstrap with theoretical guarantees. 4. A lgorithmic fr amework. Natural-gradien t up dates with annealing reco ver m ultiple mo des, estimate local curv ature, and yield cluster assignmen ts without pre-sp ecifying the n umber of groups. A ﬁxed-cov ariance sp ecial case reco vers the mean-shift algo- rithm, while learned cov ariances adapt to anisotropy and reduce bandwidth sensitiv- it y . 5. Evidenc e on simulations and data. W e show accurate mo de reco very and comp et- itiv e clustering, robustness to mild ov ercompleteness, and a hotsp ot analysis with uncertain ty on UK road-collision data. Our approac h is related to mo dal cluste ring ( Menardi , 2016 ; Chac´ on , 2020 ), whic h deﬁnes clusters as basins of attraction of densit y mo des. GER VE uses v ariational appro x- imation, yielding clusters that locally agree with basins near modes under mild conditions (Suppl. S1.4 ), while pro viding computational eﬃciency and natural uncertaint y quan tiﬁ- cation. Outline. The remainder is organised as follo ws. Section 2 recalls the v ariational founda- tions of Gibbs-measures and shows that Gaussian mixture approximations can b e used for mo de seeking b y specifying their b ehav iour under annealing (Thm. 2.1 ). Section 3 shows 3 that this mechanism can b e extended to mo de estimation when the densit y to maximise is only a v ailable through samples. The main principles of GER VE is in tro duced and the asymptotic b eha viour of the empirical maximisers is established (Thm. 3.1 ). Section 4 presen ts algorithmic details (Alg. 1 ). Section 5 giv es the ﬁxed-temp erature b ootstrap pro- cedure (Alg. 2 ) with guaran tees (Thm. 5.1 and 5.3 ). Sections 6 and 7 rep ort sim ulations and a case study on UK road-collision hotsp ots. Section 8 discusses limitations and future directions. All proofs and additional details are in Supplementary Material. A Python implemen tation for GER VE can b e found at github.com/tam-leminh/gerve . Notation. W e consider that p (or f ) has a ﬁnite num ber I of global modes { x ⋆ i } I i =1 . When f : S → R and there is no ambiguit y on the domain, f is C k means f ∈ C k ( S ), that is f is contin uously diﬀerentiable on S to the order k . F or a distribution q on A ⊆ R d and a set B ⊂ A , w e write q ( B ) := R B q ( x ) d x and q B := q 1 B /q ( B ), its truncation (or conditional law) on B . The set { q ϑ , ϑ ∈ Θ } denotes a v ariational family of densities parameterised by ϑ where ϑ = λ for single Gaussians q λ = N ( µ , Σ ), or ϑ = Λ for Gaussian mixtures q Λ = P K k =1 π k N ( µ k , Σ k ). When there is no am biguity , w e write for the ob jective L ω ( ϑ ) := L ω ( q ϑ ). Unless explicitly sp eciﬁed, ∥·∥ denotes the Euclidean norm for v ectors and the corresponding operator norm for matrices. 2 Mo de seeking with annealed Gaussian mixtures F or a measurable function f : S → R , the associated Gibbs measure g ω preserv es its mo des. The goal of ﬁnding the mo des of f can then b e turned into that of appro ximating g ω , whic h in turn can b e achiev ed through a v ariational approximation principle. The classical v ariational iden tit y L ω ( q ) = ω log Z ω ( f ) − ω KL ( q ∥ g ω ) , with KL ( q ∥ g ω ) := R S q ( x ) log( q ( x ) /g ω ( x )) d x , implies that g ω uniquely maximises the ob- jectiv e L ω when Z ω ( f ) < ∞ ( Donsker and V aradhan , 1976 ). Restricting the search for q in a tractable v ariational family gives an approximation for g ω b y minimisation of KL ( q ∥ g ω ). As a natural c hoice for approximating multimodal landscapes, we study how Gaussian mixtures approximate g ω when ω decreases to 0. Since g ω is deﬁned only on S , admissi- ble v ariational distributions are required to ha ve supp ort in S . W e therefore w ork with mixtures of truncated Gaussians. The next result, prov ed in Supplemen tary Material S2.2 , describ es ho w optimal mixtures concen trate on the global mo des when the n umber of comp onen ts K is at least the num b er of global mo des I . Theorem 2.1 (Gaussian mixture concen tration on global modes) . Assume f ∈ C 3 ( S ) and b ounde d on S , and for e ach mo de in { x ⋆ i } I i =1 , x ⋆ i ∈ int( S ) and H i := −∇ 2 f ( x ⋆ i ) ≻ 0 . Consider Q to b e the variational family of Gaussian mixtur es trunc ate d on S , with a ﬁxe d numb er of c omp onents K ≥ I and upp er-b ounde d c ovarianc es. L et q ⋆ ω denote any maximiser of L ω in Q . Cho ose disjoint neighb orho o ds U 1 , . . . , U I of the mo des and let U 0 := S \ S I i =1 U i . 4 Then, as ω → 0 : (i) We have KL( q ⋆ ω ∥ g ω ) → 0 . In p articular, q ⋆ ω ( U 0 ) → 0 , and, for e ach neighb orho o d U i , i = 1 , . . . , I , q ⋆ ω ( U i ) → c i , wher e c i ∝ (det H i ) − 1 / 2 , I X i =1 c i = 1 . (ii) F or al l i = 1 , . . . , I , the c onditional laws of q ⋆ ω and g ω on U i satisfy KL  q ⋆,U i ω ∥ g U i ω  → 0 . In p articular, the me an and c ovarianc e of q ⋆,U i ω satisfy E q ⋆,U i ω [ X ] = x ⋆ i + o (1) , Cov q ⋆,U i ω ( X ) = o (1) . This result sho ws that Gaussian mixtures pro vide accurate appro ximations of the Gibbs measure and place their mass on all global mo des of g ω , th us of f , under annealing whenev er K ≥ I . The lo cal shape of g ω at a particular mo de x ⋆ i has b een studied in Proposition 3 of Le Minh et al. ( 2025 ). There, the optimal Gaussian approximation in a neigh b orhoo d of x ⋆ i is of the form N ( µ ω , Σ ω ) satisfying: µ ω = x ⋆ i + o ( √ ω ) , Σ ω = ω H − 1 i + o ( ω ) . These expansions clarify how Gibbs measures can be used for mo de seeking. As ω decreases, mixture comp onen ts sharpen and their means drift tow ard the global modes of f . The temp erature parameter gov erns the trade-oﬀ betw een exploration (through smo oth and diﬀuse mixtures at large ω ) and exploitation (concen trated components at small ω ). A decreasing ω sc hedule allo ws starting in an exploration regime and gradually reﬁning mo de lo calisation. W e next in tro duce a statistical form ulation in whic h f is unobserv ed and only accessible through samples drawn from it. 3 Sample-based mo de estimation and mo dal clustering W e observe i.i.d. samples X 1: N ∼ p supp orted on a bounded domain supp( p ) := { x ∈ R d : p ( x ) > 0 } and w e c ho ose S with ﬁnite Lebesgue measure suc h that supp( p ) ⊂ S . Replacing f with p , we distinguish the p opulation and empiric al ob jectiv es resp ectively referring to L ω ( q ) and b L ω ,N ( q ) deﬁned b y ( 1 ), with f := p , and ( 2 ). Instead of p erforming optimisation on truncated distributions on S , GER VE maximises b L ω ,N o ver un truncated Gaussian mixture families on R d , while all integrals in the ob jectiv e are k ept o ver S . Supplemen tary Material S1.2 discusses that if S is taken large enough, then the ob jectiv e gap b etw een the truncated and untruncated distributions is uniformly small, and the truncated and untruncated maximisers coincide asymptotically under the M -estimation theory ( v an der V aart , 1998 ). This simpliﬁes optimisation, as it remov es parameter-dep enden t normalisers in the natural-gradien t updates (details in Section 4.1 ). 5 F or a single Gaussian N ( µ , Σ ), we consider the compact parameter set Θ := n λ = ( Σ − 1 µ , − 1 2 Σ − 1 ) : ∥ µ ∥ ∞ ≤ µ max , σ 2 min I ⪯ Σ ⪯ σ 2 max I o . (3) Prop osition S1.1 sho ws that if the outside mass ε ( λ ) = 1 − R S q λ ( x ) d x is uniformly small on Θ, then the ob jectiv e gap is uniformly small. The same reasoning applies com ponent-wise to mixtures on the parameter set Θ K Θ K := n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) : | v k | ≤ v max , λ k ∈ Θ o , (4) with b ounded logits v k := log ( π k /π K ), which prev en t degeneracies while preserving ﬂexi- bilit y (Suppl. S1.2 , eq. ( S3 )). In fact, throughout, Θ K shall b e understo o d as its restric- tion to the closure Θ lex K of the lexicographically ordered parameter space Θ lex K (deﬁned in Suppl. S1.3 ), that enforces a canonical labelling and th us pro vides suﬃcien t identiﬁabilit y for our tec hnical results. In the multimodal optimisation setting, the main source of bias is due to the entrop y- induced repulsion in mixtures. W e deliberately exploit this repulsion early in training to mitigate mo de collapse, a kno wn issue in v ariational Gaussian mixtures ( W u et al. , 2019 ; Jerfel et al. , 2021 ; Soletskyi et al. , 2025 ). The induced bias progressively diminishes under annealing. The following theorem establishes that GER VE remains statistically consistent at ﬁxed ω = ω 0 > 0, as N → ∞ . An asymptotic normalit y result is also derived. These results are stated for Gaussian mixtures in Θ K (eq. ( 4 )), but can b e applied to single Gaussians in Θ (eq. ( 3 )) as a special case. The proof is pro vided in Supplemen tary Mate- rial S2.3 . Theorem 3.1 (Asymptotics of empirical maximisers) . A t ﬁxe d ω 0 > 0 , if p is b ounde d and c ontinuous on S , then a p opulation maximiser Λ ⋆ ω 0 of L ω 0 exists on Θ K . If it is unique, then any empiric al maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisﬁes, as N → ∞ : b Λ ω 0 ,N P − → Λ ⋆ ω 0 . F urthermor e, if H ⋆ ω 0 := −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) ≻ 0 , then √ N  b Λ ω 0 ,N − Λ ⋆ ω 0  D − → N (0 , W ω 0 ) , wher e W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . R emark. maximiser uniqueness is subtle in mixture mo dels. In particular, unless the pa- rameter space is restricted so as to rule out degeneracies, a comp onen t with v anishing w eight can hav e its associated mean and co v ariance drift without aﬀecting the mixture 6 densit y , and if several comp onen ts share the same ( µ , Σ ) then redistributing w eights within that tied group also lea v es the density unc hanged. Consequently , con vergence at the pa- rameter lev el cannot b e guaran teed in general, ev en when the induced densities con verge to a maximiser. T o formalise consistency without uniqueness, one can replace the uniqueness assumption in Theorem 3.1 b y the w eaker requiremen t that D ω 0 = arg max Λ ∈ Θ K L ω 0 ( Λ ) is nonempty and compact. The set-v alued argmax theorem ( v an der V aart , 1998 , Cor. 5.58) then yields dist  b Λ ω 0 ,N , D ω 0  P − → 0 , dist( x , A ) = inf a ∈A ∥ x − a ∥ . The as ymptotic normalit y statemen t, how ev er, holds only under local iden tiﬁability , for example H ⋆ ω 0 ≻ 0, which rules out duplicated comp onen ts at the maximiser. In practice, non-uniqueness can b e helpful: redundant components ma y b e absorb ed as some w eights shrink or as component means coalesce near prominen t modes, so that an o v ercomplete mixture adapts to the eﬀectiv e n um b er of modes without the user pre-sp ecifying K . As demonstrated by Theorem 2.1 , optimal v ariational mixtures concen trate on the mo des of p as ω → 0. With the consistency guaranties of Theorem 3.1 , GER VE therefore deﬁnes a mo de estimator: Gaussian mixtures q Λ = P K k =1 π k q λ k sim ultaneously capture m ultiple modes under annealing. Bey ond mode estimation, GER VE naturally induces a clustering structure. With p os- sibly ov ercomplete mixtures, components are attracted tow ard distinct mo des as en trop y decreases. Redundant comp onents either v anish or collapse, so the eﬀective n umber of clus- ters adapts automatically . Unlike likelihoo d-based Gaussian mixtures clustering, GER VE targets density mo des under annealing. In practice, once ﬁtted, cluster assignments are straightforw ard via resp onsibilities (p os- terior comp onent probabilities). This connects to mo dal clustering ( Menardi , 2016 ; Chac´ on , 2020 ), where clusters are attraction basins of densit y mo des. Classical approac hes rely on nonparametric densit y estimates and gradient ﬂows, while GER VE use s a parametric v ari- ational appro ximation. The resulting clusters need not coincide exactly with attraction basins, but they agree locally near each mo de under mild conditions, see Supplementary Material S1.4 for a statemen t and a coun terexample. This yields adaptive modal cluster- ing without tuning K . A ca v eat is that annealing emphasises dominan t mo des: secondary mo des ma y disappear as ω → 0, so retaining them requires stopping at a p ositiv e temp er- ature or imposing w eight lo wer b ounds. 7 4 GER VE algorithm and v arian ts 4.1 V ariational optimisation with natural gradients The empirical ob jective b L ω ,N (eq. ( 2 )) can b e optimised o ver Θ or Θ K using natural gradi- en ts e ∇ ϑ b L ω ,N ( ϑ ) := F ( ϑ ) − 1 ∇ ϑ b L ω ,N ( ϑ ), whic h incorporates the geometry of the v ariational parameter space through the Fisher information matrix F ( ϑ ) ( Amari , 1998 ). When q ϑ b e- longs to an exp onen tial family with suﬃcien t statistic T ( X ), the natural gradien t equals the gradient in expectation parameters M = E q ϑ [ T ( X )]: e ∇ ϑ b L ω ,N ( ϑ ) = ∇ M b L ω ,N ( ϑ ) . (5) This iden tity allo ws computation of natural gradien ts without explicit estimation and in- v ersion of the Fisher matrix. It also holds for untruncated Gaussian mixtures via the Minimal Conditional Exponential F amily represen tation (MCEF; Lin et al. , 2019 ). Algorithm 1 describ es natural gradient ascen t on b L ω ,N ( ϑ ) with a temp erature schedule ω t o ver iterations t = 1 , . . . , T . Con vergence guaran tees for stochastic natural gradien t ascen t are stated in Supplemen tary Material S1.5 : at ﬁxed- ω , up dates conv erge to the sta- tionary set under standard Robbins–Monro conditions and b ounded iterates (Thm. S1.11 ); with ω t → 0, stationary sets can b e track ed with slow sc hedules, i.e. | ω t +1 − ω t | = o ( ρ t ) where ρ t is a gradien t stepsize (Cor. S1.12 ). In GER VE, a pro jection step can be added to ensure the updates remain in the compact space Θ or Θ K . Algorithm 1: GER VE (general form) 1 Input: samples X 1: N . 2 Set: initial ϑ 1 , stepsizes ( ρ t ) t , schedule ( ω t ) t , (optional) batc h size B . 3 for t = 1: T do 4 (Optional) Sample a mini-batch { X ( t ) i } B i =1 from X 1: N (or use full data). 5 Compute an (un biased) natural-gradien t estimate e ∇ b L ω t ,N ( ϑ t ). 6 Upda te ϑ t +1 ← ϑ t + ρ t e ∇ b L ω t ,N ( ϑ t ). 7 (Optional) Project to a compact set. 8 end 9 Return: ϑ T +1 . 4.2 V arian t A: Equiv alence to Gaussian mean-shift Consider the family of Gaussian distributions with a ﬁxed isotropic cov ariance σ 2 I , { q µ = N ( µ , σ 2 I ) , µ ∈ R d } . F or a large enough S , we hav e ∇ µ H S ( q µ ) ≈ 0, so the natural-gradien t step can be simpliﬁed into: µ t +1 = µ t + ρ t N N X i =1 ( X i − µ t ) q λ t ( X i ) . (6) 8 Using a batch { X ( t ) i } B i =1 giv es an unbiased estimator of the RHS (see Suppl. S3.2 for the deriv ation). Denoting b y p h ( µ ) = 1 N P N i =1 ϕ h ( X i − µ ) the KDE with a Gaussian kernel ϕ h with bandwidth h , the following result states that GER VE reco vers the mean-shift algorithm when h = σ 2 . Prop osition 4.1 (Equiv alence to Gaussian mean-shift) . Up date rule ( 6 ) p erforms gr adient asc ent on p h , and normalising the step ρ t = p h ( µ ) − 1 yields the classic al me an-shift ﬁxe d- p oint up date: µ t +1 = P N i =1 ϕ h ( X i − µ t ) X i P N j =1 ϕ h ( X j − µ t ) . Viewing ﬁxed-cov ariance GER VE as mean-shift puts us under classical KDE theory . Op erationally , µ t is a particle ascending the smo othed density p h , and the (ﬁxed) co v ariance h I is the bandwidth: large h leads to smo oth and global but biased up dates, while small h yields sharp but noisy up dates, sensitive to sampling. This regime is th us a well-understoo d nonparametric baseline that grounds richer GER VE v arian ts. This can typically b e used to set the cov ariance bounds in learned- co v ariance v arian ts of GER VE. F or example, if p is C 3 , the gradient of the Gaussian KDE has bias O ( h 2 ) and ﬂuctuation O p  ( N h d +2 ) − 1 / 2  , yielding the optimal plug-in rate h ≍ N − 1 / ( d +6) ( P arzen , 1962 ; Romano , 1988 ; Tsybak ov , 1990 ; Geno v ese et al. , 2016 ). 4.3 V arian t B: Gaussian mixture GER VE When considering Gaussian mixtures represen ted by Θ K , it is conv enient to adopt the follo wing parametrisation. Let q Λ = P K k =1 π k N ( µ k , S − 1 k ), where S k := Σ − 1 k with logits v k = log( π k /π K ) and Λ = ( v 1: K − 1 , λ 1: K ) ∈ Θ K . Natural-gradient steps lead to component- wise up dates coupled via the en tropy term (full deriv ation in Suppl. S3.3 ): S k,t +1 = S k,t  1 − ρ t N N X i =1  ( X i − µ k,t )( X i − µ k,t ) T S k,t − I  q λ k,t ( X i )  − ρ t ω t ∇ S − 1 k H S ( q Λ ) | Λ = Λ t , µ k,t +1 = µ k,t + ρ t N S − 1 k,t +1 S k,t N X i =1 ( X i − µ k,t ) q λ k,t ( X i ) + ρ t ω t ∇ µ k H S ( q Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t N N X i =1  q λ k,t ( X i ) − q λ K,t ( X i )  − ρ t ω t ∇ π k H S ( q Λ ) | Λ = Λ t . Learned cov ariances adapt to anisotropy and conv erge to positive limits at ﬁxed temp era- ture. As ω → 0, co v ariances shrink faster. F urthermore, the entrop y term induces repulsion b et w een components at large ω that v anishes as ω → 0. Therefore, a slow er decreasing 9 annealing schedule allo ws for longer exploration b y delaying mo de collapse and impro ving mo de capture in rugged and m ultimo dal landscap es. Overcomplete mixtures (when K is larger than the num b er of modes) naturally share components p er mo de (see Thm. 2.1 ). Gradien t deriv ations and Monte Carlo estimators for the en tropy terms are in Supple- men tary Material S3.3 . Diagonal, isotropic, or ﬁxed-co v ariance v arian ts are obtained b y mo difying Θ K , and deriv ations follo w the same pattern. 4.4 Complexit y and practical guidelines With T iterations, batch size B , and a Gaussian mixture with K comp onen ts, the compu- tational cost is O ( d 3 B K T ). The d 3 factor reﬂects the matrix m ultiplications and inv ersions that arise when up dating full precision matrices. Restricting the v ariational family reduces this cost to O ( dB K T ) for diagonal or ﬁxed cov ariances. These restricted families often preserv e the qualitativ e b eha viour of the algorithm and they improv e scalabilit y , whic h brings the cost in line with classical mo de-seeking and clustering pro cedures. F urther details app ear in Supplemen tary Material S3.4 . W e now summarise practical guidance for h yp erparameter selection. F or the temp era- ture schedule, the initial v alue of ω should be large so that mixture components repel and co ver S . The temperature should then decrease to a target v alue ω † as slo wly as possible so that the iterates con tinue to trac k the relev ant stationary sets, with ω † = 0 for mode estimation and ω † = ω 0 with ω 0 > 0 for clustering, as formalised in Corollary S1.12 . F or stepsizes, a safe choice satisﬁes the Robbins–Monro conditions P t ρ t = ∞ and P t ρ 2 t < ∞ . In practice, when the temperature ω t decreases o v er time, we ﬁnd that mean up dates are stabilised b y the scaling ρ t ∝ ω − β t with β ∈ (0 , 1). This c hoice often accelerates con vergence, although it can violate the Robbins–Monro conditions. Compactness constrain ts require explicit b ounds for the cov ariance parameters in Θ and for the mixture parameters in Θ K . The upper bound σ 2 max should b e large enough so that a single comp onen t can co ver the entire dataset. F or mode estimation, a suitable rule for the low er b ound is σ 2 min of the order of magnitude of N − 1 / ( d +6) , which follows classical mean-shift bandwidth theory . F or clustering, σ 2 min should b e comparable to intra-class v ariances. F or the logits that deﬁne the w eights in Θ K (eq. ( 4 )), setting the upper bound v max ≥ 6 allows any subset of w eigh ts to b e smaller than 10 − 5 , which makes it p ossible to iden tify v anishing components while main taining compactness (see Suppl. S1.2 , eq. ( S3 )). 5 Bo otstrap uncertain t y quan tiﬁcation for mo de estimation Mo de-lev el uncertaint y can b e pro vided at a ﬁxed stopping temperature ω 0 > 0, using a b ootstrap principle. A GER VE baseline ﬁt is performed on the initial sample (constant ω t = ω 0 ). GER VE is then reﬁt L times on resampled data. Each set of resulting mo des is matc hed to the baseline mo des, providing some empirical spread as conﬁdence ellipses and a stability score. 10 5.1 Bo otstrap pro cedure The b ootstrap p ro cedure is summarised in Algorithm 2 for Gaussian mixture GER VE. Each ﬁt,  = 0 , . . . , L , uses an o v ercomplete K -comp onent mixture follo wed by pruning (for some threshold  ≪ 1) and merging of near-duplicates. The ﬁrst ﬁt iden tiﬁes b K “baseline” mo des. Eac h bo otstrap ﬁt  = 1 , . . . , L pro duces its own set of b K ( ℓ ) mo des, which are matched to the baseline mo des via minim um-cost assignmen t (Hungarian algorithm, Kuhn , 1955 ). Then, w e form conﬁdence in terv als from the empirical distribution of matched mo des, and monitor mo de reco v ery with the stabilit y score s k = 1 L L X ℓ =1 1 { mo de k is matc hed in replicate  } . (7) Algorithm 2: Bo otstrap uncertaint y quantiﬁcation for mode estimation 1 Input: samples X 1: N , temp erature ω 0 , num b er of replicates L . 2 Fit GER VE at ω 0 on X 1: N , prune/merge to obtain baseline mo des { b µ (0) k } b K k =1 . 3 for  = 1 : L do 4 Sample J ( ℓ ) 1: N i.i.d. ∼ Unif ([ N ]), set X ( ℓ ) i ← X J ( ℓ ) i for i = 1 : N . 5 Fit GER VE at ω 0 on X ( ℓ ) 1: N , prune/merge to obtain { b µ ( ℓ ) j } b K ( ℓ ) j =1 . 6 Ma tch { b µ ( ℓ ) j } b K ( ℓ ) j =1 to { b µ (0) k } b K k =1 (Hungarian). 7 Record for eac h k : the matc hed b µ ( ℓ ) k . 8 end 9 Return: per-mo de conﬁdence ellipses from ( b µ ( ℓ ) k ) L ℓ =1 and stability scores ( s k ) b K k =1 . 5.2 Statistical v alidit y Let b L ω 0 ,N b e the empirical ob jective at ﬁxed ω 0 > 0 on a compact parameter set Θ K (eq. ( 4 )), and let b Λ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ω 0 ,N ( Λ ) with a unique p opulation maximiser Λ ⋆ ω 0 . W rite P ∗ for the conditional probabilit y given the observ ed sample X 1: N . Bootstrap limits are understo od conditionally on X 1: N , in P -probability . Deﬁne the bo otstrap criterion and maximiser b L ∗ ω 0 ,N ( Λ ) = E P ∗ N [ q Λ ( X )] + ω 0 H S ( q Λ ) , b Λ ∗ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ∗ ω 0 ,N ( Λ ) . In addition to unicit y , we assume that the p opulation maximiser satisﬁes H ⋆ ω 0 = −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) ≻ 0, so that Theorem 3.1 fully applies. The following statements guarantee the v alidity of bo otstrap for parameters (Thm. 5.1 ), matc hed mo des (Thm. 5.3 ), and their robustness to inexact maximisation (Prop. 5.4 ). All results are pro v ed in Supplementary Material S2.4 . In Supplementary Material S1.6 , 11 w e also inv estigate the consistency of stability scores and prop ose strategies to handle noncon vex ob jectives. Theorem 5.1 (Parameter-lev el b o otstrap v alidit y) . We have, as N → ∞ : √ N  b Λ ∗ ω 0 ,N − b Λ ω 0 ,N  D − → N (0 , W ω 0 ) c onditional ly on X 1: N , wher e W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . F or the theoretical analysis, w e ﬁx deterministic target lo cations u 1: K 0 , which act as reference anc hors for deﬁning the matching map. In practice, these targets corresp ond to the subset of baseline mode estimates we in tend to track, so K 0 is taken no larger than the n umber b K of resolv ed baseline mo des returned by the initial ﬁt. T o mov e from parameters to matc hed mo des, we then deﬁne the matc hing map M that stacks the K 0 comp onen t means closest to their resp ectiv e targets u 1: K 0 (see Suppl. S1.6.1 for more details). The follo wing guaran tees only require a local separation margin. Assumption 5.2 (Lo cal separation for matching) . Ther e exists δ > 0 such that ne ar Λ ⋆ ω 0 , e ach tar get u j has a unique ne ar est c omp onent me an and the ne ar est index is lo c al ly c onstant (se e Suppl. S1.6.1 for a formal deﬁnition). When separation fails, b o otstrap ellipses inﬂate and the stabilit y score s k drops, whic h is informative ab out ambiguous mo des (see Suppl. S1.6 for a consistency result for s k ). Theorem 5.3 (Matched-modes b o otstrap v alidit y and conﬁdence ellipses) . Under As- sumption 5.2 , M is C 1 at Λ ⋆ ω 0 with Jac obian J , and as N → ∞ : √ N  M ( b Λ ω 0 ,N ) −M ( Λ ⋆ ω 0 )  D − → N (0 , C M ) , √ N  M ( b Λ ∗ ω 0 ,N ) −M ( b Λ ω 0 ,N )  D − → N (0 , C M ) , wher e C M = J W ω 0 J T . F or e ach matche d mo de j , the mar ginal c ovarianc e is C j , so p er c entile (and, with a c onsistent b C j , studentise d) b o otstr ap el lipses ar e asymptotic al ly valid. Finally , we allow inexact baseline and b o otstrap ﬁts. Let e Λ ω 0 ,N and e Λ ∗ ω 0 ,N denote an y appro ximate solutions that are ﬁrst-order stationary up to o P ( N − 1 / 2 ) and o P ∗ ( N − 1 / 2 ), re- sp ectiv ely , that is ∥∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N ) ∥ = o P ( N − 1 / 2 ) and ∥∇ Λ b L ∗ ω 0 ,N ( e Λ ∗ ω 0 ,N ) ∥ = o P ∗ ( N − 1 / 2 ). Prop osition 5.4 (Robustness to inexact maximisation) . The c onclusions of The or ems 5.1 and 5.3 hold with e Λ ω 0 ,N and e Λ ∗ ω 0 ,N in plac e of b Λ ω 0 ,N and b Λ ∗ ω 0 ,N . 6 Sim ulation studies W e show that GER VE reco vers modes and yields useful clusterings without tuning K . F urthermore, on sim ulated data, w e illustrate (i) ho w clustering emerges as a b ypro duct of v ariational mode estimation and (ii) the consistency of mo de recov ery . 12 Figure 1: Overspeciﬁed clustering ( K = 7) of a triangle Gaussian mixture sample. Data p oin ts are colored according to the comp onen t with the highest posterior resp onsibility in the ﬁtted mixture. Ellipses represen t the component co v ariances. Left: GER VE returns 3 eﬀective clusters (the comp onen t with dashed tra jectory has v anishing weigh t, the re- maining six comp onen ts form three clusters by grouping equiv alent comp onen ts). Right: GMM-EM returns a partition in to 7 clusters. 6.1 Clustering example T o illustrate the clustering prop erties of GER VE, we simulate N = 6000 p oin ts from a 2D, 3-comp onen t Gaussian mixture with means at the nodes of an equilateral triangle, with isotropic cov ariance σ 2 I , σ 2 = 0 . 25. The data exhibits J = 3 high-densit y regions that w e aim to reco ver. Details on the setup and a densit y plot can b e found in Supplementary Material S4.1 . W e run GER VE with Gaussian mixtures (full co v ariances) under t w o regimes: matc hed K = 3 and o vercomplete K = 7. T o show case GER VE’s robustness to comp onen t mean initialisation, we initialise means in [ − 2 , 0] 2 . W e start with broad co v ariances and equal w eights. W e use a contin uously decreasing annealing schedule, and comp ensate with step size ρ t ∝ ω − 0 . 7 t (full hyperparameters and sc hedules in Suppl. S4.1 ). After training, assign b y highest resp onsibilit y and prune weigh ts < 10 − 3 . Baselines are GMM-EM (Gaussian mixture mo del ﬁt by the EM algorithm) with full co v ariances and k -means for K ∈ { 3 , 7 } . Figure 1 shows the obtained partitions when K = 7. With this o v ercomplete mixture, GER VE still yields 3 eﬀectiv e clusters: redundant components coalesce at modes or are pruned, whereas GMM-EM returns 7 clusters, so do es k -means (Suppl. S4.1 , Fig. S3 ). Because GER VE targets mo dal structure rather than a clustering lik eliho o d, it is more robust to o ver-speciﬁcation. With K = 3, all metho ds recov er 3 clusters (Suppl. S4.1 , Fig. S3 ). In Supplemen tary Material S5 , we assess robustness against classical metho ds with 13 a benchmark on standard UCI datasets. W e also provide an ablation study and rep ort run times. 6.2 Mo de-ﬁnding p erformance W e use the same triangle setting as previously , with sharp er comp onen ts ( σ 2 = 0 . 1). T rue global mo des practically coincide with means. Ov er n rep = 100 replicates, eac h metho d outputs K mo de estimates. W e rep ort three metrics (see Suppl. S4.2 ): (i) Mo de r e c overy ( MR ϵ ), the num ber of true mo des reco vered within a tolerance  , w e set  = 10 − 2 ; (ii) Hungarian matching sum ( HM ), minimum- assignmen t cost b etw een b et ween true and estimated mo des; (iii) Ne ar est-neighb or sum ( NN ), aggregate distance from eac h estimate to the closest mo de. W e run GER VE (Gaussian mixture with full co v ariances) with K ∈ { 3 , 7 } and a v an- ishing annealing schedule. Baselines are Gaussian mean-shift with ﬁxed-p oint updates, mo des found aggregated from K initialisations, and the F eature Signiﬁcance method via the feature R pac k age, whic h detects signiﬁcant curv ature p oin ts, then estimates mo des as the cen troids of k -means. F or eac h sample size N ∈ { 2 10 , 2 12 , . . . , 2 20 } and v alue of K , w e p erform a ligh t h yp erparameter grid searc h (Suppl. S4.2 ). W e a verage MR ϵ , NN and rep ort HM medians with 95% conﬁdence interv als. Fig. 2 summarises performance with resp ect to sample size N for K ∈ { 3 , 7 } . GER VE’s MR ϵ increases to wards the n um b er of true mo des and HM decreases as N gro ws, illustrating the consistency of mo de reco v ery (this matches the conclusions of Thm 3.1 ). With K = 3, GER VE’s NN is comparable to that of mean-shift and F eature Signiﬁcance, show casing its mode-lo cation accuracy . With K = 7, GER VE impro v es MR ϵ and HM while keeping NN competitive. This exhibits GER VE’s robustness to o v ersp eciﬁcation, whereas F eature Signiﬁcance is degraded. A broader sw eep o v er K ∈ { 3 , . . . , 7 } app ears in Supplemen tary Material S4.2 . T o summarise the key asp ects highligh ted in this sim ulation study: (i) Clustering emerges from mo de estimation, with robustness to o v ersp eciﬁcation; (ii) F or mode esti- mation, GER VE is statistically consistent and comp etitiv e in location accuracy , and mild o verspeciﬁcation helps GER VE where it hinders the other baselines. 7 Application: UK road collision hotsp ots detection Iden tifying collision hotsp ots (spatial in tensit y mo des) and quan tifying their uncertaint y for in terven tion planning is crucial in road safety studies. Coun t-based metrics and KDE are common ( Xie and Y an , 2008 ; Ok ab e et al. , 2009 ). Spatial scans add signiﬁcance test- ing and Ba yesian models add uncertaint y , but at the cost of parametric assumptions and hea vier computation ( Kulldorﬀ , 1997 ; Aguero-V alverde and Jov anis , 2006 ). GER VE of- fers a lik eliho od-free alternative that estimates mo de lo cations with calibrated lo calisation uncertain ty . 14 1 0 3 1 0 4 1 0 5 1 0 6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Modes (mean) MR GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) 1 0 3 1 0 4 1 0 5 1 0 6 1 0 2 1 0 1 1 0 0 Distance (median) HM GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) 1 0 3 1 0 4 1 0 5 1 0 6 1 0 2 1 0 1 1 0 0 Distance (mean) NN GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) Sample Size (log scale) Figure 2: Mo de estimation vs. sample size N for the triangle mixture ( I = 3) example. Curv es: means ( MR ϵ , NN ) or medians ( HM ) ov er n rep = 100; bands: 95% conﬁdence in terv als. W e analyse UK STATS19 collisions from the Departmen t for T ransp ort Road Safet y Data p ortal ( www.gov.uk/government/statistics/r o ad-safety-data ), restricting to A-roads in Greater London (longitudes [ − 0 . 54 , 0 . 33], latitudes [51 . 28 , 51 . 70]) and sev erity labelled “fatal” or “serious” from 2020 to 2024. W e ﬁt an o v ercomplete mixture, targeting 10 hotsp ots and using K = 20 components to enable o vercompleteness, and resolv e modes by pruning and merging (details in Suppl. S6 ). The stopping temp erature ω † is c hosen at the elb o w, the largest ω where the resolv ed-mode coun t p eaks (Suppl. S6 , Fig. S10 ). At ω † , we compute L = 500 b o otstrap resamples, matc h b ootstrap modes to the baseline b y assignmen t with adaptiv e gates (Suppl. S6 ), and pro ject to a metric system (EPSG:27700). F or eac h mo de k , w e rep ort a stabilit y score s k and a 95% conﬁdence ellipse from the empirical co v ariance of matched cen tres. Under the ﬁxed- temp erature theory in Section 5 , ellipses hav e nominal co verage and the stability score s k (eq. ( 7 )) estimates the reco v ery probabilit y . Figure 3 maps 2020-2024 hotsp ots with 95% ellipses and s k in the zo omed-in windo w [ − 0 . 15 , 0 . 15] × [ − 0 . 075 , 0 . 075]. High-stabilit y lo cations ( s k ≥ 0 . 7, arbitrary threshold) suc h as Shoreditch (ID: 12) and Elephant & Castle (ID: 1) hav e ellipses of order 100-150 m, whic h supp orts in tersection-lev el action. Medium-stability lo cations (0 . 4 ≤ s k < 0 . 7) suc h as Aldgate (ID: 8) and Clapham HS (ID: 16) sho w wider ellipses that span m ultiple junctions, which suggests area-wide measures. T able S5 in Supplemen tary Material S6 lists the entries, ranked by stabilit y score. T o compare lik e-for-lik e, we use mean-shift, op erating on the same ob ject as GER VE. With a ﬂat k ernel and a bandwidth sweep to a data-driven reference h 0 (see Suppl. S6 ), mean-shift en umerates man y local maxima at small bandwidths and merges aggressively at larger ones. In the zo omed-in windo w, mean-shift at 0 . 05 h 0 yields 122 centres, while GER VE reports 17. The con trast reﬂects design: entrop y-regularised mixtures imp ose 15 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) Baseline hotspots (numbered by ID) 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure 3: Baseline collision hotsp ots identiﬁed b y GER VE for the Greater London Area b et w een 2020 and 2024, with stabilit y scores. The normalised co ordinates are b ounded within [ − 0 . 7 , 0 . 7] × [ − 0 . 35 , 0 . 35] (normalised windo w) b o x, but all hotspots are in the [ − 0 . 15 , 0 . 15] × [ − 0 . 075 , 0 . 075] b o x (zoomed-in windo w), see Supplemen tary Material S6 , Figure S12 for the full study area. Left: with hotsp ot IDs for reference to T able S5 , Supplemen tary Material S6 . Right: 95% conﬁdence ellipses, in normalised co ordinates. global comp etition for supp ort, so weak er peaks are absorbed as temp erature falls. This yields a coarser and more stable partition that aligns with ma jor junctions. A visual comparison app ears in Supplemen tary Material S6 , Figure S14 . F or temp oral stability , w e ﬁt GER VE on 2014-2019 and 2020-2024 and declare a hotspot p ersisten t if 95% ellipses ov erlap across p eriods. Eigh t of the 17 hotsp ots p ersist, whic h indi- cates structural risk that is stable across years. Non-p ersisten t hotsp ots (Piccadilly Circus, Gun ter Grov e, Clapham HS) may reﬂect temp orally v arying factors suc h as pandemic- related traﬃc shifts, local construction, or policy c hanges. Supplementary Material S6 , Figure S16 o verla ys the t w o perio ds. Stabilit y and ellipses can b e used to target actions. High s k and tigh t ellipses supp ort redesign at a named junction, whereas mo derate s k and wide ellipses rev eal corridor-lev el risks, which are more suitable for monitoring. 8 Conclusion GER VE is a lik eliho o d-free v ariational approac h to mo de estimation and modal clustering. It optimises an en trop y-regularised ob jective with natural gradien ts under annealing, ﬁtting mixtures that concen trate on high-densit y regions and induce clusters without density estimation. As temperature go es to zero, components align with population mo des and inherit lo cal curv ature. A t ﬁxed temp erature, empirical maximisers exist, are consisten t, and satisfy a central limit theorem. A matched-mode b o otstrap proce dure yields per-mo de conﬁdence sets. Empirically , GER VE recov ers mo dal structure and adapts the eﬀective num ber of groups as temp erature decreases, with redundan t comp onen ts merging or v anishing. 16 Diagonal-co v ariance v arian ts preserve most of the behaviour at lo wer cost, and an abla- tion study sho ws that annealing, mo derate comp onen t o verspeciﬁcation, and short burn-in phases are useful levers. GER VE is most suitable when uncertain ty in mo de locations is imp ortan t, when the cluster coun t is missp eciﬁed, or when mo dal clustering is preferable to partition-based grouping. F or problems where the n umber of clusters is kno wn and computational sp eed is paramount, established task-sp eciﬁc clustering algorithms remain preferable. Three lim- itations merit atten tion. First, mo dal clustering ma y misalign with the group structure that matters in an application. Mo des deﬁne clusters by attraction basins, but scientiﬁc groupings may dep end on other criteria, so agreement is not guaran teed. Second, the opti- misation procedure can con verge to local rather than global maxima. Careful initialisation, temp erature schedules, and o v ercomplete mixtures reduce this risk but do not completely eliminate it. Third, computation can b e costly with full cov ariance matrices in high di- mensions. In our experiments, using diagonal cov ariances allo ws one to control the cost while preserving mode separation. The theory and exp erimen tal results p osition GER VE as a principled statistical frame- w ork for lik eliho o d-free mo dal inference. By unifying mo dal clustering, mean-shift, v ari- ational inference, and annealing, GER VE pro vides b oth algorithmic to ols and theoretical guaran tees within one platform. Lo oking ahead, three directions are natural. First, de- v elop data-driv en annealing sc hedules with principled stopping. Second, establish path wise guaran tees for empirical maximisers and for the v alidit y of the b ootstrap procedure un- der annealing. Third, characterise mo de-comp onen t correspondence and label stabilit y as temp erature v aries. Pursuing these directions will link the annealing pro cedure to stronger statistical guarantees and broaden the method’s practical reach. Ac kno wledgemen ts This work is supported b y the F renc h ANR and JST-CREST Bay es-Dualit y gran t (ANR- 21-JSTM-0001). 17 References Aguero-V alv erde, J. and Jo v anis, P . P . (2006). Spatial analysis of fatal and injury crashes in Pennsylv ania. A c cident A nalysis & Pr evention , 38(3):618–625. Amari, S.-I. (1998). Natural gradient works eﬃciently in learning. Neur al Computation , 10(2):251–276. Ameijeiras-Alonso, J., Crujeiras, R. M., and Rodriguez-Casal, A. (2021). Multimo de: an R pack age for mo de assessment. Journal of Statistic al Softwar e , 97:1–32. Arias-Castro, E., Mason, D., and Pelletier, B. (2016). On the estimation of the gradient lines of a densit y and the consistency of the mean-shift algorithm. Journal of Machine L e arning R ese ar ch , 17:1–28. Bork ar, V. S. (2008). Sto chastic Appr oximation: a Dynamic al Systems Viewp oint . Springer. Carreira-P erpi ˜ nan, M. ´ A. (2006). Acceleration strategies for Gaussian mean-shift image segmen tation. Confer enc e on Computer Vision and Pattern R e c o gnition , 1:1160–1167. Carreira-P erpi ˜ nan, M. ´ A. (2007). Gaussian mean-shift is an EM algorithm. IEEE T r ans- actions on Pattern A nalysis and Machine Intel ligenc e , 29(5):767–776. Chac´ on, J. E. (2015). A p opulation bac kground for nonparametric density-based clustering. Statistic al Scienc e , 30(4):518–532. Chac´ on, J. E. (2020). The mo dal age of statistics. International Statistic al R eview , 88(1):122–141. Chernoﬀ, H. (1964). Estimation of the mo de. Annals of the Institute of Statistic al Mathe- matics , 16(1):31–41. Comaniciu, D. and Meer, P . (2002). Mean shift: A robust approac h to w ard feature space analysis. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 24(5):603– 619. Cormen, T. H., Leiserson, C. E., Riv est, R. L., and Stein, C. (2009). Intr o duction to A lgorithms . MIT Press, 3 edition. Dalenius, T. (1965). The mo de–A neglected statistical parameter. Journal of the R oyal Statistic al So ciety Series A: Gener al , 128(1):110–117. Dempster, A. P ., Laird, N. M., and Rubin, D. B. (1977). Maxim um lik eliho o d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 39(1):1–22. 18 Donsk er, M. D. and V aradhan, S. R. S. (1976). Asymptotic ev aluation of certain Mark ov pro cess exp ectations for large time—I II. Communic ations on Pur e and Applie d Mathe- matics , 29(4):389–461. Dudley , R. M. (2002). R e al A nalysis and Pr ob ability . Cam bridge Univ ersit y Press. F ukunaga, K. and Hostetler, L. (1975). The estimation of the gradient of a densit y function, with applications in pattern recognition. IEEE T r ansactions on Information The ory , 21(1):32–40. Geno vese, C. R., P erone-Paciﬁco, M., V erdinelli, I., and W asserman, L. (2016). Non- parametric inference for density mo des. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 78(1):99–126. Hw ang, C.-R. (1980). Laplace’s metho d revisited: w eak con v ergence of probabilit y mea- sures. The A nnals of Pr ob ability , 8(6):1177–1182. Jerfel, G., W ang, S., W ong-F annjiang, C., Heller, K. A., Ma, Y., and Jordan, M. I. (2021). V ariational reﬁnemen t for imp ortance sampling using the forward Kullbac k–Leibler di- v ergence. Unc ertainty in Artiﬁcial Intel ligenc e , pages 1819–1829. Khan, M. E. and Rue, H. (2023). The Bay esian learning rule. Journal of Machine L e arning R ese ar ch , 24(281):1–46. Kronmal, R. and T arter, M. (1968). The estimation of probability densities and cumulativ es b y Fourier series methods. Journal of the Americ an Statistic al Asso ciation , 63(323):925– 952. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval R ese ar ch L o gistics Quarterly , 2(1-2):83–97. Kullbac k, S. and Leibler, R. A. (1951). On information and suﬃciency . The A nnals of Mathematic al Statistics , 22(1):79–86. Kulldorﬀ, M. (1997). A spatial scan statistic. Communic ations in Statistics-The ory and Metho ds , 26(6):1481–1496. Kushner, H. J. and Yin, G. G. (2003). Sto chastic Appr oximation and R e cursive Algorithms and Applic ations . Springer. Le Minh, T., Arb el, J., M¨ ollenhoﬀ, T., Khan, M. E., and F orb es, F. (2025). Natural v ariational annealing for m ultimo dal optimization. arXiv pr eprint arXiv:2501.04667 . Lin, W., Khan, M. E., and Schmidt, M. (2019). F ast and simple natural-gradien t v ariational inference with mixture of exp onen tial-family appro ximations. International Confer enc e on Machine L e arning . 19 Liu, B. and Ghosh, S. K. (2020). On empirical estimation of mo de based on w eakly dep enden t samples. Computational Statistics & Data Analysis , 152:107046. Loftsgaarden, D. O. and Quesen b erry , C. P . (1965). A nonparametric estimate of a multi- v ariate density function. The A nnals of Mathematic al Statistics , 36(3):1049–1051. McLac hlan, G. J. and P eel, D. (2000). Finite Mixtur e Mo dels . John Wiley & Sons. Menardi, G. (2016). A review on mo dal clustering. International Statistic al R eview , 84(3):413–433. Noble, M., Grenioux, L., Gabri ´ e, M., and Durmus, A. O. (2025). Learned reference-based diﬀusion sampler for m ulti-mo dal distributions. International Confer enc e on L e arning R epr esentations . Ok abe, A., Satoh, T., and Sugihara, K. (2009). A kernel densit y estimation metho d for net works, its computational metho d and a GIS-based to ol. International Journal of Ge o gr aphic al Information Scienc e , 23(1):7–32. P arzen, E. (1962). On estimation of a probability densit y function and mo de. The Annals of Mathematic al Statistics , 33(3):1065–1076. Romano, J. P . (1988). On weak conv ergence and optimalit y of k ernel density estimates of the mo de. The Annals of Statistics , 16(2):629–647. Sager, T. W. (1978). Estimation of a multiv ariate mode. The Annals of Statistics , 6(4):802– 812. Sam worth, R. J. (2018). Recen t progress in log-conca v e density estimation. Statistic al Scienc e , 33(4):493–509. Scott, D. W. (1992). Multivariate density estimation: The ory, Pr actic e, and Visualization . John Wiley & Sons. Soletskyi, R., Gabri´ e, M., and Loureiro, B. (2025). A theoretical persp ectiv e on mo de col- lapse in v ariational inference. Machine L e arning: Scienc e and T e chnolo gy , 6(2):025056. Tsybak ov, A. B. (1990). Recursive estimation of the mode of a m ultiv ariate distribution. Pr oblems of Information T r ansmission , 26(1):38–45. v an der V aart, A. W. (1998). Asymptotic Statistics . Cam bridge Universit y Press. v an der V aart, A. W. and W ellner, J. A. (1996). We ak Conver genc e and Empiric al Pr o- c esses: with Applic ations to Statistics . Springer. V en ter, J. H. (1967). On estimation of the mo de. The Annals of Mathematic al Statistics , 38(5):1446–1455. 20 W u, D., W ang, L., and Zhang, P . (2019). Solving statistical mec hanics using v ariational autoregressiv e net w orks. Physic al R eview L etters , 122(8):080602. Xie, Z. and Y an, J. (2008). Kernel density estimation of traﬃc acciden ts in a net w ork space. Computers, Envir onment and Urb an Systems , 32(5):396–406. Zeidler, E. (1995). Applie d F unctional A nalysis: Applic ations to Mathematic al Physics , v olume 108. Springer. 21 Supplemen tary material S1 Theory details 23 S1.1 Notation and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 S1.2 T runcation equiv alence and compact sets . . . . . . . . . . . . . . . . . . . 24 S1.3 Lexicographic order on Θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 S1.4 Modal basins-responsibility cells agreemen t . . . . . . . . . . . . . . . . . . 28 S1.5 Natural-gradien t con v ergence . . . . . . . . . . . . . . . . . . . . . . . . . . 30 S1.6 Uncertain t y quan tiﬁcation for mo de estimation . . . . . . . . . . . . . . . . 32 S1.6.1 Notations and separation assumption . . . . . . . . . . . . . . . . . . 32 S1.6.2 Consistency of stability scores . . . . . . . . . . . . . . . . . . . . . . 34 S1.6.3 Handling local optima . . . . . . . . . . . . . . . . . . . . . . . . . . 35 S2 Pro ofs 35 S2.1 Preliminaries: Empirical pro cess prop erties . . . . . . . . . . . . . . . . . . 35 S2.2 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 S2.3 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 S2.3.1 Existence of a population maximiser . . . . . . . . . . . . . . . . . . 48 S2.3.2 Consistency of the empirical maximisers . . . . . . . . . . . . . . . . 48 S2.3.3 Asymptotic normalit y of the empirical maximisers . . . . . . . . . . 49 S2.4 Proofs for Sections 5 and S1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 50 S2.4.1 Preliminaries: Bounded-Lipschitz metric . . . . . . . . . . . . . . . . 50 S2.4.2 T echnical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 S2.4.3 Pro ofs of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 S2.4.4 Pro of of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 56 S2.4.5 Pro of of Prop osition 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 57 S2.4.6 Pro of of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 S2.4.7 Pro of of Prop osition S1.13 . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5 Proofs for Section S1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5.1 Pro of of Prop osition S1.1 . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5.2 Pro of of Theorem S1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 60 S3 Algorithms and v arian ts 61 S3.1 Natural-gradien t update rule . . . . . . . . . . . . . . . . . . . . . . . . . . 61 S3.2 Fixed-co v ariance Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . 61 S3.3 Gaussian mixture case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 S3.4 Computational considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 67 S4 Sim ulations and benchmark 67 S4.1 Details on the clustering example of Section 6.1 . . . . . . . . . . . . . . . . 67 S4.2 Mode-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 22 S5 Clustering p erformance on UCI datasets 71 S5.1 Benc hmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 S5.2 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 S5.3 Run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 S6 Details on the UK collision case study 79 S1 Theory details S1.1 Notation and setup W e w ork on a domain S ⊂ R d , with ﬁnite Leb esgue measure. Bold sym b ols denote vec- tors/matrices (e.g., x , µ , Σ ). Giv en a measurable f : S → R and a temp erature ω > 0, deﬁne a normalised Gibbs measure as g ω ( d x ) = exp( f ( x ) /ω ) Z ω ( f ) d x , with partition function Z ω ( f ) = Z S exp( f ( y ) /ω ) d y assumed ﬁnite when inv ok ed. W e appro ximate g ω on S with a tractable q (restricted to S ): (i) a single Gaussian q λ = N ( µ , Σ ), (ii) a K -comp onen t Gaussian mixture q Λ ( x ) = K X k =1 π k N ( x | µ k , Σ k ) , K X k =1 π k = 1 , π k > 0 . Mixture weigh ts are parameterised by logits v k = log ( π k /π K ). Unless stated otherwise, co v ariance eigenv alues are constrained to [ σ 2 min , σ 2 max ]. F or an y densit y q , E q [ · ] denotes the expectation under q and KL( q ∥ p ) the Kullback– Leibler divergence. F or q on S , deﬁne the en trop y on S : H S ( q ) = − Z S q ( x ) log q ( x ) d x , and the v ariational ob jective L ω ( q ) = E q [ f ( X )] + ω H S ( q ) . By the classical v ariational identit y , L ω ( q ) = ω log Z ω ( f ) − ω KL ( q || g ω ) , 23 so g ω uniquely maximises L ω whenev er Z ω ( f ) < ∞ . F or v ariational parameters ϑ , let F ( ϑ ) be the Fisher information matrix. The natural gradien t is e ∇ ϑ L ω ( ϑ ) = F ( ϑ ) − 1 ∇ ϑ L ω ( ϑ ) . If q ϑ is in an exponential family with suﬃcien t statistic T ( X ) (or in an MCEF for mixtures ( Lin et al. , 2019 )), the natural gradient equals the standard gradien t with resp ect to the exp ectation parameters M = E q ϑ [ T ( X )]: e ∇ ϑ L ω ( ϑ ) = ∇ M L ω ( ϑ )] . Let { x ⋆ i } I i =1 denote the set of global maximisers (“modes”) of f (or of p when f = p ) on S . At a nondegenerate mo de x ⋆ i (i.e. ∇ 2 f ( x ⋆ i ) ≺ 0), deﬁne the p ositiv e-deﬁnite curv ature H i := −∇ 2 f ( x ⋆ i ) ≻ 0 . F or symmetric matrix A , w e write eig min ( A ) , eig max ( A ) for extremal eigenv alues. Norms are Euclidean for vectors and operator norms for matrices unless noted. F or a set A ⊂ R d , in t( A ) denotes its in terior, ∂ A its b oundary , and A c := R d \ A its complemen t. F or an elemen t x ∈ R d , we write dist( x , A ) := inf a ∈A ∥ x − a ∥ 2 . F or r > 0, let B ( x , r ) denote the ball cen tred on x with radius r . S1.2 T runcation equiv alence and compact sets In the optimisation setting ( Le Minh et al. , 2025 ), the target f is kno wn or directly ev alu- able, so E q ϑ [ f ( x )] is computable. In GER VE, f is replaced by an unkno wn densit y p and w e observe only i.i.d. samples X 1: N ∼ p . W e supp ose that p is smo oth, b ounded, and supp orted on a b ounded domain supp( p ) ⊂ S . Using symmetry , L ω ( q ϑ ) = Z S p ( x ) q ϑ ( x ) d x + ω H S ( q ϑ ) = E p [ q ϑ ( X )] + ω H S ( q ϑ ) , so the ﬁrst term is an expectation under p . Let P N := 1 N P N i =1 δ X i b e the empirical measure. A natural plug-in estimator replaces E p [ q ϑ ( X )] b y E P N [ q ϑ ( X )], yielding the empirical v ariational ob jectiv e b L ω ,N ( ϑ ) = E P N [ q ϑ ( X )] + ω H S ( q ϑ ) = 1 N N X i =1 q ϑ ( X i ) + ω H S ( q ϑ ) . (S1) This is the learning ob jective optimised by GER VE. 24 Optimisation on un truncated Gaussians. Since p is supp orted on S rather than on R d , the corresponding Gibbs measure is deﬁned only on S , and one might a priori restrict q to S . q λ denotes the un truncated Gaussian on R d with natural parameter λ . Deﬁne ψ S ( λ ) := Z S q λ ( x ) d x . The S -truncated version is q S λ ( x ) := q λ ( x ) 1 S ( x ) /ψ S ( λ ). Optimising directly ov er q S λ is cumbersome b ecause ψ S ( λ ) v aries with the parameter. The next result (pro v en in Section S2.5.1 ) sho ws that on a large S , optimising ov er the un truncated family while in tegrating o ver S incurs only a con trolled error on a compact set. T o deﬁne suc h a compact parameter space for Gaussians, ﬁx a mean bound µ max < ∞ and eigenv alue bounds 0 < σ 2 min < σ 2 max < ∞ , and consider the compact set of natural parameters Θ := n λ = ( Σ − 1 µ , − 1 2 Σ − 1 ) : µ ∈ R d , σ 2 min I ⪯ Σ ⪯ σ 2 max I , ∥ µ ∥ ∞ ≤ µ max o . F or λ ∈ Θ, let µ ( λ ) and Σ ( λ ) be the mean and co v ariance matrix enco ded by λ . F or notational con venience, let F ω ( λ ) := L ω ( q λ ) , G ω ( λ ) := L ω ( q S λ ) , for λ ∈ Θ. Let ε ( λ ) := 1 − ψ S ( λ ), ε max := sup λ ∈ Θ ε ( λ ) < 1, and p max = sup x ∈S p ( x ). In the following prop osition, we b ound | G ω ( λ ) − F ω ( λ ) | uniformly on Θ. Prop osition S1.1 (T runcation gap) . L et C q = sup λ ∈ Θ sup x ∈S   log q λ ( x )   < ∞ . Deﬁne, for t ∈ [0 , 1) , δ ω ( t ) := p max t + ω   log(1 − t )   + ω C q t. Then for al l λ ∈ Θ ,   G ω ( λ ) − F ω ( λ )   ≤ δ ω  ε ( λ )  , (S2) and sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . This b ound implies that if µ sta ys aw a y from ∂ S , the error from using untruncated Gaussians is negligible. No w, we deriv e a tail b ound and, combined with an argmax stabilit y argumen t, we obtain the next theorem, pro ven in Section S2.5.2 . Theorem S1.2 (T runcation equiv alence on a b ounded supp ort) . L et δ ω ( · ) b e as in Pr op o- sition S1.1 . F or any ﬁxe d ω > 0 , if ther e is a mar gin M 0 > 0 such that dist( µ ( λ ) , ∂ S ) ≥ M 0 p eig max ( Σ ( λ )) for al l λ ∈ Θ , then for some C > 0 , ε max ≤ C e − M 2 0 / 2 , sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . 25 Mor e over, let m ω ( ρ ) := F ω ( λ ⋆ ω ) − sup ∥ λ − λ ⋆ ω ∥≥ ρ F ω ( λ ) b e the value gap. Conse quently, if m ω ( ρ ) > 0 for some ρ > 0 , then every maximiser of G ω lies in B ( λ ⋆ ω , ρ ) whenever δ ω ( ε max ) < m ( ρ ) / 3 . In p articular, if the maximiser of F ω is unique and in in t(Θ) , then under the same c onditions, the maximiser of G ω is also unique and arg max λ ∈ Θ G ω = λ ⋆ ω . Case of mixtures. Theorem S1.2 is stated for singl e Gaussians. F or mixtures, w e ha v e to apply Proposition S1.1 comp onen twise under the same sp ectral b ounds and control the entrop y term via p er-component b ounds. Uniform control of the mixture log-densit y log q Λ additionally requires av oiding degenerate b eha viours (e.g., v anishing weigh ts with un b ounded parameters), whic h can b e done b y staying on a compact set of “natural” parameters. Indeed, although mixtures are not an exp onen tial family , they lie in the minimal con- ditional exp onen tial family (MCEF; Lin et al. , 2019 ), whic h admits analogous natural represen tation. W e consider the compact set giv en by equation ( 4 ), that we recall here: Θ K := n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) : | v k | ≤ v max , λ k ∈ Θ o , where v k = log( π k /π K ). T o preserv e the compactness induced b y Θ, w e imp ose additional b o x constraints | v k | ≤ v max on the log-ratios, whic h implies a uniform ﬂo or on the w eights: π k ≥ 1 1 + ( K − 1) e 2 v max > 0 for all k ∈ [ K ] . Ho wev er, choosing v max ≥ 1 2 log  1 ε − ( K − 1)  (S3) allo ws a designated subset of comp onen ts to ha v e w eigh ts as small as ε simultaneously while resp ecting the b ox constrain ts. As v max → ∞ , the set of feasible w eights approac hes the simplex, so the constrain t b ecomes non-restrictiv e in practice. Typically , setting v max = 6 is already suﬃcien t to allo w all weigh ts to be smaller than 10 − 5 . In brief, for mixtures, we follow the same con ven tion as ab o ve: w e optimise o v er un- truncated mixtures while integrating o v er S . Under the margin condition ε ( µ k , Σ k ) = o (1) (e.g., dist( µ k , ∂ S ) ≥ M 0 p eig max ( Σ k ) with M 0 large), replacing truncated b y untruncated comp onen ts p erturbs b L ω ,N and L ω b y o (1) uniformly , and so leav es maximisers unchanged asymptotically . S1.3 Lexicographic order on Θ Fix a one-to-one co ordinate map τ : Θ → R d λ . F or a, b ∈ R d λ w e write a ≺ lex b if there exists j ∈ { 1 , . . . , d λ } such that a i = b i for all i < j and a j < b j . F or λ , λ ′ ∈ Θ w e write 26 λ ≺ lex λ ′ if τ ( λ ) ≺ lex τ ( λ ′ ). W e deﬁne the lexicographically ordered parameter space for K -comp onen t mixtures b y Θ lex K = n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) ∈ Θ K : λ 1 ≺ lex λ 2 ≺ lex · · · ≺ lex λ K o . Let S K denote the p erm utation group on { 1 , . . . , K } . The mixture mo del is inv arian t under relab elling of components, in the sense that for every σ ∈ S K and ev ery Λ = ( v , λ 1 , . . . , λ K ) ∈ Θ K there exists a relab elled parameter σ · Λ :=  v σ , λ σ (1) , . . . , λ σ ( K )  ∈ Θ K , suc h that the mixing w eights are p erm uted accordingly , π ( v σ ) = ( π σ (1) ( v ) , . . . , π σ ( K ) ( v )), and f ( · ; σ · Λ ) = f ( · ; Λ ) . Assume additionally that the component parameters are pairwise distinct, that is, λ i  = λ j for i  = j . Then, for ev ery Λ ∈ Θ K , there exists a unique p erm utation σ Λ ∈ S K suc h that Λ ord := σ Λ · Λ ∈ Θ lex K , f ( · ; Λ ord ) = f ( · ; Λ ) . In particular, whenev er a maximiser (or stationary p oin t) exists o ver Θ K for a permutation- in v arian t criterion, there is an equiv alent maximiser (or stationary p oint) lying in Θ lex K . Th us, without loss of generalit y , one may restrict uniqueness argumen ts to Θ lex K . F urthermore, supp ose that the mo del is iden tiﬁable up to lab el switc hing on Θ K , namely , f ( · ; Λ ) = f ( · ; Λ ′ ) = ⇒ ∃ σ ∈ S K : Λ ′ = σ · Λ . Then the parametrisation is iden tiﬁable on Θ lex K , in the sense that Λ , Λ ′ ∈ Θ lex K and f ( · ; Λ ) = f ( · ; Λ ′ ) imply Λ = Λ ′ . T o justify the existence and uniqueness of σ Λ , observe ﬁrst that because τ is one-to-one and ≺ lex is a strict total order on R d λ , the induced relation ≺ lex on Θ is also a strict total order. In particular , for any λ  = λ ′ in Θ, exactly one of λ ≺ lex λ ′ or λ ′ ≺ lex λ holds. Fix Λ = ( v , λ 1 , . . . , λ K ) ∈ Θ K and assume that λ i  = λ j for i  = j . Deﬁne the rank of comp onen t i b y r ( i ) := 1 +   { j ∈ { 1 , . . . , K } : λ j ≺ lex λ i }   . Since the λ i are pairwise distinct and the order is total, the ranks { r ( i ) } K i =1 are exactly { 1 , . . . , K } . Hence there exists a unique p ermutation σ Λ ∈ S K suc h that r  σ Λ ( k )  = k , k = 1 , . . . , K, or equiv alently , λ σ Λ (1) ≺ lex λ σ Λ (2) ≺ lex · · · ≺ lex λ σ Λ ( K ) . 27 Deﬁne the ordered representativ e Λ ord = σ Λ · Λ =  v σ Λ , λ σ Λ (1) , . . . , λ σ Λ ( K )  . By construction, Λ ord ∈ Θ lex K . Uniqueness of σ Λ follo ws because the sorting p erm utation of a list of distinct elemen ts under a strict total order is unique (see, e.g., Cormen et al. , 2009 , Sec. 8.1). By the assumed relabelling in v ariance, for ev ery σ ∈ S K w e ha ve f ( · ; σ · Λ ) = f ( · ; Λ ), and in particular f ( · ; Λ ord ) = f ( · ; Λ ). Consequently , any p erm utation-in v arian t criterion Q (for instance, Q ( Λ ) = R ψ ( x , f ( x ; Λ )) d x or Q ( Λ ) = P n i =1 log f ( X i ; Λ ), or any other function of f ( · ; Λ )) satisﬁes Q ( Λ ord ) = Q ( Λ ) . Hence, if Λ ⋆ is a maximiser of Q ov er Θ K , then  Λ ⋆  ord is also a maximiser and lies in Θ lex K . Finally , to pro v e iden tiﬁabilit y on Θ lex K , let Λ , Λ ′ ∈ Θ lex K satisfy f ( · ; Λ ) = f ( · ; Λ ′ ). By iden tiﬁability up to label switching, there exists σ ∈ S K suc h that Λ ′ = σ · Λ , and therefore ( λ ′ 1 , . . . , λ ′ K ) = ( λ σ (1) , . . . , λ σ ( K ) ) . But b oth ( λ 1 , . . . , λ K ) and ( λ ′ 1 , . . . , λ ′ K ) are strictly increasing under ≺ lex . Since the λ k are distinct, the set { λ 1 , . . . , λ K } admits a unique strictly increasing ordering. Therefore the only p erm utation σ for which ( λ σ (1) , . . . , λ σ ( K ) ) is strictly increasing is σ = id. It follo ws that λ ′ k = λ k for all k , and hence Λ ′ = Λ . This sho ws that the parametrisation is iden tiﬁable on Θ lex K . S1.4 Mo dal basins-resp onsibilit y cells agreemen t T o in v estigate the link b etw een GER VE’s induced clustering and mo dal clustering, we compare tw o partitions near each mo de x ⋆ i of a C 2 densit y: the responsibility cells A k ( θ ) of an isotropic Gaussian mixture q θ ( x ) = P K k =1 π k q µ k ,σ 2 ( x ), with θ = ( µ 1 , . . . , µ K ), and the modal basins {B i } of p . W e giv e simple, lo cal conditions under whic h they agree on inner neighborho o ds of the mo des. Let { x ⋆ i } I i =1 b e isolated modes of p in supp( p ) ⊂ S . F or r > 0, write B i ( r ) := B ( x ⋆ i , r ) the ball cen tred on x ⋆ i and with radius r , and ˜ r := r/ 2. W eights satisfy π k ∈ [ π min , π max ] with π min > 0, P k π k = 1. F or all k ∈ [ K ], deﬁne p osterior comp onen t resp onsibilities resp k ( x ; θ ) = π k q µ k ,σ 2 I ( x ) P j π j q µ j ,σ 2 I ( x ) and cells A k ( θ ) = { x : resp k ( x ; θ ) ≥ resp j ( x ; θ ) , ∀ j ∈ [ K ] } . F or i ∈ [ I ], the mo dal basin B i is the set of all initial p oints x suc h that the solution e x ( t ) of the gradien t-ascent ﬂo w ˙ x = ∇ p ( x ) with e x (0) = x satisﬁes e x ( t ) → x ⋆ i as t → ∞ Assumption S1.3 (Lo cal strong log-conca vit y and separation) . F or e ach i ther e exists r i > 0 and κ i > 0 such that −∇ 2 log p ( x ) ⪰ κ i I for al l x ∈ B i ( r i ) , and the close d b al ls B i ( r i ) ⊂ supp( p ) ar e p airwise disjoint. 28 Lemma S1.4 (Basin con tainmen t) . Under Assumption S1.3 , B i ( r i ) ⊂ B i for e ach i . Pr o of of L emma S1.4 . Let e x ( t ) be a tra jectory of ˙ x = ∇ p ( x ) with e x (0) ∈ B i ( r i ). Cho ose the Lyapuno v function V ( x ) = ∥ x − x ⋆ i ∥ 2 / 2. On supp( p ), using ˙ x = ∇ p ( x ) = p ( x ) ∇ log p ( x ), we ha ve ˙ V ( x ) = ⟨ ˙ x , x − x ⋆ i ⟩ = p ( x ) ⟨∇ log p ( x ) , x − x ⋆ i ⟩ . Since ∇ log p ( x ⋆ i ) = ∇ p ( x ⋆ i ) = 0, strong log-conca vit y (Assumption S1.3 ) implies, for x ∈ B i ( r i ) \ { x ⋆ i } , ⟨∇ log p ( x ) , x − x ⋆ i ⟩ ≤ − κ i ∥ x − x ⋆ i ∥ 2 < 0 , th us, with m i = inf x ∈ B i ( r i ) p ( x ) > 0, ˙ V ( x ) ≤ − 2 κ i p ( x ) V ( x ) ≤ − 2 κ i m i V ( x ) . By Gr¨ onw all’s inequality , w e hav e V ( e x ( t )) ≤ V ( e x (0)) exp( − 2 κ i m i t ) , (S4) as long as e x ( t ) is in B i ( r i ). Finally , w e pro ve b y con tradiction that e x ( t ) ∈ B i ( r i ) for all t ≥ 0. Supp ose the existence of T out = inf t> 0 { t : e x ( t ) ∈ B i ( r i ) } < ∞ . In this case, by contin uit y of x , w e ha ve ∥ e x ( T out ) − x ⋆ i ∥ = r i . But since V ( e x (0)) < r 2 i / 2, ( S4 ) giv es for t ∈ [0 , T out ), V ( e x ( t )) < r 2 i 2 exp( − 2 κ i m i t ) < r 2 i 2 . This implies ∥ e x ( t ) − x ⋆ i ∥ < r i for all t ∈ [0 , T out ), and this con tradicts ∥ e x ( T out ) − x ⋆ i ∥ = r i . Th us, w e hav e e x ( t ) ∈ B i ( r i ) for all t ≥ 0 and ( S4 ) ensures V ( e x ( t )) → 0 as t → ∞ , hence e x ( t ) → x ⋆ i and we conclude that e x (0) ∈ B i . Assumption S1.5 (One-p er-mo de conﬁguration) . Ther e is a c onﬁgur ation θ = ( µ 1 , . . . , µ I ) with exactly one c entr e µ i ∈ B i ( ˜ r i ) for e ach i , and no other c entr es in S I i =1 B i ( r i ) . Ther efor e, by c onvention, we r elab el c omp onents so that c entr e i lies in B i ( ˜ r i ) and write A i ( θ ) for the i th c el l. Assumption S1.6 (Local margin) . Ther e exists ∆ > 0 such that for al l i and al l x ∈ B i ( ˜ r i ) , ∥ x − µ j ∥ ≥ ∥ x − µ i ∥ + ∆ ∀ j  = i. R emark 1 . If µ i ∈ B i ( ˜ r i ) and dist  B i ( ˜ r i ) , { µ j } j  = i  > 0, then ∆ := inf x ∈ B i ( ˜ r i ) min j  = i ( ∥ x − µ j ∥ − ∥ x − µ i ∥ ) > 0. 29 Assumption S1.7 (Scale separation) . Ther e exists ε ∈ (0 , 1) such that σ ≤ ∆ p 2 log Ψ( ε ) , wher e Ψ( ε ) := ( I − 1) π max επ min > 1 . Theorem S1.8 (Resp onsibilit y on balls) . Under Assumptions S1.5 - S1.7 , for e ach i , B i ( ˜ r i ) ⊂ A i ( θ ) , and r i ( x ; θ ) ≥ (1 + ε ) − 1 , ∀ x ∈ B i ( ˜ r i ) . Pr o of of The or em S1.8 . Fix i and let x ∈ B i ( ˜ r i ). By Assumption S1.6 , ∥ x − µ j ∥ − ∥ x − µ i ∥ ≥ ∆ and ∥ x − µ j ∥ + ∥ x − µ i ∥ ≥ 2 ∥ x − µ i ∥ + ∆. Combining with Assumption S1.7 , this gives q µ j ,σ 2 I ( x ) q µ i ,σ 2 I ( x ) = exp  − ( ∥ x − µ j ∥ − ∥ x − µ i ∥ )( ∥ x − µ j ∥ + ∥ x − µ i ∥ ) 2 σ 2  ≤ exp  − ∆ 2 2 σ 2  ≤ 1 Ψ( ε ) . Summing with w eights giv es P j  = i π j q µ j π i q µ i ≤ ( I − 1) π max π min Ψ( ε ) = ε , hence resp i ( x ; θ ) ≥ (1 + ε ) − 1 > 1 / 2, which means x ∈ A i ( θ ). R emark 2 . Assumption S1.7 is in fact a suﬃcien t condition for resp i ( x ; θ ) ≥ (1 + ε ) − 1 , whic h means comp eting comp onen ts contribute a total probabilit y mass ≤ ε relative to the dominan t component i . Corollary S1.9 (Local partition agreement) . Under Assumptions S1.3 - S1.7 , for e ach i , A i ( θ ) ∩ B i ( ˜ r i ) = B i ∩ B i ( ˜ r i ) = B i ( ˜ r i ) . Pr o of of Cor ol lary S1.9 . Theorem S1.8 gives B i ( ˜ r i ) ⊂ A i ( θ ). Lemma S1.4 giv es B i ( ˜ r i ) ⊂ B i . Agreemen t is guaranteed only on lo cal neighborho ods of the mo des S I i =1 B i ( ˜ r i ). In general, the resp onsibilit y cells {A k } and the mo dal basins {B i } globally disagree. Figure S1 sho wcases a counterexample, highlighting the diﬀerences b et w een mo dal clustering and high-resp onsibilit y assignmen ts in a Gaussian mixture model. Remark ably , a mo de may not ev en b e in its corresp onding comp onen t resp onsibilit y region, notably when comp onen ts signiﬁcan tly o v erlap. S1.5 Natural-gradien t conv ergence Under the Robbins–Monro and b ounded iterate conditions, we can sho w that sto c hastic natural-gradien t ascen t (SNGA) is consisten t. Con vergence is obtained via a direct appli- cation of the ODE method for sto c hastic appro ximation ( Bork ar , 2008 , Chp. 2.1). 30 2 1 0 1 2 1 0 1 2 R esponsibility R egions Basin Boundaries (lines) Modes Component Means Figure S1: Resp onsibilit y regions (in color resp. blue, pink, y ellow) and (appro ximate) mo dal basins for a Gaussian mixture with 3 comp onen ts, means at µ 1 =  − cos( π/ 6) , − 0 . 5  , µ 2 = (0 , 1), µ 3 =  cos( π / 6) , 0  , mixture w eights π = (0 . 16 , 0 . 80 , 0 . 04) and isotropic co v ariances Σ k = σ 2 k I with v ariances ( σ 2 1 , σ 2 2 , σ 2 3 ) = (0 . 30 , 0 . 95 , 0 . 10). Cov ariances are represen ted as circles of radii σ k . Assumption S1.10. Consider the up date ϑ t +1 = ϑ t + ρ t ψ t , (S5) wher e ψ t is an unbiase d sto chastic estimate of the natur al gr adient e ∇ b L ω ,N ( ϑ t ) (e quiva- lently, the gr adient in the exp e ctation-p ar ameter sp ac e for exp onential families or mixtur es ther e of ). L et F t := σ ( ϑ 1 , ψ 1 , . . . , ψ t ) . Assume: (A1) ( R obbins–Monr o ) P t ρ t = ∞ , P t ρ 2 t < ∞ . (A2) ( Me an ﬁeld ) e ∇ b L ω ,N is lo c al ly Lipschitz and of at most line ar gr owth. (A3) ( Noise ) E [ ψ t | F t − 1 ] = e ∇ b L ω ,N ( ϑ t − 1 ) and E ∥ ψ t ∥ 2 ≤ C (1 + ∥ ϑ t − 1 ∥ 2 ) . (A4) ( Stability ) The iter ates ( ϑ t ) t ≥ 1 ar e a.s. b ounde d. Theorem S1.11 (Con vergence of SNGA at ﬁxed ω ) . L et S ω = { ϑ : e ∇ b L ω ,N ( ϑ ) = 0 } . Under Assumption S1.10 , we have, as t → ∞ : dist( ϑ t , S ω ) → 0 a.s. Equivalently, every limit p oint of ( ϑ t ) is almost sur ely stationary. R emark 3 (Natural-gradient regularit y) . F or exponential families and mixtures thereof (treated as MCEF mo dels), the “natural gradien t = expectation gradien t” identit y (eq. ( 5 )) 31 ensures that e ∇ b L ω ,N inherits Lipsc hitz con tinuit y from b L ω ,N on compact Θ, without explicit in version of the Fisher information matrix. Under annealing, conditions on the sc hedule are suﬃcient to guaran tee conv ergence. The following corollary stems from standard t w o-timescale results for sto c hastic approxi- mation ( Kushner and Yin , 2003 , Chp. 8.6; Bork ar , 2008 , Chp. 6.1). Corollary S1.12 (Con vergence of SNGA under annealing) . Supp ose that under Assump- tion S1.10 for e ach ﬁxe d ω > 0 , and that ( ϑ , ω ) 7→ e ∇ b L ω ,N ( ϑ ) is jointly c ontinuous on b ounde d sets. 1. Stagewise sc hedule. L et ω 1 > ω 2 > · · · → 0 . If at stage j the run with ﬁxe d ω j is long enough that dist( ϑ , S ω j ) ≤ δ j with δ j → 0 (and the next stage is warm-starte d at that iter ate), then dist( ϑ , S ω j ) → 0 as j → ∞ . 2. Con tin uous schedule. If ω t → 0 satisﬁes | ω t +1 − ω t | = o ( ρ t ) , then dist( ϑ t , S ω t ) → 0 a.s. R emark 4 (GER VE’s conv ergence) . GER VE p erforms optimisation in a compact parameter space (Θ for single Gaussians, Θ K for Gaussian mixtures), so the iterates m ust be pro jected to ensure that they remain in this space. In this case, the ab ov e conv ergence results are not directly applicable. How ev er, in practice, pro jection is not alwa ys needed, typically when the compact space contains the solution to the unconstrained optimisation problem. S1.6 Uncertain t y quan tiﬁcation for mo de estimation Here, w e develop the theoretical guaran tees of the mo de-lev el uncertaint y quan tiﬁcation (UQ) pro cedure proposed in Section 5 . In this section, we mathematically translate all the notions (including Assumption 5.2 ), deﬁned in Section 5.2 . These will b e also used in the pro ofs in S2.4 . Then, we inv estigate the consistency stability scores (Prop. S1.13 ) and prop ose strategies to handle nonconv ex ob jectiv es. All the results for this section and Section 5 of the main part are prov en in Section S2.4 . S1.6.1 Notations and separation assumption W e start by mathematically translating all the notions deﬁned in Section 5.2 . These will b e used throughout this section and in the pro ofs in S2.4 . Data, measures, and con v ergence. X 1: N are i.i.d. from the p opulation law P on the sample space S . The empirical and bo otstrap empirical measures are P N = 1 N N X i =1 δ X i , P ∗ N = 1 N N X i =1 δ X ∗ i , X ∗ 1 , . . . , X ∗ N i . i . d . ∼ P N giv en X 1: N . W e write P ∗ ( · ) = P ( ·| X 1: N ) for conditional (b ootstrap) probabilit y . All bo otstrap limits are understo o d c onditional ly on X 1: N , in P -pr ob ability . 32 P arameter space and ob jectiv e. Fix ω 0 > 0 and w ork on the compact parameter set Θ K (eq. ( 4 )). F or a ﬁnite signed measure G on S , deﬁne the criterion L ω 0 ( Λ ; G ) := Z q Λ ( x ) dG ( x ) + ω 0 H S ( q Λ ) , and the p opulation/empirical v ersions L ω 0 ( Λ ) := L ω 0 ( Λ ; P ) and b L ω 0 ,N ( Λ ) := L ω 0 ( Λ ; P N ). Estimators and b o otstrap reﬁts. Let b Λ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ω 0 ,N ( Λ ) , Λ ⋆ ω 0 ∈ arg max Λ ∈ Θ K L ω 0 ( Λ ) , and deﬁne the b o otstrap criterion and maximiser b L ∗ ω 0 ,N ( Λ ) := L ω 0 ( Λ ; P ∗ N ) , b Λ ∗ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ∗ ω 0 ,N ( Λ ) . W e also allow a (p ossibly inexact) estimator e Λ ω 0 ,N of b Λ ω 0 ,N (e.g., the output of a ﬁnite- iteration GER VE ﬁt). Quadratic expansion matrices. W rite H ⋆ ω 0 := −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) , V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 , W ω 0 := ( H ⋆ ω 0 ) − 1 V ω 0 ( H ⋆ ω 0 ) − 1 . Mo de means, matching, and pro jections. Here, w e consider ﬁxed targets u 1 , . . . , u K 0 ∈ R d ( K 0 ≤ K ). F or Λ ∈ Θ K , let µ k ( Λ ) ∈ R d b e the k -th comp onen t mean and m ( Λ ) :=  µ 1 ( Λ ) T , . . . , µ K ( Λ ) T  T ∈ R K d . Deﬁne the matching map M ( Λ ) ∈ R dK 0 b y selecting an injection π : { 1 , . . . , K 0 }  → { 1 , . . . , K } minimising P K 0 j =1 ∥ µ π ( j ) ( Λ ) − u j ∥ , with a ﬁxed lexicographic tie-break: M ( Λ ) =  µ π (1) ( Λ ) T , . . . , µ π ( K 0 ) ( Λ ) T  T . Let E j ∈ R d × dK 0 extract the j -th d -block and set M j ( Λ ) := E j M ( Λ ) ∈ R d . Lo cal separation and linearisation. F or j ∈ [ K 0 ] and Λ ∈ Θ K , let a ( j, Λ ) ∈ arg min k ∥ µ k ( Λ ) − u j ∥ (lexicographic tie-break) and deﬁne gap j ( Λ ) := min k  = a ( j, Λ )  ∥ µ k ( Λ ) − u j ∥ − ∥ µ a ( j, Λ ) ( Λ ) − u j ∥  , ∆( Λ ) := min 1 ≤ j ≤ K 0 gap j ( Λ ) . F or δ > 0, set G δ := { Λ ∈ Θ K : ∆( Λ ) ≥ δ } . Lo cal separation (Assumption 5.2 ) holds at Λ ⋆ ω 0 if there exists δ > 0 such that ∆( Λ ⋆ ω 0 ) ≥ δ . In this case, M is C 1 at Λ ⋆ ω 0 (Prop. S2.17 ) with Jacobian J := ∇ Λ M ( Λ ⋆ ω 0 ), and w e deﬁne C M := J W ω 0 J T , C j := E j C M E T j . 33 S1.6.2 Consistency of stability scores T o in vestigate the properties of the stabilit y scores, we write their mathematical deﬁnition. Let Φ( G ) be a deterministic measurable selection from the (possibly set-v alued) argmax Φ( G ) ∈ arg max Λ ∈ Θ K n Z q Λ dG + ω 0 H S ( q Λ ) o . Fix disjoin t neigh b orho ods U 1 , . . . , U K 0 of u 1 , . . . , u K 0 . Deﬁne F j ( G ) = 1 { E j M (Φ( G )) ∈ U j } and the targets π j,N := P  F j ( P N ) = 1  , π j := lim N →∞ π j,N . The b o otstrap stabilit y score deﬁned in Section 5 can b e reform ulated s j = 1 L P L ℓ =1 F j ( P ∗ ( ℓ ) N ). In other w ords, s j = 1 L L X ℓ =1 1 {∃ a matc hed component in U j for the  -th b o otstrap ﬁt } . Let E ∗ and V ar ∗ denote the conditional exp ectation and v ariance given X 1: N . The follo wing prop osition, prov en in Section S2.4.7 , establishes statistical guaran tees on the stabilit y scores. Prop osition S1.13 (Consistency of stabilit y scores) . L et us index the stability sc or es s j,L by L . Under Assumptions S2.10 and 5.2 and with U j chosen to have a p ositive mar gin at Λ ⋆ ω 0 (L emma S2.15 ), deﬁne τ j,N := E ∗  F j ( P ∗ N )  = E ∗ [ s j,L ] (for al l L ). Then: (i) (Conditional LLN). F or e ach N , c onditional ly on the data, s j,L a.s. − − → τ j,N as L → ∞ , V ar ∗ ( s j,L ) = V ar ∗  F j ( P ∗ N )  /L → 0 . In p articular, for any  > 0 , P ∗  | s j,L − τ j,N | >   ≤ 2 exp  − 2 L 2  . (ii) (L ar ge- N limit). As N → ∞ , τ j,N P − → π j , π j,N := P  F j ( P N ) = 1  → π j , wher e π j := F j ( P ) ∈ { 0 , 1 } (equiv alen tly , π j = P ( F j ( P ) = 1)) . Conse quently, for any se quenc e L N → ∞ as N → ∞ , we have s j,L N P − → π j . 34 S1.6.3 Handling local optima Although L ω 0 is noncon vex in general, in practice we use the following guardrails to mitigate the issue: 1. A nne aling + over c ompleteness. W e use an o vercomplete model, and we start from a higher smoothing level and con tin ue down to ω 0 . Small components are pruned. This reliably enlarges the global basin explored by GER VE. 2. Stability sc or es. P er-mo de stabilit y scores s j are monitored, with v alues near 1 in- dicating robust recov ery . F or example, as a diagnostic, w e can ﬂag s j < 0 . 5 as p oten tially problematic and s j < 0 . 3 as lik ely unstable. 3. Multiple initialisations. W e can run several random starts at ω 0 and keep the best solution (by ob jectiv e). S2 Pro ofs S2.1 Preliminaries: Empirical process properties In this section, w e ﬁrst pro v e that the classes of Gaussian distributions and GMM with b ounded cov ariances are Gliv enko–Can telli (GC), and admit in tegrable env elop es. W e recall ( v an der V aart and W ellner , 1996 ) that a class of functions F on R d is Gliv enko– Can telli (GC) with resp ect to a probabilit y measure P if, for i.i.d. samples X 1 , . . . , X N of a random v ariable X ∼ P , the empirical av erage of an y f ∈ F , P N f : = N − 1 P N i =1 f ( X i ), con verges uniformly to its exp ectation P f = E P [ f ( X )], in probability: sup f ∈F | P N f − P f | P − → 0 as N → ∞ . W e say that F admits an integrable en velope F if F ( x ) ≥ | f ( x ) | for all f ∈ F , with E P [ F ( X )] < ∞ . Lemma S2.1 (GC for ﬁxed-co v ariance Gaussian distributions) . L et ϕ Σ 0 ( x ) = N ( x ; 0 , Σ 0 ) with a given Σ 0 ≻ 0 , and deﬁne the class F 0 = { f µ ( x ) = ϕ Σ 0 ( x − µ ) , µ ∈ R d } . If P is a pr ob ability me asur e with c ontinuous and b ounde d density, then F 0 is Glivenko– Cantel li and has an inte gr able envelop e with r esp e ct to P . Pr o of of L emma S2.1 . Standard results on empirical pro cesses guarantee that F 0 is GC if it admits an en velope and is totally bounded in L 1 ( P ) (see v an der V aart and W ellner , 1996 , Thm. 2.4.3). A straigh tforw ard en velope is F ( x ) := (2 π ) − d/ 2 | Σ 0 | − 1 / 2 ≥ f µ ( x ) ≥ 0, and E P [ F ( X )] = F ( x ) < ∞ . It only remains to pro ve the total boundedness in L 1 ( P ). 35 1. Denote b y p the density of P . Since ∥ f µ ∥ L 1 ( P ) = R f µ ( x ) p ( x ) d x = R ϕ Σ 0 ( x − µ ) p ( x ) d x = p ∗ ϕ Σ 0 ( µ ), then ∥ f µ ∥ L 1 ( P ) → 0 as ∥ µ ∥ → ∞ . It follo ws that for any ε > 0, there exists some M ϵ suc h that sup ∥ µ ∥ >M ε ∥ f µ ∥ L 1 ( P ) < ε/ 2. 2. On the compact ball B (0 , R ) = { x : ∥ x ∥ ≤ R } , ( x , µ ) 7→ f µ ( x ) is jointly C 1 . Hence, there exists L R with sup ∥ x ∥≤ R | f µ ( x ) − f ν ( x ) | ≤ L R ∥ µ − ν ∥ . Choose R so that 2 P ( B (0 , R ) c ) ∥ F ∥ ∞ < ε/ 2. 3. Co v er { µ ∈ R d , ∥ µ ∥ ≤ M ε } with a ﬁnite δ -net 1 { µ j , j ∈ [ J δ ] } , where δ = ε/ (2 L R ). That means for all µ such that ∥ µ ∥ ≤ M ε , there exists j ∈ [ J δ ] such that ∥ µ − µ j ∥ ≤ δ . Compute ∥ f µ − f µ j ∥ L 1 ( P ) = Z B (0 ,R ) | f µ ( x ) − f µ j ( x ) | p ( x ) d x + Z B (0 ,R ) c | f µ ( x ) − f µ j ( x ) | p ( x ) d x . On B (0 , R ), | f µ ( x ) − f µ j ( x ) | ≤ L R ∥ µ − µ j ∥ ≤ ε/ 2, so the ﬁrst integral is smaller than ε/ 2. On B (0 , R ) c , the in tegral is smaller than 2 ∥ F ∥ ∞ P ( B (0 , R ) c ) < ε/ 2. Th us, ∥ f µ − f µ j ∥ L 1 ( P ) ≤ ε . Finally , if ∥ µ ∥ > M ε then b y 1. ∥ f µ − 0 ∥ L 1 ( P ) = ∥ f µ ∥ L 1 ( P ) < ε/ 2, whic h concludes the pro of that that F 0 is totally bounded in L 1 ( P ). Lemma S2.2 (GC for b ounded lo cation-scale Gaussian distributions) . L et F = { f µ , Σ ( x ) = N ( x ; µ , Σ ) : µ ∈ R d , σ 2 min I ⪯ Σ ⪯ σ 2 max I } . If P is a pr ob ability me asur e with c ontinuous and b ounde d density, then F is GC and has an inte gr able envelop e with r esp e ct to P . Pr o of of L emma S2.2 . The pro of is analogous to that of Lemma S2.1 . An admissible en- v elop e is F ( x ) := (2 π σ 2 min ) − d/ 2 ≥ sup µ , Σ f µ , Σ ( x ), and E P [ F ( X )] < ∞ . Analogous argu- men ts can b e used to prov e the total b oundedness in L 1 ( P ): 1. F or large ∥ µ ∥ , ∥ f µ , Σ ∥ L 1 ( P ) = ( p ∗ ϕ Σ )( µ ) → 0 uniformly ov er Σ in the spectral band [ σ 2 min , σ 2 max ], as ϕ Σ ’s tails are controlled b y σ max . Deﬁne M ε suc h that in { µ : ∥ µ ∥ > M ε } , functions are ε/ 2-close to 0 in L 1 ( P ). 2. On the compact ball B (0 , R ) = { x : ∥ x ∥ ≤ R } , ( x , µ , Σ ) 7→ f µ , Σ ( x ) is join tly C 1 . Because Σ − 1 has eigenv alues in [ σ − 2 max , σ − 2 min ], its deriv ativ es are uniformly b ounded, hence it is Lipschitz in ( µ , Σ ), i.e. there exists L R > 0 such that sup x ∈ B (0 ,R ) | f µ , Σ ( x ) − f µ ′ , Σ ′ ( x ) | ≤ L R ∥ ( µ , Σ ) − ( µ ′ , Σ ′ ) ∥ . 3. Co v er { µ : ∥ µ ∥ ≤ M ε } × { Σ : σ 2 min I ⪯ Σ ⪯ σ 2 max I } with a δ -net for δ = ε/ (2 L R ). The L 1 ( P )-distance is small on B (0 , R ), with the tail region { x : ∥ x ∥ > R } controlled b y c ho osing R suc h that 2 P ( B (0 , R ) c ) ∥ F ∥ ∞ < ε/ 2. 1 A δ -net for a set A in a metric space ( X , d ) is a ﬁnite subset N δ ⊂ X such that for ev ery x ∈ A there exists y ∈ N δ with d ( x, y ) ≤ δ . In our case, the metric is L 1 ( P ) distance b etw een densities. 36 Similarly as before, w e conclude that F is totally b ounded, hence GC. Lemma S2.3 (GC for ﬁnite mixtures) . Fix K ∈ N . L et F b e total ly b ounde d with r e- sp e ct to P and with envelop e F ∈ L 1 ( P ) . Deﬁne the class of K -c omp onent mixtur es with c omp onents in F F ( K ) = n f ( x ) = K X k =1 π k f k ( x ) : π ∈ ∆ K , f k ∈ F o , wher e ∆ K is the K − 1 -dimensional simplex. Then F ( K ) is GC with envelop e K F ∈ L 1 ( P ) . Pr o of of L emma S2.3 . W e need to pro ve that K F is an integrable env elope and that F ( K ) is totally bounded in L 1 ( P ). First, | f ( x ) | =      K X k =1 π k f k ( x )      ≤ K X k =1 π k | f k ( x ) | ≤ K X k =1 | f k ( x ) | ≤ K F ( x ) , hence E P [ K F ] < ∞ , and K F is an env elope. The total boundedness of F ( K ) comes from the total boundedness of F : 1. Let ε > 0. Since F is totally b ounded, there exists a ﬁnite ε/ 2-net { g (1) , . . . , g ( J ε ) } in L 1 ( P ). F or eac h f k , denote by g ( j k ) an ε -approximation. F or eac h g ( j k ) , ∥ g ( j k ) ∥ L 1 ( P ) ≤ ∥ g ( j k ) − f k ∥ L 1 ( P ) + ∥ f k ∥ L 1 ( P ) = ε/ 2 + ∥ F ∥ L 1 ( P ) = F ε . 2. Appro ximate π ∈ ∆ K using a ﬁnite grid Π η of mesh η = ε/ (2 K F ε ) and denote by ˜ π k the closest elemen t of Π η to π k . 3. Compute      K X k =1 π k f k − K X k =1 e π k g ( j k )      L 1 ( P ) ≤ K X k =1 π k ∥ f k − g ( j k ) ∥ L 1 ( P ) + K X k =1 F ε | π k − e π k | . The tw o terms are b ounded b y ε/ 2, so the sum is bounded b y ε . Th us F ( K ) is totally bounded, hence GC. Corollary S2.4 (GC for b ounded-cov ariance Gaussian mixtures) . L et G b e the class of Gaussian distributions with c ovarianc e eigenvalues in [ σ 2 min , σ 2 max ] and unc onstr aine d me ans. F or ﬁxe d K , the class of K -c omp onent Gaussian mixtur es with those c ovarianc e b ounds is GC under P and has an envelop e in L 1 ( P ) . Pr o of of Cor ol lary S2.4 . In the pro of of Lemma S2.2 , w e prov ed the suﬃcien t conditions to apply Lemma S2.3 with F = G , whic h prov es in turn the result ab o ve. Finally , we present the P -Donsker prop ert y . It upgrades GC to a uniform CL T, whic h yields rates for empirical av erages o ver F and underpins our asymptotic normalit y results. 37 Consider a probabilit y measure P and some i.i.d. samples X 1 , . . . , X N of a random v ariable X ∼ P . A class of functions F on R d is P -Donsker if the empirical pro cess G N deﬁned as G N f : = √ N ( P N − P ) f for f ∈ F , view ed as a random elemen t in  ∞ ( F ), conv erges in distribution to a tight mean-zero Gaussian pro cess G P with cov ariance Co v  G P f , G P g  = P [( f − P f )( g − P g )] , f , g ∈ F . Equiv alen tly , F is P -Donsker if a functional CL T holds uniformly ov er f ∈ F . A suﬃcien t condition is that F ⊂ L 2 ( P ) admits an L 2 ( P ) en v elop e and has a ﬁnite brac keting en tropy in tegral, for example Z δ 0 q log N [ ]  ε, F , L 2 ( P )  dε < ∞ , where log N [ ]  ε, F , L 2 ( P )  denotes the ε -brac keting entrop y for F in L 2 ( P ), see v an der V aart and W ellner ( 1996 ), Chapter 2.5. Lemma S2.5 (GC and Donsker prop erties for Gaussian mixtures and their gradients) . F or ﬁxe d K and c omp act Θ K , the classes Q = { q Λ : Λ ∈ Θ K } , Q ( u ) ∇ = { [ ∇ Λ q Λ ] u : Λ ∈ Θ K } for u = 1 , . . . , dim(Θ K ) and Q ( u,v ) ∇ 2 = { [ ∇ 2 Λ q Λ ] uv : Λ ∈ Θ K } for u, v = 1 , . . . , dim(Θ K ) ar e P -Donsker (henc e Glivenko–Cantel li) with envelop es in L 2 ( P ) . Pr o of of L emma S2.5 . First, we prov e that Q and Q ∇ admit env elop es in L 2 ( P ), then w e use the fact that they are Lipsc hitz to exhibit their Donsk er property . Step 1. L 2 envelop es. On set Θ K , b y contin uity of Gaussian densities and compactness of Θ K , there exist constants C 0 , C 1 , C 2 < ∞ (depending only on Θ K ) such that, for all x ∈ R d and Λ ∈ Θ K , for all u, v = 1 , . . . , dim(Θ K ), 0 ≤ q Λ ( x ) ≤ C 0 , | [ ∇ Λ q Λ ( x )] u | ≤ C 1 (1 + ∥ x ∥ + ∥ x ∥ 2 ) , | [ ∇ 2 Λ q Λ ( x )] uv | ≤ C 2 (1 + ∥ x ∥ + ∥ x ∥ 2 + ∥ x ∥ 3 + ∥ x ∥ 4 ) . Hence an env elope for Q is Q ( x ) ≡ C 0 ∈ L 2 ( P ). F or u = 1 , . . . , dim(Θ K ), since P is supp orted on a bounded set S ⊂ R d , there is R < ∞ with ∥ X ∥ ≤ R a.s., and thus an en velope for Q ( u ) ∇ is Q ∇ ( X ) ≡ C 1 (1 + R + R 2 ) a.s. In particular Q ∇ ∈ L 2 ( P ), therefore Q ( u ) ∇ admits an L 2 ( P ) env elop e. Similarly , for u, v = 1 , . . . , dim(Θ K ), an env elop e for Q ( u,v ) ∇ 2 is Q ∇ 2 ( X ) ≡ C 2 (1 + R + R 2 + R 3 + R 4 ) a.s. and Q ( u,v ) ∇ 2 also admits an L 2 ( P ) en v elop e. Step 2. Lipschitz into L 2 ( P ) . Fix Λ , e Λ ∈ Θ K and set Λ t = e Λ + t ( Λ − e Λ ), t ∈ [0 , 1]. By the mean v alue theorem in Banach spaces, q Λ − q e Λ = Z 1 0 ⟨∇ Λ q Λ t , Λ − e Λ ⟩ dt, so ∥ q Λ − q e Λ ∥ L 2 ( P ) ≤ ∥ Λ − e Λ ∥ sup Λ ′ ∈ Θ K ∥∇ Λ q Λ ′ ∥ L 2 ( P ) ≤ ∥ Q ∇ ∥ L 2 ( P ) ∥ Λ − e Λ ∥ . 38 By the same argument applied to ∇ Λ q Λ and ∇ 2 Λ q Λ and using that all second deriv atives ∇ 2 Λ q Λ ( x ) and third deriv atives ∇ 3 Λ q Λ ( x ) are uniformly bounded on S × Θ K , since Gaussian mixture densities ha v e con tin uous second and third deriv atives and Θ K ensures b ounded means and eigen v alues b ounded aw ay from 0 and ∞ , there exists C ′ 1 , C ′ 2 < ∞ with ∥ [ ∇ Λ q Λ ] u − [ ∇ Λ q e Λ ] u ∥ L 2 ( P ) ≤ C ′ 1 ∥ Λ − e Λ ∥ , ∥ [ ∇ 2 Λ q Λ ] uv − [ ∇ 2 Λ q e Λ ] uv ∥ L 2 ( P ) ≤ C ′ 2 ∥ Λ − e Λ ∥ , for all u, v = 1 , . . . , dim(Θ K ). Thus all maps Λ 7→ q Λ , Λ 7→ [ ∇ Λ q Λ ] u and Λ 7→ [ ∇ 2 Λ q Λ ] uv are Lipschitz into L 2 ( P ) with L 2 en velopes. Step 3. Donsker. By the Euclidean parametric-Lipsc hitz brack eting b ound ( v an der V aart and W ellner , 1996 , Thm. 2.7.11), the L 2 ( P ) brack eting num b ers of Q , Q ( u ) ∇ and Q ( u,v ) ∇ 2 gro w at most polynomially on compact Θ K . Therefore, the brac keting in tegral is ﬁnite, and the brac k eting Donsker theorem ( v an der V aart and W ellner , 1996 , Chp. 2.5) yields that Q , Q ( u ) ∇ and Q ( u,v ) ∇ 2 are P -Donsk er, for all u, v = 1 , . . . , dim(Θ K ). S2.2 Pro of of Theorem 2.1 W e prov e Theorem 2.1 , e stablishing the consistency of the optimal v ariational solution q ⋆ ω := arg max Q L ω ( q ), where Q is the family of truncated Gaussian mixtures of K ≥ I comp onen ts with upper-b ounded cov ariances, as ω → 0. W e proceed in three steps. First, w e identify the asymptotics of the Gibbs measure g ω via a Laplace expansion near each mo de of f (Prop. S2.6 and S2.7 ). Then, w e construct a truncated mixture e q S ω on S such that KL( e q S ω || g ω ) → 0 (Prop. S2.8 ). This implies KL( q ⋆ ω || g ω ) → 0, so ﬁnally , combining with the Laplace expansions of g ω , we deduce the claims of Theorem 2.1 . Setup Let f ∈ C 3 ( S ) be bounded with (ﬁnitely man y) nondegenerate global mo des { x ⋆ i } I i =1 , and let f ⋆ := sup x ∈S f ( x ) = f ( x ⋆ i ). Nondegeneracy means H i = −∇ 2 f ( x ⋆ i ) ≻ 0, for all i ∈ [ I ]. These mo des are isolated and in the interior of S , so there exist disjoint open, bounded neigh b orho ods U 1 , . . . , U I ⊂ S of the x ⋆ i suc h that eac h U i con tains no other critical point of f . Without loss of generalit y , assume that f is strictly p ositive on S I i =1 U i . Deﬁne U 0 := S \ S I i =1 U i . The restricted ob jective ov er a family Q is L ω ( q ) = E q [ f ( X )] + ω H S ( q ) , q ∈ Q . Here, the family Q is the S -truncated Gaussian mixture family with K ≥ I comp onen ts, and all eigen v alues of cov ariance matrices in (0 , σ 2 max ]. Recall that for any Gaussian mixture q Λ on R d , we write its S -truncation q S Λ ( x ) := q Λ ( x ) 1 S ( x ) R S q Λ ( y ) d y . 39 Let q ⋆ ω denote an y choice of maximiser of L ω ( q ) in Q , for a ﬁxed ω > 0. T o simplify the notation, we denote the conditional distributions of q ⋆ ω and g ω on the regions U i , i = 0 , . . . , I , resp ectiv ely b y ϕ i,ω and ψ i,ω , deﬁned as follows: ϕ i,ω ( x ) := q ⋆,U i ω ( x ) = q ⋆ ω ( x ) 1 U i ( x ) α i,ω , ψ i,ω ( x ) := g U i ω ( x ) = g ω ( x ) 1 U i ( x ) γ i,ω , (S6) where α i,ω := Z U i q ⋆ ω ( x ) d x , γ i,ω := Z U i g ω ( x ) d x . (S7) Step 1: Laplace expansion and lo cal Gibbs measure prop erties W e start b y applying Laplace’s metho d around eac h isolated mo de x ⋆ i of f to the unre- stricted Gibbs optimiser g ω . The tw o follo wing propositions characterise the shap e of the Gibbs measure under annealing. Prop osition S2.6 (Laplace expansion near a mode) . Fix i ∈ [ I ] and c onsider an op en neighb orho o d U i of mo de x i . Then, as ω → 0 , Z U i exp { f ( x ) /ω } d x = e f ⋆ /ω (2 π ω ) d/ 2 (det H i ) − 1 / 2  1 + o (1)  , and for any b ounde d c ontinuous function h , Z U i h ( x ) g ω ( d x ) = h ( x ⋆ i ) c i ( ω ) + ω 2 T r  H − 1 i ∇ 2 h ( x ⋆ i )  c i ( ω ) + o ( ω ) , wher e c i ( ω ) = c i + o (1) , c i := (det H i ) − 1 / 2 P I j =1 (det H j ) − 1 / 2 . Mor e over, g ω → P I i =1 c i δ x ⋆ i we akly, and ther e exists C , η > 0 such that g ω ( U 0 ) ≤ C e − η /ω . Pr o of of Pr op osition S2.6 . The pro of uses the classical Laplace metho d ( Hw ang , 1980 ; also see Le Minh et al. , 2025 , Proof of Thm. 2): in U i , write the T aylor expansion of f ( x ) at x ⋆ i , f ( x ) = f ( x ⋆ i ) − 1 2 ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + R i ( x ) , with R i ( x ) ≤ L i ∥ x − x ⋆ i ∥ 3 , where L i > 0 dep ends on U i . Then Z U i exp( f ( x ) /ω ) d x = exp( f ( x ⋆ i ) /ω ) Z U i exp n − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + 1 ω R i ( x ) o d x , and standard dominated conv ergence in Gaussian co ordinates yields the ﬁrst display . Sum- ming o v er i and normalising giv es the weigh ts c i ( ω ) → c i and the weak limit. The second 40 displa y follows from the same expansion applied to ψ (up to second order) and Gaussian momen t calculations. The last statemen t can b e established since the mo des are isolated: there exists η > 0 suc h that sup x ∈ U 0 f ( x ) ≤ f ⋆ − η , so there exists C > 0 suc h that g ω ( U 0 ) ≤ C e − η /ω . Prop osition S2.7. F or i ∈ [ I ] and ψ i,ω deﬁne d in e quation ( S6 ) , ther e exists η > 0 such that E ψ i,ω [ X ] = x ⋆ i + O ( e − η /ω ) , Co v ψ i,ω ( X ) = ω H − 1 i + O ( e − 2 η /ω ) . Pr o of of Pr op osition S2.7 . (i) Exp e ctation. Consider the co ordinate functions h j ( x ) = x j , for all j ∈ [ d ]. These functions are C 2 with h j ( x ⋆ i ) = x ⋆ i,j and ∇ 2 h j ( x ∗ i ) ≡ 0. Apply Prop osition S2.6 to h = h j : Z U i x j g ω ( x ) d x = x ⋆ i,j c i ( ω ) + o ( ω ) . Next, Prop osition S2.6 also implies γ i,ω = g ω ( U i ) = c i ( ω ) + o ( e − η /ω ). So E ψ i,ω [ X j ] = R U i x j g ω ( x ) d x γ i,ω = x ⋆ i,j c i ( ω ) + o ( ω ) c i ( ω ) + o ( e − η /ω ) . Since c i ( ω ) → c i > 0, the denominator is b ounded aw a y from 0 and w e can divide: for all j ∈ [ d ], E ψ i,ω [ X j ] = x ⋆ i,j + O ( e − η /ω ) . Therefore, E ψ i,ω [ X ] = x ⋆ i + O ( e − η /ω ) . (ii) Covarianc e. Next, consider the cen tred second momen t functions h j k ( x ) := ( x j − x ⋆ i,j )( x k − x ⋆ i,k ), for all ( j, k ) ∈ [ d ] 2 . These functions are C 2 with h j k ( x ⋆ i ) = 0 and  ∇ 2 h j k ( x ⋆ i )  ℓm = ∂ 2 ∂ x ℓ ∂ x m  ( x j − x ⋆ i,j )( x k − x ⋆ i,k )  | x = x ⋆ i = δ j ℓ δ km + δ j m δ kℓ , where δ ·· is the Kronec ker sym b ol, i.e. ∇ 2 h j k ( x ⋆ i ) is the matrix with en tries ( j, k ) and ( k, j ) set to one and all other en tries set to zero. Hence, b y symmetry of H − 1 i , T r  H − 1 i ∇ 2 h j k ( x ⋆ i )  =  H − 1 i  j k +  H − 1 i  kj = 2  H − 1 i  j k . 41 Apply Prop osition S2.6 to h j k : Z U i h j k ( x ) g ω ( x ) d x = ω  H − 1 i  j k c i ( ω ) + o ( ω ) . Again, we can divide by γ i,ω = c i ( ω ) + o ( e − η /ω ): E ψ i,ω [( X j − x ⋆ i,j )( X k − x ⋆ i,k ) T ] = R U i h j k ( x ) g ω ( x ) d x γ i,ω = ω  H − 1 i  j k + O ( e − η /ω ) . Therefore, E ψ i,ω [( X − x ⋆ i )( X − x ⋆ i ) T ] = ω H − 1 i + O ( e − η /ω ) . Finally , Co v ψ i,ω ( X ) = E ψ i,ω [( X − E ψ i,ω [ X ])( X − E ψ i,ω [ X ]) T ] = E ψ i,ω [( X − x ⋆ i )( X − x ⋆ i ) T ] − ( E ψ i,ω [ X ] − x ⋆ i )( E ψ i,ω [ X ] − x ⋆ i ) T , where we hav e already sho wn that E ψ i,ω [ X ] − x ⋆ i = O ( e − η /ω ), so Co v ψ i,ω ( X ) = ω H − 1 i + O ( e − 2 η /ω ) . Step 2: Comp etitor mixture and KL con v ergence F or ω > 0, deﬁne the Gaussian distributions, for i ∈ [ I ]: e q i,ω ( x ) = N ( x ; x ⋆ i , ω H − 1 i ) . These Gaussian dsitributions appro ximate lo cally the Gibbs measure at eac h mo de. W e deﬁne the assoc iated mixture with Laplace weigh ts e q ω ( x ) = I X i =1 c i e q i,ω ( x ) , where c i ∝ (det H i ) − 1 / 2 and P i c i = 1. W e use the truncated mixture e q S ω := e q ω 1 S / R S e q ω as a comp etitor for q ⋆ ω in Q . In this step, w e pro ve that e q S ω con verges in KL to g ω . Prop osition S2.8. L et e q S ω := e q ω 1 S / R S e q ω b e the trunc ation of e q ω on S . Then, e q S ω ∈ Q and, as ω → 0 , KL( e q S ω ∥ g ω ) − → 0 . T o prov e Proposition S2.8 , w e deriv e an in termediate lemma. 42 Lemma S2.9. L et KL S ( e q ω ∥ g ω ) := R S e q ω log e q ω /g ω . As ω → 0 , KL S ( e q ω ∥ g ω ) − → 0 . Pr o of of L emma S2.9 . W e ﬁrst compute the KL divergence for a single Gaussian e q i,ω and sho w that it conv erges to a ﬁnite constan t. W e then use the disjoint-mode structure to pro ve that the mixture e q ω is asymptotically indistinguishable from g ω in KL. Step 1. L o g-r atio for a single c omp onent. Fix i . On U i w e write f ( x ) = f ( x ⋆ i ) − 1 2 ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) , | ∆ i ( x ) | ≤ C ∥ x − x ⋆ i ∥ 3 . By deﬁnition, g ω ( x ) = exp { f ( x ) /ω } Z ω , e q i,ω ( x ) = (2 π ω ) − d/ 2 (det H i ) 1 / 2 exp  − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i )  . F rom Prop osition S2.6 , Z ω = e f ( x ⋆ i ) /ω (2 π ω ) d/ 2 I X j =1 (det H j ) − 1 / 2  1 + o (1)  , so, for x ∈ U i , log g ω ( x ) = f ( x ) ω − log Z ω = f ( x ⋆ i ) ω − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) ω − f ( x ⋆ i ) ω − d 2 log(2 π ω ) − log  I X j =1 (det H j ) − 1 / 2  + o (1) = − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) ω − d 2 log(2 π ω ) − log  I X j =1 (det H j ) − 1 / 2  + o (1) . On the other hand, log e q i,ω ( x ) = − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) − d 2 log(2 π ω ) + 1 2 log det H i . Subtracting gives log e q i,ω ( x ) g ω ( x ) = − ∆ i ( x ) ω + 1 2 log det H i + log  I X j =1 (det H j ) − 1 / 2  + o (1) = − ∆ i ( x ) ω − log c i + o (1) , 43 where c i = (det H i ) − 1 / 2 P I j =1 (det H j ) − 1 / 2 . F or X ∼ e q i,ω , write X = x ⋆ i + √ ω Y with Y ∼ N (0 , H − 1 i ). The remainder b ound implies | ∆ i ( X ) | ≤ C ∥ X − x ⋆ i ∥ 3 = C ω 3 / 2 ∥ Y ∥ 3 , so ∆ i ( X ) ω = O  √ ω ∥ Y ∥ 3  . Since Y has ﬁnite momen ts of all orders, E [∆ i ( X ) /ω ] → 0 as ω → 0. Dominated con v er- gence on U i yields, as ω → 0, KL S ( e q i,ω ∥ g ω ) = Z S e q i,ω ( x ) log e q i,ω ( x ) g ω ( x ) d x − → − log c i . Step 2. KL for the mixtur e e q ω ne ar the mo des. Deﬁne e q ω ( x ) = I X i =1 c i e q i,ω ( x ) . W e sho w that the limiting constan t − log c i cancels mode by mode inside the mixture, so that KL S ( e q ω ∥ g ω ) → 0. Fix R > 0 and for each i deﬁne the shrinking ball B i ( R, ω ) := { x : ∥ x − x ⋆ i ∥ ≤ R √ ω } ⊂ U i for all small ω . Tw o mo des x ⋆ i  = x ⋆ j are separated, so there exists δ > 0 suc h that ∥ x ⋆ i − x ⋆ j ∥ ≥ δ . But for x ∈ B i ( R, ω ), b y the triangle inequality , ∥ x − x ⋆ j ∥ ≥ δ − R √ ω > δ / 2 , for suﬃcien tly small ω . Thus, since all co v ariances are O ( ω ), there exist a constan t a > 0 suc h that for j  = i and x ∈ B i ( R, ω ), e q j,ω ( x ) e q i,ω ( x ) ≤ e − a/ω . Therefore, on B i ( R, ω ), e q ω ( x ) = c i e q i,ω ( x )  1 + O  e − a/ω   , so log e q ω ( x ) g ω ( x ) = log  c i e q i,ω ( x ) g ω ( x )  + log  1 + O  e − a/ω   = log c i + log e q i,ω ( x ) g ω ( x ) + o (1) 44 with an o (1) term uniform in x ∈ B i ( R, ω ) as ω → 0. Substituting the expansion from Step 1, log e q i,ω ( x ) g ω ( x ) = − ∆ i ( x ) ω − log c i + o (1) , w e obtain, on B i ( R, ω ), log e q ω ( x ) g ω ( x ) = − ∆ i ( x ) ω + o (1) . (S8) Let X ∼ e q i,ω . As abov e, X = x ⋆ i + √ ω Y with Y ∼ N (0 , H − 1 i ), so     ∆ i ( X ) ω     ≤ C √ ω ∥ Y ∥ 3 , E      ∆ i ( X ) ω      → 0 . Using ( S8 ) and dominated con vergence on B i ( R, ω ) yields Z B i ( R,ω ) e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x − → 0 as ω → 0 , for each ﬁxed R . Step 3. KL for the mixtur e e q ω in the tails. By Gaussian tail b ounds, e q i,ω  B i ( R, ω ) c  ≤ C 1 e − d 1 R 2 for some constants C 1 , d 1 > 0, uniformly in ω . On B i ( R, ω ) c ∩ U i , b oth e q ω and g ω are exp onen tially small in R 2 and | log( e q ω /g ω ) | grows at most p olynomially in ∥ x − x ⋆ i ∥ , so Z B i ( R,ω ) c ∩ U i e q i,ω ( x )    log e q ω ( x ) g ω ( x )    d x ≤ C 2 e − d 2 R 2 for suitable constan ts C 2 , d 2 > 0. On U 0 , Prop osition S2.6 gives g ω ( U 0 ) ≤ C e − η /ω , and the same Gaussian tail argumen ts yield e q i,ω ( U 0 ) ≤ C ′ 1 e − a i /ω , so the con tribution of U 0 to the KL in tegral is bounded b y C ′ 2 e − a ′ i /ω . Step 4. Conclusion. Putting ev erything together and summing o ver i with w eights c i , KL S ( e q ω ∥ g ω ) = Z e q ω ( x ) log e q ω ( x ) g ω ( x ) d x = I X i =1 c i Z e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x = I X i =1 c i " Z B i ( R,ω ) e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x + Z B i ( R,ω ) c ∪ U 0 e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x # . 45 F or eac h ﬁxed R , the ﬁrst term in brack ets tends to 0 as ω → 0, by the lo cal analysis abov e. The second term is bounded in absolute v alue b y C e − cR 2 + C ′ e − a/ω , indep endently of ω . Giv en ε > 0, pick R large enough that the tail b ound is < ε/ 2, then let ω → 0 to make the lo cal terms < ε/ 2. Thus, KL S ( e q ω ∥ g ω ) → 0. Pr o of of Pr op osition S2.8 . Note that b ounding the Gaussian tails in e q ω giv es β ω := R R d \S e q ω ( x ) d x = O ( e − c/ω ), for some c > 0. F rom Lemma S2.9 , KL S ( e q ω || g ω ) = o (1). Since e q S ω = e q ω / (1 − β ) on S , w e hav e KL( e q S ω || g ω ) − KL S ( e q ω || g ω ) =  1 1 − β ω − 1  KL S ( e q ω || g ω ) − log (1 − β ω ) = O ( β ω )KL S ( e q ω || g ω ) + O ( β ω ) = O ( e − c/ω ) . Therefore, we can conclude that   KL( e q S ω || g ω ) − KL S ( e q ω || g ω )   → 0, which prov es the prop o- sition. Step 3: Conclusion, pro of of Theorem 2.1 W e are now going to prov e all the statemen ts of Theorem 2.1 . Recall the conditionals of g ω and q ⋆ ω on U i , i = 0 , . . . , I : ϕ i,ω ( x ) = q ⋆ ω ( x ) 1 U i ( x ) α i,ω , ψ i,ω ( x ) = g ω ( x ) 1 U i ( x ) γ i,ω , where α i,ω = Z U i q ⋆ ω ( x ) d x , γ i,ω = Z U i g ω ( x ) d x , and the competitor mixture e q S ω = e q ω 1 S / R S e q ω ( x ) d x , with e q ω ( x ) = I X i =1 c i N ( x ; x ⋆ i , ω H − 1 i ) . Pr o of of The or em 2.1 . The principle of the pro of is as follows: b ecause the competitor e q S ω of Proposition S2.8 conv erges in KL to g ω , the same holds for q ⋆ ω , and therefore q ⋆ ω inherits the asymptotics of the Gibbs measure g ω . Steps 1-2 pro ve Theorem 2.1 (i), and Step 3 pro ceeds for Theorem 2.1 (ii). Step 1. Conver genc e of q ⋆ ω in KL. F or any density q on S , we hav e the v ariational iden tity , L ω ( q ) = ω log Z ω − ω KL( q ∥ g ω ) . 46 The comp etitor mixture e q S ω ∈ Q . Since q ⋆ ω maximises L ω o ver Q , w e hav e, as ω → 0, KL( q ⋆ ω || g ω ) ≤ KL( e q S ω || g ω ) − → 0 , where the limit follows from Prop osition S2.8 . Thus, the optimal truncated mixture satisﬁes KL( q ⋆ ω ∥ g ω ) → 0 . (S9) Step 2. Asymptotic mass in mo de neighb orho o ds. F or i = 0 , . . . , I , consider α i,ω and γ i,ω of ( S7 ). Let T : S → { 0 , 1 , . . . , I } send x ∈ U j to j . Let α ω and γ ω b e the induced discrete laws. Theorem 4.1 of Kullback and Leibler , 1951 gives KL( α ω ∥ γ ω ) ≤ KL( q ⋆ ω ∥ g ω ) − → 0 . F rom Prop osition S2.6 , γ i,ω → c i for i ∈ [ I ], and γ 0 ,ω → 0. Hence ( S9 ) implies α i,ω → c i , i ∈ [ I ] , α 0 ,ω → 0 . Step 3. L o c al c onditional densities. F or i ∈ [ I ], consider the conditional densities ϕ i,ω and ψ i,ω from ( S6 ). The chain rule for the KL div ergence ov er the partition { U 0 , U 1 , . . . , U I } giv es KL( q ⋆ ω ∥ g ω ) = KL( α ω ∥ γ ω ) + I X i =0 α i,ω KL( ϕ i,ω ∥ ψ i,ω ) . Using ( S9 ) and α i,ω → c i > 0, we obtain for each i ∈ [ I ], KL( ϕ i,ω ∥ ψ i,ω ) − → 0 . F urthermore, since all densities live on the b ounded set U i , Pinsk er’s inequalit y implies con vergence in total v ariation. F rom Prop osition S2.7 , the conditional la w ψ i,ω has mean x ⋆ i + o (1) and co v ariance o (1). Hence, the KL con vergence forces the same limits for the mixture’s lo cal conditional momen ts: E φ i,ω [ X ] = x ⋆ i + o (1) , Cov φ i,ω ( X ) = o (1) . S2.3 Pro of of Theorem 3.1 In this section, we prov e Theorem 3.1 , establishing ﬁxed-temp erature asymptotics for the empirical maximisers when the sample size gro ws to inﬁnit y . F or conv enience, w e recall the assumptions of Theorem 3.1 . Assumption S2.10. Deﬁne the fol lowing set of assumptions: 47 (B1) ( Data ) X 1 , . . . , X N i.i.d. ∼ p supp orte d on a b ounde d S ⊂ R d , with p b ounde d and c ontinuous on S . (B2) ( Unicity of the p opulation maximiser ) L ω 0 admits a unique maximiser Λ ⋆ ω 0 ∈ in t(Θ K ) . (B3) ( Nonde gener acy of the p opulation maximiser ) A t the maximiser Λ ⋆ ω 0 , H ⋆ ω 0 := H ω 0 ( Λ ⋆ ω 0 ) is p ositive deﬁnite. Eac h of the follo wing subsections pro v es one of the three statemen ts in Theorem 3.1 : (i) Under (B1), the p opulation ob jectiv e L ω 0 attains its maxim um on Θ K . (ii) Under (B1)-(B2), an y empirical maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisﬁes b Λ ω 0 ,N P − → Λ ⋆ ω 0 . (iii) Under (B1)-(B2)-(B3), an y empirical maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisﬁes √ N  b Λ ω 0 ,N − Λ ⋆ ω 0  D − → N (0 , W ω 0 ) , where W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . S2.3.1 Existence of a p opulation maximiser Pr o of of (i). On the compact Θ K , ( Λ , x ) 7→ q Λ ( x ) is contin uous and is uniformly b ounded on the bounded set S . F or each x ∈ S , Λ 7→ q Λ ( x ) is con tinuous. By dominated con- v ergence (b ounded env elop e on S ), Λ 7→ E [ q Λ ( X )] is con tinuous. The en tropy term Λ 7→ H S ( q Λ ) is con tinuous on Θ K as well. Thus, b y W eierstrass, L ω ∗ is contin uous on compact Θ K and attains its maximum. S2.3.2 Consistency of the empirical maximisers The pro of of this result consists of tw o steps: the deriv ation of a uniform law of large n umbers, due to the Glivenk o–Can telli (GC) prop ert y of the Gaussian mixture family in Θ K (see App. S2.1 , Cor. S2.4 ), and the application of the Argmax Theorem. F or more details on the GC property , w e refer to Section S2.1 . Pr o of of (ii). The class { q Λ : Λ ∈ Θ K } is Glivenk o–Cantelli with an in tegrable en velope on S , so as N → ∞ : sup Λ ∈ Θ K      1 N N X i =1 q Λ ( X i ) − E p [ q Λ ( X )]      P − → 0 . Adding the deterministic term ω ∗ H S ( q Λ ) yields sup Λ ∈ Θ K   b L ω ∗ ,N ( Λ ) − L ω ∗ ( Λ )   P − → 0 . 48 Since L ω ∗ has a unique maximiser Λ ⋆ ω ∗ (Assumption (B2)), the Argmax Theorem ( v an der V aart , 1998 , Thm. 5.7) yields b Λ ω ∗ ,N P − → Λ ⋆ ω ∗ , as N → ∞ . S2.3.3 Asymptotic normalit y of the empirical maximisers The asymptotic normality result is deriv ed from a T aylor expansion and the Donsker prop- ert y on the Gaussian mixture family but also on the gradien ts. Lemma S2.5 giv es these prop erties. Pr o of of (iii). By Lemma S2.5 , Q is GC, so sup Λ ∈ Θ K   b L ω ∗ ,N ( Λ ) − L ω ∗ ( Λ )   P − → 0, hence b Λ ω ∗ ,N P − → Λ ⋆ ω ∗ b y the standard M-estimator consistency theorem ( v an der V aart , 1998 , Thm. 5.7). W rite ∇ Λ b L ω ∗ ,N ( Λ ) = 1 N N X i =1 ∇ Λ q Λ ( X i ) + ω ∗ ∇ Λ H S ( q Λ ) , ∇ Λ L ω ∗ ( Λ ) = E [ ∇ Λ q Λ ( X )] + ω ∗ ∇ Λ H S ( q Λ ) . Th us ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) − ∇ Λ L ω ∗ ( Λ ⋆ ω ∗ ) = ( P N − P ) ∇ Λ q Λ ( X )   Λ = Λ ⋆ ω ∗ has mean zero and v ari- ance V ω ∗ / N . F urthermore, a T aylor expansion around Λ ⋆ ω ∗ giv es 0 = ∇ Λ b L ω ∗ ,N ( b Λ ω ∗ ,N ) = ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) − H ⋆ ω ∗ ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) + r N , where, by the in tegral form of T aylor’s theorem, r N = 1 2 ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) T h ∇ 3 Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ + τ ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ )) i ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) for some τ ∈ (0 , 1). Since Θ K is compact and S (the support of P ) is b ounded, the parametric map Λ 7→ q Λ ( x ) is C ∞ and all third deriv atives are uniformly b ounded on S × Θ K . Hence, ∇ 3 Λ b L ω ∗ ,N is uniformly bounded, giving | r N | ≤ C ∥ b Λ ω ∗ ,N − Λ ⋆ ω ∗ ∥ 2 = o P ( ∥ b Λ ω ∗ ,N − Λ ⋆ ω ∗ ∥ ). Therefore, b y the CL T for the P -Donsk er gradien t class (Lemma S2.5 ), √ N ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) D − → N (0 , V ω ∗ ). Since H ⋆ ω ∗ is positive deﬁnite, rearranging yields the usual Z-estimator linearisation ( v an der V aart , 1998 , Thm. 5.41): √ N ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) = ( H ⋆ ω ∗ ) − 1 √ N ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) + o P (1) , and therefore √ N ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) D − → N  0 , ( H ⋆ ω ∗ ) − 1 V ω ∗ ( H ⋆ ω ∗ ) − 1  , as N → ∞ . 49 S2.4 Pro ofs for Sections 5 and S1.6 S2.4.1 Preliminaries: Bounded-Lipsc hitz metric Let ( S , d ) b e a metric space. F or f : S → R deﬁne ∥ f ∥ ∞ = sup x ∈S | f ( x ) | , Lip( f ) = sup x  = y | f ( x ) − f ( y ) | d ( x , y ) . F or tw o measures ν, ν ′ on S , the bounded-Lipschitz (BL) me tric is d BL ( ν, ν ′ ) := sup ∥ f ∥ BL ≤ 1    R f dν − R f dν ′    , ∥ f ∥ BL := ∥ f ∥ ∞ + Lip( f ) . Con vergence in d BL implies weak con vergence ( Dudley , 2002 ), denoted b y w − → . W e use it to state bo otstrap results compactly . S2.4.2 T echnical lemmas W e state these lemmas, with pro ofs in Section S2.4.3 . Lemma S2.11 (Regularit y for Gaussian mixtures) . F or a b ounde d sp ac e S and on the c omp act Θ K , ther e exists a neighb orho o d V ⊂ Θ K of Λ ⋆ ω 0 such that (i) Λ 7→ q Λ ( x ) is C 2 on V for every x ∈ S ; (ii) sup Λ ∈V ∥∇ Λ q Λ ( x ) ∥ ≤ F 1 ( x ) for a b ounde d-Lipschitz function F 1 on S ; (iii) sup Λ ∈V ∥∇ 2 Λ q Λ ( x ) ∥ ≤ F 2 ( x ) for a b ounde d-Lipschitz function F 2 on S . Lemma S2.12 (Empirical-pro cess CL T and bo otstrap) . L et Q ( u ) ∇ :=  x 7→ [ ∇ Λ q Λ ( x )] u : Λ ∈ Θ K  and Q ( u,v ) ∇ 2 :=  x 7→ [ ∇ 2 Λ q Λ ( x )] uv : Λ ∈ Θ K  for al l u, v = 1 , . . . , dim(Θ K ) , and deﬁne the function class F = dim(Θ K ) [ u =1 Q ( u ) ∇ ∪ dim(Θ K ) [ u,v =1 Q ( u,v ) ∇ 2 . We have, as N → ∞ : √ N ( P N − P ) w − → G P in  ∞ ( F ) , √ N ( P ∗ N − P N ) w − → G P in  ∞ ( F ) , c onditional ly on X 1: N , in P -pr ob ability, wher e G P denote the P -Br ownian bridge indexe d by F . Lemma S2.13 (Hadamard diﬀeren tiabilit y of the argmax map) . L et V ⊂ Θ K b e the set in which L emma S2.11 holds and F b e the function class of L emma S2.12 . L et  ∞ lin ( F ) := { h ∈  ∞ ( F ) : h is line ar } . 50 F or h ∈  ∞ lin ( F ) , let us write h ( ∇ Λ q Λ ) := ( h ([ ∇ Λ q Λ ] u )) 1 ≤ u ≤ dim(Θ K ) and h ( ∇ 2 Λ q Λ ) :=  h ([ ∇ 2 Λ q Λ ] u,v )  1 ≤ u,v ≤ dim(Θ K ) . Deﬁne S :  ∞ lin ( F ) × V → R dim(Θ K ) by S ( G, Λ ) = G ( ∇ Λ q Λ ) + ω 0 ∇ Λ H S ( q Λ ) . Under Assumption S2.10 , we have the fol lowing (i) Ther e exists a neighb orho o d U of P such that for every G ∈ U , the e quation over Λ S ( G, Λ ) = 0 admits a unique solution Λ ( G ) ∈ V . (ii) The map Φ : G 7→ Λ ( G ) is Hadamar d diﬀer entiable at G = P , tangential ly to  ∞ lin ( F ) , with D Φ P ( h ) = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) . Lemma S2.14 (Measurability of the matc hing functional) . Λ 7→ M ( Λ ) is Bor el- me asur able on Θ K . In fact, M is c ontinuous on a ﬁnite Bor el p artition of Θ K . Lemma S2.15 (Con tin uity of the stabilit y indicator F j ) . Fix j ∈ [ K 0 ] and let U j = B ( u j , r j ) b e an op en b al l. Assume (by shrinking r j if ne e de d) that E j M ( Λ ⋆ ω 0 ) lies at p osi- tive distanc e fr om ∂ U j . Then, under Assumption S2.10 , for any deterministic me asur able sele ctor Φ fr om the ar gmax, F j ( G ) = 1 { E j M (Φ( G )) ∈ U j } is c ontinuous at G = P with r esp e ct to we ak c onver genc e: if G n w − → P , then F j ( G n ) → F j ( P ) . S2.4.3 Pro ofs of lemmas Sketch of pr o of of L emma S2.11 . Gaussian mixture densities are C ∞ in the parameters Λ as long as the co v ariances remain p ositiv e deﬁnite and in a compact subset, whic h is guaran teed in Θ K . This giv es ( i ). The ﬁrst and second order deriv atives of q Λ ( x ) w.r.t. Λ are pro ducts of p olynomials of x and Gaussian densities, whic h are smooth in x , hence Lipsc hitz on the b ounded set S . Since S and Θ K are compact, they are also uniformly b ounded. T ypically , there exists C 1 , C 2 > 0 such that F 1 ( x ) = C 1 (1 + ∥ x ∥ + ∥ x ∥ 2 ) , F 2 ( x ) = C 2 (1 + ∥ x ∥ + ∥ x ∥ 2 + ∥ x ∥ 3 + ∥ x ∥ 4 ) , whic h are b ounded-Lipsc hitz on S , and ( ii ) and ( iii ) hold. Pr o of of L emma S2.12 . By Lemma S2.5 , eac h class Q ( u ) ∇ and Q ( u,v ) ∇ 2 is P -Donsker and ad- mits a measurable env elope in L 2 ( P ). Since F is a ﬁnite union of P -Donsker classes, it is also P -Donsker ( v an der V aart and W ellner , 1996 , Example 2.10.7). Hence, it is also Gliv enko–Can telli and √ N ( P N − P ) w − → G P in  ∞ ( F ) , 51 where G P is a P -Brownian bridge indexed b y F . F urthermore, the nonparametric b o otstrap empirical pro cess satisﬁes the conditional functional CL T ( v an der V aart and W ellner , 1996 , Thm. 3.6.3): √ N ( P ∗ N − P N ) w − → G P in  ∞ ( F ) , conditionally on X 1: N , in P -probabilit y . Pr o of of L emma S2.13 . (i) By Lemma S2.11 , for all x ∈ S , the map Λ 7→ ∇ Λ q Λ ( x ) is C 1 on V , sup Λ ∈V ∥∇ Λ q Λ ( x ) ∥ ≤ F 1 ( x ) and sup Λ ∈V ∥∇ 2 Λ q Λ ( x ) ∥ ≤ F 2 ( x ) for some b ounded functions F 1 and F 2 . Therefore, for an y G ∈  ∞ ( F ), the maps Λ 7→ G ([ ∇ Λ q Λ ] u ), for u = 1 , . . . , dim(Θ K ), ha v e deriv ativ es ( G ([ ∇ 2 Λ q Λ ] u 1 ) , . . . , G ([ ∇ 2 Λ q Λ ] u dim(Θ K ) )) and are C 1 on V . This implies that S ( G, Λ ) is C 1 in Λ , and S ( P , Λ ⋆ ω 0 ) = 0 , ∂ Λ S ( P , Λ ⋆ ω 0 ) = ∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) = A ≺ 0 , where A := − H ⋆ ω 0 is inv ertible b y Assumption S2.10 . F urthermore, since ∂ Λ S ( P , Λ ) is con tin uous in Λ , there is r > 0 suc h that on the closed ball B ( Λ ⋆ ω 0 , r ) ⊂ V , sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ ∂ Λ S ( P , Λ ) − A ∥ ≤ 1 3 ∥ A − 1 ∥ . (S10) No w, deﬁne for G ∈  ∞ ( F ), Λ ∈ B ( Λ ⋆ ω 0 , r ), T G ( Λ ) := Λ − A − 1 S ( G, Λ ) . T o pro ve (i), we w ould lik e to apply the Banac h ﬁxed-p oin t theorem ( Zeidler , 1995 , Th. 1.A) to T G , for all G in some neighborho o d of P , in the set B ( Λ ⋆ ω 0 , r ). F or this, we need to v erify a self-mapping prop ert y and a con traction property , whic h hold for a suﬃciently small neighborho o d of P : we consider the neighborho o d T formed b y G suc h that ∥ G − P ∥ F ≤ min  1 3 ∥ A − 1 ∥ , r 3 ∥ A − 1 ∥  . Contr action pr op erty. S ( G, Λ ) is C 1 on V , so for any Λ 1 and Λ 2 in B ( Λ ⋆ ω 0 , r ), the mean-v alue theorem giv es S ( G, Λ 1 ) − S ( G, Λ 2 ) = ∂ Λ S ( G, e Λ )( Λ 1 − Λ 2 ) for some e Λ ∈ B ( Λ ⋆ ω 0 , r ) on the segment b et w een Λ 1 and Λ 2 . Therefore, T G ( Λ 1 ) − T G ( Λ 2 ) = ( I − A − 1 ∂ Λ S ( G, e Λ ))( Λ 1 − Λ 2 ) . This means ∥ T G ( Λ 1 ) − T G ( Λ 2 ) ∥ ≤ sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ × ∥ Λ 1 − Λ 2 ∥ . (S11) 52 F urthermore, ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ ≤ ∥ A − 1 ∥ ( ∥ A − ∂ Λ S ( P , Λ ) ∥ + ∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ ) . Then, by deﬁnition of ∥·∥ F , for an y Λ ∈ B ( Λ ⋆ ω 0 , r ), ∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ =    ( G − P )( ∇ 2 Λ q Λ )    ≤ ∥ G − P ∥ F . F or all G ∈ T , w e ha v e ∥ G − P ∥ F ≤ 1 / (3 ∥ A − 1 ∥ ), so ∥ A − 1 ∥∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ ≤ 1 3 . Com bining this with ( S10 ), w e ha v e sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ ≤ 2 3 , so in ( S11 ), we hav e ∥ T G ( Λ 1 ) − T G ( Λ 2 ) ∥ ≤ 2 3 ∥ Λ 1 − Λ 2 ∥ , (S12) whic h pro v es con traction for G ∈ T . Self-mapping pr op erty. Since S ( P , Λ ⋆ ω 0 ) = 0, T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 = − A − 1 S ( G, Λ ⋆ ω 0 ) = − A − 1 ( G − P )( ∇ Λ q Λ ⋆ ω 0 ) . By deﬁnition of ∥·∥ F ,    ( G − P )( ∇ Λ q Λ ⋆ ω 0 )    ≤ ∥ G − P ∥ F . Hence, for G ∈ T , ∥ T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 ∥ ≤ ∥ A − 1 ∥∥ G − P ∥ F ≤ r 3 Th us, using the con traction prop ert y ( S12 ), for an y Λ ∈ B ( Λ ⋆ ω 0 , r ), ∥ T G ( Λ ) − Λ ⋆ ω 0 ∥ ≤ ∥ T G ( Λ ) − T G ( Λ ⋆ ω 0 ) ∥ + ∥ T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 ∥ ≤ 2 r 3 + r 3 ≤ r , and this ensures that T G ( B ( Λ ⋆ ω 0 , r )) ⊂ B ( Λ ⋆ ω 0 , r ), the self-mapping prop ert y for G ∈ T . Conclusion. Finally , b y the Banach ﬁxed-point theorem for T G , for ev ery G in the neigh b orho od T of P , there exists a unique ﬁxed p oin t Λ ( G ) ∈ B ( Λ ⋆ ω 0 , r ) ⊂ V suc h that Λ ( G ) = T G ( Λ ( G )) is equiv alent to S ( G, Λ ( G )) = 0. This giv es (i). (ii) Let t n → 0 and h n → h in  ∞ lin ( F ) under ∥·∥ F . Set G n := P + t n h n and Λ n := Φ( G n ) = Λ ( G n ). Our ob jectiv e is to show that D Φ P ( h ) := lim n →∞ Φ( G n ) − Φ( P ) t n = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) . 53 First, the sequence ( h n ) is b ounded in ∥·∥ F , so ∥ G n − P ∥ F = | t n |∥ h n ∥ F → 0, and G n ∈ T and Λ n ∈ V , so b y (i), we ha ve 0 = S ( G n , Λ n ). F urthermore, since w e also hav e S ( P , Λ ⋆ ω 0 ) = 0, then 0 =  S ( G n , Λ n ) − S ( G n , Λ ⋆ ω 0 )  +  S ( G n , Λ ⋆ ω 0 ) − S ( P , Λ ⋆ ω 0 )  . By linearity , the second diﬀerence is S ( G n , Λ ⋆ ω 0 ) − S ( P , Λ ⋆ ω 0 ) = t n h n ( ∇ Λ q Λ ⋆ ω 0 ) . F or the ﬁrst term, since S ( G, Λ ) is C 1 in Λ , the mean-v alue theorem says that there exists e Λ n on the segmen t betw een Λ ⋆ ω 0 and Λ n suc h that S ( G n , Λ n ) − S ( G n , Λ ⋆ ω 0 ) = ∂ Λ S ( G n , e Λ n )( Λ n − Λ ⋆ ω 0 ) . Th us, w e hav e the relation 0 = ∂ Λ S ( G n , e Λ n )( Λ n − Λ ⋆ ω 0 ) + t n h n ( ∇ Λ q Λ ⋆ ω 0 ) . (S13) On the other hand, the ﬁxed-p oin t and the con traction prop erties of T G for all G ∈ T , pro ven in (i), imply that Λ n → Λ ⋆ ω 0 . Indeed, w e hav e ∥ Λ n − Λ ⋆ ω 0 ∥ = ∥ T G n ( Λ n ) − T P ( Λ ⋆ ω 0 ) ∥ ≤∥ T G n ( Λ n ) − T G n ( Λ ⋆ ω 0 ) ∥ + ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ ≤ 2 3 ∥ Λ n − Λ ⋆ ω 0 ∥ + ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ , hence ∥ Λ n − Λ ⋆ ω 0 ∥ ≤ 3 ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ ≤ 3 ∥ A − 1 ∥    ( G n − P )( ∇ Λ q Λ ⋆ ω 0 )    ≤ 3 ∥ A − 1 ∥∥ G n − P ∥ F → 0 . This also implies e Λ n → Λ ⋆ ω 0 . F urthermore, we hav e ∥ ∂ Λ S ( G n , e Λ n ) − A ∥ ≤ ∥ ∂ Λ S ( G n , e Λ n ) − ∂ Λ S ( P , e Λ n ) ∥ + ∥ ∂ Λ S ( P , e Λ n ) − A ∥ → 0 b ecause the ﬁrst term is ∥ ( G n − P )( ∇ 2 Λ q e Λ n ) ∥ ≤ sup Λ ∈V ∥ ( G n − P )( ∇ 2 Λ q Λ ) ∥ ≤ ∥ G n − P ∥ F → 0 54 and the second term goes to 0 b y contin uity of ∂ Λ S ( P , Λ ) in Λ at Λ ⋆ ω 0 . This means ∂ Λ S ( G n , e Λ n ) → A = − H ⋆ ω 0 , (S14) so there is some N suc h that n ≥ N implies ∂ Λ S ( G n , e Λ n ) is in vertible, and ( S13 ) yields Λ n − Λ ⋆ ω 0 t n = − ( ∂ Λ S ( G n , e Λ n )) − 1 h n ( ∇ Λ q Λ ⋆ ω 0 ) . (S15) Finally , as n → ∞ , h n → h in ∥·∥ F , so b y the deﬁnition of ∥·∥ F , we hav e h n ( ∇ Λ q Λ ⋆ ω 0 ) → h ( ∇ Λ q Λ ⋆ ω 0 ) . (S16) Com bining with ( S14 ) and ( S15 ), w e obtain Φ( G n ) − Φ( P ) t n = Λ n − Λ ⋆ ω 0 t n → D Φ P ( h ) = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) , whic h pro v es (ii). Pr o of of L emma S2.14 . Let Π be the ﬁnite set of injections π : { 1 , . . . , K 0 }  → { 1 , . . . , K } and deﬁne cost π ( Λ ) := K 0 X j =1 ∥ µ π ( j ) ( Λ ) − u j ∥ . Eac h cost π is con tin uous in Λ (composition of contin uous maps). F or π ∈ Π, set the open “uniqueness” region U π := n Λ ∈ Θ K : cost π ( Λ ) < cost ρ ( Λ ) ∀ ρ  = π o . On U π the minimiser is uniquely π , hence M ( Λ ) =  µ π (1) ( Λ ) T , . . . , µ π ( K 0 ) ( Λ ) T  T , whic h is contin uous there because eac h µ k is contin uous. Equip Π with the lexicographic order ≺ lex : for π , ρ ∈ Π, write π ≺ lex ρ if there exists j ⋆ with π ( j ) = ρ ( j ) for all j < j ⋆ and π ( j ⋆ ) < ρ ( j ⋆ ). F or each π ∈ Π, deﬁne the tie region T π := n Λ ∈ Θ K : cost π ( Λ ) ≤ cost η ( Λ ) ∀ η ∈ Π and cost π ( Λ ) < cost ρ ( Λ ) ∀ ρ ∈ Π with ρ ≺ lex π o . Here, each set { cost π ≤ cost η } is closed and eac h { cost π < cost ρ } is op en, so T π is Borel. On T π , the tie-break selects π , hence the same formula for M holds and M is contin uous there. 55 Finally , Θ K =  [ π ∈ Π U π  ∪  [ π ∈ Π T π  . is a ﬁnite Borel partition on whose pieces M is con tinuous. Therefore M is Borel measur- able on Θ K . Pr o of of L emma S2.15 . By Lemma S2.13 , Φ is con tinuous at P , so G n w − → P implies Φ( G n ) → Φ( P ) = Λ ⋆ ω 0 . By Proposition S2.17 , M is C 1 (hence contin uous) at Λ ⋆ ω 0 . F ur- thermore, E j is linear and the ball-mem b ership map is contin uous a w ay from the b oundary , e.g. via ψ j ( y ) := r j − ∥ y − u j ∥ (con tinuous), 1 { y ∈ U j } = 1 { ψ j ( y ) > 0 } . Therefore ψ j  E j M (Φ( G n ))  − → ψ j  E j M ( Λ ⋆ ω 0 )  . By the p ositive-margin assumption, the limit p oin t is not 0, so by contin uit y of the indicator map x 7→ 1 { x > 0 } , w e ha v e F j ( G n ) → F j ( P ). S2.4.4 Pro of of Theorem 5.1 Theorem 5.1 is a direct consequence of Theorem 3.1 (iii) and the following result (Thm. S2.16 ), whic h w e prov e b elow. Theorem S2.16 (Bo otstrap v alidit y for parameters) . Under Assumption S2.10 , d BL  La w ∗  √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N )  , Law  √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 )   P − → 0 , wher e La w( · ) denotes law under P and La w ∗ ( · ) law under P ∗ . Pr o of of The or em S2.16 . Consider the argmax functional Φ( G ) := arg max Λ ∈ Θ K n G ( q Λ ) + ω 0 H S ( q Λ ) o . By Lemma S2.13 , Φ is Hadamard diﬀerentiable at G = P tangentially to  ∞ lin ( F ) with deriv ativ e D Φ P (see deﬁnitions in the lemma statements). By Lemma S2.12 , √ N ( P ∗ N − P N ) w − → ∗ G P in  ∞ ( F ) in P -probability , where F is deﬁned in the lemma statement and G P ∈  ∞ lin ( F ). Therefore, by Theo- rem 3.1 (iii) and the b ootstrap functional delta metho d ( v an der V aart and W ellner , 1996 , Thm. 3.9.11), √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N ) = D Φ P  √ N ( P ∗ N − P N )  + o P ∗ (1) D − → N (0 , W ⋆ ω 0 ) , 56 conditionally on the data, in P -probability . This implies d BL  La w ∗  √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N )  , La w  √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 )   P − → 0 . S2.4.5 Pro of of Prop osition 5.4 Pr o of of Pr op osition 5.4 . By a mean-v alue expansion of the score, ∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N ) − ∇ Λ b L ω 0 ,N ( b Λ ω 0 ,N ) = c H ω 0 ,N ( ¯ Λ )( e Λ ω 0 ,N − b Λ ω 0 ,N ) , with ∇ Λ b L ω 0 ,N ( b Λ ω 0 ,N ) = 0 and there exists c > 0 such that eig min ( c H ω 0 ,N ( ¯ Λ )) ≥ c > 0 with probabilit y tending to 1 b y Assumption S2.10 (B3). Hence ∥ e Λ ω 0 ,N − b Λ ω 0 ,N ∥ ≤ c − 1   ∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N )   = o P ( N − 1 / 2 ) . Therefore √ N ( e Λ ω 0 ,N − b Λ ω 0 ,N ) = o P (1), and Slutsky yields the claimed Gaussian limit. F or the bo otstrap, the same argumen t holds conditionally with starred ( ∗ ) quan tities, giving the stated conclusions by the bo otstrap delta method. S2.4.6 Pro of of Theorem 5.3 Theorem 5.3 is obtained by combining Prop osition S2.17 , Theorem S2.18 , and Corol- lary S2.20 , pro ven b elo w. Mo de-matc hing map smo othness Prop osition S2.17 (Mo de-matc hing map smo othness) . Under Assumption 5.2 , ther e ex- ists a neighb orho o d V ⊂ Θ K ∩ G δ / 2 of Λ ⋆ ω 0 such that M is C 1 on V . Its Jac obian at Λ ⋆ ω 0 is J = ∇ Λ M ( Λ ⋆ ω 0 ) . Pr o of of Pr op osition S2.17 . By Assumption 5.2 , there exists δ > 0 suc h that the nearest- comp onen t indices are unique at Λ ⋆ ω 0 . By contin uit y of Λ 7→ ∥ µ k ( Λ ) − u j ∥ , these indices are locally constant, so there e xists a neigh b orhoo d V ⊂ G δ / 2 of Λ ⋆ ω 0 on whic h the matc hing is ﬁxed. W rite π ⋆ ( j ) for the index matched to u j at Λ ⋆ ω 0 . W e hav e, for all Λ ∈ V , M ( Λ ) =  µ π ⋆ (1) ( Λ ) T , . . . , µ π ⋆ ( K 0 ) ( Λ ) T  T . Since m ( Λ ) = ( µ 1 ( Λ ) T , . . . , µ K ( Λ ) T ) T is C 1 in a neighborho o d of Λ ⋆ ω 0 , the composition ab o v e sho ws that M is C 1 on V . In particular, at Λ ⋆ ω 0 , its Jacobian is J = ∇ Λ M ( Λ ⋆ ω 0 ) =   ∇ Λ µ π ⋆ (1) ( Λ ⋆ ω 0 ) . . . ∇ Λ µ π ⋆ ( K 0 ) ( Λ ⋆ ω 0 )   . 57 CL T for matc hed mo des Theorem S2.18 (CL T and b ootstrap for M ) . Under Assumptions S2.10 and 5.2 , as N → ∞ : √ N  M ( b Λ ω 0 ,N ) − M ( Λ ⋆ ω 0 )  D − → N (0 , C M ) , √ N  M ( b Λ ∗ ω 0 ,N ) − M ( b Λ ω 0 ,N )  D − → N (0 , C M ) , c onditional ly on X 1: N , in P -pr ob ability, with C M = J W ω 0 J T . Pr o of of The or em S2.18 . By Theorem 3.1 (iii), √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 ) D − → N (0 , W ω 0 ) . F urther- more, by Prop osition S2.17 , M is C 1 at Λ ⋆ ω 0 with Jacobian J , so the functional delta metho d ( v an der V aart , 1998 , Thm. 20.8) yields √ N  M ( b Λ ω 0 ,N ) − M ( Λ ⋆ ω 0 )  D − → N  0 , J W ω 0 J T  , whic h pro v es the ﬁrst claim. Finally , b y the b ootstrap functional delta method ( v an der V aart and W ellner , 1996 , Thm 3.9.11), √ N  M ( b Λ ∗ ω 0 ,N ) − M ( b Λ ω 0 ,N )  D − → N (0 , J W ω 0 J T ) conditionally on X 1: N , in P -probabilit y , whic h gives the second claim. Corollary S2.19 (Set wise con vergence) . As a c onse quenc e of The or em S2.18 , for any Bor el set B ⊂ R dK 0 whose b oundary has zer o pr ob ability under N (0 , C M ) , as N → ∞ : P ∗  M ( b Λ ∗ ω 0 ,N ) ∈ B  − P  M ( b Λ ω 0 ,N ) ∈ B  P − → 0 . In p articular, p er c entile (or studentise d) b o otstr ap c onﬁdenc e r e gions for e ach matche d mo de c o or dinate ar e asymptotic al ly valid. Pr o of of Cor ol lary S2.19 . This is directly implied b y Theorem S2.18 and the Portman teau theorem ( v an der V aart , 1998 , Lem. 2.2). P er-mo de pro jections Corollary S2.20 (Per-mode pro jections) . As a c onse quenc e of The or em S2.18 , for the j -th matche d mo de M j := E j M with E j extr acting the j -th blo ck, as N → ∞ : √ N  M j ( b Λ ω 0 ,N ) − M j ( Λ ⋆ ω 0 )  D − → N (0 , C j ) , √ N  M j ( b Λ ∗ ω 0 ,N ) − M j ( b Λ ω 0 ,N )  D − → N (0 , C j ) , and p er c entile b o otstr ap el lipses for M j ar e asymptotic al ly valid. Pr o of of Cor ol lary S2.20 . The asymptotic normality results are directly implied b y Theo- rem S2.18 and the linearit y of the extractor E j . Then, the ellipses are asymptotically v alid as a consequence of the setwise con vergence given b y Corollary S2.19 . 58 S2.4.7 Pro of of Prop osition S1.13 Pr o of of Pr op osition S1.13 . By Lemma S2.15 , F j is contin uous at P with resp ect to w eak con vergence. (i) Conditional LLN. Given the data, { F j ( P ∗ ( ℓ ) N ) } L ℓ =1 are i.i.d. Bernoulli with mean E ∗ [ F j ( P ∗ N )] = τ j,N and ﬁnite v ariance. Hence, by the strong la w, s j → τ j,N almost surely under the b o otstrap (a.s. ∗ ), and V ar ∗ ( s j ) = V ar ∗ ( F j ( P ∗ N )) /L → 0. Ho eﬀding’s inequalit y yields the tail b ound. (ii) L ar ge- N limits. Since P N w − → P and F j is contin uous at P , w e ha ve F j ( P N ) → F j ( P ) in probability , so π j,N → π j := F j ( P ). F or the b ootstrap mean, standard nonparamet- ric b ootstrap consistency at contin uit y p oints giv es F j ( P ∗ N ) P − → F j ( P ) in P -probabilit y . Because F j ∈ [0 , 1] is bounded, conditional b ounded con vergence implies E ∗ [ F j ( P ∗ N )] P − → F j ( P ) = π j . Finally , for an y L = L N → ∞ as N → ∞ , | s j − π j | ≤ | s j − τ j,N | + | τ j,N − π j | P − → 0 . By (i) and (ii), w e obtain the last claim s j P − → π j . S2.5 Pro ofs for Section S1.2 S2.5.1 Pro of of Prop osition S1.1 Pr o of of Pr op osition S1.1 . W rite q S = q /ψ , ε = 1 − ψ . On the one hand, w e hav e   E q S [ p ] − E q [ p ]   =     1 ψ − 1     Z S pq ≤ M ε. On the other hand, w e ha v e H S ( q S ) = 1 ψ H S ( q ) + log ψ . Hence, H S ( q S ) − H S ( q ) =  1 ψ − 1  H S ( q ) + log ψ = ε ψ H S ( q ) + log(1 − ε ) . On a compact parameter set and b ounded S , C q = sup λ sup x ∈S   log q λ ( x )   < ∞ is well- deﬁned and |H S ( q ) | ≤ C q R S q = C q ψ . Therefore,   H S ( q S ) − H S ( q )   ≤ C q ε + | log (1 − ε ) | . Com bining the tw o inequalities, w e obtain the point wise b ound. The uniform b ound follo ws b y taking suprema o v er Θ. 59 S2.5.2 Pro of of Theorem S1.2 Before proving Theorem S1.2 , we deriv e three supp orting lemmas. Lemma S2.21 (Argmax stability) . L et Θ b e c omp act and let F , G : Θ → R b e c ontinuous functions. Assume F has a maximiser λ ⋆ and that for some ρ > 0 the value gap m ( ρ ) := F ( λ ⋆ ) − sup ∥ λ − λ ⋆ ∥≥ ρ F ( λ ) > 0 . If ∥ G − F ∥ ∞ := sup λ ∈ Θ | G ( λ ) − F ( λ ) | ≤ m ( ρ ) / 3 , then every maximiser of G lies in B ( λ ⋆ , ρ ) . If, in addition, λ ⋆ is the unique interior maximiser of F , then ther e exists ρ 0 > 0 such that the same b ound holds for every 0 < ρ ≤ ρ 0 , then we have arg max λ ∈ Θ G ( λ ) = { λ ⋆ } . Pr o of of L emma S2.21 . Fix ρ > 0 with m ( ρ ) > 0 and set δ := ∥ G − F ∥ ∞ . F or any λ with ∥ λ − λ ⋆ ∥ ≥ ρ , G ( λ ) ≤ F ( λ ) + δ ≤ F ( λ ⋆ ) − m ( ρ ) + δ, while G ( λ ⋆ ) ≥ F ( λ ⋆ ) − δ . If δ < m ( ρ ) / 2 (hence, under δ ≤ m ( ρ ) / 3), then G ( λ ⋆ ) > G ( λ ) for all ∥ λ − λ ⋆ ∥ ≥ ρ , so ev ery maximiser of G lies in B ( λ ⋆ , ρ ). If λ ⋆ is the unique interior maximiser of F , choose ρ 0 > 0 so that B ( λ ⋆ , ρ 0 ) contains no other maximiser of F . F or an y 0 < ρ ≤ ρ 0 , the ab o ve argument ensures that ev ery maximiser of G lies in B ( λ ⋆ , ρ ). Since λ ⋆ is the only maximiser of F in this ball, the strict inequalit y abov e then implies arg max λ ∈ Θ G = { λ ⋆ } . Lemma S2.22 (Uniform tail) . Fix ω > 0 . Assume the uniform mar gin dist  µ ( λ ) , ∂ S  ≥ M 0 p eig max (Σ( λ )) , ∀ λ ∈ Θ . Then the uniform tail b ound ε max := sup λ ∈ Θ  1 − Z S q λ ( x ) d x  ≤ C e − M 2 0 / 2 holds for some c onstant C > 0 . Pr o of of L emma S2.22 . Let λ ∈ Θ, X ∼ N ( µ , Σ ), and r := M 0 p eig max ( Σ ). The margin condition dist( µ ( λ ) , ∂ S ) ≥ r implies B ( µ , r ) ⊂ S , so ε ( µ , Σ ) = P ( X / ∈ S ) ≤ P ( ∥ X − µ ∥ ≥ r ). Since Σ ⪯ eig max ( Σ ) I , for Z ∼ N (0 , I ), there exists C > 0 suc h that P ( ∥ X − µ ∥ ≥ r ) ≤ P ( ∥ Z ∥ ≥ r / p eig max ( Σ )) = P ( ∥ Z ∥ ≥ M 0 ) ≤ C e − M 2 0 / 2 , b y c hi-squared tail b ounds for ∥ Z ∥ 2 . Th us, ε max = sup λ ∈ Θ ε ( λ ) ≤ C e − M 2 0 / 2 . 60 Pr o of of The or em S1.2 . By Lemma S2.22 , ε max ≤ C e − M 2 0 / 2 . Prop osition S1.1 then yields the uniform bound sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . Fix ρ > 0 and set m ( ρ ) := F ω ( λ ⋆ ω ) − sup ∥ λ − λ ⋆ ω ∥≥ ρ F ω ( λ ) > 0. Cho ose M 0 , hence ε max , large enough that δ ω ( ε max ) < m ( ρ ) / 3. By Lemma S2.21 , ev ery maximiser of G ω lies in B ( λ ⋆ ω , ρ ). If λ ⋆ ω is a unique interior maximiser of F ω , the argmaxes coincide. S3 Algorithms and v arian ts This section details the update rules for GER VE for diﬀeren t v ariational families and pro vides pseudo-codes for implementation. F or eac h case, w e: • Recall the parameterisation and its link to natural gradients. • Presen t the sp eciﬁc up date rules and their mini-batc h version. • Giv e a pseudo-co de for practical use. • Pro vide deriv ation notes for readers in terested in the in termediate steps. S3.1 Natural-gradien t up date rule F or distributions in an exp onen tial family or MCEF, w e can use ( 5 ) ( Lin et al. , 2019 ; Khan and Rue , 2023 ) so that the natural-gradien t up date rule writes: λ t +1 = λ t + ρ t ∇ M b L ω t ,N ( λ ) | λ = λ t , whic h will b e the starting p oin t for all the follo wing deriv ations. S3.2 Fixed-co v ariance Gaussian case P arameterisation. W e consider q λ ( x ) = N ( x ; µ , s − 1 I ) with ﬁxed precision s > 0. • Natural parameter: λ = s µ ; • Suﬃcien t statistic: x ; • Expectation parameter: µ = E q λ [ X ]. Only µ is optimised. GER VE up date rule. Considering ∇ µ H S ( q λ ) ≈ ∇ µ H ( q λ ) = 0, the ﬁxed-co v ariance Gaussian GER VE update rule is: µ t +1 = µ t + ρ t N N X i =1 ( X i − µ t ) q λ t ( X i ) . 61 This is equiv alent to gradien t ascent on the kernel densit y estimate p h ( µ ) = 1 N N X i =1 ϕ h ( X i − µ ) , where ϕ h is the Gaussian k ernel with bandwidth h = s − 1 . This reduces to the classic Gaussian mean-shift update when the step sizes are chosen adaptiv ely ρ t = p h ( µ t ) − 1 . Mini-batc h implemen tation. F or a batch X ( t ) 1: B sampled uniformly with replacemen t from the dataset, i.e. J ( t ) 1 , . . . , J ( t ) B i.i.d. ∼ Unif ([ N ]) , X ( t ) 1: B := ( X J ( t ) 1 , . . . , X J ( t ) B ), deﬁne: b g ( t ) B = s B B X i =1 ( X ( t ) i − µ t ) q λ t ( X ( t ) i ) . Conditioned on X 1: N , b g ( t ) B is an un biased estimator of ∇ µ b L ω t ,N ( λ ) | λ = λ t . The mini-batch version of the up date rule is: µ t +1 = µ t + ρ t s − 1 b g ( t ) B . Algorithm. Algorithm 3 presen ts the pseudo-code corresp onding to this implemen tation of GER VE. Algorithm 3: Fixed-co v ariance Gaussian GER VE (with mini-batc hes) 1 Given samples X 1: N . 2 Set T , B , s , λ 1 , ρ 1: T . 3 for t = 1 : T do 4 Sample J ( t ) 1: B i.i.d. ∼ Unif ([ N ]), set X ( t ) i ← X J ( t ) i for i = 1 : B . 5 Compute b g ( t ) B from X ( t ) 1: B , λ t . 6 Upda te µ t +1 = µ t + ρ t s − 1 b g ( t ) B . 7 end 8 return λ T +1 (= µ T +1 ). Deriv ation details. F rom the natural-gradient rule for exponential families with λ = s µ , we hav e: s µ t +1 = s µ t + ρ t ∇ µ b L ω t ,N ( λ ) | λ = λ t . The gradient of the empirical ob jectiv e is: ∇ µ b L ω t ,N ( λ ) = s N N X i =1 ( X i − µ ) q λ ( X i ) + ω ∇ µ H S ( q λ ) , whic h yields the update rule ab o ve. 62 S3.3 Gaussian mixture case P arameterisation. F or K components: q Λ ( x ) = K X k =1 π k q λ k ( x ) , where the weigh ts π 1 , . . . , π K sum to 1, and each q λ k ( x ) is a Gaussian density with natural parameters λ k . • MCEF natural parameters: Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) , where v k := log( π k /π K ); • MCEF exp ectation parameters: M = ( π 1 , . . . , π K − 1 , M 1 , . . . , M K ) , where M k := ( m (1) k , m (2) k ) = ( π k µ k , π k ( S − 1 k + µ k µ T k )). GER VE up date rules. The Gaussian mixture GER VE up date rules are: S k,t +1 = S k,t − ρ t N S k,t N X i =1 (( X i − µ k,t )( X i − µ k,t ) T S k,t − I ) q λ k,t ( X i ) − ρ t ω t ∇ S − 1 k H S ( q Λ ) | Λ = Λ t , µ k,t +1 = µ k,t + ρ t N S − 1 k,t +1 S k,t N X i =1 ( X i − µ k,t ) q λ k,t ( X i ) + ρ t ω t ∇ µ k H S ( q Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t N N X i =1 ( q λ k,t ( X i ) − q λ K,t ( X i )) − ρ t ω t ∇ π k H S ( q Λ ) | Λ = Λ t . Mini-batc h and Monte-Carlo implemen tation. F or a batch X ( t ) 1: B sampled uni- formly with replacement from the dataset, i.e. J ( t ) 1 , . . . , J ( t ) B i.i.d. ∼ Unif ([ N ]) , X ( t ) 1: B := 63 ( X J ( t ) 1 , . . . , X J ( t ) B ), deﬁne: b f ( t ) k,B = 1 B B X i =1 ( q λ k,t ( X ( t ) i ) − q λ K,t ( X ( t ) i )) , b g ( t ) k,B = π k,t B S k,t B X i =1 ( X ( t ) i − µ k,t ) q λ k,t ( X ( t ) i ) , c H ( t ) k,B = π k,t 2 B S k,t B X i =1 (( X ( t ) i − µ k,t )( X ( t ) i − µ k,t ) T S k,t − I ) q λ k,t ( X ( t ) i ) . Conditioned on X 1: N , b g ( t ) k,B , c H ( t ) k,B , and b f ( t ) k,B are unbiased estimators of their full-data coun terparts. En tropy deriv atives must also be estimated. F or some size B e , dra w additional Monte- Carlo samples Z ( k,t ) 1: B e ∼ q λ k,t to compute b η ( t ) k,B e , b γ ( t ) k,B e , and b ϕ ( t ) k,B e (see details b elo w). W e obtain the practical updates: S k,t +1 = S k,t − 2 ρ t π k,t ( c H ( t ) k,B + ω t b η ( t ) k,B e ) , µ k,t +1 = µ k,t + ρ t S − 1 k,t +1 ( b g ( t ) k,B + ω t b γ ( t ) k,B e ) , v k,t +1 = v k,t + ρ t ( b f ( t ) k,B + ω t b ϕ ( t ) k,B e ) . Algorithm. Algorithm 4 depicts our proposed implemen tation of Gaussian mixture GER VE. 64 Algorithm 4: Gaussian mixture GER VE (with mini-batc hes) 1 Given samples X 1: N . 2 Set T , B , B e , K , Λ 1 , ρ 1: T , ω 1: T . 3 for t = 1 : T do 4 Sample J ( t ) 1: B i.i.d. ∼ Unif ([ N ]), set X ( t ) i ← X J ( t ) i for i = 1 : B . 5 for k = 1 : K do 6 Compute b g ( t ) k,B , c H ( t ) k,B from X ( t ) 1: B , Λ t . 7 Sample Z ( k,t ) 1: B e i.i.d. ∼ N ( µ k,t , S − 1 k,t ). 8 Compute b γ ( t ) k,B e , b η ( t ) k,B e from Z ( k,t ) 1: B e , Λ t . 9 Upda te S k,t +1 , µ k,t +1 . 10 end 11 for k = 1 : ( K − 1) do 12 Compute b f ( t ) k,B , b ϕ ( t ) k,B e from X ( t ) 1: B , Z ( k,t ) 1: B e , Z ( K,t ) 1: B e , Λ t . 13 Upda te v k,t +1 . 14 end 15 end 16 return Λ T +1 . Deriv ation details. The natural-gradien t up date rules are: v k,t +1 = v k,t + ρ t ∇ π k b L ω t ,N ( Λ ) | Λ = Λ t , S k,t +1 µ k,t +1 = S k,t µ k,t + ρ t ∇ m (1) k b L ω t ,N ( Λ ) | Λ = Λ t , − 1 2 S k,t +1 = − 1 2 S k,t + ρ t ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t . W e rearrange them to obtain: µ k,t +1 = µ k,t + ρ t S − 1 k,t +1 ( ∇ m (1) k b L ω t ,N ( Λ ) | Λ = Λ t + 2( ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t ) µ k,t ) , S k,t +1 = S k,t − 2 ρ t ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t ∇ π k b L ω t ,N ( Λ ) | Λ = Λ t . W e apply the c hain rule to express the gradien ts w.r.t. the expectation parameters as gradien ts w.r.t. µ k and S − 1 k : ∇ m (1) k b L ω t ,N ( Λ ) = 1 π k  ∇ µ k b L ω t ,N ( Λ ) − 2( ∇ S − 1 k b L ω t ,N ( Λ )) µ k  , ∇ m (2) k b L ω t ,N ( Λ ) = 1 π k ∇ S − 1 k b L ω t ,N ( Λ ) . 65 These gradients are ∇ µ k b L ω t ,N ( Λ ) = ∇ µ k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ µ k H S ( q Λ ) = π k N S k N X i =1 ( X i − µ k ) q λ k ( X i ) + ω t ∇ µ k H S ( q Λ ) , and ∇ S − 1 k b L ω t ,N ( Λ ) = ∇ S − 1 k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ S − 1 k H S ( q Λ ) = π k 2 N S k N X i =1 (( X i − µ k )( X i − µ k ) T S k − I ) q λ k ( X i ) + ω t ∇ S − 1 k H S ( q Λ ) . Giv en that π K = 1 − P K k =1 π k , we also ha v e ∇ π k b L ω t ,N ( Λ ) = ∇ π k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ π k H S ( q Λ ) = 1 N N X i =1 ( q λ k ( X i ) − q λ K ( X i )) + ω t ∇ π k H S ( q Λ ) . Com bining these expressions, w e ﬁnd the Gaussian mixture GER VE update rules. En trop y deriv ativ es (for Mon te-Carlo estimation). The entrop y deriv ativ es do not admit a closed form, but they admit an in tegral represen tation. F or the deriv atives w.r.t. µ k and S − 1 k , the follo wing expressions can be used: ∇ S − 1 k H S ( q Λ ) = π k 2 S k E q λ k [(( X − µ k )( X − µ k ) T S k − I ) log q Λ ( X ) 1 S ( X )] and ∇ µ k H S ( q Λ ) = π k S k E q λ k [( X − µ k ) log q Λ ( X ) 1 S ( X )] . Finally , for the deriv ativ e w.r.t. π k , we use: ∇ π k H S ( q Λ ) = E q λ k [log q Λ ( X ) 1 S ( X )] − E q λ K [log q Λ ( X ) 1 S ( X )] . 66 S3.4 Computational considerations Recall GER VE’s complexit y for Gaussian mixtures with K components (Sec. 4.4 ): • F ull co v ariances: O ( d 3 B K T ). • Diagonal, isotropic or ﬁxed co v ariances: O ( dB K T ). Simpliﬁed families sometimes retain most of the algorithm’s qualitative behaviour but impro ve scalability . This is comparable to related metho ds: • Me an-shift mo de-se eking. A single run of classical mean-shift with ﬁxed, spherical bandwidths and mini-batches costs O ( dB T ). Running from K initialisations (to target K mo des) yields O ( dB K T ), comparable to ﬁxed-co v ariance GER VE. Ho w ever, unlik e GER VE, these indep enden t tra jectories lac k a repulsion mechanism and may fail to explore the full mo dal structure. • Me an-shift clustering. Since mean-shift clustering launches one tra jectory p er data p oin t, the cost is O ( dN B T ) for spherical bandwidths. This can b e substantially more exp ensiv e than GER VE, which only launches K tra jectories to ﬁnd the cluster cen troids. • EM for Gaussian mixtur es (GMM-EM). The EM algorithm ( Dempster et al. , 1977 ; McLac hlan and P eel , 2000 ) with full cov ariances has cost O  ( d 2 B + d 3 ) K T  p er run, due to matrix inv ersions and determinan t computations. F or diagonal cov ariances, it reduces to O ( dB K T ), whic h is comparable to ﬁxed-cov ariance GER VE. In contrast, GER VE uses uniﬁed natural-gradient steps rather than alternating E- and M-steps, and do es not require lik eliho o d maximisation, making it applicable in likelihoo d-free settings. S4 Sim ulations and b enc hmark S4.1 Details on the clustering example of Section 6.1 Sample. The dataset consists of N = 6000 p oin ts from a 2D Gaussian mixture with three equally w eighted comp onen ts with means on the no des of an equilateral triangle, at (0 , 1), (cos( π / 6) , − 0 . 5), ( − cos( π / 6) , − 0 . 5) and isotropic co v ariance σ 2 I with σ 2 = 0 . 25. Figure S2 sho ws a sample and a kernel density estimate (KDE), revealing J = 3 regions of high density . GER VE hyperparameters. The mixture parameters are initialised with µ 1 , 1 , . . . , µ K, 1 ∼ Uniform([ − 2 , 0] 2 ), and for all k ∈ [ K ], Σ k, 1 = 2 I , π k, 1 = 1 /K . T able S1 giv es the other h yperparameters used for GER VE in this clustering task. Additional ﬁgures. Figure S3 sho ws results for K = 3 including an additional compar- ison with k -means for K = 7. 67 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Figure S2: Sample of N = 6000 points from a three-component Gaussian mixture whose means are located at the nodes of an equilateral triangle. Bac kground shading: Gaussian KDE with bandwidth h se lected by Scott’s rule ( Scott , 1992 ), revealing three high-density regions. T able S1: Hyp erparameters used for GER VE in the clustering sim ulations. Hyp erparameters V alues T 40000 B 1000 K { 3 , 7 } ω t 50 /t 1 . 1 + 0 . 004 ρ t 10 − 4 (50 /ω t ) 0 . 7 68 Figure S3: T op ro w: Clustering with K = 3 for GER VE (left) and GMM-EM (righ t). Ellipses represen t the cov ariance matrices of the Gaussian comp onen ts. Bottom ro w: Clus- tering with k -means for K = 7 (left) and K = 3 (righ t). 69 S4.2 Mo de-estimation Mo del. W e use the same triangle conﬁguration with means at (0 , 1), (cos( π / 6) , − 0 . 5), ( − cos( π / 6) , − 0 . 5) and isotropic co v ariance σ 2 I , but with smaller cov ariance σ 2 = 0 . 1. The three means are well separated, so the density global modes appro ximately coincide with the means: u 1 ≈ (0 , 1), u 2 ≈ (cos( π / 6) , − 0 . 5), u 3 ≈ ( − cos( π / 6) , − 0 . 5). F or eac h exp erimen t replicate, w e sample N p oin ts. Metrics. Eac h metho d is replicated n rep = 100 times and provides K estimated mo des b u = ( b u 1 , . . . , b u K ). W e assess performance in terms of mo de estimation using the follo wing metrics: • Mo de r e c overy ( MR ϵ ): n umber of true mo des { u i , i ∈ [ I ] } reco vered within a strict tolerance  : MR ϵ ( b u ) = I X i =1 1  min k ∈ [ K ]   b u k − u i   2 <   . W e set  = 10 − 2 , a small v alue relative to σ = √ 0 . 1. • Hungarian matching sum ( HM ): minim um linear assignment cost b et w een true and estimated mo des, HM ( b u ) = min τ I X i =1   b u τ ( i ) − u i   2 , i.e., the minim um total distance o ver injectiv e maps ( K ≥ I ) τ : { 1 , . . . , I }  → { 1 , . . . , K } (one distinct estimate p er true mo de). This can b e computed via the Hungarian algorithm ( Kuhn , 1955 ). • Ne ar est-neighb or sum ( NN ): aggregate distance from eac h estimate to the closest true mo de, NN ( b u ) = K X k =1 min i ∈ [ I ]   b u k − u i   2 . Hyp erparameters and grid searc h. F or GER VE, the initial comp onents ha ve equal w eights 1 /K and cov ariances σ 2 1 I , σ 1 > 0. The annealing schedule is ω t = ω 1 /t β and the step sizes are ρ t = ρ 1 ( ω 1 /ω t ) γ . W e used T = 4000 iterations and mini-batches of size B = 1000. The mean-shift algorithm is run with a mini-batc h size B = 1000 and uses a bandwidth h m for its Gaussian k ernel. The feature signiﬁcance metho d uses a bandwidth h f to iden tify the p oin ts of signiﬁcant curv ature or gradien t. F or each metho d, w e p erformed a grid or line searc h to optimise h yp erparameters with respect to each metric and sample size. T able S2 giv es the v alues used in the grid searc h procedure. F or Gaussian mean-shift, we use a mini-batch ﬁxed-co v ariance Gaussian GER VE (equiv alen t up dates to Gaussian mean-shift with bandwidth h ) with adaptive step size ρ t = B  P B b =1 q λ t ( X ( t ) b )  − 1 , where X ( t ) 1: B is the mini-batc h sample. 70 T able S2: Grid of h yp erparameter v alues used for GER VE, mean-shift, and the feature signiﬁcance metho d in the mo de-ﬁnding sim ulations. Hyp erparameters V alues σ 2 1 { 0 . 1 , 0 . 2 } ω 1 { 10 , 50 } β { 1 . 1 , 1 . 3 , 1 . 5 } ρ 1 { 0 . 01 , 0 . 03 , 0 . 1 } γ { 0 . 2 , 0 . 3 , 0 . 4 } h m { 0 . 001 , 0 . 002 , 0 . 005 , 0 . 01 , 0 . 02 , 0 . 05 , 0 . 1 , 0 . 2 } h f { 0 . 0001 , 0 . 0003 , 0 . 001 , 0 . 003 , 0 . 01 , 0 . 03 , 0 . 1 , 0 . 3 } Additional sim ulation results. Figure S4 shows p erformance metrics for GER VE and the tw o baseline metho ds ov er a sw eep of K ∈ { 3 , . . . , 7 } . S5 Clustering p erformance on UCI datasets W e assess robustness to misspeciﬁcation on real data and compare to classic clustering metho ds. W e also provide an ablation study and report run times. S5.1 Benc hmark Datasets. The three datasets can b e found in the UCI Machine Learning Rep ository ( ar chive.ics.uci.e du ). • Iris : N = 150, d = 4 (sepal length/width, petal length/width), J = 3 sp ecies (setosa, v ersicolor, virginica). • Wine : N = 178, d = 13 c hemical attributes from J = 3 cultiv ars. • P endigits : N ≈ 11000, d = 16 (eigh t pen-tip ( x, y ) pairs, normalised), J = 10 classes (digits 0 to 9). Clustering metrics. W e report the Adjusted Rand Index ( ARI , pairwise agreemen t), the Adjusted Mutual Information ( AMI , mutual information b etw een estimated and true classes), Purity (p er-cluster ma jorit y prop ortion), and Macro-F1 (classwise F1, a v eraged). GER VE default setup. GER VE uses diagonal-co v ariance mixtures with w eigh ts ﬁxed to k -means prop ortions. Means are initialised at k -means cen troids, co v ariances at σ 2 1 I . W e include a brief ﬁxed-co v ariance burn-in of L B iterations within a total budget of T b efore learning cov ariances. W e use o vercompletion K = J + 5 to enable component merging. Other hyperparameters are listed in T able S3 . 71 0.0 0.5 1.0 1.5 2.0 2.5 3.0 GERVE MR (mean) 1 0 2 1 0 1 1 0 0 HM (median) 1 0 2 1 0 1 NN (mean) K=3 K=4 K=5 K=6 K=7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 F eature Significance 1 0 2 1 0 1 1 0 2 1 0 1 1 0 0 1 0 3 1 0 4 1 0 5 1 0 6 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Mean-shift 1 0 3 1 0 4 1 0 5 1 0 6 1 0 1 1 0 0 1 0 3 1 0 4 1 0 5 1 0 6 1 0 1 6 × 1 0 2 2 × 1 0 1 Sample Size (log scale) Figure S4: Mode-e stimation p erformance of GER VE (ﬁrst row), F eature Signiﬁcance (sec- ond row), and Gaussian mean-shift (third ro w) with resp ect to sample size, for the triangle mixture ( I = 3 true mo des). Curv es sho w, for v arying K ∈ { 3 , 4 , 5 , 6 , 7 } , means ( MR ϵ , NN ) or medians ( HM ) ov er n rep = 100 replicates. Bands are 95% conﬁdence in terv als: means use t -interv als, medians use bo otstrap percentiles (5000 samples). 72 T able S3: Dataset characteristics and h yp erparameter v alues used for GER VE default setup. Dataset Iris Wine P endigits Dataset characteristics N 150 178 10992 d 4 13 16 J 3 3 10 GER VE default setup T 2000 2000 2000 K 8 8 15 L B 100 100 100 σ 2 1 1 1 1 ω 1 0 . 1 10 − 4 10 − 3 ω t ω 1 /t 2 ω 1 /t 2 ω 1 /t 2 ρ 1 10 6 10 12 10 11 ρ t ρ 1 /t 2 ρ 1 /t 2 ρ 1 /t 2 σ 2 min 0 . 01 0 . 1 0 . 01 σ 2 max 1 1 1 D µ 0 . 1 0 . 1 0 . 2 D Σ 1 1 1 W e enforce stabilit y by clipping parameter updates ( ∥ µ k,t +1 − µ k,t ∥ 2 ≤ D µ , ∥ Σ k,t +1 − Σ k,t ∥ F ≤ D Σ ) and constraining the parameter space via pro jections ( σ 2 min I ⪯ Σ k ⪯ σ 2 max I ). These constrain ts ha v e little practical inﬂuence on results but improv e stability in high dimensions. Baseline methods. W e compare GER VE to mean-shift (with a ﬂat kernel), GMM-EM, and k -means. Our main lab el-free baselines are implemen ted using Python’s scikit-learn pac k age: • Flat mean-shift (MS-Scott): ﬂat k ernel, bandwidth from Scott’s rule ( Scott , 1992 ). W e set the maxim um n umber of iterations to 300 (scikit-learn MeanShift ’s default). • GMM-BIC: Gaussian mixture mo del with diagonal cov ariances, the num ber of com- p onen ts selected by BIC at each run. Initial mixture parameters are set using k - means++. W e stop early when the lo wer b ound a verage gain is b elo w 10 − 3 and set the maximum num ber of iterations to 100 (scikit-learn GaussianMixture ’s default). • k -means (KM-Elb ow): num b er of centroids selected by the elb o w metho d. Initial cen troid lo cations are set using greedy k -means++. W e stop early when the lo wer b ound a verage gain is b elo w 10 − 4 and set the maxim um num b er of iterations to 300 (scikit-learn KMeans ’s default). 73 MS-Scott KM-Elbow GMM- BIC G E R V E ARI AMI Purity Macro-F1 3 3 5 [5-5] 2 #clusters MS-Scott GMM- BIC G E R V E KM-Elbow 6 4 [4-5] 4 [4-5] 3 0.0 0.2 0.4 0.6 0.8 1.0 MS-Scott KM-Elbow GMM- BIC G E R V E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 5 15 [15-15] 13 [13-14] Iris Wine Pendigits Scor e (mean) Figure S5: Clustering metrics (means ± 95% CI) and n um b er of eﬀectiv e clusters (median and [Q1-Q3] quan tiles when applicable, or constant across all replicates) for Iris, Wine, P endigits o v er n rep = 10. Label-free methods ordered by a v erage rank within each dataset. F or reference, w e also rep ort oracle v ariants that use ground truth and are not a v ailable in practice: GER VE-K ( K = J ), MS-K (bandwidth c hosen b y line searc h so that the n umber of clusters is closest to J ), MS-ARI (bandwidth maximising ARI ), GMM-K ( K = J ), GMM-ARI ( K maximising ARI ), KM-K ( K = J ), KM-ARI ( K maximising ARI ). GMM-BIC selects K by BIC at eac h run. The -ARI oracle v arian ts are chosen once p er dataset b y maximising mean ARI across seeds and then held ﬁxed across runs. Results. Figure S5 sho ws means and 95% conﬁdence interv als ov er n rep = 10 runs for the four metrics, and the median n umbers of eﬀective clusters and their [Q1-Q3] quantiles. GER VE is comp etitiv e with lab el-free baselines while adapting the n umber of clusters through merging. F or reference, comparison against oracle baselines is pro vided in Fig- ure S6 . Iris: GER VE matches baselines on ARI / AMI /Macro-F1. It merges t wo close sp ecies, yielding t wo eﬀective clusters v ersus J = 3, with a small drop in Purit y and a cleaner mo dal partition. Wine: GER VE attains strong scores across metrics. k -means edges out the top v alues on this near-spherical dataset, while GER VE remains robust without tuning K via BIC. Pendigits: No metho d reac hes J = 10 in this ov erlapping, high-dimensional setting. GER VE produces the eﬀective cluster count closest to J and leads on all four metrics, indicating consisten t reco v ery of global structure ev en with class splits. W e emphasise ARI , AMI , and Macro-F1, since Purit y can inﬂate with larger K . In prac- tice, GER VE returns a smaller eﬀectiv e K b y merging redundan t comp onen ts. Compared to mean-shift, it scales b etter in dimension and a v oids bandwidth tuning through adaptive co v ariance learning. 74 MS- ARI ( ) MS-K ( ) MS-Scott KM- ARI ( ) KM-K ( ) KM-Elbow GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E ARI AMI Purity Macro-F1 5 3 3 3 3 3 3 3 5 [5-5] 2 2 #clusters MS- ARI ( ) MS-K ( ) MS-Scott GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E KM- ARI ( ) KM-K ( ) KM-Elbow 6 3 6 3 3 4 [4-5] 3 4 [4-5] 3 3 3 0.0 0.2 0.4 0.6 0.8 1.0 MS- ARI ( ) MS-K ( ) MS-Scott KM- ARI ( ) KM-K ( ) KM-Elbow GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15 7 1 13 10 5 14 10 15 [15-15] 10 [9-10] 13 [13-14] Iris Wine Pendigits Scor e (mean) Figure S6: Clustering metrics (means ± 95% CI) and eﬀective clusters (median and [Q1-Q3] quan tiles when applicable, or constant across all replicates) for Iris, Wine, P endigits ov er n rep = 10. Lab el-free metho ds (GER VE, MS-Scott, GMM-BIC, KM-Elb o w) are ordered b y av erage rank within each dataset. Oracle v arian ts are marked with the † sym b ol, and directly placed below the corresponding label-free metho d. 75 S5.2 Ablation study W e now turn to an ablation study to examine how diﬀerent conﬁguration choices (w eights, ov ercompleteness, annealing, co v ariance structure, burn-in) aﬀect the behaviour of GER VE. W e start with the same default GER VE dataset-speciﬁc conﬁguration. F or eac h dataset, w e create new conﬁgurations b y v arying one factor at a time relativ e to the de- fault: (i) W eigh ts: equal vs. prop ortional to k -means group sizes; (ii) Ov ercompleteness: K ∈ { J, J +5 } ; (iii) Annealing: ω t ∈ { 0 , ω 1 /t β , β ∈ { 0 . 5 , 1 , 2 , 3 }} ; (iv) Co v ariance: ﬁxed, diagonal, full; (v) Burn-in: L B ∈ { 0 , 50 , 100 , 200 , 500 } . Figures S7 - S8 rep ort, for each conﬁguration, the mean of each metric (with 95% con- ﬁdence interv als) and the eﬀective num b er of clusters (median [Q1-Q3]). W e note the main takea w ays for each v arying factor: (i) W eights prop ortional to k -means help on Wine (higher ARI / AMI /Purity/Macro-F1), without signiﬁcan t eﬀect elsewhere; (ii) Starting o v er- complete slightly increases the eﬀectiv e num b er of clusters but helps on Pendigits, where class ov erlap is substantial; (iii) F aster annealing improv es end-state consolidation under ﬁxed iteration budgets, ω t ≡ 0 underperforms on o verlapping classes, illustrating the im- p ortance of the exploration induced b y the en tropy term; (iv) Diagonal co v ariances are a go od sp eed vs. robustness compromise, as full co v ariances add cost and v ariance without systematic gains; (v) A ﬁxed-cov ariance short burn-in ( L B ≈ 100) can help, but longer burn-ins steal iterations from co v ariance learning. S5.3 Run time T o illustrate ho w GER VE’s computational cost depends on data dimension, the num b er of mixture comp onents, and cov ariance parameterisation, w e rep ort mean run times across four datasets (T riangle mixture, Iris, Wine, P endigits), under a common conﬁguration. This is not intended as a scalabilit y b enchmark, since runtime is implementation-dependent and sensitiv e to engineering c hoices. T able S4 shows a verage w all-clo c k runtimes o v er ten replicates. As exp ected, cost increases with dimensionalit y and with the ﬂexibilit y of the cov ariance mo del. Fixed co- v ariances are the fastest to compute but also the least ﬂexible, while full co v ariances are substan tially slow er, esp ecially in higher dimensions. The diagonal setting oﬀers a prac- tical compromise: only sligh tly slow er than the ﬁxed co v ariance setting, y et av oiding the instabilit y and high cost of full cov ariance up dates. Ov erall, run times remain mo dest on medium-sized datasets, indicating that GER VE is practically usable in realistic clustering scenarios even when co v ariance learning is included. W e also plot execution time as a function of K on the P endigits dataset. The scaling is approximately linear in K , with diagonal cov ariances remaining a practical compromise b et w een ﬁxed and full parameterisations. 76 O -EW O K W TK -EW TK -K W Iris ARI AMI Purity Macro-F1 2 2 2 2 #clusters O -EW O K W TK -EW TK -K W Wine 8 4 [4-5] 3 3 0.0 0.2 0.4 0.6 0.8 1.0 O -EW O K W TK -EW TK -K W Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15 [14-15] 13 [13-14] 10 10 [9-10] Scor e (mean) No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Iris ARI AMI Purity Macro-F1 2 2 2 4 [3-5] 8 [7-8] 8 #clusters No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Wine 4 [3-4] 4 [3-4] 4 [4-5] 8 8 8 0.0 0.2 0.4 0.6 0.8 1.0 No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 9 13 [12-13] 13 [13-14] 14 [14-15] 15 [15-15] 15 Scor e (mean) Figure S7: Ablation – W eigh ts and ov ercompleteness (top) and Annealing (b ot- tom). F or each dataset (rows), the ﬁrst four columns sho w ARI / AMI /Purit y/Macro-F1 (means ± 95% t -CIs o v er n rep = 10 runs), the rightmost column sho ws the eﬀective num b er of clusters (median [Q1–Q3], or constant across all runs). Bold marks the default GER VE conﬁguration. The dashed reference line in each metric panel marks the default’s mean for that metric. Dataset-speciﬁc J v alues are listed in T able S3 . T op. Preﬁxes: T - for K = J (true groups), O - for K = J +5 (ov ercomplete). Suﬃxes: -KW for k -means-prop ortional weigh ts, -EW for equal w eights. Bottom. Sc hedules: Constant ( ω t ≡ ω 1 ), Power- x ( ω t = ω 1 t − x ), No entr opy ( ω t ≡ 0). 77 F ull D i a g o n a l F ix ed max. F ix ed min. Iris ARI AMI Purity Macro-F1 2 2 2 6 [6-7] #clusters F ull D i a g o n a l F ix ed max. F ix ed min. Wine 4 [4-5] 4 [4-5] 4 [4-4] 8 [8-8] 0.0 0.2 0.4 0.6 0.8 1.0 F ull D i a g o n a l F ix ed max. F ix ed min. Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 14 [13-14] 13 [13-14] 13 [12-13] 15 Scor e (mean) Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Iris ARI AMI Purity Macro-F1 2 2 2 2 3 [3-3] #clusters Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Wine 4 [4-4] 4 [4-4] 4 [4-5] 6 [6-7] 8 [8-8] 0.0 0.2 0.4 0.6 0.8 1.0 Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 13 [13-14] 13 [13-13] 13 [13-14] 14 [13-14] 14 [13-14] Scor e (mean) Figure S8: Ablation – Cov ariance (top) and Burn-in phase (b ottom). F or each dataset (rows), the ﬁrst four columns show ARI / AMI /Purity/Macro-F1 (means ± 95% t -CIs o ver n rep = 10 runs), the rightmost column shows the eﬀectiv e n umber of clusters (median [Q1–Q3], or constan t across all runs). Bold marks the default GER VE conﬁguration. The dashed reference line in eac h metric panel marks the default’s mean for that metric. T op. Cov ariance parameterisations: Fixe d min. ( Σ k ≡ σ 2 min I ), Fixe d max. ( Σ k ≡ σ 2 max I ), Diagonal (learned diagonal Σ k with bounds σ 2 min I ⪯ Σ k ⪯ σ 2 max I ), and F ul l (learned full Σ k under the same b ounds). Dataset-sp eciﬁc v alues for σ 2 min , σ 2 max and the initialisation σ 2 0 (for learned co v ariances) are listed in T able S3 . Bottom. Burn-in v arian ts: No burn-in ( L = 0), Burn- x (ﬁxed-cov ariance phase with L = x iterations). The total iteration budget T is the same across v ariants and listed in T able S3 . 78 T able S4: Mean wall-clock run time (in seconds) o v er 10 replicates for o vercomplete GER VE ( K = J + 5) with w eigh t updates, T = 500, B = 150, L = 0, using identical h yp erpa- rameters across all datasets. V alues are mean ± standard deviation. All runs on the same mac hine. This study illustrates cost trends by co v ariance parameterisation, n umber of comp onen ts, and dimensionalit y . Dataset Fixed Diagonal F ull T riangle ( d = 2 , J = 3) 3.5 ± 0.6 3.9 ± 0.8 4.2 ± 0.8 Iris ( d = 4 , J = 3) 3.5 ± 0.7 3.8 ± 0.7 4.6 ± 0.9 Wine ( d = 13 , J = 3) 4.2 ± 0.8 4.8 ± 0.9 7.8 ± 1.2 P endigits ( d = 16 , J = 10) 13.4 ± 2.1 15.2 ± 2.6 24.6 ± 3.3 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Components (K) 0 5 10 15 20 25 30 R untime (seconds) Diagonal F ix ed F ull Figure S9: GER VE run time with respect to the num b er of comp onen ts K on the P endigits dataset, for diagonal, ﬁxed and full co v ariance matrices. Curves show mean run time o v er 10 replicates, shaded bands indicate ± 1 standard deviation. S6 Details on the UK collision case study Data normalisation and pro cessing. W e apply GER VE on a normalised space : the Greater London window [ − 0 . 54 , 0 . 33] × [51 . 28 , 51 . 70] (longitude-latitude) into a cen tred rectangle [ − 0 . 7 , 0 . 7] × [ − 0 . 35 , 0 . 35] preserving the asp ect ratio and with area ≈ 1. This transformation is applied to the coordinates of the data p oin ts. T o smoothen the under- lying distribution p , we generate artiﬁcial data p oints uniformly in the region b et w een the normalised rectangle and the [ − 2 , 2] × [ − 2 , 2] square, with density 1000 times smaller than the mean densit y of the data in the normalised rectangle. GER VE h yp erparameters. W e target the top H = 10 hotsp ots. W e ﬁt an o vercom- plete mixture with K = 2 H = 20. • T emp er atur e. W e ﬁx the stopping temp erature at ω † = 1. This c hoice follows the elb o w of the resolved-mode curv e represented in Figure S10 and described in Sec. 7 . 79 0.1 0.25 0.5 1 2 4 8 w 0 2 5 7 10 12 15 17 20 Number of resolved modes Components af ter mer ging Components af ter pruning Figure S10: Diagnostic for c ho osing ω : resolved modes after successive pruning (dashed) and merging (solid) for v alues of ω (in log-scale). The elb o w at ω † = 1 maximises resolv ed mo des b efore decline. The same ω † is used for the bo otstrap reﬁts. • Step sizes. The step-size sc hedule is ρ t = 0 . 2 t − 0 . 1 . • Initialisation. Comp onen t means use k -means++ cen troids on the w orking co ordi- nates. Initial co v ariances are Σ k, 1 = σ 2 1 I with σ 2 1 = 5 × 10 − 3 . • Covarianc e b ounds. W e constrain σ 2 min = 1 × 10 − 5 and σ 2 max = 1 × 10 − 2 . • Domain. Optimisation is performed on S = [ − 2 . 05 , 2 . 05] × [ − 2 . 05 , 2 . 05]. Mapping to British National Grid is used only for reporting distances and ellipses. • Entr opy term. W e use B = 100 Mon te Carlo samples p er iteration to estimate the en tropy contributions. • Early stopping. W e cap iterations at T = 10 , 000. W e stop early if the following hold at three consecutiv e c hec ks, ev aluated ev ery 10 iterations:   µ k,t +1 − µ k,t   2 < 10 − 2 and   S k,t +1 − S k,t   F   S k,t   F < 10 − 1 for all k . • Pruning and mer ging. A comp onen t is pruned if eig max ( Σ k ) > σ 2 min + β  σ 2 1 − σ 2 min  , with β ≈ 0 . 018, so that the threshold equals 10 σ 2 min . Comp onents means within 0 . 005 ( ≈ 200 meters in the British National Grid EPSG:27700) are merged. • Bo otstr ap settings. W e use L = 500 resamples at ω † . Bo otstrap modes are matc hed to the baseline with a Hungarian assignmen t and adaptiv e gates around each baseline mo de as described below. 80 Mo de matc hing and p er-mode ellipses. W e z -score co ordinates using the baseline mo de cen tres. F or mo de k , w e compute nn k as the minim um distance to any other baseline mo de. The gate is τ k = clip( η nn k , τ min , τ max ) with defaults η = 0 . 35, τ min = 0 . 08, τ max = 0 . 22. Since η < 0 . 5, gates stay inside midp oin ts b et w een mo des, which preven ts cross- assignmen ts. The clips a void degenerate gates when mo des are extremely close or very isolated. After Hungarian assignment, pairs with distance larger than τ k are rejected. Accepted matches form the b o otstrap cloud for mo de k , from which w e compute stability s k and the 95% chi-square conﬁdence ellipse. T o obtain conﬁdence regions in meters, w e pro ject coordinates to the British National Grid (EPSG:27700). F or eac h resolv ed mode k , let { b µ ( ℓ ) } L ℓ =1 denote the matched b o otstrap cen tres. W e report 95% conﬁdence ellipses from the empirical cov ariance b V k = V ar ∗ ( b µ ( ℓ ) k ): if eig 1 ≥ eig 2 are eigen v alues of b V k and e 1 is the principal eigen vector, then the semi-axes are a = q χ 2 2;1 − α eig 1 and b = q χ 2 2;1 − α eig 2 , with angle atan2( e 1 ,x , e 1 ,y ) (in degrees) relativ e to Easting. Figure S11 shows, for eac h baseline mode, the cloud of matc hed b o otstrap cen tres, the adaptiv e gate used for matc hing, and the resulting conﬁdence ellipse, illustrating mo de stabilit y and lo calisation uncertaint y . 2020-2024 hotsp ots. T able S5 rep orts the 17 hotsp ots found b y GER VE and Figure S12 displa ys them in the complete searc h space. Mean-shift details. W e use scikit-learn’s MeanShift with a ﬂat (uniform) k ernel on the same pro jected coordinates. The reference bandwidth h 0 is estimated as the 0.3-quan tile of pairwise distances. W e then sw eep h on a grid from 0 . 02 h 0 to 0 . 2 h 0 and report the resulting cen tres. Figure S13 sho ws the cen tres across the sw eep. Smaller h reco vers many additional p eripheral mo des, while larger h merges cen tral structure. In Figure S14 , w e o verla y the 0 . 05 h 0 and 0 . 1 h 0 cen tres in the zoomed-in windo w and compare them with GER VE’s centres and 95% ellipses. A t 0 . 1 h 0 , visual insp ection against the road netw ork some mean-shift cen tres fall slightly oﬀ carriagew a ys, so w e select a smaller bandwidth (0 . 05 h 0 ) as a baseline comparator for GER VE. 2014-2019 hotsp ots and comparison to 2020-2024. W e ﬁt GER VE to collision data from 2014 to 2019 with identical ω † and K . Conﬁdence ellipses are obtained after L = 500 b ootstrap ﬁts. T able S6 rep orts the 16 mo des found b y GER VE and Figure S15 displa ys them with their 95% conﬁdence in terv als. Figure S16 ov erla ys the 2014-2019 hotsp ots on a map of the 2020-2024 ones in the zoomed-in windo w. 81 0.068 0.067 0.066 0.065 L ongitude (nor malized) 0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120 0.0125 Latitude (nor malized) Mode 0 ( =0.214) n=44 Baseline Matched Match Mean 95% Ellipse 0.006 0.008 0.010 0.012 L ongitude (nor malized) 0.006 0.007 0.008 0.009 0.010 Latitude (nor malized) Mode 1 ( =0.207) n=465 0.027 0.026 0.025 0.024 L ongitude (nor malized) 0.044 0.045 0.046 0.047 0.048 Latitude (nor malized) Mode 2 ( =0.175) n=165 0.013 0.012 0.011 0.010 0.009 0.008 L ongitude (nor malized) 0.016 0.015 0.014 0.013 0.012 Latitude (nor malized) Mode 3 ( =0.136) n=208 0.033 0.032 0.031 0.030 0.029 L ongitude (nor malized) 0.010 0.009 0.008 0.007 0.006 0.005 0.004 Latitude (nor malized) Mode 4 ( =0.136) n=185 0.064 0.062 0.060 L ongitude (nor malized) 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 Latitude (nor malized) Mode 5 ( =0.118) n=340 0.0200 0.0175 0.0150 0.0125 0.0100 L ongitude (nor malized) 0.054 0.052 0.050 0.048 0.046 Latitude (nor malized) Mode 6 ( =0.158) n=202 0.034 0.032 L ongitude (nor malized) 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 Latitude (nor malized) Mode 7 ( =0.150) n=68 0.050 0.055 0.060 L ongitude (nor malized) 0.038 0.040 0.042 0.044 0.046 Latitude (nor malized) Mode 8 ( =0.150) n=242 0.048 0.046 0.044 0.042 0.040 L ongitude (nor malized) 0.032 0.034 0.036 Latitude (nor malized) Mode 9 ( =0.118) n=389 0.0200 0.0225 0.0250 0.0275 0.0300 L ongitude (nor malized) 0.022 0.024 0.026 0.028 0.030 0.032 0.034 Latitude (nor malized) Mode 10 ( =0.207) n=279 0.115 0.116 0.117 L ongitude (nor malized) 0.055 0.056 0.057 0.058 0.059 0.060 0.061 Latitude (nor malized) Mode 11 ( =0.220) n=157 0.040 0.042 0.044 0.046 L ongitude (nor malized) 0.054 0.055 0.056 0.057 0.058 0.059 0.060 Latitude (nor malized) Mode 12 ( =0.150) n=475 0.138 0.136 0.134 0.132 L ongitude (nor malized) 0.017 0.016 0.015 0.014 0.013 0.012 0.011 0.010 Latitude (nor malized) Mode 13 ( =0.220) n=342 0.000 0.005 0.010 L ongitude (nor malized) 0.066 0.064 0.062 0.060 Latitude (nor malized) Mode 14 ( =0.158) n=102 0.1150 0.1125 0.1100 0.1075 0.1050 0.1025 L ongitude (nor malized) 0.046 0.048 0.050 0.052 0.054 Latitude (nor malized) Mode 15 ( =0.220) n=221 0.050 0.045 0.040 L ongitude (nor malized) 0.046 0.044 0.042 0.040 0.038 0.036 Latitude (nor malized) Mode 16 ( =0.202) n=226 Figure S11: Distribution of the matc hed mo des across the L = 500 b ootstrap ﬁts for eac h baseline hotsp ot from the 2020-2024 data, with the estimated conﬁdence ellipses and means. 82 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) Baseline hotspots (numbered by ID) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S12: Baseline collision hotsp ots identiﬁed b y GER VE for the Greater London Area b et w een 2020 and 2024, with stabilit y scores and 95% conﬁdence ellipses, in normalised co ordinates. The dashed rectangle is the zo omed-in window of Figure 3 . 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.02 | Modes: 3205 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.05 | Modes: 1317 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.1 | Modes: 487 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.2 | Modes: 158 Figure S13: Mean-shift mo des across diﬀeren t bandwidths ( h = 0 . 02 h 0 , 0 . 05 h 0 , 0 . 1 h 0 , 0 . 2 h 0 ) in the normalised window. The dashed rectangle is the zo omed-in window. 83 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) MeanShift (×0.05): 122 centers vs. GERVE: 17 modes (in view) GER VE modes with ellipses MeanShif t modes 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) MeanShift (×0.1): 35 centers vs. GERVE: 17 modes (in view) GER VE modes with ellipses MeanShif t modes 0.0 0.2 0.4 0.6 0.8 1.0 GERVE Stability Figure S14: Comparison of mean-shift ( h = 0 . 05 h 0 , 0 . 1 h 0 ) and GER VE mo des, with conﬁ- dence ellipses for GER VE, in the zoomed-in windo w. 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) Baseline hotspots (numbered by ID) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) Baseline hotspots (numbered by ID) 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S15: Baseline collision hotsp ots identiﬁed b y GER VE for the Greater London Area b et w een 2014 and 2019, with stabilit y scores and 95% conﬁdence ellipses, in normalised co ordinates. T op: Shows the normalised window con taining all data points. The dashed rectangle is the zo omed-in window. Bottom: Zo omed-in v ersion for comparison to Figure 3 . 84 T able S5: Rep orted hotspots for y ear 2024 with (un-normalised) coordinates, conﬁdence ellipse dimensions ( a, b ), angles, stabilit y , bo otstrap matches, and lo cal coun ts (radius 200 m). Hotsp ots are listed in decreasing stabilit y order. ID Name Longitude ( ◦ ) Latitude ( ◦ ) a (m) b (m) Angle ( ◦ ) Stabilit y Count 12 Shoreditch -0.078909 51.524226 134 121 -54 0.95 32 1 Elephan t & Circus -0.099874 51.495003 120 86 80 0.93 27 9 Piccadilly Circus -0.132294 51.510060 162 150 89 0.78 27 5 Oxford Circus -0.142322 51.515222 265 66 84 0.68 31 13 Gun ter Grov e -0.185143 51.482279 134 83 72 0.68 22 10 London Bridge -0.088544 51.506068 156 117 60 0.56 21 8 Aldgate -0.072412 51.515338 357 207 27 0.48 14 16 Clapham HS -0.131914 51.464362 409 255 37 0.45 23 15 Edgw are Road -0.170483 51.519900 230 144 -41 0.44 22 3 Ov al -0.111630 51.481528 120 105 81 0.42 19 6 Brixton -0.114597 51.460859 301 115 -49 0.40 38 4 V auxhall -0.123762 51.485494 177 82 -68 0.37 27 2 Holb orn -0.120090 51.517891 87 78 54 0.33 25 11 Mile End -0.034823 51.524982 149 59 88 0.31 23 14 Herne Hill -0.102411 51.452610 305 91 -24 0.20 10 7 W estminster -0.125293 51.501799 177 76 79 0.14 17 0 Victoria -0.145081 51.496576 96 68 84 0.09 20 T able S6: Rep orted hotspots for y ear 2019 with (un-normalised) coordinates, conﬁdence ellipse dimensions ( a, b ), angles, stabilit y , bo otstrap matches, and lo cal coun ts (radius 200 m). Hotsp ots are listed in decreasing stabilit y order. ID Name Longitude ( ◦ ) Latitude ( ◦ ) a (m) b (m) Angle ( ◦ ) Stabilit y Count 14 Shoreditch -0.078437 51.524026 159 134 -78 0.96 35 5 Elephan t & Circus -0.100192 5 1.494709 123 112 -86 0.96 27 0 Ludgate Circus -0.104340 51.513547 177 110 -80 0.85 20 4 Charing Cross -0.126966 51.507739 168 114 -4 0.75 24 1 V auxhall -0.123471 51.486071 111 102 -66 0.74 23 15 Shep erd’s Bush -0.223191 51.503715 179 133 -3 0.70 25 9 Oxford Circus -0.141671 51.515031 197 96 56 0.62 26 11 Ov al -0.111782 51.481310 98 72 -79 0.57 23 2 Aldgate -0.071210 51.515511 279 100 14 0.54 15 12 Aldwyc h -0.118401 51.511326 145 108 15 0.53 23 6 St Giles Circus -0.130695 51.516011 151 109 -29 0.47 15 3 King’s Cross -0.122322 51.530607 252 155 24 0.44 22 7 Camden T own -0.140953 51.540307 197 90 59 0.44 19 8 Edgw are -0.168222 51.518691 138 137 73 0.25 22 13 London Bridge -0.088592 51.506237 318 85 70 0.23 14 10 Olympia -0.206178 51.496948 156 72 7 0.07 14 85 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 2020-2024 modes (filled) 2014-2019 modes (shaded, dashed) 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S16: Comparison of 2014-2019 hotsp ots and 2020-2024 hotsp ots in the zo omed-in windo w. 86

A variational framework for modal estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment