A variational framework for modal estimation

We approach multivariate mode estimation through Gibbs distributions and introduce GERVE (Gibbs-measure Entropy-Regularised Variational Estimation), a likelihood-free framework that approximates Gibbs measures directly from samples by maximizing an e…

Authors: Tâm LeMinh, Julyan Arbel, Florence Forbes

A variational framework for modal estimation
A v ariational framew ork for mo dal estimation Tˆ am Le Minh 1 , 2 , July an Arb el 2 , Florence F orbes 2 and Hien Duy Nguy en 3 , 4 1 Scho ol of Me dicine and Dentistry, Griffith University, Nathan, QLD 4101, Austr alia 2 Univ. Gr enoble Alp es, Inria, CNRS, Gr enoble INP, LJK, 38000 Gr enoble, F r anc e 3 Scho ol of Computing, Engine ering and Mathematic al Scienc es, L a T r ob e University, Bundo or a, VIC 3086, Austr alia 4 Institute of Mathematics for Industry, Kyushu Univ., Nishi War d, F ukuoka 819-0395, Jap an Abstract. W e approac h multiv ariate mo de estimation through Gibbs distributions and introduce GER VE (Gibbs-measure En tropy-Regularised V ariational Estimation), a lik eliho o d-free framework that approximates Gibbs measures directly from samples by maximizing an entrop y-regularised v ariational ob jectiv e with natural-gradien t up dates. GER VE brings together k ernel density estimation, mean-shift, v ariational inference, and annealing in a single platform for mo de estimation. It fits Gaussian mixtures that concen- trate on high-densit y regions and yields cluster assignmen ts from resp onsibilities, with re- duced sensitivit y to the chosen n um b er of comp onen ts. W e pro vide theory in tw o regimes: as the Gibbs temperature approac hes zero, mixture comp onen ts con verge to population mo des; at fixed temp erature, maximisers of the empirical ob jective exist, are consisten t, and are asymptotically normal. W e also propose a bo otstrap pro cedure for p er-mo de con- fidence ellipses and stability scores. Simulation and real-data studies sho w accurate mo de reco very and emergen t clustering, robust to mixture o verspecification. GER VE is a prac- tical likelihoo d-free approac h when the n umber of modes or groups is unknown and full densit y estimation is impractical. Keyw ords. Mo de estimation; Mo dal clustering; V ariational metho ds; Gibbs measures; Annealing; Bo otstrap. 1 In tro duction Understanding the structure of an unknown probabilit y distribution p on R d , observed through samples, is a cen tral task in statistics. Modes, the lo cal maxima of p , rev eal data high-densit y regions. Detecting m ultiple modes exposes heterogeneit y , supports cluster- ing and segme n tation, and offers alternativ e explanations in inv erse problems ( Carreira- 1 P erpi ˜ nan , 2006 ; Chac´ on , 2015 ). Recen tly , it has also emerged as a preliminary requiremen t for new efficien t sampling metho ds of m ultimo dal high-dimensional distributions ( Noble et al. , 2025 ). Classical approac hes to mo de estimation are commonly group ed into three main fam- ilies. The first comprises geometric methods that detect high-densit y regions from the sample ( Chernoff , 1964 ; Dalenius , 1965 ; V enter , 1967 ; Sager , 1978 ). These metho ds are simple and robust but tend to degrade in high dimensions. The second fits a smo oth densit y and then lo cates its modes ( P arzen , 1962 ; Loftsgaarden and Quesen b erry , 1965 ; Kronmal and T arter , 1968 ; Sam worth , 2018 ; Liu and Ghosh , 2020 ). This route is princi- pled but suffers from smo othing bias and high computational cost. The third consists of mean-shift algorithms that ascend the gradien t of a k ernel densit y estimate (KDE) without requiring fitting a global model ( F ukunaga and Hostetler , 1975 ; Comaniciu and Meer , 2002 ; Carreira-P erpi ˜ nan , 2007 ; Arias-Castro et al. , 2016 ). These methods are efficient but band- width sensitiv e and can miss w ell-separated modes. T o our knowledge, no existing metho d sim ultaneously targets m ultiple densit y mo des, pro vides v alid uncertaint y quan tification, and scales with dimension via simple algorithmic up dates, without requiring a full density estimate ( Genov ese et al. , 2016 ; Ameijeiras-Alonso et al. , 2021 ). W e prop ose a differen t p ersp ectiv e: w e recast the task of mo de estimation into that of appro ximating a Gibbs measure from samples. Let f : S → R be a measurable function and fix a temp erature parameter ω > 0. Define g ω ( d x ) = exp( f ( x ) /ω ) Z ω ( f ) d x , with Z ω ( f ) := Z S exp( f ( y ) /ω ) d y , and assume Z ω ( f ) < ∞ . W e assume that S ⊂ R d has finite Leb esgue measure. The Gibbs measure g ω preserv es the mo des of f and is the maximiser, among the unconstrained family of distributions, of the entrop y-regularised v ariational ob jective ( Donsker and V aradhan , 1976 ): L ω ( q ) = E q [ f ( X )] + ω H S ( q ) , with H S ( q ) = − Z S q ( y ) log q ( y ) d y . (1) W e in tro duce GER VE ( Gibbs-me asur e Entr opy-R e gularise d V ariational Estimation ), a likelihoo d-free v ariational framew ork that appro ximates the Gibbs measure asso ciated with a density f := p , in a Gaussian-mixture family from i.i.d. samples X 1: N ∼ p , where p is nonnegativ e, supported on a subset of S . Since p is a density , but not analytically a v ailable, we lev erage the iden tity E q [ p ( X )] = E p [ q ( X )] to target, with natural-gradient up dates, the maximisation of an empirical v ersion of L ω a veraging the v ariational densit y q ov er the observ ed samples: b L ω ,N ( q ) = 1 N N X i =1 q ( X i ) + ω H S ( q ) . (2) 2 This estimation problem requires fundamen tally different treatmen t from optimisation set- tings where p (or f ) can b e ev aluated: we develop asymptotic theory for maximisers, establish consistency and asymptotic normalit y , and in tro duce bo otstrap-based inference for p er-mo de uncertain t y quan tification, extending recen t computational techniques from v ariational optimisation ( Khan and Rue , 2023 ; Le Minh et al. , 2025 ) to a statistical esti- mation metho dology . By fitting Gaussian mixtures to the Gibbs densit y , GER VE unifies ideas from mean- shift, kernel density estimation, en trop y annealing, and v ariational inference within one algorithmic paradigm. The v ariational formulation is well-suited to multimodal and multi- v ariate settings, where geometric or purely nonparametric mo de estimators often struggle. The en tropy term induces repulsion betw een mixture comp onen ts, enabling exploration of the mo dal landscap e, while annealing ( ω → 0) concentrates mass at the global mo des. This allo ws o v ercomplete mixtures to automatically adapt to the effectiv e n um b er of modes without prior specification. Our main con tributions are: 1. V ariational formulation for mo de estimation. W e cast mode estimation as Gibbs- measure appro ximation from samples, enabling mo de finding and uncertain ty quan- tification when the n umber of mo des is unkno wn and density estimation is impracti- cal. 2. Asymptotic the ory for multimo dal r e c overy. W e establish t wo complemen tary theo- retical regimes: As ω → 0, the v ariational mixture concen trates on the p opulation mo des. A t fixed ω > 0, empirical maximisers are consisten t and asymptotically normal. 3. Bo otstr ap unc ertainty for mo des. W e pro vide p er-mo de confidence ellipses and sta- bilit y scores via a matc hed-mo de bo otstrap with theoretical guarantees. 4. A lgorithmic fr amework. Natural-gradien t up dates with annealing reco ver m ultiple mo des, estimate local curv ature, and yield cluster assignmen ts without pre-sp ecifying the n umber of groups. A fixed-cov ariance sp ecial case reco vers the mean-shift algo- rithm, while learned cov ariances adapt to anisotropy and reduce bandwidth sensitiv- it y . 5. Evidenc e on simulations and data. W e show accurate mo de reco very and comp et- itiv e clustering, robustness to mild ov ercompleteness, and a hotsp ot analysis with uncertain ty on UK road-collision data. Our approac h is related to mo dal cluste ring ( Menardi , 2016 ; Chac´ on , 2020 ), whic h defines clusters as basins of attraction of densit y mo des. GER VE uses v ariational appro x- imation, yielding clusters that locally agree with basins near modes under mild conditions (Suppl. S1.4 ), while pro viding computational efficiency and natural uncertaint y quan tifi- cation. Outline. The remainder is organised as follo ws. Section 2 recalls the v ariational founda- tions of Gibbs-measures and shows that Gaussian mixture approximations can b e used for mo de seeking b y specifying their b ehav iour under annealing (Thm. 2.1 ). Section 3 shows 3 that this mechanism can b e extended to mo de estimation when the densit y to maximise is only a v ailable through samples. The main principles of GER VE is in tro duced and the asymptotic b eha viour of the empirical maximisers is established (Thm. 3.1 ). Section 4 presen ts algorithmic details (Alg. 1 ). Section 5 giv es the fixed-temp erature b ootstrap pro- cedure (Alg. 2 ) with guaran tees (Thm. 5.1 and 5.3 ). Sections 6 and 7 rep ort sim ulations and a case study on UK road-collision hotsp ots. Section 8 discusses limitations and future directions. All proofs and additional details are in Supplementary Material. A Python implemen tation for GER VE can b e found at github.com/tam-leminh/gerve . Notation. W e consider that p (or f ) has a finite num ber I of global modes { x ⋆ i } I i =1 . When f : S → R and there is no ambiguit y on the domain, f is C k means f ∈ C k ( S ), that is f is contin uously differentiable on S to the order k . F or a distribution q on A ⊆ R d and a set B ⊂ A , w e write q ( B ) := R B q ( x ) d x and q B := q 1 B /q ( B ), its truncation (or conditional law) on B . The set { q ϑ , ϑ ∈ Θ } denotes a v ariational family of densities parameterised by ϑ where ϑ = λ for single Gaussians q λ = N ( µ , Σ ), or ϑ = Λ for Gaussian mixtures q Λ = P K k =1 π k N ( µ k , Σ k ). When there is no am biguity , w e write for the ob jective L ω ( ϑ ) := L ω ( q ϑ ). Unless explicitly sp ecified, ∥·∥ denotes the Euclidean norm for v ectors and the corresponding operator norm for matrices. 2 Mo de seeking with annealed Gaussian mixtures F or a measurable function f : S → R , the associated Gibbs measure g ω preserv es its mo des. The goal of finding the mo des of f can then b e turned into that of appro ximating g ω , whic h in turn can b e achiev ed through a v ariational approximation principle. The classical v ariational iden tit y L ω ( q ) = ω log Z ω ( f ) − ω KL ( q ∥ g ω ) , with KL ( q ∥ g ω ) := R S q ( x ) log( q ( x ) /g ω ( x )) d x , implies that g ω uniquely maximises the ob- jectiv e L ω when Z ω ( f ) < ∞ ( Donsker and V aradhan , 1976 ). Restricting the search for q in a tractable v ariational family gives an approximation for g ω b y minimisation of KL ( q ∥ g ω ). As a natural c hoice for approximating multimodal landscapes, we study how Gaussian mixtures approximate g ω when ω decreases to 0. Since g ω is defined only on S , admissi- ble v ariational distributions are required to ha ve supp ort in S . W e therefore w ork with mixtures of truncated Gaussians. The next result, prov ed in Supplemen tary Material S2.2 , describ es ho w optimal mixtures concen trate on the global mo des when the n umber of comp onen ts K is at least the num b er of global mo des I . Theorem 2.1 (Gaussian mixture concen tration on global modes) . Assume f ∈ C 3 ( S ) and b ounde d on S , and for e ach mo de in { x ⋆ i } I i =1 , x ⋆ i ∈ int( S ) and H i := −∇ 2 f ( x ⋆ i ) ≻ 0 . Consider Q to b e the variational family of Gaussian mixtur es trunc ate d on S , with a fixe d numb er of c omp onents K ≥ I and upp er-b ounde d c ovarianc es. L et q ⋆ ω denote any maximiser of L ω in Q . Cho ose disjoint neighb orho o ds U 1 , . . . , U I of the mo des and let U 0 := S \ S I i =1 U i . 4 Then, as ω → 0 : (i) We have KL( q ⋆ ω ∥ g ω ) → 0 . In p articular, q ⋆ ω ( U 0 ) → 0 , and, for e ach neighb orho o d U i , i = 1 , . . . , I , q ⋆ ω ( U i ) → c i , wher e c i ∝ (det H i ) − 1 / 2 , I X i =1 c i = 1 . (ii) F or al l i = 1 , . . . , I , the c onditional laws of q ⋆ ω and g ω on U i satisfy KL  q ⋆,U i ω ∥ g U i ω  → 0 . In p articular, the me an and c ovarianc e of q ⋆,U i ω satisfy E q ⋆,U i ω [ X ] = x ⋆ i + o (1) , Cov q ⋆,U i ω ( X ) = o (1) . This result sho ws that Gaussian mixtures pro vide accurate appro ximations of the Gibbs measure and place their mass on all global mo des of g ω , th us of f , under annealing whenev er K ≥ I . The lo cal shape of g ω at a particular mo de x ⋆ i has b een studied in Proposition 3 of Le Minh et al. ( 2025 ). There, the optimal Gaussian approximation in a neigh b orhoo d of x ⋆ i is of the form N ( µ ω , Σ ω ) satisfying: µ ω = x ⋆ i + o ( √ ω ) , Σ ω = ω H − 1 i + o ( ω ) . These expansions clarify how Gibbs measures can be used for mo de seeking. As ω decreases, mixture comp onen ts sharpen and their means drift tow ard the global modes of f . The temp erature parameter gov erns the trade-off betw een exploration (through smo oth and diffuse mixtures at large ω ) and exploitation (concen trated components at small ω ). A decreasing ω sc hedule allo ws starting in an exploration regime and gradually refining mo de lo calisation. W e next in tro duce a statistical form ulation in whic h f is unobserv ed and only accessible through samples drawn from it. 3 Sample-based mo de estimation and mo dal clustering W e observe i.i.d. samples X 1: N ∼ p supp orted on a bounded domain supp( p ) := { x ∈ R d : p ( x ) > 0 } and w e c ho ose S with finite Lebesgue measure suc h that supp( p ) ⊂ S . Replacing f with p , we distinguish the p opulation and empiric al ob jectiv es resp ectively referring to L ω ( q ) and b L ω ,N ( q ) defined b y ( 1 ), with f := p , and ( 2 ). Instead of p erforming optimisation on truncated distributions on S , GER VE maximises b L ω ,N o ver un truncated Gaussian mixture families on R d , while all integrals in the ob jectiv e are k ept o ver S . Supplemen tary Material S1.2 discusses that if S is taken large enough, then the ob jectiv e gap b etw een the truncated and untruncated distributions is uniformly small, and the truncated and untruncated maximisers coincide asymptotically under the M -estimation theory ( v an der V aart , 1998 ). This simplifies optimisation, as it remov es parameter-dep enden t normalisers in the natural-gradien t updates (details in Section 4.1 ). 5 F or a single Gaussian N ( µ , Σ ), we consider the compact parameter set Θ := n λ = ( Σ − 1 µ , − 1 2 Σ − 1 ) : ∥ µ ∥ ∞ ≤ µ max , σ 2 min I ⪯ Σ ⪯ σ 2 max I o . (3) Prop osition S1.1 sho ws that if the outside mass ε ( λ ) = 1 − R S q λ ( x ) d x is uniformly small on Θ, then the ob jectiv e gap is uniformly small. The same reasoning applies com ponent-wise to mixtures on the parameter set Θ K Θ K := n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) : | v k | ≤ v max , λ k ∈ Θ o , (4) with b ounded logits v k := log ( π k /π K ), which prev en t degeneracies while preserving flexi- bilit y (Suppl. S1.2 , eq. ( S3 )). In fact, throughout, Θ K shall b e understo o d as its restric- tion to the closure Θ lex K of the lexicographically ordered parameter space Θ lex K (defined in Suppl. S1.3 ), that enforces a canonical labelling and th us pro vides sufficien t identifiabilit y for our tec hnical results. In the multimodal optimisation setting, the main source of bias is due to the entrop y- induced repulsion in mixtures. W e deliberately exploit this repulsion early in training to mitigate mo de collapse, a kno wn issue in v ariational Gaussian mixtures ( W u et al. , 2019 ; Jerfel et al. , 2021 ; Soletskyi et al. , 2025 ). The induced bias progressively diminishes under annealing. The following theorem establishes that GER VE remains statistically consistent at fixed ω = ω 0 > 0, as N → ∞ . An asymptotic normalit y result is also derived. These results are stated for Gaussian mixtures in Θ K (eq. ( 4 )), but can b e applied to single Gaussians in Θ (eq. ( 3 )) as a special case. The proof is pro vided in Supplemen tary Mate- rial S2.3 . Theorem 3.1 (Asymptotics of empirical maximisers) . A t fixe d ω 0 > 0 , if p is b ounde d and c ontinuous on S , then a p opulation maximiser Λ ⋆ ω 0 of L ω 0 exists on Θ K . If it is unique, then any empiric al maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisfies, as N → ∞ : b Λ ω 0 ,N P − → Λ ⋆ ω 0 . F urthermor e, if H ⋆ ω 0 := −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) ≻ 0 , then √ N  b Λ ω 0 ,N − Λ ⋆ ω 0  D − → N (0 , W ω 0 ) , wher e W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . R emark. maximiser uniqueness is subtle in mixture mo dels. In particular, unless the pa- rameter space is restricted so as to rule out degeneracies, a comp onen t with v anishing w eight can hav e its associated mean and co v ariance drift without affecting the mixture 6 densit y , and if several comp onen ts share the same ( µ , Σ ) then redistributing w eights within that tied group also lea v es the density unc hanged. Consequently , con vergence at the pa- rameter lev el cannot b e guaran teed in general, ev en when the induced densities con verge to a maximiser. T o formalise consistency without uniqueness, one can replace the uniqueness assumption in Theorem 3.1 b y the w eaker requiremen t that D ω 0 = arg max Λ ∈ Θ K L ω 0 ( Λ ) is nonempty and compact. The set-v alued argmax theorem ( v an der V aart , 1998 , Cor. 5.58) then yields dist  b Λ ω 0 ,N , D ω 0  P − → 0 , dist( x , A ) = inf a ∈A ∥ x − a ∥ . The as ymptotic normalit y statemen t, how ev er, holds only under local iden tifiability , for example H ⋆ ω 0 ≻ 0, which rules out duplicated comp onen ts at the maximiser. In practice, non-uniqueness can b e helpful: redundant components ma y b e absorb ed as some w eights shrink or as component means coalesce near prominen t modes, so that an o v ercomplete mixture adapts to the effectiv e n um b er of modes without the user pre-sp ecifying K . As demonstrated by Theorem 2.1 , optimal v ariational mixtures concen trate on the mo des of p as ω → 0. With the consistency guaranties of Theorem 3.1 , GER VE therefore defines a mo de estimator: Gaussian mixtures q Λ = P K k =1 π k q λ k sim ultaneously capture m ultiple modes under annealing. Bey ond mode estimation, GER VE naturally induces a clustering structure. With p os- sibly ov ercomplete mixtures, components are attracted tow ard distinct mo des as en trop y decreases. Redundant comp onents either v anish or collapse, so the effective n umber of clus- ters adapts automatically . Unlike likelihoo d-based Gaussian mixtures clustering, GER VE targets density mo des under annealing. In practice, once fitted, cluster assignments are straightforw ard via resp onsibilities (p os- terior comp onent probabilities). This connects to mo dal clustering ( Menardi , 2016 ; Chac´ on , 2020 ), where clusters are attraction basins of densit y mo des. Classical approac hes rely on nonparametric densit y estimates and gradient flows, while GER VE use s a parametric v ari- ational appro ximation. The resulting clusters need not coincide exactly with attraction basins, but they agree locally near each mo de under mild conditions, see Supplementary Material S1.4 for a statemen t and a coun terexample. This yields adaptive modal cluster- ing without tuning K . A ca v eat is that annealing emphasises dominan t mo des: secondary mo des ma y disappear as ω → 0, so retaining them requires stopping at a p ositiv e temp er- ature or imposing w eight lo wer b ounds. 7 4 GER VE algorithm and v arian ts 4.1 V ariational optimisation with natural gradients The empirical ob jective b L ω ,N (eq. ( 2 )) can b e optimised o ver Θ or Θ K using natural gradi- en ts e ∇ ϑ b L ω ,N ( ϑ ) := F ( ϑ ) − 1 ∇ ϑ b L ω ,N ( ϑ ), whic h incorporates the geometry of the v ariational parameter space through the Fisher information matrix F ( ϑ ) ( Amari , 1998 ). When q ϑ b e- longs to an exp onen tial family with sufficien t statistic T ( X ), the natural gradien t equals the gradient in expectation parameters M = E q ϑ [ T ( X )]: e ∇ ϑ b L ω ,N ( ϑ ) = ∇ M b L ω ,N ( ϑ ) . (5) This iden tity allo ws computation of natural gradien ts without explicit estimation and in- v ersion of the Fisher matrix. It also holds for untruncated Gaussian mixtures via the Minimal Conditional Exponential F amily represen tation (MCEF; Lin et al. , 2019 ). Algorithm 1 describ es natural gradient ascen t on b L ω ,N ( ϑ ) with a temp erature schedule ω t o ver iterations t = 1 , . . . , T . Con vergence guaran tees for stochastic natural gradien t ascen t are stated in Supplemen tary Material S1.5 : at fixed- ω , up dates conv erge to the sta- tionary set under standard Robbins–Monro conditions and b ounded iterates (Thm. S1.11 ); with ω t → 0, stationary sets can b e track ed with slow sc hedules, i.e. | ω t +1 − ω t | = o ( ρ t ) where ρ t is a gradien t stepsize (Cor. S1.12 ). In GER VE, a pro jection step can be added to ensure the updates remain in the compact space Θ or Θ K . Algorithm 1: GER VE (general form) 1 Input: samples X 1: N . 2 Set: initial ϑ 1 , stepsizes ( ρ t ) t , schedule ( ω t ) t , (optional) batc h size B . 3 for t = 1: T do 4 (Optional) Sample a mini-batch { X ( t ) i } B i =1 from X 1: N (or use full data). 5 Compute an (un biased) natural-gradien t estimate e ∇ b L ω t ,N ( ϑ t ). 6 Upda te ϑ t +1 ← ϑ t + ρ t e ∇ b L ω t ,N ( ϑ t ). 7 (Optional) Project to a compact set. 8 end 9 Return: ϑ T +1 . 4.2 V arian t A: Equiv alence to Gaussian mean-shift Consider the family of Gaussian distributions with a fixed isotropic cov ariance σ 2 I , { q µ = N ( µ , σ 2 I ) , µ ∈ R d } . F or a large enough S , we hav e ∇ µ H S ( q µ ) ≈ 0, so the natural-gradien t step can be simplified into: µ t +1 = µ t + ρ t N N X i =1 ( X i − µ t ) q λ t ( X i ) . (6) 8 Using a batch { X ( t ) i } B i =1 giv es an unbiased estimator of the RHS (see Suppl. S3.2 for the deriv ation). Denoting b y p h ( µ ) = 1 N P N i =1 ϕ h ( X i − µ ) the KDE with a Gaussian kernel ϕ h with bandwidth h , the following result states that GER VE reco vers the mean-shift algorithm when h = σ 2 . Prop osition 4.1 (Equiv alence to Gaussian mean-shift) . Up date rule ( 6 ) p erforms gr adient asc ent on p h , and normalising the step ρ t = p h ( µ ) − 1 yields the classic al me an-shift fixe d- p oint up date: µ t +1 = P N i =1 ϕ h ( X i − µ t ) X i P N j =1 ϕ h ( X j − µ t ) . Viewing fixed-cov ariance GER VE as mean-shift puts us under classical KDE theory . Op erationally , µ t is a particle ascending the smo othed density p h , and the (fixed) co v ariance h I is the bandwidth: large h leads to smo oth and global but biased up dates, while small h yields sharp but noisy up dates, sensitive to sampling. This regime is th us a well-understoo d nonparametric baseline that grounds richer GER VE v arian ts. This can typically b e used to set the cov ariance bounds in learned- co v ariance v arian ts of GER VE. F or example, if p is C 3 , the gradient of the Gaussian KDE has bias O ( h 2 ) and fluctuation O p  ( N h d +2 ) − 1 / 2  , yielding the optimal plug-in rate h ≍ N − 1 / ( d +6) ( P arzen , 1962 ; Romano , 1988 ; Tsybak ov , 1990 ; Geno v ese et al. , 2016 ). 4.3 V arian t B: Gaussian mixture GER VE When considering Gaussian mixtures represen ted by Θ K , it is conv enient to adopt the follo wing parametrisation. Let q Λ = P K k =1 π k N ( µ k , S − 1 k ), where S k := Σ − 1 k with logits v k = log( π k /π K ) and Λ = ( v 1: K − 1 , λ 1: K ) ∈ Θ K . Natural-gradient steps lead to component- wise up dates coupled via the en tropy term (full deriv ation in Suppl. S3.3 ): S k,t +1 = S k,t  1 − ρ t N N X i =1  ( X i − µ k,t )( X i − µ k,t ) T S k,t − I  q λ k,t ( X i )  − ρ t ω t ∇ S − 1 k H S ( q Λ ) | Λ = Λ t , µ k,t +1 = µ k,t + ρ t N S − 1 k,t +1 S k,t N X i =1 ( X i − µ k,t ) q λ k,t ( X i ) + ρ t ω t ∇ µ k H S ( q Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t N N X i =1  q λ k,t ( X i ) − q λ K,t ( X i )  − ρ t ω t ∇ π k H S ( q Λ ) | Λ = Λ t . Learned cov ariances adapt to anisotropy and conv erge to positive limits at fixed temp era- ture. As ω → 0, co v ariances shrink faster. F urthermore, the entrop y term induces repulsion b et w een components at large ω that v anishes as ω → 0. Therefore, a slow er decreasing 9 annealing schedule allo ws for longer exploration b y delaying mo de collapse and impro ving mo de capture in rugged and m ultimo dal landscap es. Overcomplete mixtures (when K is larger than the num b er of modes) naturally share components p er mo de (see Thm. 2.1 ). Gradien t deriv ations and Monte Carlo estimators for the en tropy terms are in Supple- men tary Material S3.3 . Diagonal, isotropic, or fixed-co v ariance v arian ts are obtained b y mo difying Θ K , and deriv ations follo w the same pattern. 4.4 Complexit y and practical guidelines With T iterations, batch size B , and a Gaussian mixture with K comp onen ts, the compu- tational cost is O ( d 3 B K T ). The d 3 factor reflects the matrix m ultiplications and inv ersions that arise when up dating full precision matrices. Restricting the v ariational family reduces this cost to O ( dB K T ) for diagonal or fixed cov ariances. These restricted families often preserv e the qualitativ e b eha viour of the algorithm and they improv e scalabilit y , whic h brings the cost in line with classical mo de-seeking and clustering pro cedures. F urther details app ear in Supplemen tary Material S3.4 . W e now summarise practical guidance for h yp erparameter selection. F or the temp era- ture schedule, the initial v alue of ω should be large so that mixture components repel and co ver S . The temperature should then decrease to a target v alue ω † as slo wly as possible so that the iterates con tinue to trac k the relev ant stationary sets, with ω † = 0 for mode estimation and ω † = ω 0 with ω 0 > 0 for clustering, as formalised in Corollary S1.12 . F or stepsizes, a safe choice satisfies the Robbins–Monro conditions P t ρ t = ∞ and P t ρ 2 t < ∞ . In practice, when the temperature ω t decreases o v er time, we find that mean up dates are stabilised b y the scaling ρ t ∝ ω − β t with β ∈ (0 , 1). This c hoice often accelerates con vergence, although it can violate the Robbins–Monro conditions. Compactness constrain ts require explicit b ounds for the cov ariance parameters in Θ and for the mixture parameters in Θ K . The upper bound σ 2 max should b e large enough so that a single comp onen t can co ver the entire dataset. F or mode estimation, a suitable rule for the low er b ound is σ 2 min of the order of magnitude of N − 1 / ( d +6) , which follows classical mean-shift bandwidth theory . F or clustering, σ 2 min should b e comparable to intra-class v ariances. F or the logits that define the w eights in Θ K (eq. ( 4 )), setting the upper bound v max ≥ 6 allows any subset of w eigh ts to b e smaller than 10 − 5 , which makes it p ossible to iden tify v anishing components while main taining compactness (see Suppl. S1.2 , eq. ( S3 )). 5 Bo otstrap uncertain t y quan tification for mo de estimation Mo de-lev el uncertaint y can b e pro vided at a fixed stopping temperature ω 0 > 0, using a b ootstrap principle. A GER VE baseline fit is performed on the initial sample (constant ω t = ω 0 ). GER VE is then refit L times on resampled data. Each set of resulting mo des is matc hed to the baseline mo des, providing some empirical spread as confidence ellipses and a stability score. 10 5.1 Bo otstrap pro cedure The b ootstrap p ro cedure is summarised in Algorithm 2 for Gaussian mixture GER VE. Each fit,  = 0 , . . . , L , uses an o v ercomplete K -comp onent mixture follo wed by pruning (for some threshold  ≪ 1) and merging of near-duplicates. The first fit iden tifies b K “baseline” mo des. Eac h bo otstrap fit  = 1 , . . . , L pro duces its own set of b K ( ℓ ) mo des, which are matched to the baseline mo des via minim um-cost assignmen t (Hungarian algorithm, Kuhn , 1955 ). Then, w e form confidence in terv als from the empirical distribution of matched mo des, and monitor mo de reco v ery with the stabilit y score s k = 1 L L X ℓ =1 1 { mo de k is matc hed in replicate  } . (7) Algorithm 2: Bo otstrap uncertaint y quantification for mode estimation 1 Input: samples X 1: N , temp erature ω 0 , num b er of replicates L . 2 Fit GER VE at ω 0 on X 1: N , prune/merge to obtain baseline mo des { b µ (0) k } b K k =1 . 3 for  = 1 : L do 4 Sample J ( ℓ ) 1: N i.i.d. ∼ Unif ([ N ]), set X ( ℓ ) i ← X J ( ℓ ) i for i = 1 : N . 5 Fit GER VE at ω 0 on X ( ℓ ) 1: N , prune/merge to obtain { b µ ( ℓ ) j } b K ( ℓ ) j =1 . 6 Ma tch { b µ ( ℓ ) j } b K ( ℓ ) j =1 to { b µ (0) k } b K k =1 (Hungarian). 7 Record for eac h k : the matc hed b µ ( ℓ ) k . 8 end 9 Return: per-mo de confidence ellipses from ( b µ ( ℓ ) k ) L ℓ =1 and stability scores ( s k ) b K k =1 . 5.2 Statistical v alidit y Let b L ω 0 ,N b e the empirical ob jective at fixed ω 0 > 0 on a compact parameter set Θ K (eq. ( 4 )), and let b Λ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ω 0 ,N ( Λ ) with a unique p opulation maximiser Λ ⋆ ω 0 . W rite P ∗ for the conditional probabilit y given the observ ed sample X 1: N . Bootstrap limits are understo od conditionally on X 1: N , in P -probability . Define the bo otstrap criterion and maximiser b L ∗ ω 0 ,N ( Λ ) = E P ∗ N [ q Λ ( X )] + ω 0 H S ( q Λ ) , b Λ ∗ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ∗ ω 0 ,N ( Λ ) . In addition to unicit y , we assume that the p opulation maximiser satisfies H ⋆ ω 0 = −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) ≻ 0, so that Theorem 3.1 fully applies. The following statements guarantee the v alidity of bo otstrap for parameters (Thm. 5.1 ), matc hed mo des (Thm. 5.3 ), and their robustness to inexact maximisation (Prop. 5.4 ). All results are pro v ed in Supplementary Material S2.4 . In Supplementary Material S1.6 , 11 w e also inv estigate the consistency of stability scores and prop ose strategies to handle noncon vex ob jectives. Theorem 5.1 (Parameter-lev el b o otstrap v alidit y) . We have, as N → ∞ : √ N  b Λ ∗ ω 0 ,N − b Λ ω 0 ,N  D − → N (0 , W ω 0 ) c onditional ly on X 1: N , wher e W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . F or the theoretical analysis, w e fix deterministic target lo cations u 1: K 0 , which act as reference anc hors for defining the matching map. In practice, these targets corresp ond to the subset of baseline mode estimates we in tend to track, so K 0 is taken no larger than the n umber b K of resolv ed baseline mo des returned by the initial fit. T o mov e from parameters to matc hed mo des, we then define the matc hing map M that stacks the K 0 comp onen t means closest to their resp ectiv e targets u 1: K 0 (see Suppl. S1.6.1 for more details). The follo wing guaran tees only require a local separation margin. Assumption 5.2 (Lo cal separation for matching) . Ther e exists δ > 0 such that ne ar Λ ⋆ ω 0 , e ach tar get u j has a unique ne ar est c omp onent me an and the ne ar est index is lo c al ly c onstant (se e Suppl. S1.6.1 for a formal definition). When separation fails, b o otstrap ellipses inflate and the stabilit y score s k drops, whic h is informative ab out ambiguous mo des (see Suppl. S1.6 for a consistency result for s k ). Theorem 5.3 (Matched-modes b o otstrap v alidit y and confidence ellipses) . Under As- sumption 5.2 , M is C 1 at Λ ⋆ ω 0 with Jac obian J , and as N → ∞ : √ N  M ( b Λ ω 0 ,N ) −M ( Λ ⋆ ω 0 )  D − → N (0 , C M ) , √ N  M ( b Λ ∗ ω 0 ,N ) −M ( b Λ ω 0 ,N )  D − → N (0 , C M ) , wher e C M = J W ω 0 J T . F or e ach matche d mo de j , the mar ginal c ovarianc e is C j , so p er c entile (and, with a c onsistent b C j , studentise d) b o otstr ap el lipses ar e asymptotic al ly valid. Finally , we allow inexact baseline and b o otstrap fits. Let e Λ ω 0 ,N and e Λ ∗ ω 0 ,N denote an y appro ximate solutions that are first-order stationary up to o P ( N − 1 / 2 ) and o P ∗ ( N − 1 / 2 ), re- sp ectiv ely , that is ∥∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N ) ∥ = o P ( N − 1 / 2 ) and ∥∇ Λ b L ∗ ω 0 ,N ( e Λ ∗ ω 0 ,N ) ∥ = o P ∗ ( N − 1 / 2 ). Prop osition 5.4 (Robustness to inexact maximisation) . The c onclusions of The or ems 5.1 and 5.3 hold with e Λ ω 0 ,N and e Λ ∗ ω 0 ,N in plac e of b Λ ω 0 ,N and b Λ ∗ ω 0 ,N . 6 Sim ulation studies W e show that GER VE reco vers modes and yields useful clusterings without tuning K . F urthermore, on sim ulated data, w e illustrate (i) ho w clustering emerges as a b ypro duct of v ariational mode estimation and (ii) the consistency of mo de recov ery . 12 Figure 1: Overspecified clustering ( K = 7) of a triangle Gaussian mixture sample. Data p oin ts are colored according to the comp onen t with the highest posterior resp onsibility in the fitted mixture. Ellipses represen t the component co v ariances. Left: GER VE returns 3 effective clusters (the comp onen t with dashed tra jectory has v anishing weigh t, the re- maining six comp onen ts form three clusters by grouping equiv alent comp onen ts). Right: GMM-EM returns a partition in to 7 clusters. 6.1 Clustering example T o illustrate the clustering prop erties of GER VE, we simulate N = 6000 p oin ts from a 2D, 3-comp onen t Gaussian mixture with means at the nodes of an equilateral triangle, with isotropic cov ariance σ 2 I , σ 2 = 0 . 25. The data exhibits J = 3 high-densit y regions that w e aim to reco ver. Details on the setup and a densit y plot can b e found in Supplementary Material S4.1 . W e run GER VE with Gaussian mixtures (full co v ariances) under t w o regimes: matc hed K = 3 and o vercomplete K = 7. T o show case GER VE’s robustness to comp onen t mean initialisation, we initialise means in [ − 2 , 0] 2 . W e start with broad co v ariances and equal w eights. W e use a contin uously decreasing annealing schedule, and comp ensate with step size ρ t ∝ ω − 0 . 7 t (full hyperparameters and sc hedules in Suppl. S4.1 ). After training, assign b y highest resp onsibilit y and prune weigh ts < 10 − 3 . Baselines are GMM-EM (Gaussian mixture mo del fit by the EM algorithm) with full co v ariances and k -means for K ∈ { 3 , 7 } . Figure 1 shows the obtained partitions when K = 7. With this o v ercomplete mixture, GER VE still yields 3 effectiv e clusters: redundant components coalesce at modes or are pruned, whereas GMM-EM returns 7 clusters, so do es k -means (Suppl. S4.1 , Fig. S3 ). Because GER VE targets mo dal structure rather than a clustering lik eliho o d, it is more robust to o ver-specification. With K = 3, all metho ds recov er 3 clusters (Suppl. S4.1 , Fig. S3 ). In Supplemen tary Material S5 , we assess robustness against classical metho ds with 13 a benchmark on standard UCI datasets. W e also provide an ablation study and rep ort run times. 6.2 Mo de-finding p erformance W e use the same triangle setting as previously , with sharp er comp onen ts ( σ 2 = 0 . 1). T rue global mo des practically coincide with means. Ov er n rep = 100 replicates, eac h metho d outputs K mo de estimates. W e rep ort three metrics (see Suppl. S4.2 ): (i) Mo de r e c overy ( MR ϵ ), the num ber of true mo des reco vered within a tolerance  , w e set  = 10 − 2 ; (ii) Hungarian matching sum ( HM ), minimum- assignmen t cost b etw een b et ween true and estimated mo des; (iii) Ne ar est-neighb or sum ( NN ), aggregate distance from eac h estimate to the closest mo de. W e run GER VE (Gaussian mixture with full co v ariances) with K ∈ { 3 , 7 } and a v an- ishing annealing schedule. Baselines are Gaussian mean-shift with fixed-p oint updates, mo des found aggregated from K initialisations, and the F eature Significance method via the feature R pac k age, whic h detects significant curv ature p oin ts, then estimates mo des as the cen troids of k -means. F or eac h sample size N ∈ { 2 10 , 2 12 , . . . , 2 20 } and v alue of K , w e p erform a ligh t h yp erparameter grid searc h (Suppl. S4.2 ). W e a verage MR ϵ , NN and rep ort HM medians with 95% confidence interv als. Fig. 2 summarises performance with resp ect to sample size N for K ∈ { 3 , 7 } . GER VE’s MR ϵ increases to wards the n um b er of true mo des and HM decreases as N gro ws, illustrating the consistency of mo de reco v ery (this matches the conclusions of Thm 3.1 ). With K = 3, GER VE’s NN is comparable to that of mean-shift and F eature Significance, show casing its mode-lo cation accuracy . With K = 7, GER VE impro v es MR ϵ and HM while keeping NN competitive. This exhibits GER VE’s robustness to o v ersp ecification, whereas F eature Significance is degraded. A broader sw eep o v er K ∈ { 3 , . . . , 7 } app ears in Supplemen tary Material S4.2 . T o summarise the key asp ects highligh ted in this sim ulation study: (i) Clustering emerges from mo de estimation, with robustness to o v ersp ecification; (ii) F or mode esti- mation, GER VE is statistically consistent and comp etitiv e in location accuracy , and mild o verspecification helps GER VE where it hinders the other baselines. 7 Application: UK road collision hotsp ots detection Iden tifying collision hotsp ots (spatial in tensit y mo des) and quan tifying their uncertaint y for in terven tion planning is crucial in road safety studies. Coun t-based metrics and KDE are common ( Xie and Y an , 2008 ; Ok ab e et al. , 2009 ). Spatial scans add significance test- ing and Ba yesian models add uncertaint y , but at the cost of parametric assumptions and hea vier computation ( Kulldorff , 1997 ; Aguero-V alverde and Jov anis , 2006 ). GER VE of- fers a lik eliho od-free alternative that estimates mo de lo cations with calibrated lo calisation uncertain ty . 14 1 0 3 1 0 4 1 0 5 1 0 6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Modes (mean) MR GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) 1 0 3 1 0 4 1 0 5 1 0 6 1 0 2 1 0 1 1 0 0 Distance (median) HM GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) 1 0 3 1 0 4 1 0 5 1 0 6 1 0 2 1 0 1 1 0 0 Distance (mean) NN GER VE (K=3) Mean-shif t (K=3) F eatur e Significance (K=3) GER VE (K=7) Mean-shif t (K=7) F eatur e Significance (K=7) Sample Size (log scale) Figure 2: Mo de estimation vs. sample size N for the triangle mixture ( I = 3) example. Curv es: means ( MR ϵ , NN ) or medians ( HM ) ov er n rep = 100; bands: 95% confidence in terv als. W e analyse UK STATS19 collisions from the Departmen t for T ransp ort Road Safet y Data p ortal ( www.gov.uk/government/statistics/r o ad-safety-data ), restricting to A-roads in Greater London (longitudes [ − 0 . 54 , 0 . 33], latitudes [51 . 28 , 51 . 70]) and sev erity labelled “fatal” or “serious” from 2020 to 2024. W e fit an o v ercomplete mixture, targeting 10 hotsp ots and using K = 20 components to enable o vercompleteness, and resolv e modes by pruning and merging (details in Suppl. S6 ). The stopping temp erature ω † is c hosen at the elb o w, the largest ω where the resolv ed-mode coun t p eaks (Suppl. S6 , Fig. S10 ). At ω † , we compute L = 500 b o otstrap resamples, matc h b ootstrap modes to the baseline b y assignmen t with adaptiv e gates (Suppl. S6 ), and pro ject to a metric system (EPSG:27700). F or eac h mo de k , w e rep ort a stabilit y score s k and a 95% confidence ellipse from the empirical co v ariance of matched cen tres. Under the fixed- temp erature theory in Section 5 , ellipses hav e nominal co verage and the stability score s k (eq. ( 7 )) estimates the reco v ery probabilit y . Figure 3 maps 2020-2024 hotsp ots with 95% ellipses and s k in the zo omed-in windo w [ − 0 . 15 , 0 . 15] × [ − 0 . 075 , 0 . 075]. High-stabilit y lo cations ( s k ≥ 0 . 7, arbitrary threshold) suc h as Shoreditch (ID: 12) and Elephant & Castle (ID: 1) hav e ellipses of order 100-150 m, whic h supp orts in tersection-lev el action. Medium-stability lo cations (0 . 4 ≤ s k < 0 . 7) suc h as Aldgate (ID: 8) and Clapham HS (ID: 16) sho w wider ellipses that span m ultiple junctions, which suggests area-wide measures. T able S5 in Supplemen tary Material S6 lists the entries, ranked by stabilit y score. T o compare lik e-for-lik e, we use mean-shift, op erating on the same ob ject as GER VE. With a flat k ernel and a bandwidth sweep to a data-driven reference h 0 (see Suppl. S6 ), mean-shift en umerates man y local maxima at small bandwidths and merges aggressively at larger ones. In the zo omed-in windo w, mean-shift at 0 . 05 h 0 yields 122 centres, while GER VE reports 17. The con trast reflects design: entrop y-regularised mixtures imp ose 15 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) Baseline hotspots (numbered by ID) 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure 3: Baseline collision hotsp ots identified b y GER VE for the Greater London Area b et w een 2020 and 2024, with stabilit y scores. The normalised co ordinates are b ounded within [ − 0 . 7 , 0 . 7] × [ − 0 . 35 , 0 . 35] (normalised windo w) b o x, but all hotspots are in the [ − 0 . 15 , 0 . 15] × [ − 0 . 075 , 0 . 075] b o x (zoomed-in windo w), see Supplemen tary Material S6 , Figure S12 for the full study area. Left: with hotsp ot IDs for reference to T able S5 , Supplemen tary Material S6 . Right: 95% confidence ellipses, in normalised co ordinates. global comp etition for supp ort, so weak er peaks are absorbed as temp erature falls. This yields a coarser and more stable partition that aligns with ma jor junctions. A visual comparison app ears in Supplemen tary Material S6 , Figure S14 . F or temp oral stability , w e fit GER VE on 2014-2019 and 2020-2024 and declare a hotspot p ersisten t if 95% ellipses ov erlap across p eriods. Eigh t of the 17 hotsp ots p ersist, whic h indi- cates structural risk that is stable across years. Non-p ersisten t hotsp ots (Piccadilly Circus, Gun ter Grov e, Clapham HS) may reflect temp orally v arying factors suc h as pandemic- related traffic shifts, local construction, or policy c hanges. Supplementary Material S6 , Figure S16 o verla ys the t w o perio ds. Stabilit y and ellipses can b e used to target actions. High s k and tigh t ellipses supp ort redesign at a named junction, whereas mo derate s k and wide ellipses rev eal corridor-lev el risks, which are more suitable for monitoring. 8 Conclusion GER VE is a lik eliho o d-free v ariational approac h to mo de estimation and modal clustering. It optimises an en trop y-regularised ob jective with natural gradien ts under annealing, fitting mixtures that concen trate on high-densit y regions and induce clusters without density estimation. As temperature go es to zero, components align with population mo des and inherit lo cal curv ature. A t fixed temp erature, empirical maximisers exist, are consisten t, and satisfy a central limit theorem. A matched-mode b o otstrap proce dure yields per-mo de confidence sets. Empirically , GER VE recov ers mo dal structure and adapts the effective num ber of groups as temp erature decreases, with redundan t comp onen ts merging or v anishing. 16 Diagonal-co v ariance v arian ts preserve most of the behaviour at lo wer cost, and an abla- tion study sho ws that annealing, mo derate comp onen t o verspecification, and short burn-in phases are useful levers. GER VE is most suitable when uncertain ty in mo de locations is imp ortan t, when the cluster coun t is missp ecified, or when mo dal clustering is preferable to partition-based grouping. F or problems where the n umber of clusters is kno wn and computational sp eed is paramount, established task-sp ecific clustering algorithms remain preferable. Three lim- itations merit atten tion. First, mo dal clustering ma y misalign with the group structure that matters in an application. Mo des define clusters by attraction basins, but scientific groupings may dep end on other criteria, so agreement is not guaran teed. Second, the opti- misation procedure can con verge to local rather than global maxima. Careful initialisation, temp erature schedules, and o v ercomplete mixtures reduce this risk but do not completely eliminate it. Third, computation can b e costly with full cov ariance matrices in high di- mensions. In our experiments, using diagonal cov ariances allo ws one to control the cost while preserving mode separation. The theory and exp erimen tal results p osition GER VE as a principled statistical frame- w ork for lik eliho o d-free mo dal inference. By unifying mo dal clustering, mean-shift, v ari- ational inference, and annealing, GER VE pro vides b oth algorithmic to ols and theoretical guaran tees within one platform. Lo oking ahead, three directions are natural. First, de- v elop data-driv en annealing sc hedules with principled stopping. Second, establish path wise guaran tees for empirical maximisers and for the v alidit y of the b ootstrap procedure un- der annealing. Third, characterise mo de-comp onen t correspondence and label stabilit y as temp erature v aries. Pursuing these directions will link the annealing pro cedure to stronger statistical guarantees and broaden the method’s practical reach. Ac kno wledgemen ts This work is supported b y the F renc h ANR and JST-CREST Bay es-Dualit y gran t (ANR- 21-JSTM-0001). 17 References Aguero-V alv erde, J. and Jo v anis, P . P . (2006). Spatial analysis of fatal and injury crashes in Pennsylv ania. A c cident A nalysis & Pr evention , 38(3):618–625. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neur al Computation , 10(2):251–276. Ameijeiras-Alonso, J., Crujeiras, R. M., and Rodriguez-Casal, A. (2021). Multimo de: an R pack age for mo de assessment. Journal of Statistic al Softwar e , 97:1–32. Arias-Castro, E., Mason, D., and Pelletier, B. (2016). On the estimation of the gradient lines of a densit y and the consistency of the mean-shift algorithm. Journal of Machine L e arning R ese ar ch , 17:1–28. Bork ar, V. S. (2008). Sto chastic Appr oximation: a Dynamic al Systems Viewp oint . Springer. Carreira-P erpi ˜ nan, M. ´ A. (2006). Acceleration strategies for Gaussian mean-shift image segmen tation. Confer enc e on Computer Vision and Pattern R e c o gnition , 1:1160–1167. Carreira-P erpi ˜ nan, M. ´ A. (2007). Gaussian mean-shift is an EM algorithm. IEEE T r ans- actions on Pattern A nalysis and Machine Intel ligenc e , 29(5):767–776. Chac´ on, J. E. (2015). A p opulation bac kground for nonparametric density-based clustering. Statistic al Scienc e , 30(4):518–532. Chac´ on, J. E. (2020). The mo dal age of statistics. International Statistic al R eview , 88(1):122–141. Chernoff, H. (1964). Estimation of the mo de. Annals of the Institute of Statistic al Mathe- matics , 16(1):31–41. Comaniciu, D. and Meer, P . (2002). Mean shift: A robust approac h to w ard feature space analysis. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 24(5):603– 619. Cormen, T. H., Leiserson, C. E., Riv est, R. L., and Stein, C. (2009). Intr o duction to A lgorithms . MIT Press, 3 edition. Dalenius, T. (1965). The mo de–A neglected statistical parameter. Journal of the R oyal Statistic al So ciety Series A: Gener al , 128(1):110–117. Dempster, A. P ., Laird, N. M., and Rubin, D. B. (1977). Maxim um lik eliho o d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 39(1):1–22. 18 Donsk er, M. D. and V aradhan, S. R. S. (1976). Asymptotic ev aluation of certain Mark ov pro cess exp ectations for large time—I II. Communic ations on Pur e and Applie d Mathe- matics , 29(4):389–461. Dudley , R. M. (2002). R e al A nalysis and Pr ob ability . Cam bridge Univ ersit y Press. F ukunaga, K. and Hostetler, L. (1975). The estimation of the gradient of a densit y function, with applications in pattern recognition. IEEE T r ansactions on Information The ory , 21(1):32–40. Geno vese, C. R., P erone-Pacifico, M., V erdinelli, I., and W asserman, L. (2016). Non- parametric inference for density mo des. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 78(1):99–126. Hw ang, C.-R. (1980). Laplace’s metho d revisited: w eak con v ergence of probabilit y mea- sures. The A nnals of Pr ob ability , 8(6):1177–1182. Jerfel, G., W ang, S., W ong-F annjiang, C., Heller, K. A., Ma, Y., and Jordan, M. I. (2021). V ariational refinemen t for imp ortance sampling using the forward Kullbac k–Leibler di- v ergence. Unc ertainty in Artificial Intel ligenc e , pages 1819–1829. Khan, M. E. and Rue, H. (2023). The Bay esian learning rule. Journal of Machine L e arning R ese ar ch , 24(281):1–46. Kronmal, R. and T arter, M. (1968). The estimation of probability densities and cumulativ es b y Fourier series methods. Journal of the Americ an Statistic al Asso ciation , 63(323):925– 952. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval R ese ar ch L o gistics Quarterly , 2(1-2):83–97. Kullbac k, S. and Leibler, R. A. (1951). On information and sufficiency . The A nnals of Mathematic al Statistics , 22(1):79–86. Kulldorff, M. (1997). A spatial scan statistic. Communic ations in Statistics-The ory and Metho ds , 26(6):1481–1496. Kushner, H. J. and Yin, G. G. (2003). Sto chastic Appr oximation and R e cursive Algorithms and Applic ations . Springer. Le Minh, T., Arb el, J., M¨ ollenhoff, T., Khan, M. E., and F orb es, F. (2025). Natural v ariational annealing for m ultimo dal optimization. arXiv pr eprint arXiv:2501.04667 . Lin, W., Khan, M. E., and Schmidt, M. (2019). F ast and simple natural-gradien t v ariational inference with mixture of exp onen tial-family appro ximations. International Confer enc e on Machine L e arning . 19 Liu, B. and Ghosh, S. K. (2020). On empirical estimation of mo de based on w eakly dep enden t samples. Computational Statistics & Data Analysis , 152:107046. Loftsgaarden, D. O. and Quesen b erry , C. P . (1965). A nonparametric estimate of a multi- v ariate density function. The A nnals of Mathematic al Statistics , 36(3):1049–1051. McLac hlan, G. J. and P eel, D. (2000). Finite Mixtur e Mo dels . John Wiley & Sons. Menardi, G. (2016). A review on mo dal clustering. International Statistic al R eview , 84(3):413–433. Noble, M., Grenioux, L., Gabri ´ e, M., and Durmus, A. O. (2025). Learned reference-based diffusion sampler for m ulti-mo dal distributions. International Confer enc e on L e arning R epr esentations . Ok abe, A., Satoh, T., and Sugihara, K. (2009). A kernel densit y estimation metho d for net works, its computational metho d and a GIS-based to ol. International Journal of Ge o gr aphic al Information Scienc e , 23(1):7–32. P arzen, E. (1962). On estimation of a probability densit y function and mo de. The Annals of Mathematic al Statistics , 33(3):1065–1076. Romano, J. P . (1988). On weak conv ergence and optimalit y of k ernel density estimates of the mo de. The Annals of Statistics , 16(2):629–647. Sager, T. W. (1978). Estimation of a multiv ariate mode. The Annals of Statistics , 6(4):802– 812. Sam worth, R. J. (2018). Recen t progress in log-conca v e density estimation. Statistic al Scienc e , 33(4):493–509. Scott, D. W. (1992). Multivariate density estimation: The ory, Pr actic e, and Visualization . John Wiley & Sons. Soletskyi, R., Gabri´ e, M., and Loureiro, B. (2025). A theoretical persp ectiv e on mo de col- lapse in v ariational inference. Machine L e arning: Scienc e and T e chnolo gy , 6(2):025056. Tsybak ov, A. B. (1990). Recursive estimation of the mode of a m ultiv ariate distribution. Pr oblems of Information T r ansmission , 26(1):38–45. v an der V aart, A. W. (1998). Asymptotic Statistics . Cam bridge Universit y Press. v an der V aart, A. W. and W ellner, J. A. (1996). We ak Conver genc e and Empiric al Pr o- c esses: with Applic ations to Statistics . Springer. V en ter, J. H. (1967). On estimation of the mo de. The Annals of Mathematic al Statistics , 38(5):1446–1455. 20 W u, D., W ang, L., and Zhang, P . (2019). Solving statistical mec hanics using v ariational autoregressiv e net w orks. Physic al R eview L etters , 122(8):080602. Xie, Z. and Y an, J. (2008). Kernel density estimation of traffic acciden ts in a net w ork space. Computers, Envir onment and Urb an Systems , 32(5):396–406. Zeidler, E. (1995). Applie d F unctional A nalysis: Applic ations to Mathematic al Physics , v olume 108. Springer. 21 Supplemen tary material S1 Theory details 23 S1.1 Notation and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 S1.2 T runcation equiv alence and compact sets . . . . . . . . . . . . . . . . . . . 24 S1.3 Lexicographic order on Θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 S1.4 Modal basins-responsibility cells agreemen t . . . . . . . . . . . . . . . . . . 28 S1.5 Natural-gradien t con v ergence . . . . . . . . . . . . . . . . . . . . . . . . . . 30 S1.6 Uncertain t y quan tification for mo de estimation . . . . . . . . . . . . . . . . 32 S1.6.1 Notations and separation assumption . . . . . . . . . . . . . . . . . . 32 S1.6.2 Consistency of stability scores . . . . . . . . . . . . . . . . . . . . . . 34 S1.6.3 Handling local optima . . . . . . . . . . . . . . . . . . . . . . . . . . 35 S2 Pro ofs 35 S2.1 Preliminaries: Empirical pro cess prop erties . . . . . . . . . . . . . . . . . . 35 S2.2 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 S2.3 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 S2.3.1 Existence of a population maximiser . . . . . . . . . . . . . . . . . . 48 S2.3.2 Consistency of the empirical maximisers . . . . . . . . . . . . . . . . 48 S2.3.3 Asymptotic normalit y of the empirical maximisers . . . . . . . . . . 49 S2.4 Proofs for Sections 5 and S1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 50 S2.4.1 Preliminaries: Bounded-Lipschitz metric . . . . . . . . . . . . . . . . 50 S2.4.2 T echnical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 S2.4.3 Pro ofs of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 S2.4.4 Pro of of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 56 S2.4.5 Pro of of Prop osition 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 57 S2.4.6 Pro of of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 S2.4.7 Pro of of Prop osition S1.13 . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5 Proofs for Section S1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5.1 Pro of of Prop osition S1.1 . . . . . . . . . . . . . . . . . . . . . . . . 59 S2.5.2 Pro of of Theorem S1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 60 S3 Algorithms and v arian ts 61 S3.1 Natural-gradien t update rule . . . . . . . . . . . . . . . . . . . . . . . . . . 61 S3.2 Fixed-co v ariance Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . 61 S3.3 Gaussian mixture case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 S3.4 Computational considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 67 S4 Sim ulations and benchmark 67 S4.1 Details on the clustering example of Section 6.1 . . . . . . . . . . . . . . . . 67 S4.2 Mode-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 22 S5 Clustering p erformance on UCI datasets 71 S5.1 Benc hmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 S5.2 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 S5.3 Run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 S6 Details on the UK collision case study 79 S1 Theory details S1.1 Notation and setup W e w ork on a domain S ⊂ R d , with finite Leb esgue measure. Bold sym b ols denote vec- tors/matrices (e.g., x , µ , Σ ). Giv en a measurable f : S → R and a temp erature ω > 0, define a normalised Gibbs measure as g ω ( d x ) = exp( f ( x ) /ω ) Z ω ( f ) d x , with partition function Z ω ( f ) = Z S exp( f ( y ) /ω ) d y assumed finite when inv ok ed. W e appro ximate g ω on S with a tractable q (restricted to S ): (i) a single Gaussian q λ = N ( µ , Σ ), (ii) a K -comp onen t Gaussian mixture q Λ ( x ) = K X k =1 π k N ( x | µ k , Σ k ) , K X k =1 π k = 1 , π k > 0 . Mixture weigh ts are parameterised by logits v k = log ( π k /π K ). Unless stated otherwise, co v ariance eigenv alues are constrained to [ σ 2 min , σ 2 max ]. F or an y densit y q , E q [ · ] denotes the expectation under q and KL( q ∥ p ) the Kullback– Leibler divergence. F or q on S , define the en trop y on S : H S ( q ) = − Z S q ( x ) log q ( x ) d x , and the v ariational ob jective L ω ( q ) = E q [ f ( X )] + ω H S ( q ) . By the classical v ariational identit y , L ω ( q ) = ω log Z ω ( f ) − ω KL ( q || g ω ) , 23 so g ω uniquely maximises L ω whenev er Z ω ( f ) < ∞ . F or v ariational parameters ϑ , let F ( ϑ ) be the Fisher information matrix. The natural gradien t is e ∇ ϑ L ω ( ϑ ) = F ( ϑ ) − 1 ∇ ϑ L ω ( ϑ ) . If q ϑ is in an exponential family with sufficien t statistic T ( X ) (or in an MCEF for mixtures ( Lin et al. , 2019 )), the natural gradient equals the standard gradien t with resp ect to the exp ectation parameters M = E q ϑ [ T ( X )]: e ∇ ϑ L ω ( ϑ ) = ∇ M L ω ( ϑ )] . Let { x ⋆ i } I i =1 denote the set of global maximisers (“modes”) of f (or of p when f = p ) on S . At a nondegenerate mo de x ⋆ i (i.e. ∇ 2 f ( x ⋆ i ) ≺ 0), define the p ositiv e-definite curv ature H i := −∇ 2 f ( x ⋆ i ) ≻ 0 . F or symmetric matrix A , w e write eig min ( A ) , eig max ( A ) for extremal eigenv alues. Norms are Euclidean for vectors and operator norms for matrices unless noted. F or a set A ⊂ R d , in t( A ) denotes its in terior, ∂ A its b oundary , and A c := R d \ A its complemen t. F or an elemen t x ∈ R d , we write dist( x , A ) := inf a ∈A ∥ x − a ∥ 2 . F or r > 0, let B ( x , r ) denote the ball cen tred on x with radius r . S1.2 T runcation equiv alence and compact sets In the optimisation setting ( Le Minh et al. , 2025 ), the target f is kno wn or directly ev alu- able, so E q ϑ [ f ( x )] is computable. In GER VE, f is replaced by an unkno wn densit y p and w e observe only i.i.d. samples X 1: N ∼ p . W e supp ose that p is smo oth, b ounded, and supp orted on a b ounded domain supp( p ) ⊂ S . Using symmetry , L ω ( q ϑ ) = Z S p ( x ) q ϑ ( x ) d x + ω H S ( q ϑ ) = E p [ q ϑ ( X )] + ω H S ( q ϑ ) , so the first term is an expectation under p . Let P N := 1 N P N i =1 δ X i b e the empirical measure. A natural plug-in estimator replaces E p [ q ϑ ( X )] b y E P N [ q ϑ ( X )], yielding the empirical v ariational ob jectiv e b L ω ,N ( ϑ ) = E P N [ q ϑ ( X )] + ω H S ( q ϑ ) = 1 N N X i =1 q ϑ ( X i ) + ω H S ( q ϑ ) . (S1) This is the learning ob jective optimised by GER VE. 24 Optimisation on un truncated Gaussians. Since p is supp orted on S rather than on R d , the corresponding Gibbs measure is defined only on S , and one might a priori restrict q to S . q λ denotes the un truncated Gaussian on R d with natural parameter λ . Define ψ S ( λ ) := Z S q λ ( x ) d x . The S -truncated version is q S λ ( x ) := q λ ( x ) 1 S ( x ) /ψ S ( λ ). Optimising directly ov er q S λ is cumbersome b ecause ψ S ( λ ) v aries with the parameter. The next result (pro v en in Section S2.5.1 ) sho ws that on a large S , optimising ov er the un truncated family while in tegrating o ver S incurs only a con trolled error on a compact set. T o define suc h a compact parameter space for Gaussians, fix a mean bound µ max < ∞ and eigenv alue bounds 0 < σ 2 min < σ 2 max < ∞ , and consider the compact set of natural parameters Θ := n λ = ( Σ − 1 µ , − 1 2 Σ − 1 ) : µ ∈ R d , σ 2 min I ⪯ Σ ⪯ σ 2 max I , ∥ µ ∥ ∞ ≤ µ max o . F or λ ∈ Θ, let µ ( λ ) and Σ ( λ ) be the mean and co v ariance matrix enco ded by λ . F or notational con venience, let F ω ( λ ) := L ω ( q λ ) , G ω ( λ ) := L ω ( q S λ ) , for λ ∈ Θ. Let ε ( λ ) := 1 − ψ S ( λ ), ε max := sup λ ∈ Θ ε ( λ ) < 1, and p max = sup x ∈S p ( x ). In the following prop osition, we b ound | G ω ( λ ) − F ω ( λ ) | uniformly on Θ. Prop osition S1.1 (T runcation gap) . L et C q = sup λ ∈ Θ sup x ∈S   log q λ ( x )   < ∞ . Define, for t ∈ [0 , 1) , δ ω ( t ) := p max t + ω   log(1 − t )   + ω C q t. Then for al l λ ∈ Θ ,   G ω ( λ ) − F ω ( λ )   ≤ δ ω  ε ( λ )  , (S2) and sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . This b ound implies that if µ sta ys aw a y from ∂ S , the error from using untruncated Gaussians is negligible. No w, we deriv e a tail b ound and, combined with an argmax stabilit y argumen t, we obtain the next theorem, pro ven in Section S2.5.2 . Theorem S1.2 (T runcation equiv alence on a b ounded supp ort) . L et δ ω ( · ) b e as in Pr op o- sition S1.1 . F or any fixe d ω > 0 , if ther e is a mar gin M 0 > 0 such that dist( µ ( λ ) , ∂ S ) ≥ M 0 p eig max ( Σ ( λ )) for al l λ ∈ Θ , then for some C > 0 , ε max ≤ C e − M 2 0 / 2 , sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . 25 Mor e over, let m ω ( ρ ) := F ω ( λ ⋆ ω ) − sup ∥ λ − λ ⋆ ω ∥≥ ρ F ω ( λ ) b e the value gap. Conse quently, if m ω ( ρ ) > 0 for some ρ > 0 , then every maximiser of G ω lies in B ( λ ⋆ ω , ρ ) whenever δ ω ( ε max ) < m ( ρ ) / 3 . In p articular, if the maximiser of F ω is unique and in in t(Θ) , then under the same c onditions, the maximiser of G ω is also unique and arg max λ ∈ Θ G ω = λ ⋆ ω . Case of mixtures. Theorem S1.2 is stated for singl e Gaussians. F or mixtures, w e ha v e to apply Proposition S1.1 comp onen twise under the same sp ectral b ounds and control the entrop y term via p er-component b ounds. Uniform control of the mixture log-densit y log q Λ additionally requires av oiding degenerate b eha viours (e.g., v anishing weigh ts with un b ounded parameters), whic h can b e done b y staying on a compact set of “natural” parameters. Indeed, although mixtures are not an exp onen tial family , they lie in the minimal con- ditional exp onen tial family (MCEF; Lin et al. , 2019 ), whic h admits analogous natural represen tation. W e consider the compact set giv en by equation ( 4 ), that we recall here: Θ K := n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) : | v k | ≤ v max , λ k ∈ Θ o , where v k = log( π k /π K ). T o preserv e the compactness induced b y Θ, w e imp ose additional b o x constraints | v k | ≤ v max on the log-ratios, whic h implies a uniform flo or on the w eights: π k ≥ 1 1 + ( K − 1) e 2 v max > 0 for all k ∈ [ K ] . Ho wev er, choosing v max ≥ 1 2 log  1 ε − ( K − 1)  (S3) allo ws a designated subset of comp onen ts to ha v e w eigh ts as small as ε simultaneously while resp ecting the b ox constrain ts. As v max → ∞ , the set of feasible w eights approac hes the simplex, so the constrain t b ecomes non-restrictiv e in practice. Typically , setting v max = 6 is already sufficien t to allo w all weigh ts to be smaller than 10 − 5 . In brief, for mixtures, we follow the same con ven tion as ab o ve: w e optimise o v er un- truncated mixtures while integrating o v er S . Under the margin condition ε ( µ k , Σ k ) = o (1) (e.g., dist( µ k , ∂ S ) ≥ M 0 p eig max ( Σ k ) with M 0 large), replacing truncated b y untruncated comp onen ts p erturbs b L ω ,N and L ω b y o (1) uniformly , and so leav es maximisers unchanged asymptotically . S1.3 Lexicographic order on Θ Fix a one-to-one co ordinate map τ : Θ → R d λ . F or a, b ∈ R d λ w e write a ≺ lex b if there exists j ∈ { 1 , . . . , d λ } such that a i = b i for all i < j and a j < b j . F or λ , λ ′ ∈ Θ w e write 26 λ ≺ lex λ ′ if τ ( λ ) ≺ lex τ ( λ ′ ). W e define the lexicographically ordered parameter space for K -comp onen t mixtures b y Θ lex K = n Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) ∈ Θ K : λ 1 ≺ lex λ 2 ≺ lex · · · ≺ lex λ K o . Let S K denote the p erm utation group on { 1 , . . . , K } . The mixture mo del is inv arian t under relab elling of components, in the sense that for every σ ∈ S K and ev ery Λ = ( v , λ 1 , . . . , λ K ) ∈ Θ K there exists a relab elled parameter σ · Λ :=  v σ , λ σ (1) , . . . , λ σ ( K )  ∈ Θ K , suc h that the mixing w eights are p erm uted accordingly , π ( v σ ) = ( π σ (1) ( v ) , . . . , π σ ( K ) ( v )), and f ( · ; σ · Λ ) = f ( · ; Λ ) . Assume additionally that the component parameters are pairwise distinct, that is, λ i  = λ j for i  = j . Then, for ev ery Λ ∈ Θ K , there exists a unique p erm utation σ Λ ∈ S K suc h that Λ ord := σ Λ · Λ ∈ Θ lex K , f ( · ; Λ ord ) = f ( · ; Λ ) . In particular, whenev er a maximiser (or stationary p oin t) exists o ver Θ K for a permutation- in v arian t criterion, there is an equiv alent maximiser (or stationary p oint) lying in Θ lex K . Th us, without loss of generalit y , one may restrict uniqueness argumen ts to Θ lex K . F urthermore, supp ose that the mo del is iden tifiable up to lab el switc hing on Θ K , namely , f ( · ; Λ ) = f ( · ; Λ ′ ) = ⇒ ∃ σ ∈ S K : Λ ′ = σ · Λ . Then the parametrisation is iden tifiable on Θ lex K , in the sense that Λ , Λ ′ ∈ Θ lex K and f ( · ; Λ ) = f ( · ; Λ ′ ) imply Λ = Λ ′ . T o justify the existence and uniqueness of σ Λ , observe first that because τ is one-to-one and ≺ lex is a strict total order on R d λ , the induced relation ≺ lex on Θ is also a strict total order. In particular , for any λ  = λ ′ in Θ, exactly one of λ ≺ lex λ ′ or λ ′ ≺ lex λ holds. Fix Λ = ( v , λ 1 , . . . , λ K ) ∈ Θ K and assume that λ i  = λ j for i  = j . Define the rank of comp onen t i b y r ( i ) := 1 +   { j ∈ { 1 , . . . , K } : λ j ≺ lex λ i }   . Since the λ i are pairwise distinct and the order is total, the ranks { r ( i ) } K i =1 are exactly { 1 , . . . , K } . Hence there exists a unique p ermutation σ Λ ∈ S K suc h that r  σ Λ ( k )  = k , k = 1 , . . . , K, or equiv alently , λ σ Λ (1) ≺ lex λ σ Λ (2) ≺ lex · · · ≺ lex λ σ Λ ( K ) . 27 Define the ordered representativ e Λ ord = σ Λ · Λ =  v σ Λ , λ σ Λ (1) , . . . , λ σ Λ ( K )  . By construction, Λ ord ∈ Θ lex K . Uniqueness of σ Λ follo ws because the sorting p erm utation of a list of distinct elemen ts under a strict total order is unique (see, e.g., Cormen et al. , 2009 , Sec. 8.1). By the assumed relabelling in v ariance, for ev ery σ ∈ S K w e ha ve f ( · ; σ · Λ ) = f ( · ; Λ ), and in particular f ( · ; Λ ord ) = f ( · ; Λ ). Consequently , any p erm utation-in v arian t criterion Q (for instance, Q ( Λ ) = R ψ ( x , f ( x ; Λ )) d x or Q ( Λ ) = P n i =1 log f ( X i ; Λ ), or any other function of f ( · ; Λ )) satisfies Q ( Λ ord ) = Q ( Λ ) . Hence, if Λ ⋆ is a maximiser of Q ov er Θ K , then  Λ ⋆  ord is also a maximiser and lies in Θ lex K . Finally , to pro v e iden tifiabilit y on Θ lex K , let Λ , Λ ′ ∈ Θ lex K satisfy f ( · ; Λ ) = f ( · ; Λ ′ ). By iden tifiability up to label switching, there exists σ ∈ S K suc h that Λ ′ = σ · Λ , and therefore ( λ ′ 1 , . . . , λ ′ K ) = ( λ σ (1) , . . . , λ σ ( K ) ) . But b oth ( λ 1 , . . . , λ K ) and ( λ ′ 1 , . . . , λ ′ K ) are strictly increasing under ≺ lex . Since the λ k are distinct, the set { λ 1 , . . . , λ K } admits a unique strictly increasing ordering. Therefore the only p erm utation σ for which ( λ σ (1) , . . . , λ σ ( K ) ) is strictly increasing is σ = id. It follo ws that λ ′ k = λ k for all k , and hence Λ ′ = Λ . This sho ws that the parametrisation is iden tifiable on Θ lex K . S1.4 Mo dal basins-resp onsibilit y cells agreemen t T o in v estigate the link b etw een GER VE’s induced clustering and mo dal clustering, we compare tw o partitions near each mo de x ⋆ i of a C 2 densit y: the responsibility cells A k ( θ ) of an isotropic Gaussian mixture q θ ( x ) = P K k =1 π k q µ k ,σ 2 ( x ), with θ = ( µ 1 , . . . , µ K ), and the modal basins {B i } of p . W e giv e simple, lo cal conditions under whic h they agree on inner neighborho o ds of the mo des. Let { x ⋆ i } I i =1 b e isolated modes of p in supp( p ) ⊂ S . F or r > 0, write B i ( r ) := B ( x ⋆ i , r ) the ball cen tred on x ⋆ i and with radius r , and ˜ r := r/ 2. W eights satisfy π k ∈ [ π min , π max ] with π min > 0, P k π k = 1. F or all k ∈ [ K ], define p osterior comp onen t resp onsibilities resp k ( x ; θ ) = π k q µ k ,σ 2 I ( x ) P j π j q µ j ,σ 2 I ( x ) and cells A k ( θ ) = { x : resp k ( x ; θ ) ≥ resp j ( x ; θ ) , ∀ j ∈ [ K ] } . F or i ∈ [ I ], the mo dal basin B i is the set of all initial p oints x suc h that the solution e x ( t ) of the gradien t-ascent flo w ˙ x = ∇ p ( x ) with e x (0) = x satisfies e x ( t ) → x ⋆ i as t → ∞ Assumption S1.3 (Lo cal strong log-conca vit y and separation) . F or e ach i ther e exists r i > 0 and κ i > 0 such that −∇ 2 log p ( x ) ⪰ κ i I for al l x ∈ B i ( r i ) , and the close d b al ls B i ( r i ) ⊂ supp( p ) ar e p airwise disjoint. 28 Lemma S1.4 (Basin con tainmen t) . Under Assumption S1.3 , B i ( r i ) ⊂ B i for e ach i . Pr o of of L emma S1.4 . Let e x ( t ) be a tra jectory of ˙ x = ∇ p ( x ) with e x (0) ∈ B i ( r i ). Cho ose the Lyapuno v function V ( x ) = ∥ x − x ⋆ i ∥ 2 / 2. On supp( p ), using ˙ x = ∇ p ( x ) = p ( x ) ∇ log p ( x ), we ha ve ˙ V ( x ) = ⟨ ˙ x , x − x ⋆ i ⟩ = p ( x ) ⟨∇ log p ( x ) , x − x ⋆ i ⟩ . Since ∇ log p ( x ⋆ i ) = ∇ p ( x ⋆ i ) = 0, strong log-conca vit y (Assumption S1.3 ) implies, for x ∈ B i ( r i ) \ { x ⋆ i } , ⟨∇ log p ( x ) , x − x ⋆ i ⟩ ≤ − κ i ∥ x − x ⋆ i ∥ 2 < 0 , th us, with m i = inf x ∈ B i ( r i ) p ( x ) > 0, ˙ V ( x ) ≤ − 2 κ i p ( x ) V ( x ) ≤ − 2 κ i m i V ( x ) . By Gr¨ onw all’s inequality , w e hav e V ( e x ( t )) ≤ V ( e x (0)) exp( − 2 κ i m i t ) , (S4) as long as e x ( t ) is in B i ( r i ). Finally , w e pro ve b y con tradiction that e x ( t ) ∈ B i ( r i ) for all t ≥ 0. Supp ose the existence of T out = inf t> 0 { t : e x ( t ) ∈ B i ( r i ) } < ∞ . In this case, by contin uit y of x , w e ha ve ∥ e x ( T out ) − x ⋆ i ∥ = r i . But since V ( e x (0)) < r 2 i / 2, ( S4 ) giv es for t ∈ [0 , T out ), V ( e x ( t )) < r 2 i 2 exp( − 2 κ i m i t ) < r 2 i 2 . This implies ∥ e x ( t ) − x ⋆ i ∥ < r i for all t ∈ [0 , T out ), and this con tradicts ∥ e x ( T out ) − x ⋆ i ∥ = r i . Th us, w e hav e e x ( t ) ∈ B i ( r i ) for all t ≥ 0 and ( S4 ) ensures V ( e x ( t )) → 0 as t → ∞ , hence e x ( t ) → x ⋆ i and we conclude that e x (0) ∈ B i . Assumption S1.5 (One-p er-mo de configuration) . Ther e is a c onfigur ation θ = ( µ 1 , . . . , µ I ) with exactly one c entr e µ i ∈ B i ( ˜ r i ) for e ach i , and no other c entr es in S I i =1 B i ( r i ) . Ther efor e, by c onvention, we r elab el c omp onents so that c entr e i lies in B i ( ˜ r i ) and write A i ( θ ) for the i th c el l. Assumption S1.6 (Local margin) . Ther e exists ∆ > 0 such that for al l i and al l x ∈ B i ( ˜ r i ) , ∥ x − µ j ∥ ≥ ∥ x − µ i ∥ + ∆ ∀ j  = i. R emark 1 . If µ i ∈ B i ( ˜ r i ) and dist  B i ( ˜ r i ) , { µ j } j  = i  > 0, then ∆ := inf x ∈ B i ( ˜ r i ) min j  = i ( ∥ x − µ j ∥ − ∥ x − µ i ∥ ) > 0. 29 Assumption S1.7 (Scale separation) . Ther e exists ε ∈ (0 , 1) such that σ ≤ ∆ p 2 log Ψ( ε ) , wher e Ψ( ε ) := ( I − 1) π max επ min > 1 . Theorem S1.8 (Resp onsibilit y on balls) . Under Assumptions S1.5 - S1.7 , for e ach i , B i ( ˜ r i ) ⊂ A i ( θ ) , and r i ( x ; θ ) ≥ (1 + ε ) − 1 , ∀ x ∈ B i ( ˜ r i ) . Pr o of of The or em S1.8 . Fix i and let x ∈ B i ( ˜ r i ). By Assumption S1.6 , ∥ x − µ j ∥ − ∥ x − µ i ∥ ≥ ∆ and ∥ x − µ j ∥ + ∥ x − µ i ∥ ≥ 2 ∥ x − µ i ∥ + ∆. Combining with Assumption S1.7 , this gives q µ j ,σ 2 I ( x ) q µ i ,σ 2 I ( x ) = exp  − ( ∥ x − µ j ∥ − ∥ x − µ i ∥ )( ∥ x − µ j ∥ + ∥ x − µ i ∥ ) 2 σ 2  ≤ exp  − ∆ 2 2 σ 2  ≤ 1 Ψ( ε ) . Summing with w eights giv es P j  = i π j q µ j π i q µ i ≤ ( I − 1) π max π min Ψ( ε ) = ε , hence resp i ( x ; θ ) ≥ (1 + ε ) − 1 > 1 / 2, which means x ∈ A i ( θ ). R emark 2 . Assumption S1.7 is in fact a sufficien t condition for resp i ( x ; θ ) ≥ (1 + ε ) − 1 , whic h means comp eting comp onen ts contribute a total probabilit y mass ≤ ε relative to the dominan t component i . Corollary S1.9 (Local partition agreement) . Under Assumptions S1.3 - S1.7 , for e ach i , A i ( θ ) ∩ B i ( ˜ r i ) = B i ∩ B i ( ˜ r i ) = B i ( ˜ r i ) . Pr o of of Cor ol lary S1.9 . Theorem S1.8 gives B i ( ˜ r i ) ⊂ A i ( θ ). Lemma S1.4 giv es B i ( ˜ r i ) ⊂ B i . Agreemen t is guaranteed only on lo cal neighborho ods of the mo des S I i =1 B i ( ˜ r i ). In general, the resp onsibilit y cells {A k } and the mo dal basins {B i } globally disagree. Figure S1 sho wcases a counterexample, highlighting the differences b et w een mo dal clustering and high-resp onsibilit y assignmen ts in a Gaussian mixture model. Remark ably , a mo de may not ev en b e in its corresp onding comp onen t resp onsibilit y region, notably when comp onen ts significan tly o v erlap. S1.5 Natural-gradien t conv ergence Under the Robbins–Monro and b ounded iterate conditions, we can sho w that sto c hastic natural-gradien t ascen t (SNGA) is consisten t. Con vergence is obtained via a direct appli- cation of the ODE method for sto c hastic appro ximation ( Bork ar , 2008 , Chp. 2.1). 30 2 1 0 1 2 1 0 1 2 R esponsibility R egions Basin Boundaries (lines) Modes Component Means Figure S1: Resp onsibilit y regions (in color resp. blue, pink, y ellow) and (appro ximate) mo dal basins for a Gaussian mixture with 3 comp onen ts, means at µ 1 =  − cos( π/ 6) , − 0 . 5  , µ 2 = (0 , 1), µ 3 =  cos( π / 6) , 0  , mixture w eights π = (0 . 16 , 0 . 80 , 0 . 04) and isotropic co v ariances Σ k = σ 2 k I with v ariances ( σ 2 1 , σ 2 2 , σ 2 3 ) = (0 . 30 , 0 . 95 , 0 . 10). Cov ariances are represen ted as circles of radii σ k . Assumption S1.10. Consider the up date ϑ t +1 = ϑ t + ρ t ψ t , (S5) wher e ψ t is an unbiase d sto chastic estimate of the natur al gr adient e ∇ b L ω ,N ( ϑ t ) (e quiva- lently, the gr adient in the exp e ctation-p ar ameter sp ac e for exp onential families or mixtur es ther e of ). L et F t := σ ( ϑ 1 , ψ 1 , . . . , ψ t ) . Assume: (A1) ( R obbins–Monr o ) P t ρ t = ∞ , P t ρ 2 t < ∞ . (A2) ( Me an field ) e ∇ b L ω ,N is lo c al ly Lipschitz and of at most line ar gr owth. (A3) ( Noise ) E [ ψ t | F t − 1 ] = e ∇ b L ω ,N ( ϑ t − 1 ) and E ∥ ψ t ∥ 2 ≤ C (1 + ∥ ϑ t − 1 ∥ 2 ) . (A4) ( Stability ) The iter ates ( ϑ t ) t ≥ 1 ar e a.s. b ounde d. Theorem S1.11 (Con vergence of SNGA at fixed ω ) . L et S ω = { ϑ : e ∇ b L ω ,N ( ϑ ) = 0 } . Under Assumption S1.10 , we have, as t → ∞ : dist( ϑ t , S ω ) → 0 a.s. Equivalently, every limit p oint of ( ϑ t ) is almost sur ely stationary. R emark 3 (Natural-gradient regularit y) . F or exponential families and mixtures thereof (treated as MCEF mo dels), the “natural gradien t = expectation gradien t” identit y (eq. ( 5 )) 31 ensures that e ∇ b L ω ,N inherits Lipsc hitz con tinuit y from b L ω ,N on compact Θ, without explicit in version of the Fisher information matrix. Under annealing, conditions on the sc hedule are sufficient to guaran tee conv ergence. The following corollary stems from standard t w o-timescale results for sto c hastic approxi- mation ( Kushner and Yin , 2003 , Chp. 8.6; Bork ar , 2008 , Chp. 6.1). Corollary S1.12 (Con vergence of SNGA under annealing) . Supp ose that under Assump- tion S1.10 for e ach fixe d ω > 0 , and that ( ϑ , ω ) 7→ e ∇ b L ω ,N ( ϑ ) is jointly c ontinuous on b ounde d sets. 1. Stagewise sc hedule. L et ω 1 > ω 2 > · · · → 0 . If at stage j the run with fixe d ω j is long enough that dist( ϑ , S ω j ) ≤ δ j with δ j → 0 (and the next stage is warm-starte d at that iter ate), then dist( ϑ , S ω j ) → 0 as j → ∞ . 2. Con tin uous schedule. If ω t → 0 satisfies | ω t +1 − ω t | = o ( ρ t ) , then dist( ϑ t , S ω t ) → 0 a.s. R emark 4 (GER VE’s conv ergence) . GER VE p erforms optimisation in a compact parameter space (Θ for single Gaussians, Θ K for Gaussian mixtures), so the iterates m ust be pro jected to ensure that they remain in this space. In this case, the ab ov e conv ergence results are not directly applicable. How ev er, in practice, pro jection is not alwa ys needed, typically when the compact space contains the solution to the unconstrained optimisation problem. S1.6 Uncertain t y quan tification for mo de estimation Here, w e develop the theoretical guaran tees of the mo de-lev el uncertaint y quan tification (UQ) pro cedure proposed in Section 5 . In this section, we mathematically translate all the notions (including Assumption 5.2 ), defined in Section 5.2 . These will b e also used in the pro ofs in S2.4 . Then, we inv estigate the consistency stability scores (Prop. S1.13 ) and prop ose strategies to handle nonconv ex ob jectiv es. All the results for this section and Section 5 of the main part are prov en in Section S2.4 . S1.6.1 Notations and separation assumption W e start by mathematically translating all the notions defined in Section 5.2 . These will b e used throughout this section and in the pro ofs in S2.4 . Data, measures, and con v ergence. X 1: N are i.i.d. from the p opulation law P on the sample space S . The empirical and bo otstrap empirical measures are P N = 1 N N X i =1 δ X i , P ∗ N = 1 N N X i =1 δ X ∗ i , X ∗ 1 , . . . , X ∗ N i . i . d . ∼ P N giv en X 1: N . W e write P ∗ ( · ) = P ( ·| X 1: N ) for conditional (b ootstrap) probabilit y . All bo otstrap limits are understo o d c onditional ly on X 1: N , in P -pr ob ability . 32 P arameter space and ob jectiv e. Fix ω 0 > 0 and w ork on the compact parameter set Θ K (eq. ( 4 )). F or a finite signed measure G on S , define the criterion L ω 0 ( Λ ; G ) := Z q Λ ( x ) dG ( x ) + ω 0 H S ( q Λ ) , and the p opulation/empirical v ersions L ω 0 ( Λ ) := L ω 0 ( Λ ; P ) and b L ω 0 ,N ( Λ ) := L ω 0 ( Λ ; P N ). Estimators and b o otstrap refits. Let b Λ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ω 0 ,N ( Λ ) , Λ ⋆ ω 0 ∈ arg max Λ ∈ Θ K L ω 0 ( Λ ) , and define the b o otstrap criterion and maximiser b L ∗ ω 0 ,N ( Λ ) := L ω 0 ( Λ ; P ∗ N ) , b Λ ∗ ω 0 ,N ∈ arg max Λ ∈ Θ K b L ∗ ω 0 ,N ( Λ ) . W e also allow a (p ossibly inexact) estimator e Λ ω 0 ,N of b Λ ω 0 ,N (e.g., the output of a finite- iteration GER VE fit). Quadratic expansion matrices. W rite H ⋆ ω 0 := −∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) , V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 , W ω 0 := ( H ⋆ ω 0 ) − 1 V ω 0 ( H ⋆ ω 0 ) − 1 . Mo de means, matching, and pro jections. Here, w e consider fixed targets u 1 , . . . , u K 0 ∈ R d ( K 0 ≤ K ). F or Λ ∈ Θ K , let µ k ( Λ ) ∈ R d b e the k -th comp onen t mean and m ( Λ ) :=  µ 1 ( Λ ) T , . . . , µ K ( Λ ) T  T ∈ R K d . Define the matching map M ( Λ ) ∈ R dK 0 b y selecting an injection π : { 1 , . . . , K 0 }  → { 1 , . . . , K } minimising P K 0 j =1 ∥ µ π ( j ) ( Λ ) − u j ∥ , with a fixed lexicographic tie-break: M ( Λ ) =  µ π (1) ( Λ ) T , . . . , µ π ( K 0 ) ( Λ ) T  T . Let E j ∈ R d × dK 0 extract the j -th d -block and set M j ( Λ ) := E j M ( Λ ) ∈ R d . Lo cal separation and linearisation. F or j ∈ [ K 0 ] and Λ ∈ Θ K , let a ( j, Λ ) ∈ arg min k ∥ µ k ( Λ ) − u j ∥ (lexicographic tie-break) and define gap j ( Λ ) := min k  = a ( j, Λ )  ∥ µ k ( Λ ) − u j ∥ − ∥ µ a ( j, Λ ) ( Λ ) − u j ∥  , ∆( Λ ) := min 1 ≤ j ≤ K 0 gap j ( Λ ) . F or δ > 0, set G δ := { Λ ∈ Θ K : ∆( Λ ) ≥ δ } . Lo cal separation (Assumption 5.2 ) holds at Λ ⋆ ω 0 if there exists δ > 0 such that ∆( Λ ⋆ ω 0 ) ≥ δ . In this case, M is C 1 at Λ ⋆ ω 0 (Prop. S2.17 ) with Jacobian J := ∇ Λ M ( Λ ⋆ ω 0 ), and w e define C M := J W ω 0 J T , C j := E j C M E T j . 33 S1.6.2 Consistency of stability scores T o in vestigate the properties of the stabilit y scores, we write their mathematical definition. Let Φ( G ) be a deterministic measurable selection from the (possibly set-v alued) argmax Φ( G ) ∈ arg max Λ ∈ Θ K n Z q Λ dG + ω 0 H S ( q Λ ) o . Fix disjoin t neigh b orho ods U 1 , . . . , U K 0 of u 1 , . . . , u K 0 . Define F j ( G ) = 1 { E j M (Φ( G )) ∈ U j } and the targets π j,N := P  F j ( P N ) = 1  , π j := lim N →∞ π j,N . The b o otstrap stabilit y score defined in Section 5 can b e reform ulated s j = 1 L P L ℓ =1 F j ( P ∗ ( ℓ ) N ). In other w ords, s j = 1 L L X ℓ =1 1 {∃ a matc hed component in U j for the  -th b o otstrap fit } . Let E ∗ and V ar ∗ denote the conditional exp ectation and v ariance given X 1: N . The follo wing prop osition, prov en in Section S2.4.7 , establishes statistical guaran tees on the stabilit y scores. Prop osition S1.13 (Consistency of stabilit y scores) . L et us index the stability sc or es s j,L by L . Under Assumptions S2.10 and 5.2 and with U j chosen to have a p ositive mar gin at Λ ⋆ ω 0 (L emma S2.15 ), define τ j,N := E ∗  F j ( P ∗ N )  = E ∗ [ s j,L ] (for al l L ). Then: (i) (Conditional LLN). F or e ach N , c onditional ly on the data, s j,L a.s. − − → τ j,N as L → ∞ , V ar ∗ ( s j,L ) = V ar ∗  F j ( P ∗ N )  /L → 0 . In p articular, for any  > 0 , P ∗  | s j,L − τ j,N | >   ≤ 2 exp  − 2 L 2  . (ii) (L ar ge- N limit). As N → ∞ , τ j,N P − → π j , π j,N := P  F j ( P N ) = 1  → π j , wher e π j := F j ( P ) ∈ { 0 , 1 } (equiv alen tly , π j = P ( F j ( P ) = 1)) . Conse quently, for any se quenc e L N → ∞ as N → ∞ , we have s j,L N P − → π j . 34 S1.6.3 Handling local optima Although L ω 0 is noncon vex in general, in practice we use the following guardrails to mitigate the issue: 1. A nne aling + over c ompleteness. W e use an o vercomplete model, and we start from a higher smoothing level and con tin ue down to ω 0 . Small components are pruned. This reliably enlarges the global basin explored by GER VE. 2. Stability sc or es. P er-mo de stabilit y scores s j are monitored, with v alues near 1 in- dicating robust recov ery . F or example, as a diagnostic, w e can flag s j < 0 . 5 as p oten tially problematic and s j < 0 . 3 as lik ely unstable. 3. Multiple initialisations. W e can run several random starts at ω 0 and keep the best solution (by ob jectiv e). S2 Pro ofs S2.1 Preliminaries: Empirical process properties In this section, w e first pro v e that the classes of Gaussian distributions and GMM with b ounded cov ariances are Gliv enko–Can telli (GC), and admit in tegrable env elop es. W e recall ( v an der V aart and W ellner , 1996 ) that a class of functions F on R d is Gliv enko– Can telli (GC) with resp ect to a probabilit y measure P if, for i.i.d. samples X 1 , . . . , X N of a random v ariable X ∼ P , the empirical av erage of an y f ∈ F , P N f : = N − 1 P N i =1 f ( X i ), con verges uniformly to its exp ectation P f = E P [ f ( X )], in probability: sup f ∈F | P N f − P f | P − → 0 as N → ∞ . W e say that F admits an integrable en velope F if F ( x ) ≥ | f ( x ) | for all f ∈ F , with E P [ F ( X )] < ∞ . Lemma S2.1 (GC for fixed-co v ariance Gaussian distributions) . L et ϕ Σ 0 ( x ) = N ( x ; 0 , Σ 0 ) with a given Σ 0 ≻ 0 , and define the class F 0 = { f µ ( x ) = ϕ Σ 0 ( x − µ ) , µ ∈ R d } . If P is a pr ob ability me asur e with c ontinuous and b ounde d density, then F 0 is Glivenko– Cantel li and has an inte gr able envelop e with r esp e ct to P . Pr o of of L emma S2.1 . Standard results on empirical pro cesses guarantee that F 0 is GC if it admits an en velope and is totally bounded in L 1 ( P ) (see v an der V aart and W ellner , 1996 , Thm. 2.4.3). A straigh tforw ard en velope is F ( x ) := (2 π ) − d/ 2 | Σ 0 | − 1 / 2 ≥ f µ ( x ) ≥ 0, and E P [ F ( X )] = F ( x ) < ∞ . It only remains to pro ve the total boundedness in L 1 ( P ). 35 1. Denote b y p the density of P . Since ∥ f µ ∥ L 1 ( P ) = R f µ ( x ) p ( x ) d x = R ϕ Σ 0 ( x − µ ) p ( x ) d x = p ∗ ϕ Σ 0 ( µ ), then ∥ f µ ∥ L 1 ( P ) → 0 as ∥ µ ∥ → ∞ . It follo ws that for any ε > 0, there exists some M ϵ suc h that sup ∥ µ ∥ >M ε ∥ f µ ∥ L 1 ( P ) < ε/ 2. 2. On the compact ball B (0 , R ) = { x : ∥ x ∥ ≤ R } , ( x , µ ) 7→ f µ ( x ) is jointly C 1 . Hence, there exists L R with sup ∥ x ∥≤ R | f µ ( x ) − f ν ( x ) | ≤ L R ∥ µ − ν ∥ . Choose R so that 2 P ( B (0 , R ) c ) ∥ F ∥ ∞ < ε/ 2. 3. Co v er { µ ∈ R d , ∥ µ ∥ ≤ M ε } with a finite δ -net 1 { µ j , j ∈ [ J δ ] } , where δ = ε/ (2 L R ). That means for all µ such that ∥ µ ∥ ≤ M ε , there exists j ∈ [ J δ ] such that ∥ µ − µ j ∥ ≤ δ . Compute ∥ f µ − f µ j ∥ L 1 ( P ) = Z B (0 ,R ) | f µ ( x ) − f µ j ( x ) | p ( x ) d x + Z B (0 ,R ) c | f µ ( x ) − f µ j ( x ) | p ( x ) d x . On B (0 , R ), | f µ ( x ) − f µ j ( x ) | ≤ L R ∥ µ − µ j ∥ ≤ ε/ 2, so the first integral is smaller than ε/ 2. On B (0 , R ) c , the in tegral is smaller than 2 ∥ F ∥ ∞ P ( B (0 , R ) c ) < ε/ 2. Th us, ∥ f µ − f µ j ∥ L 1 ( P ) ≤ ε . Finally , if ∥ µ ∥ > M ε then b y 1. ∥ f µ − 0 ∥ L 1 ( P ) = ∥ f µ ∥ L 1 ( P ) < ε/ 2, whic h concludes the pro of that that F 0 is totally bounded in L 1 ( P ). Lemma S2.2 (GC for b ounded lo cation-scale Gaussian distributions) . L et F = { f µ , Σ ( x ) = N ( x ; µ , Σ ) : µ ∈ R d , σ 2 min I ⪯ Σ ⪯ σ 2 max I } . If P is a pr ob ability me asur e with c ontinuous and b ounde d density, then F is GC and has an inte gr able envelop e with r esp e ct to P . Pr o of of L emma S2.2 . The pro of is analogous to that of Lemma S2.1 . An admissible en- v elop e is F ( x ) := (2 π σ 2 min ) − d/ 2 ≥ sup µ , Σ f µ , Σ ( x ), and E P [ F ( X )] < ∞ . Analogous argu- men ts can b e used to prov e the total b oundedness in L 1 ( P ): 1. F or large ∥ µ ∥ , ∥ f µ , Σ ∥ L 1 ( P ) = ( p ∗ ϕ Σ )( µ ) → 0 uniformly ov er Σ in the spectral band [ σ 2 min , σ 2 max ], as ϕ Σ ’s tails are controlled b y σ max . Define M ε suc h that in { µ : ∥ µ ∥ > M ε } , functions are ε/ 2-close to 0 in L 1 ( P ). 2. On the compact ball B (0 , R ) = { x : ∥ x ∥ ≤ R } , ( x , µ , Σ ) 7→ f µ , Σ ( x ) is join tly C 1 . Because Σ − 1 has eigenv alues in [ σ − 2 max , σ − 2 min ], its deriv ativ es are uniformly b ounded, hence it is Lipschitz in ( µ , Σ ), i.e. there exists L R > 0 such that sup x ∈ B (0 ,R ) | f µ , Σ ( x ) − f µ ′ , Σ ′ ( x ) | ≤ L R ∥ ( µ , Σ ) − ( µ ′ , Σ ′ ) ∥ . 3. Co v er { µ : ∥ µ ∥ ≤ M ε } × { Σ : σ 2 min I ⪯ Σ ⪯ σ 2 max I } with a δ -net for δ = ε/ (2 L R ). The L 1 ( P )-distance is small on B (0 , R ), with the tail region { x : ∥ x ∥ > R } controlled b y c ho osing R suc h that 2 P ( B (0 , R ) c ) ∥ F ∥ ∞ < ε/ 2. 1 A δ -net for a set A in a metric space ( X , d ) is a finite subset N δ ⊂ X such that for ev ery x ∈ A there exists y ∈ N δ with d ( x, y ) ≤ δ . In our case, the metric is L 1 ( P ) distance b etw een densities. 36 Similarly as before, w e conclude that F is totally b ounded, hence GC. Lemma S2.3 (GC for finite mixtures) . Fix K ∈ N . L et F b e total ly b ounde d with r e- sp e ct to P and with envelop e F ∈ L 1 ( P ) . Define the class of K -c omp onent mixtur es with c omp onents in F F ( K ) = n f ( x ) = K X k =1 π k f k ( x ) : π ∈ ∆ K , f k ∈ F o , wher e ∆ K is the K − 1 -dimensional simplex. Then F ( K ) is GC with envelop e K F ∈ L 1 ( P ) . Pr o of of L emma S2.3 . W e need to pro ve that K F is an integrable env elope and that F ( K ) is totally bounded in L 1 ( P ). First, | f ( x ) | =      K X k =1 π k f k ( x )      ≤ K X k =1 π k | f k ( x ) | ≤ K X k =1 | f k ( x ) | ≤ K F ( x ) , hence E P [ K F ] < ∞ , and K F is an env elope. The total boundedness of F ( K ) comes from the total boundedness of F : 1. Let ε > 0. Since F is totally b ounded, there exists a finite ε/ 2-net { g (1) , . . . , g ( J ε ) } in L 1 ( P ). F or eac h f k , denote by g ( j k ) an ε -approximation. F or eac h g ( j k ) , ∥ g ( j k ) ∥ L 1 ( P ) ≤ ∥ g ( j k ) − f k ∥ L 1 ( P ) + ∥ f k ∥ L 1 ( P ) = ε/ 2 + ∥ F ∥ L 1 ( P ) = F ε . 2. Appro ximate π ∈ ∆ K using a finite grid Π η of mesh η = ε/ (2 K F ε ) and denote by ˜ π k the closest elemen t of Π η to π k . 3. Compute      K X k =1 π k f k − K X k =1 e π k g ( j k )      L 1 ( P ) ≤ K X k =1 π k ∥ f k − g ( j k ) ∥ L 1 ( P ) + K X k =1 F ε | π k − e π k | . The tw o terms are b ounded b y ε/ 2, so the sum is bounded b y ε . Th us F ( K ) is totally bounded, hence GC. Corollary S2.4 (GC for b ounded-cov ariance Gaussian mixtures) . L et G b e the class of Gaussian distributions with c ovarianc e eigenvalues in [ σ 2 min , σ 2 max ] and unc onstr aine d me ans. F or fixe d K , the class of K -c omp onent Gaussian mixtur es with those c ovarianc e b ounds is GC under P and has an envelop e in L 1 ( P ) . Pr o of of Cor ol lary S2.4 . In the pro of of Lemma S2.2 , w e prov ed the sufficien t conditions to apply Lemma S2.3 with F = G , whic h prov es in turn the result ab o ve. Finally , we present the P -Donsker prop ert y . It upgrades GC to a uniform CL T, whic h yields rates for empirical av erages o ver F and underpins our asymptotic normalit y results. 37 Consider a probabilit y measure P and some i.i.d. samples X 1 , . . . , X N of a random v ariable X ∼ P . A class of functions F on R d is P -Donsker if the empirical pro cess G N defined as G N f : = √ N ( P N − P ) f for f ∈ F , view ed as a random elemen t in  ∞ ( F ), conv erges in distribution to a tight mean-zero Gaussian pro cess G P with cov ariance Co v  G P f , G P g  = P [( f − P f )( g − P g )] , f , g ∈ F . Equiv alen tly , F is P -Donsker if a functional CL T holds uniformly ov er f ∈ F . A sufficien t condition is that F ⊂ L 2 ( P ) admits an L 2 ( P ) en v elop e and has a finite brac keting en tropy in tegral, for example Z δ 0 q log N [ ]  ε, F , L 2 ( P )  dε < ∞ , where log N [ ]  ε, F , L 2 ( P )  denotes the ε -brac keting entrop y for F in L 2 ( P ), see v an der V aart and W ellner ( 1996 ), Chapter 2.5. Lemma S2.5 (GC and Donsker prop erties for Gaussian mixtures and their gradients) . F or fixe d K and c omp act Θ K , the classes Q = { q Λ : Λ ∈ Θ K } , Q ( u ) ∇ = { [ ∇ Λ q Λ ] u : Λ ∈ Θ K } for u = 1 , . . . , dim(Θ K ) and Q ( u,v ) ∇ 2 = { [ ∇ 2 Λ q Λ ] uv : Λ ∈ Θ K } for u, v = 1 , . . . , dim(Θ K ) ar e P -Donsker (henc e Glivenko–Cantel li) with envelop es in L 2 ( P ) . Pr o of of L emma S2.5 . First, we prov e that Q and Q ∇ admit env elop es in L 2 ( P ), then w e use the fact that they are Lipsc hitz to exhibit their Donsk er property . Step 1. L 2 envelop es. On set Θ K , b y contin uity of Gaussian densities and compactness of Θ K , there exist constants C 0 , C 1 , C 2 < ∞ (depending only on Θ K ) such that, for all x ∈ R d and Λ ∈ Θ K , for all u, v = 1 , . . . , dim(Θ K ), 0 ≤ q Λ ( x ) ≤ C 0 , | [ ∇ Λ q Λ ( x )] u | ≤ C 1 (1 + ∥ x ∥ + ∥ x ∥ 2 ) , | [ ∇ 2 Λ q Λ ( x )] uv | ≤ C 2 (1 + ∥ x ∥ + ∥ x ∥ 2 + ∥ x ∥ 3 + ∥ x ∥ 4 ) . Hence an env elope for Q is Q ( x ) ≡ C 0 ∈ L 2 ( P ). F or u = 1 , . . . , dim(Θ K ), since P is supp orted on a bounded set S ⊂ R d , there is R < ∞ with ∥ X ∥ ≤ R a.s., and thus an en velope for Q ( u ) ∇ is Q ∇ ( X ) ≡ C 1 (1 + R + R 2 ) a.s. In particular Q ∇ ∈ L 2 ( P ), therefore Q ( u ) ∇ admits an L 2 ( P ) env elop e. Similarly , for u, v = 1 , . . . , dim(Θ K ), an env elop e for Q ( u,v ) ∇ 2 is Q ∇ 2 ( X ) ≡ C 2 (1 + R + R 2 + R 3 + R 4 ) a.s. and Q ( u,v ) ∇ 2 also admits an L 2 ( P ) en v elop e. Step 2. Lipschitz into L 2 ( P ) . Fix Λ , e Λ ∈ Θ K and set Λ t = e Λ + t ( Λ − e Λ ), t ∈ [0 , 1]. By the mean v alue theorem in Banach spaces, q Λ − q e Λ = Z 1 0 ⟨∇ Λ q Λ t , Λ − e Λ ⟩ dt, so ∥ q Λ − q e Λ ∥ L 2 ( P ) ≤ ∥ Λ − e Λ ∥ sup Λ ′ ∈ Θ K ∥∇ Λ q Λ ′ ∥ L 2 ( P ) ≤ ∥ Q ∇ ∥ L 2 ( P ) ∥ Λ − e Λ ∥ . 38 By the same argument applied to ∇ Λ q Λ and ∇ 2 Λ q Λ and using that all second deriv atives ∇ 2 Λ q Λ ( x ) and third deriv atives ∇ 3 Λ q Λ ( x ) are uniformly bounded on S × Θ K , since Gaussian mixture densities ha v e con tin uous second and third deriv atives and Θ K ensures b ounded means and eigen v alues b ounded aw ay from 0 and ∞ , there exists C ′ 1 , C ′ 2 < ∞ with ∥ [ ∇ Λ q Λ ] u − [ ∇ Λ q e Λ ] u ∥ L 2 ( P ) ≤ C ′ 1 ∥ Λ − e Λ ∥ , ∥ [ ∇ 2 Λ q Λ ] uv − [ ∇ 2 Λ q e Λ ] uv ∥ L 2 ( P ) ≤ C ′ 2 ∥ Λ − e Λ ∥ , for all u, v = 1 , . . . , dim(Θ K ). Thus all maps Λ 7→ q Λ , Λ 7→ [ ∇ Λ q Λ ] u and Λ 7→ [ ∇ 2 Λ q Λ ] uv are Lipschitz into L 2 ( P ) with L 2 en velopes. Step 3. Donsker. By the Euclidean parametric-Lipsc hitz brack eting b ound ( v an der V aart and W ellner , 1996 , Thm. 2.7.11), the L 2 ( P ) brack eting num b ers of Q , Q ( u ) ∇ and Q ( u,v ) ∇ 2 gro w at most polynomially on compact Θ K . Therefore, the brac keting in tegral is finite, and the brac k eting Donsker theorem ( v an der V aart and W ellner , 1996 , Chp. 2.5) yields that Q , Q ( u ) ∇ and Q ( u,v ) ∇ 2 are P -Donsk er, for all u, v = 1 , . . . , dim(Θ K ). S2.2 Pro of of Theorem 2.1 W e prov e Theorem 2.1 , e stablishing the consistency of the optimal v ariational solution q ⋆ ω := arg max Q L ω ( q ), where Q is the family of truncated Gaussian mixtures of K ≥ I comp onen ts with upper-b ounded cov ariances, as ω → 0. W e proceed in three steps. First, w e identify the asymptotics of the Gibbs measure g ω via a Laplace expansion near each mo de of f (Prop. S2.6 and S2.7 ). Then, w e construct a truncated mixture e q S ω on S such that KL( e q S ω || g ω ) → 0 (Prop. S2.8 ). This implies KL( q ⋆ ω || g ω ) → 0, so finally , combining with the Laplace expansions of g ω , we deduce the claims of Theorem 2.1 . Setup Let f ∈ C 3 ( S ) be bounded with (finitely man y) nondegenerate global mo des { x ⋆ i } I i =1 , and let f ⋆ := sup x ∈S f ( x ) = f ( x ⋆ i ). Nondegeneracy means H i = −∇ 2 f ( x ⋆ i ) ≻ 0, for all i ∈ [ I ]. These mo des are isolated and in the interior of S , so there exist disjoint open, bounded neigh b orho ods U 1 , . . . , U I ⊂ S of the x ⋆ i suc h that eac h U i con tains no other critical point of f . Without loss of generalit y , assume that f is strictly p ositive on S I i =1 U i . Define U 0 := S \ S I i =1 U i . The restricted ob jective ov er a family Q is L ω ( q ) = E q [ f ( X )] + ω H S ( q ) , q ∈ Q . Here, the family Q is the S -truncated Gaussian mixture family with K ≥ I comp onen ts, and all eigen v alues of cov ariance matrices in (0 , σ 2 max ]. Recall that for any Gaussian mixture q Λ on R d , we write its S -truncation q S Λ ( x ) := q Λ ( x ) 1 S ( x ) R S q Λ ( y ) d y . 39 Let q ⋆ ω denote an y choice of maximiser of L ω ( q ) in Q , for a fixed ω > 0. T o simplify the notation, we denote the conditional distributions of q ⋆ ω and g ω on the regions U i , i = 0 , . . . , I , resp ectiv ely b y ϕ i,ω and ψ i,ω , defined as follows: ϕ i,ω ( x ) := q ⋆,U i ω ( x ) = q ⋆ ω ( x ) 1 U i ( x ) α i,ω , ψ i,ω ( x ) := g U i ω ( x ) = g ω ( x ) 1 U i ( x ) γ i,ω , (S6) where α i,ω := Z U i q ⋆ ω ( x ) d x , γ i,ω := Z U i g ω ( x ) d x . (S7) Step 1: Laplace expansion and lo cal Gibbs measure prop erties W e start b y applying Laplace’s metho d around eac h isolated mo de x ⋆ i of f to the unre- stricted Gibbs optimiser g ω . The tw o follo wing propositions characterise the shap e of the Gibbs measure under annealing. Prop osition S2.6 (Laplace expansion near a mode) . Fix i ∈ [ I ] and c onsider an op en neighb orho o d U i of mo de x i . Then, as ω → 0 , Z U i exp { f ( x ) /ω } d x = e f ⋆ /ω (2 π ω ) d/ 2 (det H i ) − 1 / 2  1 + o (1)  , and for any b ounde d c ontinuous function h , Z U i h ( x ) g ω ( d x ) = h ( x ⋆ i ) c i ( ω ) + ω 2 T r  H − 1 i ∇ 2 h ( x ⋆ i )  c i ( ω ) + o ( ω ) , wher e c i ( ω ) = c i + o (1) , c i := (det H i ) − 1 / 2 P I j =1 (det H j ) − 1 / 2 . Mor e over, g ω → P I i =1 c i δ x ⋆ i we akly, and ther e exists C , η > 0 such that g ω ( U 0 ) ≤ C e − η /ω . Pr o of of Pr op osition S2.6 . The pro of uses the classical Laplace metho d ( Hw ang , 1980 ; also see Le Minh et al. , 2025 , Proof of Thm. 2): in U i , write the T aylor expansion of f ( x ) at x ⋆ i , f ( x ) = f ( x ⋆ i ) − 1 2 ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + R i ( x ) , with R i ( x ) ≤ L i ∥ x − x ⋆ i ∥ 3 , where L i > 0 dep ends on U i . Then Z U i exp( f ( x ) /ω ) d x = exp( f ( x ⋆ i ) /ω ) Z U i exp n − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + 1 ω R i ( x ) o d x , and standard dominated conv ergence in Gaussian co ordinates yields the first display . Sum- ming o v er i and normalising giv es the weigh ts c i ( ω ) → c i and the weak limit. The second 40 displa y follows from the same expansion applied to ψ (up to second order) and Gaussian momen t calculations. The last statemen t can b e established since the mo des are isolated: there exists η > 0 suc h that sup x ∈ U 0 f ( x ) ≤ f ⋆ − η , so there exists C > 0 suc h that g ω ( U 0 ) ≤ C e − η /ω . Prop osition S2.7. F or i ∈ [ I ] and ψ i,ω define d in e quation ( S6 ) , ther e exists η > 0 such that E ψ i,ω [ X ] = x ⋆ i + O ( e − η /ω ) , Co v ψ i,ω ( X ) = ω H − 1 i + O ( e − 2 η /ω ) . Pr o of of Pr op osition S2.7 . (i) Exp e ctation. Consider the co ordinate functions h j ( x ) = x j , for all j ∈ [ d ]. These functions are C 2 with h j ( x ⋆ i ) = x ⋆ i,j and ∇ 2 h j ( x ∗ i ) ≡ 0. Apply Prop osition S2.6 to h = h j : Z U i x j g ω ( x ) d x = x ⋆ i,j c i ( ω ) + o ( ω ) . Next, Prop osition S2.6 also implies γ i,ω = g ω ( U i ) = c i ( ω ) + o ( e − η /ω ). So E ψ i,ω [ X j ] = R U i x j g ω ( x ) d x γ i,ω = x ⋆ i,j c i ( ω ) + o ( ω ) c i ( ω ) + o ( e − η /ω ) . Since c i ( ω ) → c i > 0, the denominator is b ounded aw a y from 0 and w e can divide: for all j ∈ [ d ], E ψ i,ω [ X j ] = x ⋆ i,j + O ( e − η /ω ) . Therefore, E ψ i,ω [ X ] = x ⋆ i + O ( e − η /ω ) . (ii) Covarianc e. Next, consider the cen tred second momen t functions h j k ( x ) := ( x j − x ⋆ i,j )( x k − x ⋆ i,k ), for all ( j, k ) ∈ [ d ] 2 . These functions are C 2 with h j k ( x ⋆ i ) = 0 and  ∇ 2 h j k ( x ⋆ i )  ℓm = ∂ 2 ∂ x ℓ ∂ x m  ( x j − x ⋆ i,j )( x k − x ⋆ i,k )  | x = x ⋆ i = δ j ℓ δ km + δ j m δ kℓ , where δ ·· is the Kronec ker sym b ol, i.e. ∇ 2 h j k ( x ⋆ i ) is the matrix with en tries ( j, k ) and ( k, j ) set to one and all other en tries set to zero. Hence, b y symmetry of H − 1 i , T r  H − 1 i ∇ 2 h j k ( x ⋆ i )  =  H − 1 i  j k +  H − 1 i  kj = 2  H − 1 i  j k . 41 Apply Prop osition S2.6 to h j k : Z U i h j k ( x ) g ω ( x ) d x = ω  H − 1 i  j k c i ( ω ) + o ( ω ) . Again, we can divide by γ i,ω = c i ( ω ) + o ( e − η /ω ): E ψ i,ω [( X j − x ⋆ i,j )( X k − x ⋆ i,k ) T ] = R U i h j k ( x ) g ω ( x ) d x γ i,ω = ω  H − 1 i  j k + O ( e − η /ω ) . Therefore, E ψ i,ω [( X − x ⋆ i )( X − x ⋆ i ) T ] = ω H − 1 i + O ( e − η /ω ) . Finally , Co v ψ i,ω ( X ) = E ψ i,ω [( X − E ψ i,ω [ X ])( X − E ψ i,ω [ X ]) T ] = E ψ i,ω [( X − x ⋆ i )( X − x ⋆ i ) T ] − ( E ψ i,ω [ X ] − x ⋆ i )( E ψ i,ω [ X ] − x ⋆ i ) T , where we hav e already sho wn that E ψ i,ω [ X ] − x ⋆ i = O ( e − η /ω ), so Co v ψ i,ω ( X ) = ω H − 1 i + O ( e − 2 η /ω ) . Step 2: Comp etitor mixture and KL con v ergence F or ω > 0, define the Gaussian distributions, for i ∈ [ I ]: e q i,ω ( x ) = N ( x ; x ⋆ i , ω H − 1 i ) . These Gaussian dsitributions appro ximate lo cally the Gibbs measure at eac h mo de. W e define the assoc iated mixture with Laplace weigh ts e q ω ( x ) = I X i =1 c i e q i,ω ( x ) , where c i ∝ (det H i ) − 1 / 2 and P i c i = 1. W e use the truncated mixture e q S ω := e q ω 1 S / R S e q ω as a comp etitor for q ⋆ ω in Q . In this step, w e pro ve that e q S ω con verges in KL to g ω . Prop osition S2.8. L et e q S ω := e q ω 1 S / R S e q ω b e the trunc ation of e q ω on S . Then, e q S ω ∈ Q and, as ω → 0 , KL( e q S ω ∥ g ω ) − → 0 . T o prov e Proposition S2.8 , w e deriv e an in termediate lemma. 42 Lemma S2.9. L et KL S ( e q ω ∥ g ω ) := R S e q ω log e q ω /g ω . As ω → 0 , KL S ( e q ω ∥ g ω ) − → 0 . Pr o of of L emma S2.9 . W e first compute the KL divergence for a single Gaussian e q i,ω and sho w that it conv erges to a finite constan t. W e then use the disjoint-mode structure to pro ve that the mixture e q ω is asymptotically indistinguishable from g ω in KL. Step 1. L o g-r atio for a single c omp onent. Fix i . On U i w e write f ( x ) = f ( x ⋆ i ) − 1 2 ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) , | ∆ i ( x ) | ≤ C ∥ x − x ⋆ i ∥ 3 . By definition, g ω ( x ) = exp { f ( x ) /ω } Z ω , e q i,ω ( x ) = (2 π ω ) − d/ 2 (det H i ) 1 / 2 exp  − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i )  . F rom Prop osition S2.6 , Z ω = e f ( x ⋆ i ) /ω (2 π ω ) d/ 2 I X j =1 (det H j ) − 1 / 2  1 + o (1)  , so, for x ∈ U i , log g ω ( x ) = f ( x ) ω − log Z ω = f ( x ⋆ i ) ω − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) ω − f ( x ⋆ i ) ω − d 2 log(2 π ω ) − log  I X j =1 (det H j ) − 1 / 2  + o (1) = − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) + ∆ i ( x ) ω − d 2 log(2 π ω ) − log  I X j =1 (det H j ) − 1 / 2  + o (1) . On the other hand, log e q i,ω ( x ) = − 1 2 ω ( x − x ⋆ i ) T H i ( x − x ⋆ i ) − d 2 log(2 π ω ) + 1 2 log det H i . Subtracting gives log e q i,ω ( x ) g ω ( x ) = − ∆ i ( x ) ω + 1 2 log det H i + log  I X j =1 (det H j ) − 1 / 2  + o (1) = − ∆ i ( x ) ω − log c i + o (1) , 43 where c i = (det H i ) − 1 / 2 P I j =1 (det H j ) − 1 / 2 . F or X ∼ e q i,ω , write X = x ⋆ i + √ ω Y with Y ∼ N (0 , H − 1 i ). The remainder b ound implies | ∆ i ( X ) | ≤ C ∥ X − x ⋆ i ∥ 3 = C ω 3 / 2 ∥ Y ∥ 3 , so ∆ i ( X ) ω = O  √ ω ∥ Y ∥ 3  . Since Y has finite momen ts of all orders, E [∆ i ( X ) /ω ] → 0 as ω → 0. Dominated con v er- gence on U i yields, as ω → 0, KL S ( e q i,ω ∥ g ω ) = Z S e q i,ω ( x ) log e q i,ω ( x ) g ω ( x ) d x − → − log c i . Step 2. KL for the mixtur e e q ω ne ar the mo des. Define e q ω ( x ) = I X i =1 c i e q i,ω ( x ) . W e sho w that the limiting constan t − log c i cancels mode by mode inside the mixture, so that KL S ( e q ω ∥ g ω ) → 0. Fix R > 0 and for each i define the shrinking ball B i ( R, ω ) := { x : ∥ x − x ⋆ i ∥ ≤ R √ ω } ⊂ U i for all small ω . Tw o mo des x ⋆ i  = x ⋆ j are separated, so there exists δ > 0 suc h that ∥ x ⋆ i − x ⋆ j ∥ ≥ δ . But for x ∈ B i ( R, ω ), b y the triangle inequality , ∥ x − x ⋆ j ∥ ≥ δ − R √ ω > δ / 2 , for sufficien tly small ω . Thus, since all co v ariances are O ( ω ), there exist a constan t a > 0 suc h that for j  = i and x ∈ B i ( R, ω ), e q j,ω ( x ) e q i,ω ( x ) ≤ e − a/ω . Therefore, on B i ( R, ω ), e q ω ( x ) = c i e q i,ω ( x )  1 + O  e − a/ω   , so log e q ω ( x ) g ω ( x ) = log  c i e q i,ω ( x ) g ω ( x )  + log  1 + O  e − a/ω   = log c i + log e q i,ω ( x ) g ω ( x ) + o (1) 44 with an o (1) term uniform in x ∈ B i ( R, ω ) as ω → 0. Substituting the expansion from Step 1, log e q i,ω ( x ) g ω ( x ) = − ∆ i ( x ) ω − log c i + o (1) , w e obtain, on B i ( R, ω ), log e q ω ( x ) g ω ( x ) = − ∆ i ( x ) ω + o (1) . (S8) Let X ∼ e q i,ω . As abov e, X = x ⋆ i + √ ω Y with Y ∼ N (0 , H − 1 i ), so     ∆ i ( X ) ω     ≤ C √ ω ∥ Y ∥ 3 , E      ∆ i ( X ) ω      → 0 . Using ( S8 ) and dominated con vergence on B i ( R, ω ) yields Z B i ( R,ω ) e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x − → 0 as ω → 0 , for each fixed R . Step 3. KL for the mixtur e e q ω in the tails. By Gaussian tail b ounds, e q i,ω  B i ( R, ω ) c  ≤ C 1 e − d 1 R 2 for some constants C 1 , d 1 > 0, uniformly in ω . On B i ( R, ω ) c ∩ U i , b oth e q ω and g ω are exp onen tially small in R 2 and | log( e q ω /g ω ) | grows at most p olynomially in ∥ x − x ⋆ i ∥ , so Z B i ( R,ω ) c ∩ U i e q i,ω ( x )    log e q ω ( x ) g ω ( x )    d x ≤ C 2 e − d 2 R 2 for suitable constan ts C 2 , d 2 > 0. On U 0 , Prop osition S2.6 gives g ω ( U 0 ) ≤ C e − η /ω , and the same Gaussian tail argumen ts yield e q i,ω ( U 0 ) ≤ C ′ 1 e − a i /ω , so the con tribution of U 0 to the KL in tegral is bounded b y C ′ 2 e − a ′ i /ω . Step 4. Conclusion. Putting ev erything together and summing o ver i with w eights c i , KL S ( e q ω ∥ g ω ) = Z e q ω ( x ) log e q ω ( x ) g ω ( x ) d x = I X i =1 c i Z e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x = I X i =1 c i " Z B i ( R,ω ) e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x + Z B i ( R,ω ) c ∪ U 0 e q i,ω ( x ) log e q ω ( x ) g ω ( x ) d x # . 45 F or eac h fixed R , the first term in brack ets tends to 0 as ω → 0, by the lo cal analysis abov e. The second term is bounded in absolute v alue b y C e − cR 2 + C ′ e − a/ω , indep endently of ω . Giv en ε > 0, pick R large enough that the tail b ound is < ε/ 2, then let ω → 0 to make the lo cal terms < ε/ 2. Thus, KL S ( e q ω ∥ g ω ) → 0. Pr o of of Pr op osition S2.8 . Note that b ounding the Gaussian tails in e q ω giv es β ω := R R d \S e q ω ( x ) d x = O ( e − c/ω ), for some c > 0. F rom Lemma S2.9 , KL S ( e q ω || g ω ) = o (1). Since e q S ω = e q ω / (1 − β ) on S , w e hav e KL( e q S ω || g ω ) − KL S ( e q ω || g ω ) =  1 1 − β ω − 1  KL S ( e q ω || g ω ) − log (1 − β ω ) = O ( β ω )KL S ( e q ω || g ω ) + O ( β ω ) = O ( e − c/ω ) . Therefore, we can conclude that   KL( e q S ω || g ω ) − KL S ( e q ω || g ω )   → 0, which prov es the prop o- sition. Step 3: Conclusion, pro of of Theorem 2.1 W e are now going to prov e all the statemen ts of Theorem 2.1 . Recall the conditionals of g ω and q ⋆ ω on U i , i = 0 , . . . , I : ϕ i,ω ( x ) = q ⋆ ω ( x ) 1 U i ( x ) α i,ω , ψ i,ω ( x ) = g ω ( x ) 1 U i ( x ) γ i,ω , where α i,ω = Z U i q ⋆ ω ( x ) d x , γ i,ω = Z U i g ω ( x ) d x , and the competitor mixture e q S ω = e q ω 1 S / R S e q ω ( x ) d x , with e q ω ( x ) = I X i =1 c i N ( x ; x ⋆ i , ω H − 1 i ) . Pr o of of The or em 2.1 . The principle of the pro of is as follows: b ecause the competitor e q S ω of Proposition S2.8 conv erges in KL to g ω , the same holds for q ⋆ ω , and therefore q ⋆ ω inherits the asymptotics of the Gibbs measure g ω . Steps 1-2 pro ve Theorem 2.1 (i), and Step 3 pro ceeds for Theorem 2.1 (ii). Step 1. Conver genc e of q ⋆ ω in KL. F or any density q on S , we hav e the v ariational iden tity , L ω ( q ) = ω log Z ω − ω KL( q ∥ g ω ) . 46 The comp etitor mixture e q S ω ∈ Q . Since q ⋆ ω maximises L ω o ver Q , w e hav e, as ω → 0, KL( q ⋆ ω || g ω ) ≤ KL( e q S ω || g ω ) − → 0 , where the limit follows from Prop osition S2.8 . Thus, the optimal truncated mixture satisfies KL( q ⋆ ω ∥ g ω ) → 0 . (S9) Step 2. Asymptotic mass in mo de neighb orho o ds. F or i = 0 , . . . , I , consider α i,ω and γ i,ω of ( S7 ). Let T : S → { 0 , 1 , . . . , I } send x ∈ U j to j . Let α ω and γ ω b e the induced discrete laws. Theorem 4.1 of Kullback and Leibler , 1951 gives KL( α ω ∥ γ ω ) ≤ KL( q ⋆ ω ∥ g ω ) − → 0 . F rom Prop osition S2.6 , γ i,ω → c i for i ∈ [ I ], and γ 0 ,ω → 0. Hence ( S9 ) implies α i,ω → c i , i ∈ [ I ] , α 0 ,ω → 0 . Step 3. L o c al c onditional densities. F or i ∈ [ I ], consider the conditional densities ϕ i,ω and ψ i,ω from ( S6 ). The chain rule for the KL div ergence ov er the partition { U 0 , U 1 , . . . , U I } giv es KL( q ⋆ ω ∥ g ω ) = KL( α ω ∥ γ ω ) + I X i =0 α i,ω KL( ϕ i,ω ∥ ψ i,ω ) . Using ( S9 ) and α i,ω → c i > 0, we obtain for each i ∈ [ I ], KL( ϕ i,ω ∥ ψ i,ω ) − → 0 . F urthermore, since all densities live on the b ounded set U i , Pinsk er’s inequalit y implies con vergence in total v ariation. F rom Prop osition S2.7 , the conditional la w ψ i,ω has mean x ⋆ i + o (1) and co v ariance o (1). Hence, the KL con vergence forces the same limits for the mixture’s lo cal conditional momen ts: E φ i,ω [ X ] = x ⋆ i + o (1) , Cov φ i,ω ( X ) = o (1) . S2.3 Pro of of Theorem 3.1 In this section, we prov e Theorem 3.1 , establishing fixed-temp erature asymptotics for the empirical maximisers when the sample size gro ws to infinit y . F or conv enience, w e recall the assumptions of Theorem 3.1 . Assumption S2.10. Define the fol lowing set of assumptions: 47 (B1) ( Data ) X 1 , . . . , X N i.i.d. ∼ p supp orte d on a b ounde d S ⊂ R d , with p b ounde d and c ontinuous on S . (B2) ( Unicity of the p opulation maximiser ) L ω 0 admits a unique maximiser Λ ⋆ ω 0 ∈ in t(Θ K ) . (B3) ( Nonde gener acy of the p opulation maximiser ) A t the maximiser Λ ⋆ ω 0 , H ⋆ ω 0 := H ω 0 ( Λ ⋆ ω 0 ) is p ositive definite. Eac h of the follo wing subsections pro v es one of the three statemen ts in Theorem 3.1 : (i) Under (B1), the p opulation ob jectiv e L ω 0 attains its maxim um on Θ K . (ii) Under (B1)-(B2), an y empirical maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisfies b Λ ω 0 ,N P − → Λ ⋆ ω 0 . (iii) Under (B1)-(B2)-(B3), an y empirical maximiser b Λ ω 0 ,N ∈ arg max b L ω 0 ,N satisfies √ N  b Λ ω 0 ,N − Λ ⋆ ω 0  D − → N (0 , W ω 0 ) , where W ω 0 :=  H ⋆ ω 0  − 1 V ω 0  H ⋆ ω 0  − 1 , and V ω 0 := V ar  ∇ Λ q Λ ( X )    Λ = Λ ⋆ ω 0 . S2.3.1 Existence of a p opulation maximiser Pr o of of (i). On the compact Θ K , ( Λ , x ) 7→ q Λ ( x ) is contin uous and is uniformly b ounded on the bounded set S . F or each x ∈ S , Λ 7→ q Λ ( x ) is con tinuous. By dominated con- v ergence (b ounded env elop e on S ), Λ 7→ E [ q Λ ( X )] is con tinuous. The en tropy term Λ 7→ H S ( q Λ ) is con tinuous on Θ K as well. Thus, b y W eierstrass, L ω ∗ is contin uous on compact Θ K and attains its maximum. S2.3.2 Consistency of the empirical maximisers The pro of of this result consists of tw o steps: the deriv ation of a uniform law of large n umbers, due to the Glivenk o–Can telli (GC) prop ert y of the Gaussian mixture family in Θ K (see App. S2.1 , Cor. S2.4 ), and the application of the Argmax Theorem. F or more details on the GC property , w e refer to Section S2.1 . Pr o of of (ii). The class { q Λ : Λ ∈ Θ K } is Glivenk o–Cantelli with an in tegrable en velope on S , so as N → ∞ : sup Λ ∈ Θ K      1 N N X i =1 q Λ ( X i ) − E p [ q Λ ( X )]      P − → 0 . Adding the deterministic term ω ∗ H S ( q Λ ) yields sup Λ ∈ Θ K   b L ω ∗ ,N ( Λ ) − L ω ∗ ( Λ )   P − → 0 . 48 Since L ω ∗ has a unique maximiser Λ ⋆ ω ∗ (Assumption (B2)), the Argmax Theorem ( v an der V aart , 1998 , Thm. 5.7) yields b Λ ω ∗ ,N P − → Λ ⋆ ω ∗ , as N → ∞ . S2.3.3 Asymptotic normalit y of the empirical maximisers The asymptotic normality result is deriv ed from a T aylor expansion and the Donsker prop- ert y on the Gaussian mixture family but also on the gradien ts. Lemma S2.5 giv es these prop erties. Pr o of of (iii). By Lemma S2.5 , Q is GC, so sup Λ ∈ Θ K   b L ω ∗ ,N ( Λ ) − L ω ∗ ( Λ )   P − → 0, hence b Λ ω ∗ ,N P − → Λ ⋆ ω ∗ b y the standard M-estimator consistency theorem ( v an der V aart , 1998 , Thm. 5.7). W rite ∇ Λ b L ω ∗ ,N ( Λ ) = 1 N N X i =1 ∇ Λ q Λ ( X i ) + ω ∗ ∇ Λ H S ( q Λ ) , ∇ Λ L ω ∗ ( Λ ) = E [ ∇ Λ q Λ ( X )] + ω ∗ ∇ Λ H S ( q Λ ) . Th us ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) − ∇ Λ L ω ∗ ( Λ ⋆ ω ∗ ) = ( P N − P ) ∇ Λ q Λ ( X )   Λ = Λ ⋆ ω ∗ has mean zero and v ari- ance V ω ∗ / N . F urthermore, a T aylor expansion around Λ ⋆ ω ∗ giv es 0 = ∇ Λ b L ω ∗ ,N ( b Λ ω ∗ ,N ) = ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) − H ⋆ ω ∗ ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) + r N , where, by the in tegral form of T aylor’s theorem, r N = 1 2 ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) T h ∇ 3 Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ + τ ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ )) i ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) for some τ ∈ (0 , 1). Since Θ K is compact and S (the support of P ) is b ounded, the parametric map Λ 7→ q Λ ( x ) is C ∞ and all third deriv atives are uniformly b ounded on S × Θ K . Hence, ∇ 3 Λ b L ω ∗ ,N is uniformly bounded, giving | r N | ≤ C ∥ b Λ ω ∗ ,N − Λ ⋆ ω ∗ ∥ 2 = o P ( ∥ b Λ ω ∗ ,N − Λ ⋆ ω ∗ ∥ ). Therefore, b y the CL T for the P -Donsk er gradien t class (Lemma S2.5 ), √ N ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) D − → N (0 , V ω ∗ ). Since H ⋆ ω ∗ is positive definite, rearranging yields the usual Z-estimator linearisation ( v an der V aart , 1998 , Thm. 5.41): √ N ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) = ( H ⋆ ω ∗ ) − 1 √ N ∇ Λ b L ω ∗ ,N ( Λ ⋆ ω ∗ ) + o P (1) , and therefore √ N ( b Λ ω ∗ ,N − Λ ⋆ ω ∗ ) D − → N  0 , ( H ⋆ ω ∗ ) − 1 V ω ∗ ( H ⋆ ω ∗ ) − 1  , as N → ∞ . 49 S2.4 Pro ofs for Sections 5 and S1.6 S2.4.1 Preliminaries: Bounded-Lipsc hitz metric Let ( S , d ) b e a metric space. F or f : S → R define ∥ f ∥ ∞ = sup x ∈S | f ( x ) | , Lip( f ) = sup x  = y | f ( x ) − f ( y ) | d ( x , y ) . F or tw o measures ν, ν ′ on S , the bounded-Lipschitz (BL) me tric is d BL ( ν, ν ′ ) := sup ∥ f ∥ BL ≤ 1    R f dν − R f dν ′    , ∥ f ∥ BL := ∥ f ∥ ∞ + Lip( f ) . Con vergence in d BL implies weak con vergence ( Dudley , 2002 ), denoted b y w − → . W e use it to state bo otstrap results compactly . S2.4.2 T echnical lemmas W e state these lemmas, with pro ofs in Section S2.4.3 . Lemma S2.11 (Regularit y for Gaussian mixtures) . F or a b ounde d sp ac e S and on the c omp act Θ K , ther e exists a neighb orho o d V ⊂ Θ K of Λ ⋆ ω 0 such that (i) Λ 7→ q Λ ( x ) is C 2 on V for every x ∈ S ; (ii) sup Λ ∈V ∥∇ Λ q Λ ( x ) ∥ ≤ F 1 ( x ) for a b ounde d-Lipschitz function F 1 on S ; (iii) sup Λ ∈V ∥∇ 2 Λ q Λ ( x ) ∥ ≤ F 2 ( x ) for a b ounde d-Lipschitz function F 2 on S . Lemma S2.12 (Empirical-pro cess CL T and bo otstrap) . L et Q ( u ) ∇ :=  x 7→ [ ∇ Λ q Λ ( x )] u : Λ ∈ Θ K  and Q ( u,v ) ∇ 2 :=  x 7→ [ ∇ 2 Λ q Λ ( x )] uv : Λ ∈ Θ K  for al l u, v = 1 , . . . , dim(Θ K ) , and define the function class F = dim(Θ K ) [ u =1 Q ( u ) ∇ ∪ dim(Θ K ) [ u,v =1 Q ( u,v ) ∇ 2 . We have, as N → ∞ : √ N ( P N − P ) w − → G P in  ∞ ( F ) , √ N ( P ∗ N − P N ) w − → G P in  ∞ ( F ) , c onditional ly on X 1: N , in P -pr ob ability, wher e G P denote the P -Br ownian bridge indexe d by F . Lemma S2.13 (Hadamard differen tiabilit y of the argmax map) . L et V ⊂ Θ K b e the set in which L emma S2.11 holds and F b e the function class of L emma S2.12 . L et  ∞ lin ( F ) := { h ∈  ∞ ( F ) : h is line ar } . 50 F or h ∈  ∞ lin ( F ) , let us write h ( ∇ Λ q Λ ) := ( h ([ ∇ Λ q Λ ] u )) 1 ≤ u ≤ dim(Θ K ) and h ( ∇ 2 Λ q Λ ) :=  h ([ ∇ 2 Λ q Λ ] u,v )  1 ≤ u,v ≤ dim(Θ K ) . Define S :  ∞ lin ( F ) × V → R dim(Θ K ) by S ( G, Λ ) = G ( ∇ Λ q Λ ) + ω 0 ∇ Λ H S ( q Λ ) . Under Assumption S2.10 , we have the fol lowing (i) Ther e exists a neighb orho o d U of P such that for every G ∈ U , the e quation over Λ S ( G, Λ ) = 0 admits a unique solution Λ ( G ) ∈ V . (ii) The map Φ : G 7→ Λ ( G ) is Hadamar d differ entiable at G = P , tangential ly to  ∞ lin ( F ) , with D Φ P ( h ) = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) . Lemma S2.14 (Measurability of the matc hing functional) . Λ 7→ M ( Λ ) is Bor el- me asur able on Θ K . In fact, M is c ontinuous on a finite Bor el p artition of Θ K . Lemma S2.15 (Con tin uity of the stabilit y indicator F j ) . Fix j ∈ [ K 0 ] and let U j = B ( u j , r j ) b e an op en b al l. Assume (by shrinking r j if ne e de d) that E j M ( Λ ⋆ ω 0 ) lies at p osi- tive distanc e fr om ∂ U j . Then, under Assumption S2.10 , for any deterministic me asur able sele ctor Φ fr om the ar gmax, F j ( G ) = 1 { E j M (Φ( G )) ∈ U j } is c ontinuous at G = P with r esp e ct to we ak c onver genc e: if G n w − → P , then F j ( G n ) → F j ( P ) . S2.4.3 Pro ofs of lemmas Sketch of pr o of of L emma S2.11 . Gaussian mixture densities are C ∞ in the parameters Λ as long as the co v ariances remain p ositiv e definite and in a compact subset, whic h is guaran teed in Θ K . This giv es ( i ). The first and second order deriv atives of q Λ ( x ) w.r.t. Λ are pro ducts of p olynomials of x and Gaussian densities, whic h are smooth in x , hence Lipsc hitz on the b ounded set S . Since S and Θ K are compact, they are also uniformly b ounded. T ypically , there exists C 1 , C 2 > 0 such that F 1 ( x ) = C 1 (1 + ∥ x ∥ + ∥ x ∥ 2 ) , F 2 ( x ) = C 2 (1 + ∥ x ∥ + ∥ x ∥ 2 + ∥ x ∥ 3 + ∥ x ∥ 4 ) , whic h are b ounded-Lipsc hitz on S , and ( ii ) and ( iii ) hold. Pr o of of L emma S2.12 . By Lemma S2.5 , eac h class Q ( u ) ∇ and Q ( u,v ) ∇ 2 is P -Donsker and ad- mits a measurable env elope in L 2 ( P ). Since F is a finite union of P -Donsker classes, it is also P -Donsker ( v an der V aart and W ellner , 1996 , Example 2.10.7). Hence, it is also Gliv enko–Can telli and √ N ( P N − P ) w − → G P in  ∞ ( F ) , 51 where G P is a P -Brownian bridge indexed b y F . F urthermore, the nonparametric b o otstrap empirical pro cess satisfies the conditional functional CL T ( v an der V aart and W ellner , 1996 , Thm. 3.6.3): √ N ( P ∗ N − P N ) w − → G P in  ∞ ( F ) , conditionally on X 1: N , in P -probabilit y . Pr o of of L emma S2.13 . (i) By Lemma S2.11 , for all x ∈ S , the map Λ 7→ ∇ Λ q Λ ( x ) is C 1 on V , sup Λ ∈V ∥∇ Λ q Λ ( x ) ∥ ≤ F 1 ( x ) and sup Λ ∈V ∥∇ 2 Λ q Λ ( x ) ∥ ≤ F 2 ( x ) for some b ounded functions F 1 and F 2 . Therefore, for an y G ∈  ∞ ( F ), the maps Λ 7→ G ([ ∇ Λ q Λ ] u ), for u = 1 , . . . , dim(Θ K ), ha v e deriv ativ es ( G ([ ∇ 2 Λ q Λ ] u 1 ) , . . . , G ([ ∇ 2 Λ q Λ ] u dim(Θ K ) )) and are C 1 on V . This implies that S ( G, Λ ) is C 1 in Λ , and S ( P , Λ ⋆ ω 0 ) = 0 , ∂ Λ S ( P , Λ ⋆ ω 0 ) = ∇ 2 Λ L ω 0 ( Λ ⋆ ω 0 ) = A ≺ 0 , where A := − H ⋆ ω 0 is inv ertible b y Assumption S2.10 . F urthermore, since ∂ Λ S ( P , Λ ) is con tin uous in Λ , there is r > 0 suc h that on the closed ball B ( Λ ⋆ ω 0 , r ) ⊂ V , sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ ∂ Λ S ( P , Λ ) − A ∥ ≤ 1 3 ∥ A − 1 ∥ . (S10) No w, define for G ∈  ∞ ( F ), Λ ∈ B ( Λ ⋆ ω 0 , r ), T G ( Λ ) := Λ − A − 1 S ( G, Λ ) . T o pro ve (i), we w ould lik e to apply the Banac h fixed-p oin t theorem ( Zeidler , 1995 , Th. 1.A) to T G , for all G in some neighborho o d of P , in the set B ( Λ ⋆ ω 0 , r ). F or this, we need to v erify a self-mapping prop ert y and a con traction property , whic h hold for a sufficiently small neighborho o d of P : we consider the neighborho o d T formed b y G suc h that ∥ G − P ∥ F ≤ min  1 3 ∥ A − 1 ∥ , r 3 ∥ A − 1 ∥  . Contr action pr op erty. S ( G, Λ ) is C 1 on V , so for any Λ 1 and Λ 2 in B ( Λ ⋆ ω 0 , r ), the mean-v alue theorem giv es S ( G, Λ 1 ) − S ( G, Λ 2 ) = ∂ Λ S ( G, e Λ )( Λ 1 − Λ 2 ) for some e Λ ∈ B ( Λ ⋆ ω 0 , r ) on the segment b et w een Λ 1 and Λ 2 . Therefore, T G ( Λ 1 ) − T G ( Λ 2 ) = ( I − A − 1 ∂ Λ S ( G, e Λ ))( Λ 1 − Λ 2 ) . This means ∥ T G ( Λ 1 ) − T G ( Λ 2 ) ∥ ≤ sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ × ∥ Λ 1 − Λ 2 ∥ . (S11) 52 F urthermore, ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ ≤ ∥ A − 1 ∥ ( ∥ A − ∂ Λ S ( P , Λ ) ∥ + ∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ ) . Then, by definition of ∥·∥ F , for an y Λ ∈ B ( Λ ⋆ ω 0 , r ), ∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ =    ( G − P )( ∇ 2 Λ q Λ )    ≤ ∥ G − P ∥ F . F or all G ∈ T , w e ha v e ∥ G − P ∥ F ≤ 1 / (3 ∥ A − 1 ∥ ), so ∥ A − 1 ∥∥ ∂ Λ S ( G, Λ ) − ∂ Λ S ( P , Λ ) ∥ ≤ 1 3 . Com bining this with ( S10 ), w e ha v e sup Λ ∈ B ( Λ ⋆ ω 0 ,r ) ∥ I − A − 1 ∂ Λ S ( G, Λ ) ∥ ≤ 2 3 , so in ( S11 ), we hav e ∥ T G ( Λ 1 ) − T G ( Λ 2 ) ∥ ≤ 2 3 ∥ Λ 1 − Λ 2 ∥ , (S12) whic h pro v es con traction for G ∈ T . Self-mapping pr op erty. Since S ( P , Λ ⋆ ω 0 ) = 0, T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 = − A − 1 S ( G, Λ ⋆ ω 0 ) = − A − 1 ( G − P )( ∇ Λ q Λ ⋆ ω 0 ) . By definition of ∥·∥ F ,    ( G − P )( ∇ Λ q Λ ⋆ ω 0 )    ≤ ∥ G − P ∥ F . Hence, for G ∈ T , ∥ T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 ∥ ≤ ∥ A − 1 ∥∥ G − P ∥ F ≤ r 3 Th us, using the con traction prop ert y ( S12 ), for an y Λ ∈ B ( Λ ⋆ ω 0 , r ), ∥ T G ( Λ ) − Λ ⋆ ω 0 ∥ ≤ ∥ T G ( Λ ) − T G ( Λ ⋆ ω 0 ) ∥ + ∥ T G ( Λ ⋆ ω 0 ) − Λ ⋆ ω 0 ∥ ≤ 2 r 3 + r 3 ≤ r , and this ensures that T G ( B ( Λ ⋆ ω 0 , r )) ⊂ B ( Λ ⋆ ω 0 , r ), the self-mapping prop ert y for G ∈ T . Conclusion. Finally , b y the Banach fixed-point theorem for T G , for ev ery G in the neigh b orho od T of P , there exists a unique fixed p oin t Λ ( G ) ∈ B ( Λ ⋆ ω 0 , r ) ⊂ V suc h that Λ ( G ) = T G ( Λ ( G )) is equiv alent to S ( G, Λ ( G )) = 0. This giv es (i). (ii) Let t n → 0 and h n → h in  ∞ lin ( F ) under ∥·∥ F . Set G n := P + t n h n and Λ n := Φ( G n ) = Λ ( G n ). Our ob jectiv e is to show that D Φ P ( h ) := lim n →∞ Φ( G n ) − Φ( P ) t n = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) . 53 First, the sequence ( h n ) is b ounded in ∥·∥ F , so ∥ G n − P ∥ F = | t n |∥ h n ∥ F → 0, and G n ∈ T and Λ n ∈ V , so b y (i), we ha ve 0 = S ( G n , Λ n ). F urthermore, since w e also hav e S ( P , Λ ⋆ ω 0 ) = 0, then 0 =  S ( G n , Λ n ) − S ( G n , Λ ⋆ ω 0 )  +  S ( G n , Λ ⋆ ω 0 ) − S ( P , Λ ⋆ ω 0 )  . By linearity , the second difference is S ( G n , Λ ⋆ ω 0 ) − S ( P , Λ ⋆ ω 0 ) = t n h n ( ∇ Λ q Λ ⋆ ω 0 ) . F or the first term, since S ( G, Λ ) is C 1 in Λ , the mean-v alue theorem says that there exists e Λ n on the segmen t betw een Λ ⋆ ω 0 and Λ n suc h that S ( G n , Λ n ) − S ( G n , Λ ⋆ ω 0 ) = ∂ Λ S ( G n , e Λ n )( Λ n − Λ ⋆ ω 0 ) . Th us, w e hav e the relation 0 = ∂ Λ S ( G n , e Λ n )( Λ n − Λ ⋆ ω 0 ) + t n h n ( ∇ Λ q Λ ⋆ ω 0 ) . (S13) On the other hand, the fixed-p oin t and the con traction prop erties of T G for all G ∈ T , pro ven in (i), imply that Λ n → Λ ⋆ ω 0 . Indeed, w e hav e ∥ Λ n − Λ ⋆ ω 0 ∥ = ∥ T G n ( Λ n ) − T P ( Λ ⋆ ω 0 ) ∥ ≤∥ T G n ( Λ n ) − T G n ( Λ ⋆ ω 0 ) ∥ + ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ ≤ 2 3 ∥ Λ n − Λ ⋆ ω 0 ∥ + ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ , hence ∥ Λ n − Λ ⋆ ω 0 ∥ ≤ 3 ∥ T G n ( Λ ⋆ ω 0 ) − T P ( Λ ⋆ ω 0 ) ∥ ≤ 3 ∥ A − 1 ∥    ( G n − P )( ∇ Λ q Λ ⋆ ω 0 )    ≤ 3 ∥ A − 1 ∥∥ G n − P ∥ F → 0 . This also implies e Λ n → Λ ⋆ ω 0 . F urthermore, we hav e ∥ ∂ Λ S ( G n , e Λ n ) − A ∥ ≤ ∥ ∂ Λ S ( G n , e Λ n ) − ∂ Λ S ( P , e Λ n ) ∥ + ∥ ∂ Λ S ( P , e Λ n ) − A ∥ → 0 b ecause the first term is ∥ ( G n − P )( ∇ 2 Λ q e Λ n ) ∥ ≤ sup Λ ∈V ∥ ( G n − P )( ∇ 2 Λ q Λ ) ∥ ≤ ∥ G n − P ∥ F → 0 54 and the second term goes to 0 b y contin uity of ∂ Λ S ( P , Λ ) in Λ at Λ ⋆ ω 0 . This means ∂ Λ S ( G n , e Λ n ) → A = − H ⋆ ω 0 , (S14) so there is some N suc h that n ≥ N implies ∂ Λ S ( G n , e Λ n ) is in vertible, and ( S13 ) yields Λ n − Λ ⋆ ω 0 t n = − ( ∂ Λ S ( G n , e Λ n )) − 1 h n ( ∇ Λ q Λ ⋆ ω 0 ) . (S15) Finally , as n → ∞ , h n → h in ∥·∥ F , so b y the definition of ∥·∥ F , we hav e h n ( ∇ Λ q Λ ⋆ ω 0 ) → h ( ∇ Λ q Λ ⋆ ω 0 ) . (S16) Com bining with ( S14 ) and ( S15 ), w e obtain Φ( G n ) − Φ( P ) t n = Λ n − Λ ⋆ ω 0 t n → D Φ P ( h ) = ( H ∗ ω 0 ) − 1 h ( ∇ Λ q Λ ∗ ω 0 ) , whic h pro v es (ii). Pr o of of L emma S2.14 . Let Π be the finite set of injections π : { 1 , . . . , K 0 }  → { 1 , . . . , K } and define cost π ( Λ ) := K 0 X j =1 ∥ µ π ( j ) ( Λ ) − u j ∥ . Eac h cost π is con tin uous in Λ (composition of contin uous maps). F or π ∈ Π, set the open “uniqueness” region U π := n Λ ∈ Θ K : cost π ( Λ ) < cost ρ ( Λ ) ∀ ρ  = π o . On U π the minimiser is uniquely π , hence M ( Λ ) =  µ π (1) ( Λ ) T , . . . , µ π ( K 0 ) ( Λ ) T  T , whic h is contin uous there because eac h µ k is contin uous. Equip Π with the lexicographic order ≺ lex : for π , ρ ∈ Π, write π ≺ lex ρ if there exists j ⋆ with π ( j ) = ρ ( j ) for all j < j ⋆ and π ( j ⋆ ) < ρ ( j ⋆ ). F or each π ∈ Π, define the tie region T π := n Λ ∈ Θ K : cost π ( Λ ) ≤ cost η ( Λ ) ∀ η ∈ Π and cost π ( Λ ) < cost ρ ( Λ ) ∀ ρ ∈ Π with ρ ≺ lex π o . Here, each set { cost π ≤ cost η } is closed and eac h { cost π < cost ρ } is op en, so T π is Borel. On T π , the tie-break selects π , hence the same formula for M holds and M is contin uous there. 55 Finally , Θ K =  [ π ∈ Π U π  ∪  [ π ∈ Π T π  . is a finite Borel partition on whose pieces M is con tinuous. Therefore M is Borel measur- able on Θ K . Pr o of of L emma S2.15 . By Lemma S2.13 , Φ is con tinuous at P , so G n w − → P implies Φ( G n ) → Φ( P ) = Λ ⋆ ω 0 . By Proposition S2.17 , M is C 1 (hence contin uous) at Λ ⋆ ω 0 . F ur- thermore, E j is linear and the ball-mem b ership map is contin uous a w ay from the b oundary , e.g. via ψ j ( y ) := r j − ∥ y − u j ∥ (con tinuous), 1 { y ∈ U j } = 1 { ψ j ( y ) > 0 } . Therefore ψ j  E j M (Φ( G n ))  − → ψ j  E j M ( Λ ⋆ ω 0 )  . By the p ositive-margin assumption, the limit p oin t is not 0, so by contin uit y of the indicator map x 7→ 1 { x > 0 } , w e ha v e F j ( G n ) → F j ( P ). S2.4.4 Pro of of Theorem 5.1 Theorem 5.1 is a direct consequence of Theorem 3.1 (iii) and the following result (Thm. S2.16 ), whic h w e prov e b elow. Theorem S2.16 (Bo otstrap v alidit y for parameters) . Under Assumption S2.10 , d BL  La w ∗  √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N )  , Law  √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 )   P − → 0 , wher e La w( · ) denotes law under P and La w ∗ ( · ) law under P ∗ . Pr o of of The or em S2.16 . Consider the argmax functional Φ( G ) := arg max Λ ∈ Θ K n G ( q Λ ) + ω 0 H S ( q Λ ) o . By Lemma S2.13 , Φ is Hadamard differentiable at G = P tangentially to  ∞ lin ( F ) with deriv ativ e D Φ P (see definitions in the lemma statements). By Lemma S2.12 , √ N ( P ∗ N − P N ) w − → ∗ G P in  ∞ ( F ) in P -probability , where F is defined in the lemma statement and G P ∈  ∞ lin ( F ). Therefore, by Theo- rem 3.1 (iii) and the b ootstrap functional delta metho d ( v an der V aart and W ellner , 1996 , Thm. 3.9.11), √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N ) = D Φ P  √ N ( P ∗ N − P N )  + o P ∗ (1) D − → N (0 , W ⋆ ω 0 ) , 56 conditionally on the data, in P -probability . This implies d BL  La w ∗  √ N ( b Λ ∗ ω 0 ,N − b Λ ω 0 ,N )  , La w  √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 )   P − → 0 . S2.4.5 Pro of of Prop osition 5.4 Pr o of of Pr op osition 5.4 . By a mean-v alue expansion of the score, ∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N ) − ∇ Λ b L ω 0 ,N ( b Λ ω 0 ,N ) = c H ω 0 ,N ( ¯ Λ )( e Λ ω 0 ,N − b Λ ω 0 ,N ) , with ∇ Λ b L ω 0 ,N ( b Λ ω 0 ,N ) = 0 and there exists c > 0 such that eig min ( c H ω 0 ,N ( ¯ Λ )) ≥ c > 0 with probabilit y tending to 1 b y Assumption S2.10 (B3). Hence ∥ e Λ ω 0 ,N − b Λ ω 0 ,N ∥ ≤ c − 1   ∇ Λ b L ω 0 ,N ( e Λ ω 0 ,N )   = o P ( N − 1 / 2 ) . Therefore √ N ( e Λ ω 0 ,N − b Λ ω 0 ,N ) = o P (1), and Slutsky yields the claimed Gaussian limit. F or the bo otstrap, the same argumen t holds conditionally with starred ( ∗ ) quan tities, giving the stated conclusions by the bo otstrap delta method. S2.4.6 Pro of of Theorem 5.3 Theorem 5.3 is obtained by combining Prop osition S2.17 , Theorem S2.18 , and Corol- lary S2.20 , pro ven b elo w. Mo de-matc hing map smo othness Prop osition S2.17 (Mo de-matc hing map smo othness) . Under Assumption 5.2 , ther e ex- ists a neighb orho o d V ⊂ Θ K ∩ G δ / 2 of Λ ⋆ ω 0 such that M is C 1 on V . Its Jac obian at Λ ⋆ ω 0 is J = ∇ Λ M ( Λ ⋆ ω 0 ) . Pr o of of Pr op osition S2.17 . By Assumption 5.2 , there exists δ > 0 suc h that the nearest- comp onen t indices are unique at Λ ⋆ ω 0 . By contin uit y of Λ 7→ ∥ µ k ( Λ ) − u j ∥ , these indices are locally constant, so there e xists a neigh b orhoo d V ⊂ G δ / 2 of Λ ⋆ ω 0 on whic h the matc hing is fixed. W rite π ⋆ ( j ) for the index matched to u j at Λ ⋆ ω 0 . W e hav e, for all Λ ∈ V , M ( Λ ) =  µ π ⋆ (1) ( Λ ) T , . . . , µ π ⋆ ( K 0 ) ( Λ ) T  T . Since m ( Λ ) = ( µ 1 ( Λ ) T , . . . , µ K ( Λ ) T ) T is C 1 in a neighborho o d of Λ ⋆ ω 0 , the composition ab o v e sho ws that M is C 1 on V . In particular, at Λ ⋆ ω 0 , its Jacobian is J = ∇ Λ M ( Λ ⋆ ω 0 ) =   ∇ Λ µ π ⋆ (1) ( Λ ⋆ ω 0 ) . . . ∇ Λ µ π ⋆ ( K 0 ) ( Λ ⋆ ω 0 )   . 57 CL T for matc hed mo des Theorem S2.18 (CL T and b ootstrap for M ) . Under Assumptions S2.10 and 5.2 , as N → ∞ : √ N  M ( b Λ ω 0 ,N ) − M ( Λ ⋆ ω 0 )  D − → N (0 , C M ) , √ N  M ( b Λ ∗ ω 0 ,N ) − M ( b Λ ω 0 ,N )  D − → N (0 , C M ) , c onditional ly on X 1: N , in P -pr ob ability, with C M = J W ω 0 J T . Pr o of of The or em S2.18 . By Theorem 3.1 (iii), √ N ( b Λ ω 0 ,N − Λ ⋆ ω 0 ) D − → N (0 , W ω 0 ) . F urther- more, by Prop osition S2.17 , M is C 1 at Λ ⋆ ω 0 with Jacobian J , so the functional delta metho d ( v an der V aart , 1998 , Thm. 20.8) yields √ N  M ( b Λ ω 0 ,N ) − M ( Λ ⋆ ω 0 )  D − → N  0 , J W ω 0 J T  , whic h pro v es the first claim. Finally , b y the b ootstrap functional delta method ( v an der V aart and W ellner , 1996 , Thm 3.9.11), √ N  M ( b Λ ∗ ω 0 ,N ) − M ( b Λ ω 0 ,N )  D − → N (0 , J W ω 0 J T ) conditionally on X 1: N , in P -probabilit y , whic h gives the second claim. Corollary S2.19 (Set wise con vergence) . As a c onse quenc e of The or em S2.18 , for any Bor el set B ⊂ R dK 0 whose b oundary has zer o pr ob ability under N (0 , C M ) , as N → ∞ : P ∗  M ( b Λ ∗ ω 0 ,N ) ∈ B  − P  M ( b Λ ω 0 ,N ) ∈ B  P − → 0 . In p articular, p er c entile (or studentise d) b o otstr ap c onfidenc e r e gions for e ach matche d mo de c o or dinate ar e asymptotic al ly valid. Pr o of of Cor ol lary S2.19 . This is directly implied b y Theorem S2.18 and the Portman teau theorem ( v an der V aart , 1998 , Lem. 2.2). P er-mo de pro jections Corollary S2.20 (Per-mode pro jections) . As a c onse quenc e of The or em S2.18 , for the j -th matche d mo de M j := E j M with E j extr acting the j -th blo ck, as N → ∞ : √ N  M j ( b Λ ω 0 ,N ) − M j ( Λ ⋆ ω 0 )  D − → N (0 , C j ) , √ N  M j ( b Λ ∗ ω 0 ,N ) − M j ( b Λ ω 0 ,N )  D − → N (0 , C j ) , and p er c entile b o otstr ap el lipses for M j ar e asymptotic al ly valid. Pr o of of Cor ol lary S2.20 . The asymptotic normality results are directly implied b y Theo- rem S2.18 and the linearit y of the extractor E j . Then, the ellipses are asymptotically v alid as a consequence of the setwise con vergence given b y Corollary S2.19 . 58 S2.4.7 Pro of of Prop osition S1.13 Pr o of of Pr op osition S1.13 . By Lemma S2.15 , F j is contin uous at P with resp ect to w eak con vergence. (i) Conditional LLN. Given the data, { F j ( P ∗ ( ℓ ) N ) } L ℓ =1 are i.i.d. Bernoulli with mean E ∗ [ F j ( P ∗ N )] = τ j,N and finite v ariance. Hence, by the strong la w, s j → τ j,N almost surely under the b o otstrap (a.s. ∗ ), and V ar ∗ ( s j ) = V ar ∗ ( F j ( P ∗ N )) /L → 0. Ho effding’s inequalit y yields the tail b ound. (ii) L ar ge- N limits. Since P N w − → P and F j is contin uous at P , w e ha ve F j ( P N ) → F j ( P ) in probability , so π j,N → π j := F j ( P ). F or the b ootstrap mean, standard nonparamet- ric b ootstrap consistency at contin uit y p oints giv es F j ( P ∗ N ) P − → F j ( P ) in P -probabilit y . Because F j ∈ [0 , 1] is bounded, conditional b ounded con vergence implies E ∗ [ F j ( P ∗ N )] P − → F j ( P ) = π j . Finally , for an y L = L N → ∞ as N → ∞ , | s j − π j | ≤ | s j − τ j,N | + | τ j,N − π j | P − → 0 . By (i) and (ii), w e obtain the last claim s j P − → π j . S2.5 Pro ofs for Section S1.2 S2.5.1 Pro of of Prop osition S1.1 Pr o of of Pr op osition S1.1 . W rite q S = q /ψ , ε = 1 − ψ . On the one hand, w e hav e   E q S [ p ] − E q [ p ]   =     1 ψ − 1     Z S pq ≤ M ε. On the other hand, w e ha v e H S ( q S ) = 1 ψ H S ( q ) + log ψ . Hence, H S ( q S ) − H S ( q ) =  1 ψ − 1  H S ( q ) + log ψ = ε ψ H S ( q ) + log(1 − ε ) . On a compact parameter set and b ounded S , C q = sup λ sup x ∈S   log q λ ( x )   < ∞ is well- defined and |H S ( q ) | ≤ C q R S q = C q ψ . Therefore,   H S ( q S ) − H S ( q )   ≤ C q ε + | log (1 − ε ) | . Com bining the tw o inequalities, w e obtain the point wise b ound. The uniform b ound follo ws b y taking suprema o v er Θ. 59 S2.5.2 Pro of of Theorem S1.2 Before proving Theorem S1.2 , we deriv e three supp orting lemmas. Lemma S2.21 (Argmax stability) . L et Θ b e c omp act and let F , G : Θ → R b e c ontinuous functions. Assume F has a maximiser λ ⋆ and that for some ρ > 0 the value gap m ( ρ ) := F ( λ ⋆ ) − sup ∥ λ − λ ⋆ ∥≥ ρ F ( λ ) > 0 . If ∥ G − F ∥ ∞ := sup λ ∈ Θ | G ( λ ) − F ( λ ) | ≤ m ( ρ ) / 3 , then every maximiser of G lies in B ( λ ⋆ , ρ ) . If, in addition, λ ⋆ is the unique interior maximiser of F , then ther e exists ρ 0 > 0 such that the same b ound holds for every 0 < ρ ≤ ρ 0 , then we have arg max λ ∈ Θ G ( λ ) = { λ ⋆ } . Pr o of of L emma S2.21 . Fix ρ > 0 with m ( ρ ) > 0 and set δ := ∥ G − F ∥ ∞ . F or any λ with ∥ λ − λ ⋆ ∥ ≥ ρ , G ( λ ) ≤ F ( λ ) + δ ≤ F ( λ ⋆ ) − m ( ρ ) + δ, while G ( λ ⋆ ) ≥ F ( λ ⋆ ) − δ . If δ < m ( ρ ) / 2 (hence, under δ ≤ m ( ρ ) / 3), then G ( λ ⋆ ) > G ( λ ) for all ∥ λ − λ ⋆ ∥ ≥ ρ , so ev ery maximiser of G lies in B ( λ ⋆ , ρ ). If λ ⋆ is the unique interior maximiser of F , choose ρ 0 > 0 so that B ( λ ⋆ , ρ 0 ) contains no other maximiser of F . F or an y 0 < ρ ≤ ρ 0 , the ab o ve argument ensures that ev ery maximiser of G lies in B ( λ ⋆ , ρ ). Since λ ⋆ is the only maximiser of F in this ball, the strict inequalit y abov e then implies arg max λ ∈ Θ G = { λ ⋆ } . Lemma S2.22 (Uniform tail) . Fix ω > 0 . Assume the uniform mar gin dist  µ ( λ ) , ∂ S  ≥ M 0 p eig max (Σ( λ )) , ∀ λ ∈ Θ . Then the uniform tail b ound ε max := sup λ ∈ Θ  1 − Z S q λ ( x ) d x  ≤ C e − M 2 0 / 2 holds for some c onstant C > 0 . Pr o of of L emma S2.22 . Let λ ∈ Θ, X ∼ N ( µ , Σ ), and r := M 0 p eig max ( Σ ). The margin condition dist( µ ( λ ) , ∂ S ) ≥ r implies B ( µ , r ) ⊂ S , so ε ( µ , Σ ) = P ( X / ∈ S ) ≤ P ( ∥ X − µ ∥ ≥ r ). Since Σ ⪯ eig max ( Σ ) I , for Z ∼ N (0 , I ), there exists C > 0 suc h that P ( ∥ X − µ ∥ ≥ r ) ≤ P ( ∥ Z ∥ ≥ r / p eig max ( Σ )) = P ( ∥ Z ∥ ≥ M 0 ) ≤ C e − M 2 0 / 2 , b y c hi-squared tail b ounds for ∥ Z ∥ 2 . Th us, ε max = sup λ ∈ Θ ε ( λ ) ≤ C e − M 2 0 / 2 . 60 Pr o of of The or em S1.2 . By Lemma S2.22 , ε max ≤ C e − M 2 0 / 2 . Prop osition S1.1 then yields the uniform bound sup λ ∈ Θ   G ω ( λ ) − F ω ( λ )   ≤ δ ω ( ε max ) . Fix ρ > 0 and set m ( ρ ) := F ω ( λ ⋆ ω ) − sup ∥ λ − λ ⋆ ω ∥≥ ρ F ω ( λ ) > 0. Cho ose M 0 , hence ε max , large enough that δ ω ( ε max ) < m ( ρ ) / 3. By Lemma S2.21 , ev ery maximiser of G ω lies in B ( λ ⋆ ω , ρ ). If λ ⋆ ω is a unique interior maximiser of F ω , the argmaxes coincide. S3 Algorithms and v arian ts This section details the update rules for GER VE for differen t v ariational families and pro vides pseudo-codes for implementation. F or eac h case, w e: • Recall the parameterisation and its link to natural gradients. • Presen t the sp ecific up date rules and their mini-batc h version. • Giv e a pseudo-co de for practical use. • Pro vide deriv ation notes for readers in terested in the in termediate steps. S3.1 Natural-gradien t up date rule F or distributions in an exp onen tial family or MCEF, w e can use ( 5 ) ( Lin et al. , 2019 ; Khan and Rue , 2023 ) so that the natural-gradien t up date rule writes: λ t +1 = λ t + ρ t ∇ M b L ω t ,N ( λ ) | λ = λ t , whic h will b e the starting p oin t for all the follo wing deriv ations. S3.2 Fixed-co v ariance Gaussian case P arameterisation. W e consider q λ ( x ) = N ( x ; µ , s − 1 I ) with fixed precision s > 0. • Natural parameter: λ = s µ ; • Sufficien t statistic: x ; • Expectation parameter: µ = E q λ [ X ]. Only µ is optimised. GER VE up date rule. Considering ∇ µ H S ( q λ ) ≈ ∇ µ H ( q λ ) = 0, the fixed-co v ariance Gaussian GER VE update rule is: µ t +1 = µ t + ρ t N N X i =1 ( X i − µ t ) q λ t ( X i ) . 61 This is equiv alent to gradien t ascent on the kernel densit y estimate p h ( µ ) = 1 N N X i =1 ϕ h ( X i − µ ) , where ϕ h is the Gaussian k ernel with bandwidth h = s − 1 . This reduces to the classic Gaussian mean-shift update when the step sizes are chosen adaptiv ely ρ t = p h ( µ t ) − 1 . Mini-batc h implemen tation. F or a batch X ( t ) 1: B sampled uniformly with replacemen t from the dataset, i.e. J ( t ) 1 , . . . , J ( t ) B i.i.d. ∼ Unif ([ N ]) , X ( t ) 1: B := ( X J ( t ) 1 , . . . , X J ( t ) B ), define: b g ( t ) B = s B B X i =1 ( X ( t ) i − µ t ) q λ t ( X ( t ) i ) . Conditioned on X 1: N , b g ( t ) B is an un biased estimator of ∇ µ b L ω t ,N ( λ ) | λ = λ t . The mini-batch version of the up date rule is: µ t +1 = µ t + ρ t s − 1 b g ( t ) B . Algorithm. Algorithm 3 presen ts the pseudo-code corresp onding to this implemen tation of GER VE. Algorithm 3: Fixed-co v ariance Gaussian GER VE (with mini-batc hes) 1 Given samples X 1: N . 2 Set T , B , s , λ 1 , ρ 1: T . 3 for t = 1 : T do 4 Sample J ( t ) 1: B i.i.d. ∼ Unif ([ N ]), set X ( t ) i ← X J ( t ) i for i = 1 : B . 5 Compute b g ( t ) B from X ( t ) 1: B , λ t . 6 Upda te µ t +1 = µ t + ρ t s − 1 b g ( t ) B . 7 end 8 return λ T +1 (= µ T +1 ). Deriv ation details. F rom the natural-gradient rule for exponential families with λ = s µ , we hav e: s µ t +1 = s µ t + ρ t ∇ µ b L ω t ,N ( λ ) | λ = λ t . The gradient of the empirical ob jectiv e is: ∇ µ b L ω t ,N ( λ ) = s N N X i =1 ( X i − µ ) q λ ( X i ) + ω ∇ µ H S ( q λ ) , whic h yields the update rule ab o ve. 62 S3.3 Gaussian mixture case P arameterisation. F or K components: q Λ ( x ) = K X k =1 π k q λ k ( x ) , where the weigh ts π 1 , . . . , π K sum to 1, and each q λ k ( x ) is a Gaussian density with natural parameters λ k . • MCEF natural parameters: Λ = ( v 1 , . . . , v K − 1 , λ 1 , . . . , λ K ) , where v k := log( π k /π K ); • MCEF exp ectation parameters: M = ( π 1 , . . . , π K − 1 , M 1 , . . . , M K ) , where M k := ( m (1) k , m (2) k ) = ( π k µ k , π k ( S − 1 k + µ k µ T k )). GER VE up date rules. The Gaussian mixture GER VE up date rules are: S k,t +1 = S k,t − ρ t N S k,t N X i =1 (( X i − µ k,t )( X i − µ k,t ) T S k,t − I ) q λ k,t ( X i ) − ρ t ω t ∇ S − 1 k H S ( q Λ ) | Λ = Λ t , µ k,t +1 = µ k,t + ρ t N S − 1 k,t +1 S k,t N X i =1 ( X i − µ k,t ) q λ k,t ( X i ) + ρ t ω t ∇ µ k H S ( q Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t N N X i =1 ( q λ k,t ( X i ) − q λ K,t ( X i )) − ρ t ω t ∇ π k H S ( q Λ ) | Λ = Λ t . Mini-batc h and Monte-Carlo implemen tation. F or a batch X ( t ) 1: B sampled uni- formly with replacement from the dataset, i.e. J ( t ) 1 , . . . , J ( t ) B i.i.d. ∼ Unif ([ N ]) , X ( t ) 1: B := 63 ( X J ( t ) 1 , . . . , X J ( t ) B ), define: b f ( t ) k,B = 1 B B X i =1 ( q λ k,t ( X ( t ) i ) − q λ K,t ( X ( t ) i )) , b g ( t ) k,B = π k,t B S k,t B X i =1 ( X ( t ) i − µ k,t ) q λ k,t ( X ( t ) i ) , c H ( t ) k,B = π k,t 2 B S k,t B X i =1 (( X ( t ) i − µ k,t )( X ( t ) i − µ k,t ) T S k,t − I ) q λ k,t ( X ( t ) i ) . Conditioned on X 1: N , b g ( t ) k,B , c H ( t ) k,B , and b f ( t ) k,B are unbiased estimators of their full-data coun terparts. En tropy deriv atives must also be estimated. F or some size B e , dra w additional Monte- Carlo samples Z ( k,t ) 1: B e ∼ q λ k,t to compute b η ( t ) k,B e , b γ ( t ) k,B e , and b ϕ ( t ) k,B e (see details b elo w). W e obtain the practical updates: S k,t +1 = S k,t − 2 ρ t π k,t ( c H ( t ) k,B + ω t b η ( t ) k,B e ) , µ k,t +1 = µ k,t + ρ t S − 1 k,t +1 ( b g ( t ) k,B + ω t b γ ( t ) k,B e ) , v k,t +1 = v k,t + ρ t ( b f ( t ) k,B + ω t b ϕ ( t ) k,B e ) . Algorithm. Algorithm 4 depicts our proposed implemen tation of Gaussian mixture GER VE. 64 Algorithm 4: Gaussian mixture GER VE (with mini-batc hes) 1 Given samples X 1: N . 2 Set T , B , B e , K , Λ 1 , ρ 1: T , ω 1: T . 3 for t = 1 : T do 4 Sample J ( t ) 1: B i.i.d. ∼ Unif ([ N ]), set X ( t ) i ← X J ( t ) i for i = 1 : B . 5 for k = 1 : K do 6 Compute b g ( t ) k,B , c H ( t ) k,B from X ( t ) 1: B , Λ t . 7 Sample Z ( k,t ) 1: B e i.i.d. ∼ N ( µ k,t , S − 1 k,t ). 8 Compute b γ ( t ) k,B e , b η ( t ) k,B e from Z ( k,t ) 1: B e , Λ t . 9 Upda te S k,t +1 , µ k,t +1 . 10 end 11 for k = 1 : ( K − 1) do 12 Compute b f ( t ) k,B , b ϕ ( t ) k,B e from X ( t ) 1: B , Z ( k,t ) 1: B e , Z ( K,t ) 1: B e , Λ t . 13 Upda te v k,t +1 . 14 end 15 end 16 return Λ T +1 . Deriv ation details. The natural-gradien t up date rules are: v k,t +1 = v k,t + ρ t ∇ π k b L ω t ,N ( Λ ) | Λ = Λ t , S k,t +1 µ k,t +1 = S k,t µ k,t + ρ t ∇ m (1) k b L ω t ,N ( Λ ) | Λ = Λ t , − 1 2 S k,t +1 = − 1 2 S k,t + ρ t ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t . W e rearrange them to obtain: µ k,t +1 = µ k,t + ρ t S − 1 k,t +1 ( ∇ m (1) k b L ω t ,N ( Λ ) | Λ = Λ t + 2( ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t ) µ k,t ) , S k,t +1 = S k,t − 2 ρ t ∇ m (2) k b L ω t ,N ( Λ ) | Λ = Λ t , v k,t +1 = v k,t + ρ t ∇ π k b L ω t ,N ( Λ ) | Λ = Λ t . W e apply the c hain rule to express the gradien ts w.r.t. the expectation parameters as gradien ts w.r.t. µ k and S − 1 k : ∇ m (1) k b L ω t ,N ( Λ ) = 1 π k  ∇ µ k b L ω t ,N ( Λ ) − 2( ∇ S − 1 k b L ω t ,N ( Λ )) µ k  , ∇ m (2) k b L ω t ,N ( Λ ) = 1 π k ∇ S − 1 k b L ω t ,N ( Λ ) . 65 These gradients are ∇ µ k b L ω t ,N ( Λ ) = ∇ µ k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ µ k H S ( q Λ ) = π k N S k N X i =1 ( X i − µ k ) q λ k ( X i ) + ω t ∇ µ k H S ( q Λ ) , and ∇ S − 1 k b L ω t ,N ( Λ ) = ∇ S − 1 k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ S − 1 k H S ( q Λ ) = π k 2 N S k N X i =1 (( X i − µ k )( X i − µ k ) T S k − I ) q λ k ( X i ) + ω t ∇ S − 1 k H S ( q Λ ) . Giv en that π K = 1 − P K k =1 π k , we also ha v e ∇ π k b L ω t ,N ( Λ ) = ∇ π k 1 N N X i =1 q Λ ( X i ) ! + ω t ∇ π k H S ( q Λ ) = 1 N N X i =1 ( q λ k ( X i ) − q λ K ( X i )) + ω t ∇ π k H S ( q Λ ) . Com bining these expressions, w e find the Gaussian mixture GER VE update rules. En trop y deriv ativ es (for Mon te-Carlo estimation). The entrop y deriv ativ es do not admit a closed form, but they admit an in tegral represen tation. F or the deriv atives w.r.t. µ k and S − 1 k , the follo wing expressions can be used: ∇ S − 1 k H S ( q Λ ) = π k 2 S k E q λ k [(( X − µ k )( X − µ k ) T S k − I ) log q Λ ( X ) 1 S ( X )] and ∇ µ k H S ( q Λ ) = π k S k E q λ k [( X − µ k ) log q Λ ( X ) 1 S ( X )] . Finally , for the deriv ativ e w.r.t. π k , we use: ∇ π k H S ( q Λ ) = E q λ k [log q Λ ( X ) 1 S ( X )] − E q λ K [log q Λ ( X ) 1 S ( X )] . 66 S3.4 Computational considerations Recall GER VE’s complexit y for Gaussian mixtures with K components (Sec. 4.4 ): • F ull co v ariances: O ( d 3 B K T ). • Diagonal, isotropic or fixed co v ariances: O ( dB K T ). Simplified families sometimes retain most of the algorithm’s qualitative behaviour but impro ve scalability . This is comparable to related metho ds: • Me an-shift mo de-se eking. A single run of classical mean-shift with fixed, spherical bandwidths and mini-batches costs O ( dB T ). Running from K initialisations (to target K mo des) yields O ( dB K T ), comparable to fixed-co v ariance GER VE. Ho w ever, unlik e GER VE, these indep enden t tra jectories lac k a repulsion mechanism and may fail to explore the full mo dal structure. • Me an-shift clustering. Since mean-shift clustering launches one tra jectory p er data p oin t, the cost is O ( dN B T ) for spherical bandwidths. This can b e substantially more exp ensiv e than GER VE, which only launches K tra jectories to find the cluster cen troids. • EM for Gaussian mixtur es (GMM-EM). The EM algorithm ( Dempster et al. , 1977 ; McLac hlan and P eel , 2000 ) with full cov ariances has cost O  ( d 2 B + d 3 ) K T  p er run, due to matrix inv ersions and determinan t computations. F or diagonal cov ariances, it reduces to O ( dB K T ), whic h is comparable to fixed-cov ariance GER VE. In contrast, GER VE uses unified natural-gradient steps rather than alternating E- and M-steps, and do es not require lik eliho o d maximisation, making it applicable in likelihoo d-free settings. S4 Sim ulations and b enc hmark S4.1 Details on the clustering example of Section 6.1 Sample. The dataset consists of N = 6000 p oin ts from a 2D Gaussian mixture with three equally w eighted comp onen ts with means on the no des of an equilateral triangle, at (0 , 1), (cos( π / 6) , − 0 . 5), ( − cos( π / 6) , − 0 . 5) and isotropic co v ariance σ 2 I with σ 2 = 0 . 25. Figure S2 sho ws a sample and a kernel density estimate (KDE), revealing J = 3 regions of high density . GER VE hyperparameters. The mixture parameters are initialised with µ 1 , 1 , . . . , µ K, 1 ∼ Uniform([ − 2 , 0] 2 ), and for all k ∈ [ K ], Σ k, 1 = 2 I , π k, 1 = 1 /K . T able S1 giv es the other h yperparameters used for GER VE in this clustering task. Additional figures. Figure S3 sho ws results for K = 3 including an additional compar- ison with k -means for K = 7. 67 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Figure S2: Sample of N = 6000 points from a three-component Gaussian mixture whose means are located at the nodes of an equilateral triangle. Bac kground shading: Gaussian KDE with bandwidth h se lected by Scott’s rule ( Scott , 1992 ), revealing three high-density regions. T able S1: Hyp erparameters used for GER VE in the clustering sim ulations. Hyp erparameters V alues T 40000 B 1000 K { 3 , 7 } ω t 50 /t 1 . 1 + 0 . 004 ρ t 10 − 4 (50 /ω t ) 0 . 7 68 Figure S3: T op ro w: Clustering with K = 3 for GER VE (left) and GMM-EM (righ t). Ellipses represen t the cov ariance matrices of the Gaussian comp onen ts. Bottom ro w: Clus- tering with k -means for K = 7 (left) and K = 3 (righ t). 69 S4.2 Mo de-estimation Mo del. W e use the same triangle configuration with means at (0 , 1), (cos( π / 6) , − 0 . 5), ( − cos( π / 6) , − 0 . 5) and isotropic co v ariance σ 2 I , but with smaller cov ariance σ 2 = 0 . 1. The three means are well separated, so the density global modes appro ximately coincide with the means: u 1 ≈ (0 , 1), u 2 ≈ (cos( π / 6) , − 0 . 5), u 3 ≈ ( − cos( π / 6) , − 0 . 5). F or eac h exp erimen t replicate, w e sample N p oin ts. Metrics. Eac h metho d is replicated n rep = 100 times and provides K estimated mo des b u = ( b u 1 , . . . , b u K ). W e assess performance in terms of mo de estimation using the follo wing metrics: • Mo de r e c overy ( MR ϵ ): n umber of true mo des { u i , i ∈ [ I ] } reco vered within a strict tolerance  : MR ϵ ( b u ) = I X i =1 1  min k ∈ [ K ]   b u k − u i   2 <   . W e set  = 10 − 2 , a small v alue relative to σ = √ 0 . 1. • Hungarian matching sum ( HM ): minim um linear assignment cost b et w een true and estimated mo des, HM ( b u ) = min τ I X i =1   b u τ ( i ) − u i   2 , i.e., the minim um total distance o ver injectiv e maps ( K ≥ I ) τ : { 1 , . . . , I }  → { 1 , . . . , K } (one distinct estimate p er true mo de). This can b e computed via the Hungarian algorithm ( Kuhn , 1955 ). • Ne ar est-neighb or sum ( NN ): aggregate distance from eac h estimate to the closest true mo de, NN ( b u ) = K X k =1 min i ∈ [ I ]   b u k − u i   2 . Hyp erparameters and grid searc h. F or GER VE, the initial comp onents ha ve equal w eights 1 /K and cov ariances σ 2 1 I , σ 1 > 0. The annealing schedule is ω t = ω 1 /t β and the step sizes are ρ t = ρ 1 ( ω 1 /ω t ) γ . W e used T = 4000 iterations and mini-batches of size B = 1000. The mean-shift algorithm is run with a mini-batc h size B = 1000 and uses a bandwidth h m for its Gaussian k ernel. The feature significance metho d uses a bandwidth h f to iden tify the p oin ts of significant curv ature or gradien t. F or each metho d, w e p erformed a grid or line searc h to optimise h yp erparameters with respect to each metric and sample size. T able S2 giv es the v alues used in the grid searc h procedure. F or Gaussian mean-shift, we use a mini-batch fixed-co v ariance Gaussian GER VE (equiv alen t up dates to Gaussian mean-shift with bandwidth h ) with adaptive step size ρ t = B  P B b =1 q λ t ( X ( t ) b )  − 1 , where X ( t ) 1: B is the mini-batc h sample. 70 T able S2: Grid of h yp erparameter v alues used for GER VE, mean-shift, and the feature significance metho d in the mo de-finding sim ulations. Hyp erparameters V alues σ 2 1 { 0 . 1 , 0 . 2 } ω 1 { 10 , 50 } β { 1 . 1 , 1 . 3 , 1 . 5 } ρ 1 { 0 . 01 , 0 . 03 , 0 . 1 } γ { 0 . 2 , 0 . 3 , 0 . 4 } h m { 0 . 001 , 0 . 002 , 0 . 005 , 0 . 01 , 0 . 02 , 0 . 05 , 0 . 1 , 0 . 2 } h f { 0 . 0001 , 0 . 0003 , 0 . 001 , 0 . 003 , 0 . 01 , 0 . 03 , 0 . 1 , 0 . 3 } Additional sim ulation results. Figure S4 shows p erformance metrics for GER VE and the tw o baseline metho ds ov er a sw eep of K ∈ { 3 , . . . , 7 } . S5 Clustering p erformance on UCI datasets W e assess robustness to misspecification on real data and compare to classic clustering metho ds. W e also provide an ablation study and report run times. S5.1 Benc hmark Datasets. The three datasets can b e found in the UCI Machine Learning Rep ository ( ar chive.ics.uci.e du ). • Iris : N = 150, d = 4 (sepal length/width, petal length/width), J = 3 sp ecies (setosa, v ersicolor, virginica). • Wine : N = 178, d = 13 c hemical attributes from J = 3 cultiv ars. • P endigits : N ≈ 11000, d = 16 (eigh t pen-tip ( x, y ) pairs, normalised), J = 10 classes (digits 0 to 9). Clustering metrics. W e report the Adjusted Rand Index ( ARI , pairwise agreemen t), the Adjusted Mutual Information ( AMI , mutual information b etw een estimated and true classes), Purity (p er-cluster ma jorit y prop ortion), and Macro-F1 (classwise F1, a v eraged). GER VE default setup. GER VE uses diagonal-co v ariance mixtures with w eigh ts fixed to k -means prop ortions. Means are initialised at k -means cen troids, co v ariances at σ 2 1 I . W e include a brief fixed-co v ariance burn-in of L B iterations within a total budget of T b efore learning cov ariances. W e use o vercompletion K = J + 5 to enable component merging. Other hyperparameters are listed in T able S3 . 71 0.0 0.5 1.0 1.5 2.0 2.5 3.0 GERVE MR (mean) 1 0 2 1 0 1 1 0 0 HM (median) 1 0 2 1 0 1 NN (mean) K=3 K=4 K=5 K=6 K=7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 F eature Significance 1 0 2 1 0 1 1 0 2 1 0 1 1 0 0 1 0 3 1 0 4 1 0 5 1 0 6 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Mean-shift 1 0 3 1 0 4 1 0 5 1 0 6 1 0 1 1 0 0 1 0 3 1 0 4 1 0 5 1 0 6 1 0 1 6 × 1 0 2 2 × 1 0 1 Sample Size (log scale) Figure S4: Mode-e stimation p erformance of GER VE (first row), F eature Significance (sec- ond row), and Gaussian mean-shift (third ro w) with resp ect to sample size, for the triangle mixture ( I = 3 true mo des). Curv es sho w, for v arying K ∈ { 3 , 4 , 5 , 6 , 7 } , means ( MR ϵ , NN ) or medians ( HM ) ov er n rep = 100 replicates. Bands are 95% confidence in terv als: means use t -interv als, medians use bo otstrap percentiles (5000 samples). 72 T able S3: Dataset characteristics and h yp erparameter v alues used for GER VE default setup. Dataset Iris Wine P endigits Dataset characteristics N 150 178 10992 d 4 13 16 J 3 3 10 GER VE default setup T 2000 2000 2000 K 8 8 15 L B 100 100 100 σ 2 1 1 1 1 ω 1 0 . 1 10 − 4 10 − 3 ω t ω 1 /t 2 ω 1 /t 2 ω 1 /t 2 ρ 1 10 6 10 12 10 11 ρ t ρ 1 /t 2 ρ 1 /t 2 ρ 1 /t 2 σ 2 min 0 . 01 0 . 1 0 . 01 σ 2 max 1 1 1 D µ 0 . 1 0 . 1 0 . 2 D Σ 1 1 1 W e enforce stabilit y by clipping parameter updates ( ∥ µ k,t +1 − µ k,t ∥ 2 ≤ D µ , ∥ Σ k,t +1 − Σ k,t ∥ F ≤ D Σ ) and constraining the parameter space via pro jections ( σ 2 min I ⪯ Σ k ⪯ σ 2 max I ). These constrain ts ha v e little practical influence on results but improv e stability in high dimensions. Baseline methods. W e compare GER VE to mean-shift (with a flat kernel), GMM-EM, and k -means. Our main lab el-free baselines are implemen ted using Python’s scikit-learn pac k age: • Flat mean-shift (MS-Scott): flat k ernel, bandwidth from Scott’s rule ( Scott , 1992 ). W e set the maxim um n umber of iterations to 300 (scikit-learn MeanShift ’s default). • GMM-BIC: Gaussian mixture mo del with diagonal cov ariances, the num ber of com- p onen ts selected by BIC at each run. Initial mixture parameters are set using k - means++. W e stop early when the lo wer b ound a verage gain is b elo w 10 − 3 and set the maximum num ber of iterations to 100 (scikit-learn GaussianMixture ’s default). • k -means (KM-Elb ow): num b er of centroids selected by the elb o w metho d. Initial cen troid lo cations are set using greedy k -means++. W e stop early when the lo wer b ound a verage gain is b elo w 10 − 4 and set the maxim um num b er of iterations to 300 (scikit-learn KMeans ’s default). 73 MS-Scott KM-Elbow GMM- BIC G E R V E ARI AMI Purity Macro-F1 3 3 5 [5-5] 2 #clusters MS-Scott GMM- BIC G E R V E KM-Elbow 6 4 [4-5] 4 [4-5] 3 0.0 0.2 0.4 0.6 0.8 1.0 MS-Scott KM-Elbow GMM- BIC G E R V E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 5 15 [15-15] 13 [13-14] Iris Wine Pendigits Scor e (mean) Figure S5: Clustering metrics (means ± 95% CI) and n um b er of effectiv e clusters (median and [Q1-Q3] quan tiles when applicable, or constant across all replicates) for Iris, Wine, P endigits o v er n rep = 10. Label-free methods ordered by a v erage rank within each dataset. F or reference, w e also rep ort oracle v ariants that use ground truth and are not a v ailable in practice: GER VE-K ( K = J ), MS-K (bandwidth c hosen b y line searc h so that the n umber of clusters is closest to J ), MS-ARI (bandwidth maximising ARI ), GMM-K ( K = J ), GMM-ARI ( K maximising ARI ), KM-K ( K = J ), KM-ARI ( K maximising ARI ). GMM-BIC selects K by BIC at eac h run. The -ARI oracle v arian ts are chosen once p er dataset b y maximising mean ARI across seeds and then held fixed across runs. Results. Figure S5 sho ws means and 95% confidence interv als ov er n rep = 10 runs for the four metrics, and the median n umbers of effective clusters and their [Q1-Q3] quantiles. GER VE is comp etitiv e with lab el-free baselines while adapting the n umber of clusters through merging. F or reference, comparison against oracle baselines is pro vided in Fig- ure S6 . Iris: GER VE matches baselines on ARI / AMI /Macro-F1. It merges t wo close sp ecies, yielding t wo effective clusters v ersus J = 3, with a small drop in Purit y and a cleaner mo dal partition. Wine: GER VE attains strong scores across metrics. k -means edges out the top v alues on this near-spherical dataset, while GER VE remains robust without tuning K via BIC. Pendigits: No metho d reac hes J = 10 in this ov erlapping, high-dimensional setting. GER VE produces the effective cluster count closest to J and leads on all four metrics, indicating consisten t reco v ery of global structure ev en with class splits. W e emphasise ARI , AMI , and Macro-F1, since Purit y can inflate with larger K . In prac- tice, GER VE returns a smaller effectiv e K b y merging redundan t comp onen ts. Compared to mean-shift, it scales b etter in dimension and a v oids bandwidth tuning through adaptive co v ariance learning. 74 MS- ARI ( ) MS-K ( ) MS-Scott KM- ARI ( ) KM-K ( ) KM-Elbow GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E ARI AMI Purity Macro-F1 5 3 3 3 3 3 3 3 5 [5-5] 2 2 #clusters MS- ARI ( ) MS-K ( ) MS-Scott GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E KM- ARI ( ) KM-K ( ) KM-Elbow 6 3 6 3 3 4 [4-5] 3 4 [4-5] 3 3 3 0.0 0.2 0.4 0.6 0.8 1.0 MS- ARI ( ) MS-K ( ) MS-Scott KM- ARI ( ) KM-K ( ) KM-Elbow GMM- ARI ( ) GMM-K ( ) GMM- BIC G E R V E K ( ) G E R V E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15 7 1 13 10 5 14 10 15 [15-15] 10 [9-10] 13 [13-14] Iris Wine Pendigits Scor e (mean) Figure S6: Clustering metrics (means ± 95% CI) and effective clusters (median and [Q1-Q3] quan tiles when applicable, or constant across all replicates) for Iris, Wine, P endigits ov er n rep = 10. Lab el-free metho ds (GER VE, MS-Scott, GMM-BIC, KM-Elb o w) are ordered b y av erage rank within each dataset. Oracle v arian ts are marked with the † sym b ol, and directly placed below the corresponding label-free metho d. 75 S5.2 Ablation study W e now turn to an ablation study to examine how different configuration choices (w eights, ov ercompleteness, annealing, co v ariance structure, burn-in) affect the behaviour of GER VE. W e start with the same default GER VE dataset-specific configuration. F or eac h dataset, w e create new configurations b y v arying one factor at a time relativ e to the de- fault: (i) W eigh ts: equal vs. prop ortional to k -means group sizes; (ii) Ov ercompleteness: K ∈ { J, J +5 } ; (iii) Annealing: ω t ∈ { 0 , ω 1 /t β , β ∈ { 0 . 5 , 1 , 2 , 3 }} ; (iv) Co v ariance: fixed, diagonal, full; (v) Burn-in: L B ∈ { 0 , 50 , 100 , 200 , 500 } . Figures S7 - S8 rep ort, for each configuration, the mean of each metric (with 95% con- fidence interv als) and the effective num b er of clusters (median [Q1-Q3]). W e note the main takea w ays for each v arying factor: (i) W eights prop ortional to k -means help on Wine (higher ARI / AMI /Purity/Macro-F1), without significan t effect elsewhere; (ii) Starting o v er- complete slightly increases the effectiv e num b er of clusters but helps on Pendigits, where class ov erlap is substantial; (iii) F aster annealing improv es end-state consolidation under fixed iteration budgets, ω t ≡ 0 underperforms on o verlapping classes, illustrating the im- p ortance of the exploration induced b y the en tropy term; (iv) Diagonal co v ariances are a go od sp eed vs. robustness compromise, as full co v ariances add cost and v ariance without systematic gains; (v) A fixed-cov ariance short burn-in ( L B ≈ 100) can help, but longer burn-ins steal iterations from co v ariance learning. S5.3 Run time T o illustrate ho w GER VE’s computational cost depends on data dimension, the num b er of mixture comp onents, and cov ariance parameterisation, w e rep ort mean run times across four datasets (T riangle mixture, Iris, Wine, P endigits), under a common configuration. This is not intended as a scalabilit y b enchmark, since runtime is implementation-dependent and sensitiv e to engineering c hoices. T able S4 shows a verage w all-clo c k runtimes o v er ten replicates. As exp ected, cost increases with dimensionalit y and with the flexibilit y of the cov ariance mo del. Fixed co- v ariances are the fastest to compute but also the least flexible, while full co v ariances are substan tially slow er, esp ecially in higher dimensions. The diagonal setting offers a prac- tical compromise: only sligh tly slow er than the fixed co v ariance setting, y et av oiding the instabilit y and high cost of full cov ariance up dates. Ov erall, run times remain mo dest on medium-sized datasets, indicating that GER VE is practically usable in realistic clustering scenarios even when co v ariance learning is included. W e also plot execution time as a function of K on the P endigits dataset. The scaling is approximately linear in K , with diagonal cov ariances remaining a practical compromise b et w een fixed and full parameterisations. 76 O -EW O K W TK -EW TK -K W Iris ARI AMI Purity Macro-F1 2 2 2 2 #clusters O -EW O K W TK -EW TK -K W Wine 8 4 [4-5] 3 3 0.0 0.2 0.4 0.6 0.8 1.0 O -EW O K W TK -EW TK -K W Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 15 [14-15] 13 [13-14] 10 10 [9-10] Scor e (mean) No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Iris ARI AMI Purity Macro-F1 2 2 2 4 [3-5] 8 [7-8] 8 #clusters No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Wine 4 [3-4] 4 [3-4] 4 [4-5] 8 8 8 0.0 0.2 0.4 0.6 0.8 1.0 No entr opy P ower 3.0 P o w e r 2 . 0 P ower 1.0 P ower 0.5 Constant Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 9 13 [12-13] 13 [13-14] 14 [14-15] 15 [15-15] 15 Scor e (mean) Figure S7: Ablation – W eigh ts and ov ercompleteness (top) and Annealing (b ot- tom). F or each dataset (rows), the first four columns sho w ARI / AMI /Purit y/Macro-F1 (means ± 95% t -CIs o v er n rep = 10 runs), the rightmost column sho ws the effective num b er of clusters (median [Q1–Q3], or constant across all runs). Bold marks the default GER VE configuration. The dashed reference line in each metric panel marks the default’s mean for that metric. Dataset-specific J v alues are listed in T able S3 . T op. Prefixes: T - for K = J (true groups), O - for K = J +5 (ov ercomplete). Suffixes: -KW for k -means-prop ortional weigh ts, -EW for equal w eights. Bottom. Sc hedules: Constant ( ω t ≡ ω 1 ), Power- x ( ω t = ω 1 t − x ), No entr opy ( ω t ≡ 0). 77 F ull D i a g o n a l F ix ed max. F ix ed min. Iris ARI AMI Purity Macro-F1 2 2 2 6 [6-7] #clusters F ull D i a g o n a l F ix ed max. F ix ed min. Wine 4 [4-5] 4 [4-5] 4 [4-4] 8 [8-8] 0.0 0.2 0.4 0.6 0.8 1.0 F ull D i a g o n a l F ix ed max. F ix ed min. Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 14 [13-14] 13 [13-14] 13 [12-13] 15 Scor e (mean) Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Iris ARI AMI Purity Macro-F1 2 2 2 2 3 [3-3] #clusters Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Wine 4 [4-4] 4 [4-4] 4 [4-5] 6 [6-7] 8 [8-8] 0.0 0.2 0.4 0.6 0.8 1.0 Bur n 500 Bur n 200 B u r n 1 0 0 Bur n 50 No bur n-in Pendigits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 13 [13-14] 13 [13-13] 13 [13-14] 14 [13-14] 14 [13-14] Scor e (mean) Figure S8: Ablation – Cov ariance (top) and Burn-in phase (b ottom). F or each dataset (rows), the first four columns show ARI / AMI /Purity/Macro-F1 (means ± 95% t -CIs o ver n rep = 10 runs), the rightmost column shows the effectiv e n umber of clusters (median [Q1–Q3], or constan t across all runs). Bold marks the default GER VE configuration. The dashed reference line in eac h metric panel marks the default’s mean for that metric. T op. Cov ariance parameterisations: Fixe d min. ( Σ k ≡ σ 2 min I ), Fixe d max. ( Σ k ≡ σ 2 max I ), Diagonal (learned diagonal Σ k with bounds σ 2 min I ⪯ Σ k ⪯ σ 2 max I ), and F ul l (learned full Σ k under the same b ounds). Dataset-sp ecific v alues for σ 2 min , σ 2 max and the initialisation σ 2 0 (for learned co v ariances) are listed in T able S3 . Bottom. Burn-in v arian ts: No burn-in ( L = 0), Burn- x (fixed-cov ariance phase with L = x iterations). The total iteration budget T is the same across v ariants and listed in T able S3 . 78 T able S4: Mean wall-clock run time (in seconds) o v er 10 replicates for o vercomplete GER VE ( K = J + 5) with w eigh t updates, T = 500, B = 150, L = 0, using identical h yp erpa- rameters across all datasets. V alues are mean ± standard deviation. All runs on the same mac hine. This study illustrates cost trends by co v ariance parameterisation, n umber of comp onen ts, and dimensionalit y . Dataset Fixed Diagonal F ull T riangle ( d = 2 , J = 3) 3.5 ± 0.6 3.9 ± 0.8 4.2 ± 0.8 Iris ( d = 4 , J = 3) 3.5 ± 0.7 3.8 ± 0.7 4.6 ± 0.9 Wine ( d = 13 , J = 3) 4.2 ± 0.8 4.8 ± 0.9 7.8 ± 1.2 P endigits ( d = 16 , J = 10) 13.4 ± 2.1 15.2 ± 2.6 24.6 ± 3.3 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Components (K) 0 5 10 15 20 25 30 R untime (seconds) Diagonal F ix ed F ull Figure S9: GER VE run time with respect to the num b er of comp onen ts K on the P endigits dataset, for diagonal, fixed and full co v ariance matrices. Curves show mean run time o v er 10 replicates, shaded bands indicate ± 1 standard deviation. S6 Details on the UK collision case study Data normalisation and pro cessing. W e apply GER VE on a normalised space : the Greater London window [ − 0 . 54 , 0 . 33] × [51 . 28 , 51 . 70] (longitude-latitude) into a cen tred rectangle [ − 0 . 7 , 0 . 7] × [ − 0 . 35 , 0 . 35] preserving the asp ect ratio and with area ≈ 1. This transformation is applied to the coordinates of the data p oin ts. T o smoothen the under- lying distribution p , we generate artificial data p oints uniformly in the region b et w een the normalised rectangle and the [ − 2 , 2] × [ − 2 , 2] square, with density 1000 times smaller than the mean densit y of the data in the normalised rectangle. GER VE h yp erparameters. W e target the top H = 10 hotsp ots. W e fit an o vercom- plete mixture with K = 2 H = 20. • T emp er atur e. W e fix the stopping temp erature at ω † = 1. This c hoice follows the elb o w of the resolved-mode curv e represented in Figure S10 and described in Sec. 7 . 79 0.1 0.25 0.5 1 2 4 8 w 0 2 5 7 10 12 15 17 20 Number of resolved modes Components af ter mer ging Components af ter pruning Figure S10: Diagnostic for c ho osing ω : resolved modes after successive pruning (dashed) and merging (solid) for v alues of ω (in log-scale). The elb o w at ω † = 1 maximises resolv ed mo des b efore decline. The same ω † is used for the bo otstrap refits. • Step sizes. The step-size sc hedule is ρ t = 0 . 2 t − 0 . 1 . • Initialisation. Comp onen t means use k -means++ cen troids on the w orking co ordi- nates. Initial co v ariances are Σ k, 1 = σ 2 1 I with σ 2 1 = 5 × 10 − 3 . • Covarianc e b ounds. W e constrain σ 2 min = 1 × 10 − 5 and σ 2 max = 1 × 10 − 2 . • Domain. Optimisation is performed on S = [ − 2 . 05 , 2 . 05] × [ − 2 . 05 , 2 . 05]. Mapping to British National Grid is used only for reporting distances and ellipses. • Entr opy term. W e use B = 100 Mon te Carlo samples p er iteration to estimate the en tropy contributions. • Early stopping. W e cap iterations at T = 10 , 000. W e stop early if the following hold at three consecutiv e c hec ks, ev aluated ev ery 10 iterations:   µ k,t +1 − µ k,t   2 < 10 − 2 and   S k,t +1 − S k,t   F   S k,t   F < 10 − 1 for all k . • Pruning and mer ging. A comp onen t is pruned if eig max ( Σ k ) > σ 2 min + β  σ 2 1 − σ 2 min  , with β ≈ 0 . 018, so that the threshold equals 10 σ 2 min . Comp onents means within 0 . 005 ( ≈ 200 meters in the British National Grid EPSG:27700) are merged. • Bo otstr ap settings. W e use L = 500 resamples at ω † . Bo otstrap modes are matc hed to the baseline with a Hungarian assignmen t and adaptiv e gates around each baseline mo de as described below. 80 Mo de matc hing and p er-mode ellipses. W e z -score co ordinates using the baseline mo de cen tres. F or mo de k , w e compute nn k as the minim um distance to any other baseline mo de. The gate is τ k = clip( η nn k , τ min , τ max ) with defaults η = 0 . 35, τ min = 0 . 08, τ max = 0 . 22. Since η < 0 . 5, gates stay inside midp oin ts b et w een mo des, which preven ts cross- assignmen ts. The clips a void degenerate gates when mo des are extremely close or very isolated. After Hungarian assignment, pairs with distance larger than τ k are rejected. Accepted matches form the b o otstrap cloud for mo de k , from which w e compute stability s k and the 95% chi-square confidence ellipse. T o obtain confidence regions in meters, w e pro ject coordinates to the British National Grid (EPSG:27700). F or eac h resolv ed mode k , let { b µ ( ℓ ) } L ℓ =1 denote the matched b o otstrap cen tres. W e report 95% confidence ellipses from the empirical cov ariance b V k = V ar ∗ ( b µ ( ℓ ) k ): if eig 1 ≥ eig 2 are eigen v alues of b V k and e 1 is the principal eigen vector, then the semi-axes are a = q χ 2 2;1 − α eig 1 and b = q χ 2 2;1 − α eig 2 , with angle atan2( e 1 ,x , e 1 ,y ) (in degrees) relativ e to Easting. Figure S11 shows, for eac h baseline mode, the cloud of matc hed b o otstrap cen tres, the adaptiv e gate used for matc hing, and the resulting confidence ellipse, illustrating mo de stabilit y and lo calisation uncertaint y . 2020-2024 hotsp ots. T able S5 rep orts the 17 hotsp ots found b y GER VE and Figure S12 displa ys them in the complete searc h space. Mean-shift details. W e use scikit-learn’s MeanShift with a flat (uniform) k ernel on the same pro jected coordinates. The reference bandwidth h 0 is estimated as the 0.3-quan tile of pairwise distances. W e then sw eep h on a grid from 0 . 02 h 0 to 0 . 2 h 0 and report the resulting cen tres. Figure S13 sho ws the cen tres across the sw eep. Smaller h reco vers many additional p eripheral mo des, while larger h merges cen tral structure. In Figure S14 , w e o verla y the 0 . 05 h 0 and 0 . 1 h 0 cen tres in the zoomed-in windo w and compare them with GER VE’s centres and 95% ellipses. A t 0 . 1 h 0 , visual insp ection against the road netw ork some mean-shift cen tres fall slightly off carriagew a ys, so w e select a smaller bandwidth (0 . 05 h 0 ) as a baseline comparator for GER VE. 2014-2019 hotsp ots and comparison to 2020-2024. W e fit GER VE to collision data from 2014 to 2019 with identical ω † and K . Confidence ellipses are obtained after L = 500 b ootstrap fits. T able S6 rep orts the 16 mo des found b y GER VE and Figure S15 displa ys them with their 95% confidence in terv als. Figure S16 ov erla ys the 2014-2019 hotsp ots on a map of the 2020-2024 ones in the zoomed-in windo w. 81 0.068 0.067 0.066 0.065 L ongitude (nor malized) 0.0090 0.0095 0.0100 0.0105 0.0110 0.0115 0.0120 0.0125 Latitude (nor malized) Mode 0 ( =0.214) n=44 Baseline Matched Match Mean 95% Ellipse 0.006 0.008 0.010 0.012 L ongitude (nor malized) 0.006 0.007 0.008 0.009 0.010 Latitude (nor malized) Mode 1 ( =0.207) n=465 0.027 0.026 0.025 0.024 L ongitude (nor malized) 0.044 0.045 0.046 0.047 0.048 Latitude (nor malized) Mode 2 ( =0.175) n=165 0.013 0.012 0.011 0.010 0.009 0.008 L ongitude (nor malized) 0.016 0.015 0.014 0.013 0.012 Latitude (nor malized) Mode 3 ( =0.136) n=208 0.033 0.032 0.031 0.030 0.029 L ongitude (nor malized) 0.010 0.009 0.008 0.007 0.006 0.005 0.004 Latitude (nor malized) Mode 4 ( =0.136) n=185 0.064 0.062 0.060 L ongitude (nor malized) 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 Latitude (nor malized) Mode 5 ( =0.118) n=340 0.0200 0.0175 0.0150 0.0125 0.0100 L ongitude (nor malized) 0.054 0.052 0.050 0.048 0.046 Latitude (nor malized) Mode 6 ( =0.158) n=202 0.034 0.032 L ongitude (nor malized) 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 Latitude (nor malized) Mode 7 ( =0.150) n=68 0.050 0.055 0.060 L ongitude (nor malized) 0.038 0.040 0.042 0.044 0.046 Latitude (nor malized) Mode 8 ( =0.150) n=242 0.048 0.046 0.044 0.042 0.040 L ongitude (nor malized) 0.032 0.034 0.036 Latitude (nor malized) Mode 9 ( =0.118) n=389 0.0200 0.0225 0.0250 0.0275 0.0300 L ongitude (nor malized) 0.022 0.024 0.026 0.028 0.030 0.032 0.034 Latitude (nor malized) Mode 10 ( =0.207) n=279 0.115 0.116 0.117 L ongitude (nor malized) 0.055 0.056 0.057 0.058 0.059 0.060 0.061 Latitude (nor malized) Mode 11 ( =0.220) n=157 0.040 0.042 0.044 0.046 L ongitude (nor malized) 0.054 0.055 0.056 0.057 0.058 0.059 0.060 Latitude (nor malized) Mode 12 ( =0.150) n=475 0.138 0.136 0.134 0.132 L ongitude (nor malized) 0.017 0.016 0.015 0.014 0.013 0.012 0.011 0.010 Latitude (nor malized) Mode 13 ( =0.220) n=342 0.000 0.005 0.010 L ongitude (nor malized) 0.066 0.064 0.062 0.060 Latitude (nor malized) Mode 14 ( =0.158) n=102 0.1150 0.1125 0.1100 0.1075 0.1050 0.1025 L ongitude (nor malized) 0.046 0.048 0.050 0.052 0.054 Latitude (nor malized) Mode 15 ( =0.220) n=221 0.050 0.045 0.040 L ongitude (nor malized) 0.046 0.044 0.042 0.040 0.038 0.036 Latitude (nor malized) Mode 16 ( =0.202) n=226 Figure S11: Distribution of the matc hed mo des across the L = 500 b ootstrap fits for eac h baseline hotsp ot from the 2020-2024 data, with the estimated confidence ellipses and means. 82 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) Baseline hotspots (numbered by ID) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S12: Baseline collision hotsp ots identified b y GER VE for the Greater London Area b et w een 2020 and 2024, with stabilit y scores and 95% confidence ellipses, in normalised co ordinates. The dashed rectangle is the zo omed-in window of Figure 3 . 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.02 | Modes: 3205 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.05 | Modes: 1317 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.1 | Modes: 487 0.6 0.4 0.2 0.0 0.2 0.4 0.6 L ongitude (nor malized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (nor malized) Bandwidth ×0.2 | Modes: 158 Figure S13: Mean-shift mo des across differen t bandwidths ( h = 0 . 02 h 0 , 0 . 05 h 0 , 0 . 1 h 0 , 0 . 2 h 0 ) in the normalised window. The dashed rectangle is the zo omed-in window. 83 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) MeanShift (×0.05): 122 centers vs. GERVE: 17 modes (in view) GER VE modes with ellipses MeanShif t modes 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) MeanShift (×0.1): 35 centers vs. GERVE: 17 modes (in view) GER VE modes with ellipses MeanShif t modes 0.0 0.2 0.4 0.6 0.8 1.0 GERVE Stability Figure S14: Comparison of mean-shift ( h = 0 . 05 h 0 , 0 . 1 h 0 ) and GER VE mo des, with confi- dence ellipses for GER VE, in the zoomed-in windo w. 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) Baseline hotspots (numbered by ID) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitude (normalized) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) Baseline hotspots (numbered by ID) 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 95% Confidence Ellipses 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S15: Baseline collision hotsp ots identified b y GER VE for the Greater London Area b et w een 2014 and 2019, with stabilit y scores and 95% confidence ellipses, in normalised co ordinates. T op: Shows the normalised window con taining all data points. The dashed rectangle is the zo omed-in window. Bottom: Zo omed-in v ersion for comparison to Figure 3 . 84 T able S5: Rep orted hotspots for y ear 2024 with (un-normalised) coordinates, confidence ellipse dimensions ( a, b ), angles, stabilit y , bo otstrap matches, and lo cal coun ts (radius 200 m). Hotsp ots are listed in decreasing stabilit y order. ID Name Longitude ( ◦ ) Latitude ( ◦ ) a (m) b (m) Angle ( ◦ ) Stabilit y Count 12 Shoreditch -0.078909 51.524226 134 121 -54 0.95 32 1 Elephan t & Circus -0.099874 51.495003 120 86 80 0.93 27 9 Piccadilly Circus -0.132294 51.510060 162 150 89 0.78 27 5 Oxford Circus -0.142322 51.515222 265 66 84 0.68 31 13 Gun ter Grov e -0.185143 51.482279 134 83 72 0.68 22 10 London Bridge -0.088544 51.506068 156 117 60 0.56 21 8 Aldgate -0.072412 51.515338 357 207 27 0.48 14 16 Clapham HS -0.131914 51.464362 409 255 37 0.45 23 15 Edgw are Road -0.170483 51.519900 230 144 -41 0.44 22 3 Ov al -0.111630 51.481528 120 105 81 0.42 19 6 Brixton -0.114597 51.460859 301 115 -49 0.40 38 4 V auxhall -0.123762 51.485494 177 82 -68 0.37 27 2 Holb orn -0.120090 51.517891 87 78 54 0.33 25 11 Mile End -0.034823 51.524982 149 59 88 0.31 23 14 Herne Hill -0.102411 51.452610 305 91 -24 0.20 10 7 W estminster -0.125293 51.501799 177 76 79 0.14 17 0 Victoria -0.145081 51.496576 96 68 84 0.09 20 T able S6: Rep orted hotspots for y ear 2019 with (un-normalised) coordinates, confidence ellipse dimensions ( a, b ), angles, stabilit y , bo otstrap matches, and lo cal coun ts (radius 200 m). Hotsp ots are listed in decreasing stabilit y order. ID Name Longitude ( ◦ ) Latitude ( ◦ ) a (m) b (m) Angle ( ◦ ) Stabilit y Count 14 Shoreditch -0.078437 51.524026 159 134 -78 0.96 35 5 Elephan t & Circus -0.100192 5 1.494709 123 112 -86 0.96 27 0 Ludgate Circus -0.104340 51.513547 177 110 -80 0.85 20 4 Charing Cross -0.126966 51.507739 168 114 -4 0.75 24 1 V auxhall -0.123471 51.486071 111 102 -66 0.74 23 15 Shep erd’s Bush -0.223191 51.503715 179 133 -3 0.70 25 9 Oxford Circus -0.141671 51.515031 197 96 56 0.62 26 11 Ov al -0.111782 51.481310 98 72 -79 0.57 23 2 Aldgate -0.071210 51.515511 279 100 14 0.54 15 12 Aldwyc h -0.118401 51.511326 145 108 15 0.53 23 6 St Giles Circus -0.130695 51.516011 151 109 -29 0.47 15 3 King’s Cross -0.122322 51.530607 252 155 24 0.44 22 7 Camden T own -0.140953 51.540307 197 90 59 0.44 19 8 Edgw are -0.168222 51.518691 138 137 73 0.25 22 13 London Bridge -0.088592 51.506237 318 85 70 0.23 14 10 Olympia -0.206178 51.496948 156 72 7 0.07 14 85 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Longitude (normalized) 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Latitude (normalized) 2020-2024 modes (filled) 2014-2019 modes (shaded, dashed) 0.0 0.2 0.4 0.6 0.8 1.0 Stability Figure S16: Comparison of 2014-2019 hotsp ots and 2020-2024 hotsp ots in the zo omed-in windo w. 86

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment