Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian Mixtures

Sharp Inequalities b et w een T otal V ariation and Hellinger Distances for Gaussian Mixtures Jo onh yuk Jung 1 and Chao Gao ∗ 1 1 Dep artment of Statistics, University of Chic ago Abstract W e study the relation b et w een the total v ariation (TV) and Hellinger distances b et ween t wo Gaussian lo cation mixtures. Our ﬁrst result establishes a general upp er b ound: for an y t wo mixing distributions supp orted on a compact set, the Hellinger distance b et ween the tw o mixtures is controlled by the TV distance raised to a p o wer 1 − o (1), where the o (1) term is of order 1 / log log (1 / TV). W e also construct t wo sequences of mixing distributions that demonstrate the sharpness of this b ound. T ak en together, our results resolv e an op en problem raised in Jia et al. ( 2023 ) and thus lead to an entropic characterization of learning Gaussian mixtures in total v ariation. Our inequality also yields optimal robust estimation of Gaussian mixtures in Hellinger distance, whic h has a direct implication for bounding the minimax regret of empirical Ba yes under Hub er contamination. 1 In tro duction The Gaussian lo cation mixture is one of the most fundamental models used in nonparametric densit y estimation, Bay esian inference, and clustering ( Lindsay , 1995 ; Dasgupta , 1999 ). Giv en a probabilit y measure π supported on R d , the induced Gaussian mixture is deﬁned by f π ( x ) = Z R d ϕ d ( x − θ ) dπ ( θ ) , where ϕ d ( x ) = (2 π ) − d/ 2 exp( −∥ x ∥ 2 2 / 2) is the densit y function of the d -dimensional standard Gaus- sian distribution. In this pap er, w e study the relation betw een the total v ariation distance TV( p, q ) := 1 2 R | p − q | and the Hellinger distance H ( p, q ) := q 1 2 R ( √ p − √ q ) 2 of t w o Gaussian mixture densities. Without an y restriction on the distributions, it is w ell known that H 2 ( p, q ) ≤ TV( p, q ) ≤ √ 2 H ( p, q ) . (1) The Hellinger distance is a commonly used loss function for density estimation ( W ong and Shen , 1995 ). It is especially useful in the setting of Gaussian lo cation mixture estimation giv en its direct consequence for b ounding the regret of an empirical Ba y es estimator using a plug-in estimator of the prior ( Jiang and Zhang , 2009 ). When the data set con tains a small subset of arbitrary outliers, ∗ The researc h of CG is supported in part by NSF Grants ECCS-2216912 and DMS-2310769, and an Alfred Sloan fello wship. 1 the density estimation problem can b e regarded as missp eciﬁed under total v ariation. Therefore, sharp inequalities are necessary for deriving optimal error rates for robust densit y estimation of Gaussian lo cation mixtures, and the inequalities in ( 1 ) are to o lo ose for this purpose. Relations b et ween f -divergences of Gaussian lo cation mixtures hav e b een studied in the literature. In particular, for distributions π and η supp orted on a b ounded Euclidean ball { θ ∈ R d : ∥ θ ∥ 2 ≤ M } , it w as prov ed b y Jia et al. ( 2023 ) that the induced Gaussian mixtures f π and f η satisfy H 2 ( f π , f η ) ≍ KL( f π ∥ f η ) , (2) up to constant factors depending on M and d . Here, KL( p ∥ q ) := R p log p q denotes the Kullback– Leibler div ergence. The result in ( 2 ) implies an entropic characterization of the minimax rate of estimating the Gaussian lo cation mixture. The pap er Jia et al. ( 2023 ) also in vestigated the relation b et w een the total v ariation distance TV( f π , f η ) and the L 2 distance ∥ f π − f η ∥ 2 . How ev er, whether the relation TV( f π , f η ) ≍ H ( f π , f η ) holds w as explicitly listed as an op en question in the pap er. In this pap er, w e resolve this open problem by pro ving that H ( f π , f η ) ≤ TV 1 − o (1) ( f π , f η ) , where the o (1) term in the exp onen t is of order Ω  1 log log(1 / TV ( f π , f η ))  . W e also construct sequences of distributions π n and η n sho wing that this 1 − o (1) factor is indeed necessary , thereb y disproving the linear comparability TV ( f π , f η ) ≍ H ( f π , f η ) for Gaussian location mixtures. Our proof is based on an expansion of the ratio ( f π − f η ) /ϕ d using Hermite polynomials. Key steps in the analysis in volv e the deriv ation of the m ultiv ariate Nikolskii-t yp e inequality (see Prop osition A.6 ) and restricted-range inequalit y (see Proposition A.7 ). As a direct application, for densit y estimation of f π with data generated from the Huber contamination mo del (1 − ϵ ) P f π + ϵQ , where Q is arbitrary , w e show that the minimax rate under the Hellinger distance is given b y ϵ 1 − Θ  1 log log(1 /ϵ )  , pro vided that the sample size satisﬁes n ≥ p oly(1 /ϵ ). 1.1 P ap er Organization The remainder of this pap er is organized as follows. Our main result is presented in Section 2 , follo wed b y the sharpness construction in Section 3 . Two applications of the main results—en tropic c haracterization of Gaussian lo cation mixture estimation in total v ariation and robust density estimation—are discussed in Section 4 . In Section 5 , w e brieﬂy discuss a few open directions. Due to page limits, most technical proofs are deferred to the app endices. 1.2 Notation Let N 0 b e the set of nonnegative in tegers and R the set of real num bers. W e use the b oldface notation, e.g., k and l , for m ulti-index. F or k = ( k 1 , . . . , k d ) ∈ N d 0 , w e write | k | := k 1 + · · · + k d . W e denote by ∥ θ ∥ 2 and ∥ θ ∥ ∞ the Euclidean norm and ∞ -norm of θ ∈ R d , resp ectiv ely . F or a real matrix A ∈ R m × n , ∥ A ∥ ∞ := max {∥ Ax ∥ ∞ : ∥ x ∥ ∞ = 1 } is the op erator norm induced b y the ∞ - norm of v ectors. Recall that ϕ d denotes the d -dimensional standard Gaussian densit y . W e may use 2 ϕ = ϕ 1 when w e only discuss one-dimensional results. F or p ∈ { 1 , 2 } , a measurable set A ⊆ R d , and a measurable function g : R d → R , w e write ∥ g ∥ L p ( A ,ϕ d ) as  R A | g ( x ) | p ϕ d ( x ) dx  1 /p =  R A | g | p ϕ d  1 /p , whenev er the in tegral exists. The abbreviation for L p ( R d , ϕ d ) is often L p ( ϕ d ) when no confusion arises. Let Π d n b e the set of real p olynomials of total degree ≤ n in d v ariables. W e also write Π n = Π 1 n when d = 1. F or k ∈ N 0 , we deﬁne the one-dimensional (normalized) Hermite p olynomial h k ∈ Π k b y h k ( x ) := ( − 1) k √ k ! ϕ ( x ) d k dx k ϕ ( x ) . (3) F or arbitrary dimensions, w e deﬁne the Hermite p olynomial h k ∈ Π d | k | b y tensor pro ducts of one- dimensional Hermite p olynomials: h k ( x ) := d Y j =1 h k j ( x j ) . Note that deg h k = | k | and the collection { h k : | k | ≤ n } forms an orthonormal basis of Π d n with resp ect to the L 2 ( ϕ d )-norm. The dimension of Π d n is given b y  n + d n  . F or in teger or real v alues, w e write a ∨ b := max { a, b } and a ∧ b := min { a, b } . F or a p ositiv e integer N ∈ N , we write [ N ] := { 1 , . . . , N } . F or a real num b er x , ⌈ x ⌉ is the smallest integer no smaller than x and ⌊ x ⌋ is the largest integer no larger than x . F or a, b : G → [0 , ∞ ), we write a ≲ b or a = O ( b ) if there exists some constant C > 0 indep enden t of g such that a ( g ) ≤ C b ( g ) holds for all g ∈ G . W e write a ≳ b or a = Ω( b ) if b ≲ a . W e write a ≍ b or a = Θ( b ) if a ≲ b and b ≲ a . 2 Main Results In this section, w e presen t our main results. The ﬁrst result b ounds the χ 2 -div ergence χ 2 ( p ∥ q ) := R ( p − q ) 2 q b y the total v ariation. Theorem 2.1 (Inequality b etw een TV distance and χ 2 -div ergence) . L et π and η b e pr ob ability me asur es supp orte d on the d -dimensional cub e [ − M , M ] d . L et δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that q χ 2 ( f π ∥ f η ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we deﬁne α ( t ) := 2 + δ log (log(1 /t ) ∨ e ) , (4) for t > 0 . R emark 2.2 . Note that α ( t ) is increasing in t and that α ( t ) → 0 as t ↓ 0. Ho wev er, t − α ( t ) is decreasing in t and t − α ( t ) → + ∞ as t ↓ 0. R emark 2.3 . W e note that the exp onen t α ( t ) do es not dep end on M or d . The dep endence on M and d app ears solely in the constant C 0 . W e will discuss the dep endency of C 0 on M and d later in App endix A.3 . 3 Corollary 2.4 (Inequality b et w een TV and Hellinger distances) . L et π and η b e pr ob abil- ity me asur es supp orte d on the d -dimensional cub e [ − M , M ] d . L et δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that H ( f π , f η ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we deﬁne α ( · ) as in ( 4 ) . Pr o of. This is a direct consequence of Theorem 2.1 , noting that H 2 ( p, q ) ≤ χ 2 ( p ∥ q ) holds in general. A key argument of deriving the results of Theorem 2.1 and Corollary 2.4 is to relate the L 1 ( ϕ d ) and L 2 ( ϕ d ) norms of the ratio f π − f η ϕ d . W e note that the L 1 ( ϕ d )-norm of f π − f η ϕ d is exactly t wice the total v ariation. On the other hand, the squared Hellinger distance and the χ 2 -div ergence behav e lik e the squared L 2 ( ϕ d )-norm. Theorem 2.5 (Inequality betw een L 1 ( ϕ d ) and L 2 ( ϕ d ) norms) . L et π and η b e pr ob ability me asur es supp orte d on the d -dimensional cub e [ − 2 M , 2 M ] d . Deﬁne g := f π − f η ϕ d and supp ose δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that ∥ g ∥ L 2 ( ϕ d ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we deﬁne α ( · ) as in ( 4 ) . Pr o of. W e provide the pro of of the result in general dimension later in App endix A.2 , built up on the inequalities in App endix A.1 . Here, we giv e a sketc h of the proof for the one-dimensional setting with d = 1. Recall the deﬁnition ( 3 ) of (one-dimensional) Hermite polynomials, and consider the Hermite p olynomial expansion (see Lemma A.1 ) of g as follows. g ( x ) = Z R ϕ 1 ( x − θ ) ϕ 1 ( x ) d ( π − η )( θ ) = Z R ∞ X k =0 θ k √ k ! h k ( x ) d ( π − η )( θ ) (b y Lemma A.1 ) = ∞ X k =0 ∆ k √ k ! h k ( x ) , where ∆ k := R R θ k d ( π − η )( θ ). W e decomp ose g = q + r , where q = n X k =0 ∆ k √ k ! h k , r = ∞ X k = n +1 ∆ k √ k ! h k , and n is an in teger that will be determined later. T o handle the L 1 ( ϕ 1 )-norm of q ∈ Π n , w e deﬁne c n := inf n ∥ P ∥ L 1 ( ϕ 1 ) : P ∈ Π n , ∥ P ∥ L 2 ( ϕ 1 ) = 1 o . (5) Note ﬁrst that c n ≤ 1 b y Cauch y-Sc h warz inequalit y . F or P ∈ Π n , the Nikolskii-t yp e inequalit y ( Nev ai and T otik , 1987 ) states that sup x ∈ R    P ( x ) ϕ 1 / 2 1 ( x )    ≲ n 1 / 4 ∥ P ∥ L 2 ( ϕ 1 ) . (6) 4 Th us, the following argumen t implies that c n ≥ cn − 1 / 4 e − n holds for some universal constant c > 0: ∥ P ∥ 2 L 2 ( ϕ 1 ) = Z ∞ −∞ P 2 ϕ 1 ≤ 2 Z 2 √ n +1 − 2 √ n +1 P 2 ϕ 1 (Restricted-range inequalit y ) ≤ 2 sup | x |≤ 2 √ n +1    ϕ − 1 / 2 1 ( x )    sup x ∈ R    P ( x ) ϕ 1 / 2 1 ( x )    Z ∞ −∞ | P ϕ 1 | ≲ e n · n 1 / 4 ∥ P ∥ L 2 ( ϕ 1 ) · ∥ P ∥ L 1 ( ϕ 1 ) . (b y ( 6 )) The abov e restricted-range inequality is due to Theorem 6.2(b) of Lubinsky ( 2007 ) with W = ϕ 1 / 2 1 b eing the F reud-t yp e w eigh t function. In addition to c n , another tec hnical ingredien t is to control the tail norm ∥ r ∥ L 2 ( ϕ 1 ) . Knowing that | ∆ k | ≤ 2(2 M ) k , w e hav e ∥ r ∥ L 2 ( ϕ 1 ) ≤ ∞ X k = n +1 4(4 M 2 ) k k ! ! 1 / 2 ≤  C n + 1  ( n +1) / 2 , where C is a p ositiv e constant depending solely on M . No w we are ready to lo wer bound ∥ g ∥ L 1 ( ϕ 1 ) , ∥ g ∥ L 1 ( ϕ 1 ) ≥ ∥ q ∥ L 1 ( ϕ 1 ) − ∥ r ∥ L 1 ( ϕ 1 ) ≥ c n ∥ q ∥ L 2 ( ϕ 1 ) − ∥ r ∥ L 2 ( ϕ 1 ) (b y ( 5 )) ≥ c n ∥ g ∥ L 2 ( ϕ 1 ) − 2 ∥ r ∥ L 2 ( ϕ 1 ) , where the last inequality holds since c n ≤ 1. T ogether with the lo wer b ound for c n and the upp er b ound for ∥ r ∥ L 2 ( ϕ 1 ) , w e hav e 2 t ≥ sup n ≥ 1 ( cn − 1 / 4 e − n ∥ g ∥ L 2 ( ϕ 1 ) − 2  C n + 1  ( n +1) / 2 ) , where t = 1 2 ∥ g ∥ L 1 ( ϕ d ) = TV( f π , f η ). Finally , w e choose n ≈ 2 log (1 /t ) log log(1 /t ) to conclude the pro of. The full pro of in App endix A is self-con tained, and the main c hallenge is to generalize the Nik olskii-t yp e inequalit y and the restricted-range inequalit y to arbitrary dimension. See Prop ositions A.6 and A.7 , resp ectiv ely . Pr o of of The or em 2.1 . Here we sho w that the Theorem 2.1 follows directly from Theorem 2.5 and that the constants C 0 in the b oth theorems coincide. Fix θ ∈ [ − M , M ] d . Consider a translation map τ θ ( x ) = x − θ and deﬁne the following push-forw ard measures: π θ := ( τ θ ) ♯ π , η θ := ( τ θ ) ♯ η . 5 Note that these are simply translations of the original measures and are supp orted on [ − 2 M , 2 M ] d . Deﬁne g θ := f π θ − f η θ ϕ d . Then, ∥ g θ ∥ 2 L 2 ( ϕ d ) = Z R d ( f π ( x + θ ) − f η ( x + θ )) 2 ϕ d ( x ) dx = Z R d ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx, ∥ g θ ∥ L 1 ( ϕ d ) = Z R d | f π ( x + θ ) − f η ( x + θ ) | dx = Z R d | f π ( x ) − f η ( x ) | dx = 2TV( f π , f η ) . Since g θ ob eys the inequalit y in Theorem 2.5 , there exists C 0 = C 0 ( δ, M , d ) > 0, not dep ending on π , η , or θ , suc h that  Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx  1 / 2 ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) . Mean while, we can apply Jensen’s inequalit y p oin twise in x to get ( f π ( x ) − f η ( x )) 2 f η ( x ) ≤ Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dη ( θ ) . In tegrate b oth sides in x . Then, use F ubini-T onelli (nonnegativity) and the fact that a mixture in tegral is upp er b ounded b y the supremum of its in tegrand to show that χ 2 ( f π ∥ f η ) ≤ sup θ ∈ [ − M ,M ] d Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx, th us concluding the pro of. 3 Sharpness In this section, w e establish the sharpness of the inequalities b y showing that the presence of the exp onen t α ( · ) is necessary up to a constant. Since our construction of sharp examples is in one dimension, w e will write ϕ = ϕ 1 for simplicit y of notation. Note that Theorem 4.6 and its pro of demonstrate that the minimax low er b ound for density estimation in arbitrary dimensions can b e established using only the one-dimensional sharpness result. W e ﬁrst show the sharpness of the Corollary 2.4 , and then the sharpness of Theorem 2.1 follows immediately b y H 2 ( p, q ) ≤ χ 2 ( p ∥ q ). Theorem 3.1 (Sharpness of Corollary 2.4 ) . Ther e exist two se quenc es of pr ob ability me asur es { π n } and { η n } supp orte d on [ − M , M ] such that, if we deﬁne TV n := TV( f π n , f η n ) , H n := H ( f π n , f η n ) , then TV n ↓ 0 as n → ∞ , and mor e over it holds for al l n that TV n < e − e and that H n ≥ TV 1 − α ∗ (TV n ) n , wher e we deﬁne α ∗ ( t ) := 0 . 33 log log(1 /t ) , t > 0 . 6 In the sequel, w e will construct three pairs of sequences of mixing distributions, namely , ( π n , η n ), ( π (1) n , η (1) n ), and ( π (2) n , η (2) n ). W e b egin with Lemma 3.2 , provid ing a sharp example ( π n , η n ) of Theorem 2.5 . Corollary 3.3 then mo diﬁes this example to ( π (1) n , η (1) n ) sho wing the sharpness of Theorem 2.1 . Finally , Corollary 3.4 constructs ( π (2) n , η (2) n ) from ( π (1) n , η (1) n ) to sho w the sharpness of Corollary 2.4 . Before w e pro ceed to construct a sharp example of Theorem 2.5 , we recall the essential ingre- dien ts of the pro of of the theorem: 1) The quan tity c n , deﬁned in ( 5 ), can be b ounded from below b y e − O ( n ) ; 2) W e can control the tail norm ∥ r ∥ L 2 ( ϕ 1 ) b y e − Ω( n log n ) using the diﬀerence b etw een higher order moments of the mixing distributions. W e note that the sequence of monomials ( x n ) n is a sharp instance of the c n , since the norm ratio ∥ x n ∥ L 1 ( ϕ 1 ) / ∥ x n ∥ L 2 ( ϕ 1 ) is deca ying exponentially in n . Knowing this fact, given n , w e construct an example such that the L 2 ( ϕ 1 ) pro jection of ( f π n − f η n ) /ϕ 1 on to Π n is prop ortional to x n . T o this end, we will ﬁrst ﬁnd ( n + 1) p oints in [ − M , M ] as the support of the tw o mixing distributions, denoted b y θ 0 , · · · , θ n , and then w e match the lo wer order momen ts ∆ 0 , . . . , ∆ n so that n X k =0 ∆ k √ k ! h k ∝ x n . Giv en the v alues of θ 0 , · · · , θ n , the diﬀerence of the low er order momen ts ∆ 0 , . . . , ∆ n can b e solv ed by a linear equation that in volv es in verting a V andermonde matrix (see Lemma B.2 for the deﬁnition). W e choose θ 0 , · · · , θ n to b e zeros of the ( n + 1)-th Chebyshev p olynomial of the ﬁrst kind, i.e., Cheb yshev no des, since the corresp onding inv erse V andermonde matrix is stable ( Gautsc hi , 1974 ). T n +1 ∈ Π n +1 , the ( n + 1)-th Chebyshev polynomial of the ﬁrst kind, is deﬁned b y T n +1 (cos( θ )) = cos(( n + 1) θ ) . (7) In addition to stability of the in verse V andermonde matrix, another adv antage of using the Cheby- shev nodes is that w e can recursively b ound the diﬀerence of the higher order moments giv en the lo wer order moments. The prop erties of the construction are summarized b y the following Lemma 3.2 whose pro of will b e giv en in App endix B . Lemma 3.2 (Sharp example of Theorem 2.5 ) . L et n ≥ 11 b e an o dd numb er and θ j = cos  2 j +1 2 n +2 π  , j = 0 , . . . , n b e the zer os of Chebyshev p olynomial of the ﬁrst kind, T n +1 ( x ) . Given M > 0 , deﬁne a = 1 ∧ M and ∆ k =       a ( √ 2 − 1)  n +1 ( n − k )!! , k is o dd , 0 , k is even , (8) for k = 0 , 1 , . . . , n , wher e ( n − k )!! is the double factorial. Deﬁne ( w 0 , . . . , w n ) ∈ R n +1 to b e the unique ve ctor solving ∆ k = n X j =0 w j ( aθ j ) k , k = 0 , 1 , . . . , n. (9) A c c or dingly, deﬁne two discr ete pr ob ability me asur es π n := n X j =0  1 n + 1 + w j  δ aθ j , η n := n X j =0 1 n + 1 δ aθ j , (10) wher e δ aθ j denotes the p oint mass at aθ j . Then, 7 1. w j is wel l-deﬁne d and | w j | ≤ 1 n +1 for al l j . 2. π n and η n ar e valid discr ete pr ob ability me asur es supp orte d on [ − M , M ] . 3. F or 0 ≤ k ≤ n , ∆ k = R θ k d ( π n − η n )( θ ) satisﬁes | ∆ k | ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b k − n , (11) wher e b := a p n 2 . 77 . 4. If we further deﬁne ∆ k := R θ k d ( π n − η n )( θ ) for k > n , then ( 11 ) is also true. 5. If we write q n ( x ) = P n k =0 ∆ k √ k ! h k ( x ) and r n ( x ) = P ∞ k = n +1 ∆ k √ k ! h k ( x ) , then q n ( x ) = n a ( √ 2 − 1) o n +1 x n n ! . (12) In addition, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that ∥ r n ∥ L 2 ( ϕ ) ≤ 1 32 exp  n 5 . 53  ∥ q n ∥ L 1 ( ϕ ) (13) ≤ 1 16 exp  −  log(2) 2 − 1 5 . 53  n  ∥ q n ∥ L 2 ( ϕ ) . (14) 6. g n = q n + r n satisﬁes lim n →∞ 1 n log n log 1 ∥ g n ∥ L 1 ( ϕ ) ! = lim n →∞ 1 n log n log 1 ∥ g n ∥ L 2 ( ϕ ) ! = 1 2 . (15) Pr o of. W e will give the full pro of in App endix B.2 . The key argumen t, which is to deriv e the b ound ( 11 ) for k > n is sk etched b elow. W rite the Chebyshev p olynomial as T n +1 ( x ) = 2 n ( x n +1 − σ 2 x n − 1 + σ 4 x n − 3 − · · · + ( − 1) ( n +1) / 2 σ n +1 ). The c hoice of the supp ort { aθ 0 , . . . , aθ n } im- plies that T n +1 ( θ j ) = 0 for all j = 0 , · · · , n , and thus ( aθ j ) K +1 = σ 2 a 2 ( aθ j ) K − 1 − σ 4 a 4 ( aθ j ) K − 3 + · · · + ( − 1) ( n − 1) / 2 σ n +1 a n +1 ( aθ j ) K − n . This implies | ∆ K +1 | =    P n j =0 w j ( aθ j ) K +1    ≤ σ 2 a 2 | ∆ K − 1 | + σ 4 a 4 | ∆ K − 3 | + · · · + σ n +1 a n +1 | ∆ K − n | , from which w e can b ound all | ∆ k | for k > n via mathematical induction. Corollary 3.3 (Sharp example of Theorem 2.1 ) . R e c al l the deﬁnition ( 10 ) of π n and η n fr om the ab ove. L et R n = √ 8 n + 4 , λ n = exp( − R n ) , (16) and ac c or dingly deﬁne π (1) n := (1 − λ n ) δ 0 + λ n π n , η (1) n := (1 − λ n ) δ 0 + λ n η n , (17) wher e δ 0 denotes the p oint mass at zer o. Then, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that TV  f π (1) n , f η (1) n  = λ n 2 ∥ g n ∥ L 1 ( ϕ ) , r χ 2  f π (1) n ∥ f η (1) n  ≥ λ n 4 ∥ q n ∥ L 2 ( ϕ ) . (18) 8 Pr o of. See App endix B.2 . Corollary 3.4 (Sharp example of Corollary 2.4 ) . R e c al l the deﬁnition ( 17 ) of π (1) n and η (1) n fr om the ab ove. L et π (2) n := 1 4 π (1) n + 3 4 η (1) n , η (2) n := η (1) n . (19) Then, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that TV  f π (2) n , f η (2) n  = λ n 8 ∥ g n ∥ L 1 ( ϕ ) , H  f π (2) n , f η (2) n  ≥ λ n 64 ∥ q n ∥ L 2 ( ϕ ) . (20) Pr o of. The equality for TV distance is straigh tforward. Now, observ e for all x ∈ R that u ( x ) := f π (1) n ( x ) f η (1) n ( x ) − 1 = (1 − λ n ) ϕ ( x ) + P n j =0  λ n n +1 + λ n w j  ϕ ( x − aθ j ) (1 − λ n ) ϕ ( x ) + P n j =0 λ n n +1 ϕ ( x − aθ j ) − 1 ≤ max 0 ≤ j ≤ n λ n n +1 + λ n w j λ n n +1 − 1 ≤ 1 ( ∵ | w j | ≤ 1 n +1 ) and hence that ∥ u ∥ ∞ ≤ 1. W rite H 2  f π (2) n , f η (2) n  = H 2  1 4 f π (1) n + 3 4 f η (1) n , f η (1) n  = Z F  1 + u 4  f η (1) n , where F ( t ) := 1 2 ( √ t − 1) 2 . A T a ylor expansion of F gives F  1 + u 4  = u 2 128 − u 3 32(4 + v ) 5 / 2 (for some | v | ≤ | u | ) ≥ u 2 128 − u 2 288 √ 3 ( ∥ u ∥ ∞ ≤ 1) ≥ u 2 256 . In tegrating against f η (1) n yields H 2  f π (2) n , f η (2) n  ≥ 1 256 χ 2  f π (1) n ∥ f η (1) n  , concluding the pro of. No w we are ready to prov e Theorem 3.1 (Sharpness of Corollary 2.4 ) with the ab o ve ( π (2) n , η (2) n ). Pr o of of The or em 3.1 . Let TV n := TV  f π (2) n , f η (2) n  , H n := H  f π (2) n , f η (2) n  . 9 Then, ( 15 ), ( 16 ), and ( 20 ) imply that lim n →∞ 1 n log n log  1 TV n  = 1 2 . Th us, it holds for large enough n that 8 ∥ g n ∥ L 1 ( ϕ ) ≤ 8 ∥ q n ∥ L 1 ( ϕ ) + 8 ∥ r n ∥ L 2 ( ϕ ) ≤ 1 2 exp  n 5 . 53  ∥ q n ∥ L 1 ( ϕ ) (b y ( 13 )) ≤ exp  −  log(2) − 2 5 . 53  n 2  ∥ q n ∥ L 2 ( ϕ ) (b y ( 14 )) ≤ exp  − 0 . 33 log(1 / TV n ) log log(1 / TV n )  ∥ q n ∥ L 2 ( ϕ ) . Multiply b oth sides b y λ n 64 to conclude that TV n = λ n 8 ∥ g n ∥ L 1 ( ϕ ) (b y ( 20 )) ≤ TV α ∗ (TV n ) n λ n 64 ∥ q n ∥ L 2 ( ϕ ) (b y the deﬁnition of α ∗ ( · )) ≤ TV α ∗ (TV n ) n H n . (again b y ( 20 )) R emark 3.5 . A careful reader can verify that the constant 0 . 33 in α ∗ ( · ) can b e replaced by any p ositiv e real strictly less than log (2) − 1 4 log(2) ≈ 0 . 332. 4 Applications In this section, we pro vide a few consequences of our results. The notation “ ≲ , ≳ , ≍ ” in this section will hide constan ts dep ending on M or d . 4.1 En tropic Characterization of Learning in TV The c haracterization of minimax rates of estimation via metric entrop y has b een inv estigated b y LeCam ( 1973 ); Birg ´ e ( 1983 , 1986 ); Y atracos ( 1985 ); Haussler and Opper ( 1997 ); Y ang and Barron ( 1999 ). While minimax upp er and low er b ounds do not necessarily matc h in general, recen t w ork b y Jia et al. ( 2023 ) show ed that estimating Gaussian mixture densities with b ounded support under Hellinger distance admits an exact en tropic characterization of the minimax rate, owing to the fact that H 2 ( f π , f η ) ≍ KL( f π ∥ f η ). Similarly , our result of Corollary 2.4 that relates total v ariation and Hellinger distances also implies such a c haracterization for the same problem under total v ariation, up to a 1 − o (1) exp onen t in the rate. W e ﬁrst deﬁne the metric en trop y of Gaussian location mixtures, and then state a result of Jia et al. ( 2023 ). Deﬁnition 4.1. Let P M ,d b e the collection of d -dimensional Gaussian mixtures where the mixing distributions are supp orted on the d -dimensional cub e [ − M , M ] d . F or a distribution class P ⊆ P M ,d , its (global) Hellinger cov ering n umber is deﬁned by N H ( P , ϵ ) := min { N : ∃ P 1 , . . . , P N , sup R ∈P inf 1 ≤ i ≤ N H ( R, P i ) ≤ ϵ } . 10 The lo cal Hellinger co vering n um b er of P is N H,loc ( P , ϵ ) := sup P ∈P ,η ≥ ϵ N H ( B H ( P , η ) , η / 2) , where B H ( P , η ) = { R ∈ P : H ( P , R ) ≤ η } . W e deﬁne the global/lo cal total v ariation cov ering n umber in the same manner. Prop osition 4.2 (Corollary 11 of Jia et al. ( 2023 )) . Supp ose P is a c omp act subset (in Hel linger) of P M ,d . L et b P = b P ( X 1 , . . . , X n ) denote an estimator b ase d on X 1 , . . . , X n dr awn i.i.d. fr om P ∈ P . Then, inf b P sup P ∈P E P h H 2  P , b P i ≍ inf b P ∈P sup P ∈P E P h H 2  P , b P i ≍ ϵ 2 n , wher e ϵ 2 n ≍ inf ϵ> 0  ϵ 2 + 1 n log N H,loc ( P , ϵ )  . (21) Unlik e the Hellinger distance, there only exists an en tropic characterization of the minimax upp er b ound in total v ariation ( Y atracos , 1985 ). An en tropic low er b ound is not av ailable in the literature to the b est of our knowledge. By Corollary 2.4 , the rate ϵ n determined by the Hellinger en tropy ( 21 ) also characterizes the minimax rate of estimation under total v ariation as follows. Theorem 4.3 (Learning Gaussian mixtures in total v ariation) . Under the same c onditions as in Pr op osition 4.2 , for any δ > 0 , we have ϵ 2  1+ 2+ δ log(log(1 /ϵ n ) ∨ e )  n ≲ inf b P sup P ∈P E P h TV 2  P , b P i ≍ inf b P ∈P sup P ∈P E P h TV 2  P , b P i ≲ ϵ 2 n , wher e we deﬁne ϵ n as in ( 21 ) . 4.2 Robust Density Estimation In this section, w e consider the problem of estimating a Gaussian mixture with contaminated data, X 1 , . . . , X n i.i.d. ∼ P := (1 − ϵ ) P f π + ϵQ, (22) where the distribution P f π ∈ P M ,d has densit y function f π and Q is an arbitrary distribution of con tamination. The data generating process in ( 22 ) is recognized as Huber’s contamination mo del ( Hub er , 1964 ). Robust densit y estimation with Hub er con tamination has b een previously studied b y Liu and Gao ( 2019 ); Zhang and Ren ( 2023 ); Hum b ert et al. ( 2022 ), and the class of kernel densit y estimators are shown to estimate H¨ older smo oth densit y functions with optimal rates. Our main goal is to estimate the Gaussian mixture f π under Hellinger distance, since Hellinger error of densit y estimation directly implies a regret b ound for empirical Ba yes learning in the Gaussian sequence ( Jiang and Zhang , 2009 ; Saha and Guntuboyina , 2020 ). T o this end, we will ﬁrst introduce a robust estimator that has statistical guarantee under the total v ariation distance. This step is standard b y the construction of Y atracos ( 1985 ) since the Hub er contamination ( 22 ) is a sp ecial case of mo del missp eciﬁcation under total v ariation. Details of the Y atracos’ estimator will b e giv en in App endix C.1 . Its statistical guarantee is given b y the follo wing prop osition. 11 Prop osition 4.4 (Robust density estimation in TV) . Consider the data gener ating pr o c ess in ( 22 ) . Then, the Y atr ac os’ estimator b f satisﬁes sup π ,Q E h TV 2  f π , b f i ≲ ϵ 2 + log d +1 ( n ) n , wher e the exp e ctation is under ( 22 ) and the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d . Note that the Y atracos’ estimator is a prop er estimator in the sense that b f itself is also a Gaus- sian location mixture with mixing distribution supported on [ − M , M ] d . Thus, our Corollary 2.4 directly implies a minimax upp er b ound in Hellinger distance as follo ws. Theorem 4.5 (Robust densit y estimation in Hellinger) . Consider the data gener ating pr o c ess in ( 22 ) . Supp ose δ > 0 . Then, the Y atr ac os’ estimator b f satisﬁes sup π ,Q E h H 2  f π , b f i ≲ E 2 ( ϵ, n ) , (23) wher e we deﬁne E 2 ( ϵ, n ) := ϵ 2  1 − 2+ δ log(log(1 /ϵ ) ∨ e )  +  1 n  1 − o d (1) , (24) the exp e ctation is under ( 22 ) , the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d , and o d (1) is a p ositive r e al-value d function of n and d , which c onver ges to zer o as n → ∞ . W e note that estimating f π in Hellinger distance has previously b een studied b y Kim and Gun tub o yina ( 2022 ); Saha and Gun tub o yina ( 2020 ); Soloﬀ et al. ( 2025 ) in the sp ecial case of ( 22 ) with ϵ = 0. Compared with these results, it is lik ely that the second term n − (1 − o d (1)) in ( 24 ) can still b e slightly impro ved. How ever, this would require tec hniques very diﬀerent from our Corollary 2.4 , and we will lea v e it as future work. On the other hand, the ﬁrst term ϵ 2  1 − 2+ δ log(log(1 /ϵ ) ∨ e )  in ( 24 ) can b e sho wn to b e optimal. The follo wing result is obtained b y applying the tw o-point argument in Chen et al. ( 2018 ) to the sharpness example used in Theorem 3.1 . Theorem 4.6 (Minimax low er b ound on robust density estimation in Hellinger) . Consider the data gener ating pr o c ess in ( 22 ) . Then, we have inf b f sup π ,Q E h H 2  f π , b f i ≳ ϵ 2  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  , (25) wher e the exp e ctation is under ( 22 ) and the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d . The Hellinger b ound in Theorem 4.5 can b e applied to empirical Ba yes learning of Gaussian means with outliers. T o motiv ate this application, consider the following Gaussian location mo del with prior π , X | θ ∼ N ( θ , I d ) , θ ∼ π . With the knowledge of the prior, the (oracle) Ba yes estimator with resp ect to the squared error loss is giv en by the p osterior mean, b θ ⋆ ( X ) = X + ∇ f π ( X ) f π ( X ) . (26) 12 This formula is kno wn as Tw eedie’s formula ( Efron , 2011 ). Without the kno wledge of π , an empir- ical Ba yes estimator replaces the f π in ( 26 ) b y its estimator, b θ ( X ) := X + ∇ b f ( X ) b f ( X ) . The regret ( Saha and Guntuboyina , 2020 ; Soloﬀ et al. , 2025 ) of b θ ( X ) is quan tiﬁed by E X ∼ f π    b θ ( X ) − b θ ⋆ ( X )    2 , whic h is actually the Fisher div ergence b et ween f π and b f . In a t ypical empirical Bay es setting, one has i.i.d. observ ations generated by f π . Here, we consider a more general data generating pro cess in ( 22 ) that allo ws the presence of arbitrary outliers. This requires the estimator b f to b e robust, and th us the Y atracos’ estimator satisfying the risk b ound in Theorem 4.5 is adopted here. W e note that the clean data setting of the problem with ϵ = 0 has b een w ell studied in the literature ( James et al. , 1961 ; Efron and Morris , 1972 , 1973 ; Johnstone , 2002 ; Ignatiadis and Sen , 2025 ), and the nonparametric maximum likelihoo d estimator (NPMLE) and sieve MLE are sho wn to achiev e the parametric rate up to some logarithmic factor ( W ong and Shen , 1995 ; Genov ese and W asserman , 2000 ; Ghosal and V an Der V aart , 2001 ; Jiang and Zhang , 2009 ; Saha and Guntuboyina , 2020 ; Soloﬀ et al. , 2025 ). How ever, when ϵ > 0, it is unclear whether the NPMLE still works with presence of arbitrary outliers. W e susp ect that the error rate of the NPMLE has a highly sub- optimal dep endence on ϵ . In terms of the technique of analyzing the regret b ound, results in Jiang and Zhang ( 2009 ); Saha and Gun tub o yina ( 2020 ); Soloﬀ et al. ( 2025 ) and related papers crucially rely on the Hellinger con trol of the Fisher div ergence. See Theorem E.1 of Saha and Gun tub o yina ( 2020 ) for instance. Note that these works employ ed a regularized version of b θ ( X ) in the following form to av oid n umerical instability when the denominator b ecomes close to zero. b θ ρ ( X ) := X + ∇ b f ( X ) b f ( X ) ∨ ρ . (27) F ollowing the same strategy , the result of Theorem 4.7 is an immediate consequence of Theorem 4.5 . Theorem 4.7 (Robust regret bound) . Consider the data gener ating pr o c ess in ( 22 ) . Supp ose b θ ⋆ ( · ) is as in ( 26 ) . Then, ther e exists ρ = ρ ( ϵ, n ) > 0 such that b θ ρ ( · ) in ( 27 ) with b f b eing the Y atr ac os’ estimator satisﬁes sup π ,Q E  E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2  ≲ E 2 ( ϵ, n ) , (28) wher e the outer exp e ctation is under ( 22 ) , the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d , and the err or function E 2 ( ϵ, n ) is deﬁne d as in ( 24 ) . See App endix C.2 for detailed pro ofs of Theorem 4.3 , Proposition 4.4 , Theorems 4.5 , 4.6 , and 4.7 . 13 5 Discussion W e establish a sharp relation betw een the total v ariation and the Hellinger distances in this paper. Our results are deriv ed for d -dimensional isotropic Gaussian mixture mo dels with a ﬁxed co v ariance I d . While w e discuss implications for empirical Ba yes metho ds, these pro cedures often in volv e a join t prior on location and co v ariance. Extending our results to heteroscedastic Gaussian mixtures is an in teresting direction for future w ork. Another op en problem closely related to our pap er is the sharp relation b et w een the total v ariation and the L 2 distances. Resolving this question will hav e direct implications for nonparametric densit y estimation under the L 2 loss. Finally , establishing a sharp connection betw een the total v ariation distance and the Fisher div ergence will further deepen the understanding of empirical Bay es pro cedures under the robust setting. Ac kno wledgemen ts W e thank Nikolaos Ignatiadis for fruitful discussions on the implications of our results for empirical Ba yes. References Birg ´ e, L. (1983). Appro ximation dans les espaces m´ etriques et th ´ eorie de l’estimation. Zeitschrift f¨ ur Wahrscheinlichkeitsthe orie und verwandte Gebiete , 65(2):181–237. Birg ´ e, L. (1986). On estimating a densit y using Hellinger distance and some other strange facts. Pr ob ability the ory and r elate d ﬁelds , 71(2):271–291. Chen, M., Gao, C., and Ren, Z. (2018). Robust co v ariance and scatter matrix estimation under Hub er’s con tamination mo del. The Annals of Statistics , 46(5):1932–1960. Dasgupta, S. (1999). Learning mixtures of Gaussians. In 40th annual symp osium on foundations of c omputer scienc e (Cat. No. 99CB37039) , pages 634–644. IEEE. Efron, B. (2011). Tw eedie’s form ula and selection bias. Journal of the A meric an Statistic al Asso- ciation , 106(496):1602–1614. Efron, B. and Morris, C. (1972). Empirical Bay es on vector observ ations: An extension of Stein’s metho d. Biometrika , 59(2):335–347. Efron, B. and Morris, C. (1973). Stein’s estimation rule and its comp etitors—an empirical Bay es approac h. Journal of the Americ an Statistic al Asso ciation , 68(341):117–130. Gautsc hi, W. (1974). Norm estimates for inv erses of V andermonde matrices. Numerische Mathe- matik , 23(4):337–347. Geno vese, C. R. and W asserman, L. (2000). Rates of con vergence for the Gaussian mixture sieve. The A nnals of Statistics , 28(4):1105–1127. Ghosal, S. and V an Der V aart, A. W. (2001). En tropies and rates of conv ergence for maximum lik eliho o d and Bay es estimation for mixtures of normal densities. The A nnals of Statistics , pages 1233–1263. Guillemin, V. and Sternberg, S. (2013). Semi-classic al analysis . International Press Boston, MA. 14 Haussler, D. and Opper, M. (1997). Mutual information, metric entrop y and cum ulative relativ e en tropy risk. The Annals of Statistics , 25(6):2451–2492. Hub er, P . J. (1964). Robust estimation of a lo cation parameter. The A nnals of Mathematic al Statistics , 35(1):73–101. Hum b ert, P ., Le Bars, B., and Min vielle, L. (2022). Robust kernel densit y estimation with median- of-means principle. In International Confer enc e on Machine L e arning , pages 9444–9465. PMLR. Ignatiadis, N. and Sen, B. (2025). Empirical partially Bay es multiple testing and comp ound χ 2 decisions. The A nnals of Statistics , 53(1):1–36. James, W., Stein, C., et al. (1961). Estimation with quadratic loss. In Pr o c e e dings of the fourth Berkeley symp osium on mathematic al statistics and pr ob ability , volume 1, pages 361–379. Uni- v ersity of California Press. Jia, Z., P olyanskiy , Y., and W u, Y. (2023). Entropic c haracterization of optimal rates for learning Gaussian mixtures. In The Thirty Sixth A nnual Confer enc e on L e arning The ory , pages 4296– 4335. PMLR. Jiang, W. and Zhang, C.-H. (2009). General maxim um likelihoo d empirical Bay es estimation of normal means. The A nnals of Statistics , pages 1647–1684. Johnstone, I. M. (2002). F unction estimation and Gaussian sequence mo dels. Unpublishe d manuscript , 2(5.3):2. Kim, A. K. and Gun tub o yina, A. (2022). Minimax b ounds for estimating multiv ariate Gaussian lo cation mixtures. Ele ctr onic Journal of Statistics , 16(1):1461–1484. LeCam, L. (1973). Conv ergence of estimates under dimensionality restrictions. The Annals of Statistics , pages 38–53. Lindsa y , B. G. (1995). Mixture mo dels: Theory , geometry and applications. In NSF-CBMS R e gional Confer enc e Series in Pr ob ability and Statistics , pages i–163. JSTOR. Liu, H. and Gao, C. (2019). Density estimation with con tamination: minimax rates and theory of adaptation. Ele ctr onic Journal of Statistics , 13:3613–3653. Lubinsky , D. S. (2007). A surv ey of w eigh ted appro ximation for exponential w eights. arXiv pr eprint math/0701099 . Ma, Y., W u, Y., and Y ang, P . (2025). On the b est approximation b y ﬁnite Gaussian mixtures. IEEE T r ansactions on Information The ory . Maizlish, O. and Prymak, A. (2015). Conv ex p olynomial approximation in R d with Freud w eigh ts. Journal of Appr oximation The ory , 192:60–68. Nev ai, P . and T otik, V. (1987). Sharp Nik olskii inequalities with exp onen tial w eights. Analysis Mathematic a , 13(4):261–267. P olyanskiy , Y. and W u, Y. (2025). Information the ory: F r om c o ding to le arning . Cambridge univ ersity press. 15 Saha, S. and Guntuboyina, A. (2020). On the nonparametric maxim um likelihoo d estimator for Gaussian lo cation mixture densities with application to Gaussian denoising. The Annals of Statistics , 48(2):738–762. Soloﬀ, J. A., Gun tub o yina, A., and Sen, B. (2025). Multiv ariate, heteroscedastic empirical Ba yes via nonparametric maximum likelihoo d. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87(1):1–32. Szeg, G. (1939). Ortho gonal p olynomials , volume 23. American Mathematical So c. W atson, G. N. (1933). Notes on generating functions of p olynomials:(2) Hermite p olynomials. Journal of the L ondon Mathematic al So ciety , 1(3):194–199. W ong, W. H. and Shen, X. (1995). Probability inequalities for likelihoo d ratios and con vergence rates of siev e MLEs. The Annals of Statistics , pages 339–362. Y ang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of conv er- gence. The A nnals of Statistics , pages 1564–1599. Y atracos, Y. G. (1985). Rates of con v ergence of minim um distance estimators and Kolmogoro v’s en tropy . The A nnals of Statistics , 13(2):768–774. Zhang, P . and Ren, Z. (2023). Adaptiv e minimax densit y estimation on R d for Hub er’s contami- nation mo del. Information and Infer enc e: A Journal of the IMA , 12(4):3042–3066. 16 A Pro of of the Main Results A.1 Preliminaries: Hermite Polynomials and Inequalities This section has tw o main goals. The ﬁrst is to develop an understanding of the Hilb ert space L 2 ( R d , ϕ d ) using the Christoﬀel-Darb oux k ernel (Prop osition A.2 ), whic h pav es the wa y for the pro of of the Nikolskii-t yp e inequality (Proposition A.6 ) and restricted-range inequalit y (Prop osi- tion A.7 ). The second is to pro ve Prop osition A.8 , which is a key ingredient in the pro of of our main result, Theorem 2.5 . The results in this section hav e imp ortan t implications in quantum mec hanics. How ev er, we p ostpone their physical in terpretation for the momen t. W e ﬁrst proceed to prov e the Prop osi- tion A.8 and the Theorem 2.5 without relying on any physics, and then return to discuss the ph ysical meaning at the end. The study of orthogonal p olynomials has a long and rich history , encompassing w orks from Szeg ( 1939 ) to Lubinsky ( 2007 ), among man y others. Results on multiv ariate p olynomials are relativ ely limited and disp ersed throughout div erse literatures, including theoretical mathematics and quan tum physics, making a uniﬁed o verview challenging. F or the sake of k eeping the present pap er self-contained, w e summarize the essential results in this section. W e refer to the notations deﬁned in Section 1.2 and ﬁx d ≥ 1 throughout this section. Lemma A.1 (Hermite polynomial expansion) . F or θ = ( θ 1 , . . . , θ d ) ∈ R d and x = ( x 1 , . . . , x d ) ∈ R d , we have ϕ d ( x − θ ) ϕ d ( x ) = X k ∈ N d 0 θ k √ k ! h k ( x ) , wher e we deﬁne θ k := d Y j =1 θ k j j , k ! := d Y j =1 k j ! . Pr o of. The one-dimensional version of this result is classical and easy to sho w. See, for example, Equation (5.5.7) of Szeg ( 1939 ). W e can generalize to arbitrary dimensions as follows. ϕ d ( x − θ ) ϕ d ( x ) = exp  ⟨ θ , x ⟩ 2 − 1 2 ∥ θ ∥ 2 2  = d Y j =1 exp  θ j x j − 1 2 θ 2 j  = d Y j =1 ∞ X k j =0 θ k j j p k j ! h k j ( x j ) . Expand the pro duct to conclude the pro of. Prop osition A.2 (Christoﬀel-Darb oux kernel) . F or n ∈ N 0 , deﬁne the n -th Christoﬀel-Darb oux kernel K n as K n ( x, y ) := X | k |≤ n h k ( x ) h k ( y ) . (29) Then, given x ∈ R d , 17 1. K n ( x, · ) ∈ Π d n . 2. ⟨ f , K n ( x, · ) ⟩ L 2 ( R d ,ϕ d ) = f ( x ) holds for al l f ∈ Π d n . Pr o of. The ﬁrst statement is obvious. Due to linearit y , it suﬃces to pro ve the second statement when f = h k for some | k | ≤ n , which is straigh tforward. Prop osition A.3 (Christoﬀel-Darb oux function) . Given x ∈ R d , inf n ∥ P ∥ 2 L 2 ( R d ,ϕ d ) : P ∈ Π d n , P ( x ) = 1 o = 1 K n ( x, x ) . (30) Pr o of. F or P ∈ Π d n suc h that P ( x ) = 1, write P = P | k |≤ n c k h k so that 1 = X | k |≤ n c k h k ( x ) ≤   X | k |≤ n c 2 k     X | k |≤ n h 2 k ( x )   = ∥ P ∥ 2 L 2 ( R d ,ϕ d ) K n ( x, x ) , demonstrating the lo w er b ound. The low er bound is attained b y the p olynomial K n ( x, · ) K n ( x,x ) ∈ Π d n . In view of Prop osition A.3 , it is imp ortan t to study an upp er b ound on the diagonal entries K n ( x, x ) of the C–D kernel. T o achiev e this, w e ﬁrst introduce a useful lemma. Lemma A.4 (Mehler’s formula) . F or k ∈ N d 0 , deﬁne E k := 2 | k | + d . F or x, y ∈ R d and t > 0 , deﬁne the Mehler kernel by M ( x, y ; t ) := X k ∈ N d 0 e − tE k h k ( x ) h k ( y ) ϕ 1 / 2 d ( x ) ϕ 1 / 2 d ( y ) . (31) Then, we have the fol lowing close d-form formula: M ( x, y ; t ) = (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 + ∥ y ∥ 2 2 4 tanh(2 t ) + ⟨ x, y ⟩ 2 2 sinh(2 t ) ! . (32) If y = x , in p articular, then M ( x, x ; t ) = (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 2 tanh( t ) ! . (33) Pr o of. The right hand side of ( 31 ) is factorized to d Y j =1 X k j ∈ N 0 e − t (2 k j +1) h k j ( x j ) h k j ( y j ) ϕ 1 / 2 1 ( x j ) ϕ 1 / 2 1 ( y j ) . Since the closed-form formula ( 32 ) can also be factorized in the same manner, it suﬃces to show ( 32 ) only for d = 1. There are many known pro ofs of the one-dimensional Mehler’s form ula. One suc h pro of dates back (at least) to W atson ( 1933 ). Since it is quite short, w e include it b elo w. Recall the F ourier transform of ϕ 1 : ϕ 1 ( x ) = 1 2 π Z exp  − ξ 2 2 + ixξ  dξ . 18 Hence, from the deﬁnition of h k , h k ( x ) ϕ 1 / 2 1 ( x ) = ( − 1) k √ k ! ϕ − 1 / 2 1 ( x ) d k dx k ϕ 1 ( x ) = 1 2 π √ k ! ϕ − 1 / 2 1 ( x ) Z ( − iξ ) k exp  − ξ 2 2 + ixξ  dξ . In conclusion, ∞ X k =0 e − t (2 k +1) h k ( x ) h k ( y ) ϕ 1 / 2 1 ( x ) ϕ 1 / 2 1 ( y ) = (2 π ) − 3 / 2 exp  − t + x 2 + y 2 4  Z Z exp  − ξ 2 + ζ 2 2 + ixξ + iy ζ  ∞ X k =0 ( − e − 2 t ξ ζ ) k k ! dξ dζ = (2 π ) − 3 / 2 exp  − t + x 2 + y 2 4  Z Z exp  − ξ 2 + ζ 2 2 − e − 2 t ξ ζ + ixξ + iy ζ  dξ dζ = (2 π (1 − e − 4 t )) − 1 / 2 exp  − t + x 2 + y 2 4 − x 2 + y 2 − 2 e − 2 t xy 2(1 − e − 4 t )  = (4 π sinh(2 t )) − 1 / 2 exp  − x 2 + y 2 4 tanh(2 t ) + xy 2 sinh(2 t )  . W e hav e derived the explicit form of Mehler’s form ula, which implies the follo wing corollary . Corollary A.5 (Upp er bounds of the C-D kernel) . R e c al l the deﬁnition ( 29 ) of Christoﬀel-Darb oux kernel K n ( x, x ) . F or n ∈ N 0 , deﬁne E n,d := 2 n + d, C n,d :=  ( n + d ) n + d n n d d  1 / 2 . (34) Then, we have sup x ∈ R d K n ( x, x ) ϕ d ( x ) ≤ (2 π ) − d/ 2 C n,d , (35) C n,d ≤  e ( n + d ) d  d/ 2 = O ( n d/ 2 ) . (36) F urthermor e, for κ > 1 , Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤  e 2 d r κ κ − 1  d/ 2 E d/ 2 n,d exp ( − c ( κ ) E n,d ) , (37) wher e we deﬁne c ( κ ) := p κ ( κ − 1) − log  √ κ + √ κ − 1  > 0 . Pr o of. The inequality ( 36 ) is straigh tforward. The other inequalities ( 35 ) and ( 37 ) can b e deriv ed 19 from Chernoﬀ b ound using the Mehler’s formula (Lemma A.4 ) as follows. F or all x ∈ R d and t > 0, K n ( x, x ) ϕ d ( x ) = X | k |≤ n h 2 k ( x ) ϕ d ( x ) (b y ( 29 )) ≤ e tE n,d X | k |≤ n e − tE k h 2 k ( x ) ϕ d ( x ) ( E k ≤ E n,d ) ≤ e tE n,d M ( x, x ; t ) (b y ( 31 )) = e tE n,d (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 2 tanh( t ) ! . (b y ( 33 )) Therefore, sup x ∈ R d K n ( x, x ) ϕ d ( x ) ≤ inf t> 0 e tE n,d (4 π sinh(2 t )) − d/ 2 = (2 π ) − d/ 2 C n,d , where the inﬁm um is attained at t = 1 4 log  1 + d n  . Similarly , for all t > 0 and 0 < s < tanh( t ), Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤ e tE n,d (4 π sinh(2 t )) − d/ 2 Z ∥ x ∥ 2 > √ 2 κE n,d exp − ∥ x ∥ 2 2 2 tanh( t ) ! ≤ exp (( t − κs ) E n,d ) (4 π sinh(2 t )) − d/ 2 Z R d exp − ∥ x ∥ 2 2 2 (tanh( t ) − s ) ! = exp (( t − κs ) E n,d ) (2 sinh(2 t )(tanh( t ) − s )) − d/ 2 . No w ﬁx t = log  √ κ + √ κ − 1  > 0 so that cosh( t ) = √ κ and that sinh( t ) = √ κ − 1. Let s = tanh( t ) − d 2 κE n,d accordingly to deduce Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤ exp  d 2 − c ( κ ) E n,d  2 d p κ ( κ − 1) κE n,d ! − d/ 2 , whic h is the desired result. Note that the c hoice of ( t, s ) is asymptotically optimal as E n,d → ∞ . W e ha ve derived upp er bounds on the diagonal en tries K n ( x, x ) of the C-D k ernel. Using these b ounds, we now introduce three norm inequalities in Π d n , stated as Propositions A.6 , A.7 , and A.8 . The ﬁrst is the Nikolskii-t yp e inequalit y . In case d = 1, the Nik olskii-type inequality has b een extensively studied. F or instance, the pap er by Nev ai and T otik ( 1987 ) focuses on the one- dimensional setting and establishes the sharpness of the Nik olskii-type inequalities (with more general w eigh t functions). Note that the Mhask ar–Rakhmano v–Saﬀ (MRS) n umber a n discussed in that pap er is linearly comparable to p 2 E n,d , the threshold. The second is the restricted-range inequality . Likewise, in the one-dimensional setting, the restricted-range inequality has b een studied in great depth; see Chapter 6 of the survey Lubinsky ( 2007 ). F or higher dimensions, a few results are kno wn as well; for example, see Lemma 5 of Maizlish and Prymak ( 2015 ). The third, to the best of our knowledge, do es not ha ve a standard name, but it can b e deriv ed as a com bination of the preceding t wo and will pla y an essential role in our main result. 20 Prop osition A.6 (Nikolskii-t yp e inequality) . R e c al l the deﬁnition ( 34 ) of C n,d . F or al l P ∈ Π d n , we have sup x ∈ R d    P ( x ) ϕ 1 / 2 d ( x )    ≤ (2 π ) − d/ 4 C 1 / 2 n,d ∥ P ∥ L 2 ( R d ,ϕ d ) . Pr o of. According to Prop osition A.3 and Corollary A.5 , it holds for all x ∈ R d that P 2 ( x ) ϕ d ( x ) ≤ (2 π ) − d/ 2 C n,d P 2 ( x ) K n ( x, x ) (b y ( 35 )) ≤ (2 π ) − d/ 2 C n,d ∥ P ∥ 2 L 2 ( R d ,ϕ d ) . (b y ( 30 )) T ake square roots of the b oth sides to conclude the pro of. Prop osition A.7 (Restricted-range inequality) . R e c al l the deﬁnition ( 34 ) of E n,d . Supp ose κ > 1 . Then, ther e exists a c onstant A = A ( κ ) , dep ending only on κ , such that, if E n,d ≥ Ad , then, for al l P ∈ Π d n , we have Z R d P 2 ϕ d ≤ 2 Z ∥ x ∥ 2 ≤ √ 2 κE n,d P 2 ( x ) ϕ d ( x ) . Pr o of. Supp ose E n,d d ≥ 1 c ( κ ) log  e c ( κ ) r κ κ − 1 ∨ e  =: A ( κ ) , (38) where we deﬁne c ( κ ) as in Corollary A.5 . F or P ∈ Π d n , write P = P | k |≤ n c k h k so that R P 2 ϕ d = P | k |≤ n c 2 k . W e hav e Z ∥ x ∥ 2 > √ 2 κE n,d P 2 ( x ) ϕ d ( x ) = X | k |≤ n X | l |≤ n c k M kl c l , where w e deﬁne M kl := Z ∥ x ∥ 2 > √ 2 κE n,d h k ( x ) h l ( x ) ϕ d ( x ) . Here, M = ( M kl ) is a (dim Π d n ) × (dim Π d n ) p ositive semi-deﬁnite matrix. Th us, ev ery eigen v alue of M is at most its trace. That is, Z ∥ x ∥ 2 > √ 2 κE n,d P 2 ( x ) ϕ d ( x ) ≤  Z R d P 2 ϕ d  trace( M ) . It suﬃces to show that the trace is at most 1 2 . By the deﬁnition ( 29 ) of Christoﬀel-Darboux k ernel, trace( M ) = X | k |≤ n M kk = Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) . (39) Note that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, the assumption ( 38 ) implies e c ( κ ) r κ κ − 1 2 c ( κ ) d E n,d ≤ exp  2 c ( κ ) d E n,d  . (40) 21 In conclusion, trace( M ) ≤  e 2 d r κ κ − 1 E n,d  d/ 2 exp ( − c ( κ ) E n,d ) (b y ( 37 ) and ( 39 )) ≤ 2 − d . (b y ( 40 )) The following Prop osition A.8 is simply a com bination of the t w o preceding Prop ositions A.6 and A.7 , and it plays a cen tral role in the pro of of our main result. Prop osition A.8 (Asymptotic lo w er b ound of L 1 ( R d , ϕ d )-norm in Π d n ) . R e c al l the deﬁnition ( 34 ) of E n,d and C n,d . Deﬁne c n,d := inf n ∥ P ∥ L 1 ( R d ,ϕ d ) : P ∈ Π d n , ∥ P ∥ L 2 ( R d ,ϕ d ) = 1 o . (41) If the assumption ( 38 ) of the pr evious Pr op osition A.7 holds, then c n,d ≥ 1 2 C − 1 / 2 n,d e − κE n,d / 2 . (42) Pr o of. F or P ∈ Π d n , ∥ P ∥ 2 L 2 ( R d ,ϕ d ) ≤ 2 Z ∥ x ∥ 2 ≤ √ 2 κE n,d P 2 ( x ) ϕ d ( x ) (b y Prop osition A.7 ) ≤ 2 sup ∥ x ∥ 2 ≤ √ 2 κE n,d    ϕ − 1 / 2 d ( x )    sup x ∈ R d    P ( x ) ϕ 1 / 2 d ( x )    Z R d | P ϕ d | ≤ 2  (2 π ) d/ 4 e κE n,d / 2   (2 π ) − d/ 4 C 1 / 2 n,d ∥ P ∥ L 2 ( R d ,ϕ d )  ∥ P ∥ L 1 ( R d ,ϕ d ) . (b y Prop osition A.6 ) Cancel out ∥ P ∥ L 2 ( R d ,ϕ d ) from the b oth sides to pro ve the inequalit y ( 42 ). Corollary A.9. R e c al l the deﬁnition ( 41 ) of c n,d . Supp ose κ 1 > 1 . Then, ther e exists a c onstant A 1 = A 1 ( κ 1 ) , dep ending only on κ 1 , such that, if n ≥ A 1 d , then we have c n,d ≥ 3 e − κ 1 n . Pr o of. Supp ose n d ≥ inf κ  1 ∨ A ( κ ) 2 ∨ 1 2( κ 1 − κ ) log  3 8 e 1+2 κ 2( κ 1 − κ ) ∨ e  =: A 1 ( κ 1 ) , (43) where w e deﬁne A ( κ ) as in ( 38 ), and the inﬁm um is tak en with respect to κ suc h that 1 < κ < κ 1 . Recall that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, the assumption ( 43 ) implies 3 8 e 1+2 κ 2( κ 1 − κ ) 4( κ 1 − κ ) d n ≤ exp  4( κ 1 − κ ) d n  . (44) 22 In conclusion, c n,d ≥ 1 2 C − 1 / 2 n,d e − κE n,d / 2 (b y ( 42 )) ≥ 1 2  e 1+2 κ ( n + d ) d  − d/ 4 exp ( − κn ) (by ( 36 )) ≥ 1 3  2 e 1+2 κ d n  − d/ 4 exp ( − κn ) ( ∵ n ≥ d ) ≥ 3 2 d − 1 exp( − κ 1 n ) . (b y ( 44 )) Since E n,d = 2 n + d ≥ 2 n , the assumption ( 43 ) also implies the assumption ( 38 ) of Prop osition A.7 . W e ha ve derived all the preliminary results required for the pro of of our main theorem. Lastly , w e introduce one technical lemma to conclude this section. Lemma A.10 (Lambert W function) . Given κ 2 > 1 , B 0 ≥ 1 , and t ∈ (0 , 1) , deﬁne w 0 := 1 ∨ 2 κ 2 − 1 log  B 0 κ 2 − 1 ∨ e  , (45) n 0 :=  2 B 0 e w 0 ∨ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  . Then, it holds for al l n ≥ n 0 that  2 B 0 n + 1  ( n +1) / 2 ≤ t. (46) Pr o of. Let w > 0 b e the unique p ositiv e real num b er suc h that log(1 /t ) = B 0 w e w . Then,  2 B 0 2 B 0 e w  B 0 e w = t. Since the function z 7→ (2 B 0 /z ) z / 2 is decreasing for z > 2 B 0 /e , it suﬃces to sho w n + 1 ≥ 2 B 0 e w to prov e the inequality ( 46 ). W e divide the argument in to tw o cases, (a) w < w 0 and (b) w ≥ w 0 . In case (a) w < w 0 , it is obvious that n + 1 ≥ n 0 + 1 ≥ 2 B 0 e w 0 ≥ 2 B 0 e w . Hence, we no w supp ose (b) w ≥ w 0 . Recall that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, ( 45 ) implies B 0 κ 2 − 1 ( κ 2 − 1) w ≤ exp (( κ 2 − 1) w ) . (47) F urthermore, since B 0 ≥ 1 and w 0 ≥ 1, w e hav e log(1 /t ) = B 0 w e w ≥ e and n + 1 ≥ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e ) = 2 κ 2 B 0 w e w log ( B 0 w e w ) ≥ 2 B 0 e w , where the last inequality is equiv alen t to ( 47 ). 23 A.2 Pro of of the Main Theorem W e ha ve already shown in the main text that Theorem 2.5 implies Theorem 2.1 . Therefore, we pro ceed to pro v e the Theorem 2.5 here. Pr o of of The or em 2.5 . Let κ 1 > 1 and κ 2 > 1 satisfy 2 κ 1 κ 2 = 2 + δ . First, in view of Corollary A.9 , there exists a p ositiv e integer A 1 = A 1 ( κ 1 ), dep ending only on κ 1 , suc h that n ≥ A 1 d = ⇒ c n,d ≥ 3 e − κ 1 n . (48) Let t := 1 2 ∥ g ∥ L 1 ( ϕ d ) = TV( f π , f η ) ∈ (0 , 1). In view of Lemma A.10 , deﬁne n := A 1 d ∨ B ∨  2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  ∈ N 0 , (49) where B 0 = B 0 ( κ 1 , M 2 d ) :=  1 ∨ 2 eM 2 d  e 2 κ 1 , (50) B = B ( κ 1 , κ 2 , M 2 d ) :=  2 B 0 exp  1 ∨ 2 κ 2 − 1 log  B 0 κ 2 − 1 ∨ e  . (51) Observ e from Lemma A.1 that g = X k ∈ N d 0 ∆ k √ k ! h k , ∆ k = Z R d θ k d ( π − η )( θ ) . W e decomp ose g = q + r , where q = X | k |≤ n ∆ k √ k ! h k ∈ Π d n , r = X | k | >n ∆ k √ k ! h k . F rom the compactness of the supp ort, | ∆ k | ≤ 2(2 M ) | k | . Th us, b y multinomial theorem and Stir- ling’s form ula, X | k | = m ∆ 2 k k ! ≤ X | k | = m 4(4 M 2 ) m k ! = 4(4 M 2 d ) m m ! ≤ 4 √ 2 π m  4 eM 2 d m  m . (52) It follo ws from the deﬁnition ( 49 ) that n + 1 ≥ 2 B 0 e ≥ 2  1 ∨ 2 eM 2 d  e 1+2 κ 1 ≥ 16 ∨ 8 eM 2 d . Thus, ∥ r ∥ 2 L 2 ( ϕ d ) = X | k | >n ∆ 2 k k ! ≤ ∞ X m = n +1 4 p 2 π ( n + 1)  4 eM 2 d n + 1  m (b y ( 52 )) ≤ ∞ X m = n +1 1 2 m − n − 1 √ 2 π  4 eM 2 d n + 1  n +1 ( ∵ n + 1 ≥ 16 ∨ 8 eM 2 d ) ≤  4 eM 2 d n + 1  n +1 . ( ∵ 2 ≤ √ 2 π ) It follo ws from the deﬁnition ( 50 ) of B 0 that 4 eM 2 d ≤ 2 B 0 e − 2 κ 1 . Hence, b y Lemma A.10 , ∥ r ∥ L 2 ( ϕ d ) ≤  2 B 0 e − 2 κ 1 n + 1  ( n +1) / 2 ≤ e − κ 1 n t ≤ 1 2 e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) . (53) 24 The last inequality follows from the H¨ older’s inequality ∥ g ∥ L 1 ( ϕ d ) ≤ ∥ g ∥ L 2 ( ϕ d ) . W e deﬁne c 0 = c 0 ( κ 1 , κ 2 , M , d ) := e − κ 1 ( A 1 d ∨ B ) and conclude that 2 t = ∥ g ∥ L 1 ( ϕ d ) ≥ ∥ q ∥ L 1 ( ϕ d ) − ∥ r ∥ L 1 ( ϕ d ) ( ∵ g = q + r ) ≥ c n,d ∥ q ∥ L 2 ( ϕ d ) − ∥ r ∥ L 2 ( ϕ d ) (b y ( 41 )) ≥ c n,d ∥ g ∥ L 2 ( ϕ d ) − 2 ∥ r ∥ L 2 ( ϕ d ) ( ∵ c n,d ≤ 1) ≥ 3 e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) − e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) (b y ( 48 ) and ( 53 )) ≥ 2 exp  − κ 1  A 1 d ∨ B ∨ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  ∥ g ∥ L 2 ( ϕ d ) ≥ 2  c 0 ∧ t α ( t )  ∥ g ∥ L 2 ( ϕ d ) , where α ( t ) = 2 κ 1 κ 2 log (log(1 /t ) ∨ e ) . Letting C 0 := c − 1 0 giv es the desired result ∥ g ∥ L 2 ( ϕ d ) ≤  C 0 ∨ t − α ( t )  t . A.3 Dep endency of the Constant In this section, w e discuss how the constan t C 0 in the main Theorems 2.1 and 2.5 dep ends on the radius M and dimension d . In short, log( C 0 ) has a p olynomial order in M 2 d , and it is “nearly” linear in the regime where δ → ∞ . Prop osition A.11 (Dependency of C 0 on M and d ) . The c onstants C 0 = C 0 ( δ, M , d ) in The o- r ems 2.1 and 2.5 c oincide. Mor e over, if we deﬁne A 1 = A 1 ( κ 1 ) and B = B ( κ 1 , κ 2 , M 2 d ) as in ( 43 ) and ( 51 ) , r esp e ctively, then we c an sp e cify the c onstant as log( C 0 ) := inf 2 κ 1 κ 2 =2+ δ κ 1 ( A 1 d ∨ B ) , wher e the inﬁmum is taken with r esp e ct to κ 1 , κ 2 > 1 such that 2 κ 1 κ 2 = 2 + δ . Pr o of. The deﬁnition ( 43 ) of A 1 = A 1 ( κ 1 ) reﬂects the assumption of Corollary A.9 , which is required to meet the condition of Prop ositions A.7 and A.8 and to guarantee that c n,d deﬁned in ( 41 ) is not less than 3 e − κ 1 n , as demonstrated in the Corollary A.9 . On the other hand, the deﬁnitions ( 50 ) and ( 51 ) of B 0 and B reﬂect Lemma A.10 , whic h is essen tial to con trol the tail norm ∥ r ∥ L 2 ( ϕ d ) of g = f π − f η ϕ d . W e give more detailed discussion b elo w. The ﬁrst observ ation is that once κ 1 > 1 is ﬁxed, A 1 is merely a universal constan t. This shows that log ( C 0 ) must dep end on the dimension d at least linearly . In con trast, the b eha vior of B 0 and B describ ed in ( 50 ) and ( 51 ) is more in tricate. It suﬃces to consider the regime where 2 eM 2 d > 1 b ecause if the radius M of supp ort is to o small, we can simply em b ed the support in to a larger cube. Therefore, once κ 1 is ﬁxed, we hav e B 0 ≍ M 2 d . If in ( 51 ) we are allow ed to tak e κ 2 suﬃcien tly large, then w e w ould obtain log( C 0 ) ≍ B 0 ≍ M 2 d . How ever, this cannot b e achiev ed in the regime where δ > 0 is ﬁxed and M 2 d is large. In such a situation, w e ha ve the follo wing p olynomial rate: log( C 0 ) ≍ ( M 2 d ) κ 2 +1 κ 2 − 1 . If δ > 0 is taken suﬃcien tly large, the p olynomial order in M 2 d ma y recov er the limit κ 2 +1 κ 2 − 1 → 1. 25 A.4 Ph ysical Interpretation: Quan tum Harmonic Oscillator In this section, w e pro vide ph ysical interpretation of the restricted-range inequality , Proposition A.7 . A classical Hamiltonian of a particle in R d is giv en by H cl = 1 2 ∥ ξ ∥ 2 2 + V ( x ) , where ξ and x are the momentum and p osition of the particle, resp ectively . The classical harmonic oscillator is deﬁned b y the p oten tial energy V ( x ) := 1 2 ∥ x ∥ 2 2 . The quan tum-mechanical analog of the Hamiltonian is given b y the follo wing diﬀerential operator. H = − ℏ 2 2 ∇ 2 + V : ψ 7→ − ℏ 2 2  ∂ 2 ∂ x 2 1 + · · · + ∂ 2 ∂ x 2 d  ψ + 1 2  x 2 1 + · · · + x 2 d  ψ . Here ψ : R d → R is a w a ve function and ℏ > 0 is a constant closely related to the Planck constant, while w e assume natural (mathematical) length and energy scales. Prop osition A.12 (Isotropic quantum harmonic oscillator) . F or k ∈ N d 0 , deﬁne the Hermite function as ψ k ( x ) :=  2 ℏ  d/ 4 h k r 2 ℏ x ! ϕ 1 / 2 d r 2 ℏ x ! . Then, 1. H is a self-adjoint op er ator. 2. (normalization) ∥ ψ k ∥ L 2 ( R d ) = 1 . 3. (Schr¨ odinger e quation) H ψ k = E k ψ k wher e the eigenvalue is E k = ℏ 2 (2 | k | + d ) . 4. { ψ k } c onsists entir ely of eigenfunctions of H . Mor e over, if we deﬁne the Mehler kernel M ( x, y ; t ) := P k ∈ N d 0 e − tE k ψ k ( x ) ψ k ( y ) for t > 0 , then M ( x, y ; t ) = (2 π ℏ sinh( ℏ t )) − d/ 2 exp − ∥ x ∥ 2 2 + ∥ y ∥ 2 2 2 ℏ tanh( ℏ t ) + ⟨ x, y ⟩ L 2 ( R d ) ℏ sinh( ℏ t ) ! . Pr o of. See Lemma A.4 . R emark A.13 . The eigen v alue E k is the energy lev el of the state k . A complex-analytical analog of Mehler k ernel is the F eynman propagator, where t > 0 represen ts inv erse temp erature. F or the sake of the preceding pro ofs, we are only interested in the sp ecial case ℏ = 2, in which ψ k = h k ϕ 1 / 2 d and E k = 2 | k | + d . Recall that Corollary A.5 describes upp er b ounds of the quantit y K n ( x, x ) ϕ d ( x ) inv olving the diagonal entries of Christoﬀel-Darb oux k ernel ( 29 ). The quantit y can b e rewritten as K n ( x, x ) ϕ d ( x ) = X E k ≤ E n,d ψ 2 k ( x ) , (54) 26 where E n,d = 2 n + d as in ( 34 ). Th us, ( 54 ) represents the diagonal entries of low-energy sp ectral pro jector k ernel and explains the spatial density of states (DOS). As suc h, lo cal W eyl la w states that, giv en x ∈ R d , in the classical regime where E n,d → ∞ , w e hav e X E k ≤ E n,d ψ 2 k ( x ) → (4 π ) − d Z H cl ≤ E n,d dξ = (4 π ) − d ω d  2 E n,d − ∥ x ∥ 2 2  d/ 2 , where ω d is the volume of d -dimensional unit (Euclidean) ball. Therefore, in the classically forbidden region where ∥ x ∥ 2 > p 2 E n,d , i.e., the p oten tial energy exceeds the mec hanical energy , we exp ect the quan tit y ( 54 ) to con v erge to zero as n → ∞ . The tail bound ( 37 ) is the mathematically rigorous v ersion of this intuition. Refer to Guillemin and Stern b erg ( 2013 ) for further details. B Pro of of the Sharpness This section completes the pro of of our sharpness result b y proving Lemma 3.2 and Corollary 3.3 . B.1 Preliminaries: Cheb yshev P olynomials and Lemmas Lemma B.1. Supp ose | ∆ k | ≤ 2 b k holds for al l k ∈ N . Then, ther e exists N ∈ N such that n ≥ N ∨ (2 . 77) b 2 = ⇒ ∞ X k = n +1 ∆ 2 k k ! ≤  eb 2 n + 1  n +1 . Pr o of. According to Stirling’s form ula, there exists N ∈ N , not depending on b , suc h that, if n ≥ N , ∆ 2 n + ℓ ( n + ℓ )! ≤ 4 b 2( n + ℓ ) ( n + ℓ )! ≤  1 − e 2 . 77   eb 2 n + ℓ  n + ℓ holds for ℓ ≥ 1. If we assume further that n ≥ (2 . 77) b 2 , then ∞ X ℓ =1  1 − e 2 . 77   eb 2 n + ℓ  n + ℓ ≤ ∞ X ℓ =1  1 − e 2 . 77   e 2 . 77  ℓ − 1  eb 2 n + 1  n +1 =  eb 2 n + 1  n +1 . Lemma B.2 (Chebyshev p olynomials of the ﬁrst kind) . L et n ≥ 11 and θ j = cos  2 j +1 2 n +2 π  , j = 0 , . . . , n b e the zer os of Chebyshev p olynomial of the ﬁrst kind, T n +1 ( x ) , with de gr e e n + 1 . Then, 1. | T n +1 ( t √ − 1) | = n ( t + √ t 2 + 1) n +1 + ( t − √ t 2 + 1) n +1 o / 2 holds for t > 0 . 2. z n = p − n 2 . 77 satisﬁes 1 2 n | z n | n +1 | T n +1 ( z n ) | < 2 . 3.   V − 1 n +1   ∞ ≤ (1+ √ 2) n +1 n +1 , wher e V n +1 =    1 · · · 1 . . . . . . . . . θ n 0 · · · θ n n    is the ( n + 1) × ( n + 1) V andermonde matrix involving θ 0 , . . . , θ n . 27 Pr o of. First, applying de Moivre’s formula to the deﬁnition ( 7 ) gives T n +1 ( x ) = 1 2  ζ n +1 + ζ − ( n +1)  , where x ∈ C and ζ = x ± √ x 2 − 1. (No matter which branch is c hosen for the square ro ot, the tw o summands are recipro cal to eac h other.) Second, if z n = p − n 2 . 77 , then 1 2 n | z n | n +1 | T n +1 ( z n ) | =   1 + q 1 + 2 . 77 n 2   n +1 +   1 − q 1 + 2 . 77 n 2   n +1 → exp  2 . 77 4  < 2 , as n → ∞ . (A more careful computation shows n ≥ 11 is suﬃcient.) Finally , according to Example 6.2 of Gautsc hi ( 1974 ), we ha ve   V − 1 n +1   ∞ ≤ 3 3 / 4 2( n + 1) | T n +1 ( √ − 1) | ≤ (1 + √ 2) n +1 n + 1 . Lemma B.3. L et n b e a p ositive o dd numb er. Then, max  ( n/ 2 . 77) ℓ (2 ℓ )!! : ℓ = 0 , . . . , n − 1 2  ≤ exp  n 5 . 54  , wher e (2 ℓ )!! denotes a double factorial. Pr o of. F or ℓ ≥ 1, we ha ve (2 ℓ )!! = 2 ℓ ℓ ! and ( n/ 2 . 77) ℓ 2 ℓ ℓ ! ≤  en 5 . 54 ℓ  ℓ ≤ exp  n 5 . 54  . The ﬁrst inequalit y holds from Stirling’s formula and the second one is given by optimizing with resp ect to ℓ o ver positive reals. The optimal v alue is attained at ℓ = n/ (5 . 54). B.2 Pro ofs W e now proceed to pro ve Lemma 3.2 and Corollary 3.3 . Pr o of of L emma 3.2 . W e solve the follo wing linear system:    1 0 . . . 0 a n       1 · · · 1 . . . . . . . . . θ n 0 · · · θ n n       w 0 . . . w n    =    ∆ 0 . . . ∆ n    . By the third statemen t of Lemma B.2 , w e ha v e | w j | ≤   V − 1 n +1   ∞ a − n ∆ n ≤ 1 n +1 for all j . Indeed, π n and η n are v alid probabilit y measures supported on [ − M , M ] since P n j =0 w j = ∆ 0 = 0. W e also ha ve ∆ k = n X j =0 w j ( aθ j ) k = Z θ k d ( π n − η n )( θ ) , 28 for k = 0 , 1 , . . . , n . Lemma B.3 veriﬁes that ( 11 ) holds for all 0 ≤ k ≤ n . W e will no w use mathematical induction to show that, in fact, ( 11 ) holds for all k ≥ 0. Let K ≥ n and assume the induction h yp othesis ( 11 ) to b e true for all k ≤ K . Recall that T n +1 ( x ) = 2 n ( x n +1 − σ 2 x n − 1 + σ 4 x n − 3 − · · · + ( − 1) ( n +1) / 2 σ n +1 ) , where σ m denotes the m -th elemen tary symmetric function of the zeros θ 0 , . . . , θ n . Since T n +1 ( θ j ) = 0, ( aθ j ) K +1 = σ 2 a 2 ( aθ j ) K − 1 − σ 4 a 4 ( aθ j ) K − 3 + · · · + ( − 1) ( n − 1) / 2 σ n +1 a n +1 ( aθ j ) K − n , | ∆ K +1 | =       n X j =0 w j ( aθ j ) K +1       ≤ σ 2 a 2 | ∆ K − 1 | + σ 4 a 4 | ∆ K − 3 | + · · · + σ n +1 a n +1 | ∆ K − n | ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n  σ 2 ( a/b ) 2 + σ 4 ( a/b ) 4 + · · ·  = n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n  a n +1 2 n b n +1     T n +1  b a √ − 1      − 1  ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n . The last inequality follows from the second statement of Lemma B.2 . W e hav e sho wn that the induction h yp othesis ( 11 ) is also true for k = K + 1. Thus, ( 11 ) is true for all k ≥ 0. Now, we pro ceed to pro v e the v ery last statemen t. In view of Lemma B.1 , there exists N ∈ N , not dep ending on a or b , such that if n ≥ N , then ∥ r n ∥ L 2 ( ϕ ) ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b − n  eb 2 n + 1  ( n +1) / 2 ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  r n 2 . 77  e n + 1  ( n +1) / 2 ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54   e n  n/ 2 . Lastly , observing that q n ( x ) = ( n − 1) / 2 X ℓ =0 n a ( √ 2 − 1) o n +1 h n − 2 ℓ ( x ) (2 ℓ )!! p ( n − 2 ℓ )! = n a ( √ 2 − 1) o n +1 x n n ! giv es the following explicit form ula for L 1 ( ϕ ) and L 2 ( ϕ ) norms of q n . ∥ q n ∥ L 1 ( ϕ ) = n a ( √ 2 − 1) o n +1 2 n/ 2 π − 1 / 2 Γ  n +1 2  n ! = n a ( √ 2 − 1) o n +1 ( π n ) − 1 / 2  e n  n/ 2  1 + O  1 n  , ∥ q n ∥ L 2 ( ϕ ) = n a ( √ 2 − 1) o n +1 2 n/ 2 π − 1 / 4 Γ 1 / 2  n + 1 2  n ! = n a ( √ 2 − 1) o n +1 ( π n ) − 1 / 2  e n  n/ 2 2 n 2 − 1 4  1 + O  1 n  . 29 Comparing these asymptotics based on Stirling’s form ula sho ws ( 13 ). In particular, b oth ∥ q n ∥ L 1 ( ϕ ) and ∥ q n ∥ L 2 ( ϕ ) deca y in a hyper-exp onen tial rate of exp( − n log n/ 2), and the tail norm ∥ r n ∥ L 2 ( ϕ ) cannot deviate from ∥ q n ∥ L 1 ( ϕ ) or ∥ q n ∥ L 2 ( ϕ ) faster than an exponential rate in n . W e hav e ( 15 ) in conclusion. Pr o of of Cor ol lary 3.3 . The equalit y for TV distance is straigh tforward. In view of the ab o ve Lemma 3.2 , let n b e a large enough o dd n umber. By construction, we ha v e f π (1) n ( x ) = (1 − λ n ) ϕ ( x ) + n X j =0  λ n n + 1 + λ n w j  ϕ ( x − aθ j ) , f η (1) n ( x ) = (1 − λ n ) ϕ ( x ) + n X j =0 λ n n + 1 ϕ ( x − aθ j ) . Recall from the lemma that | θ j | ≤ 1 for all j and that 0 < a ≤ 1. Also, recall the deﬁnition ( 16 ) of R n and λ n . Observ e for all x ∈ [ − R n , R n ] and j that ϕ ( x − aθ j ) ϕ ( x ) = exp  aθ j x − 1 2 a 2 θ 2 j  ≤ exp( | aθ j | R n ) ≤ exp( R n ) and that f η (1) n ( x ) ≤ (1 − λ n + λ n exp( R n )) ϕ ( x ) ≤ 2 ϕ ( x ) . Lastly , recall the deﬁnition ( 34 ) of E n,d . Note that E n, 1 = 2 n + 1 and that R n = √ 8 n + 4 = p 2 κE n, 1 holds for κ = 2. Therefore, we ha ve 2 λ 2 n χ 2  f π (1) n ∥ f η (1) n  ≥ 2 λ 2 n Z R n − R n  f π (1) n − f η (1) n  2 f η (1) n ≥ Z R n − R n ( f π n − f η n ) 2 ϕ = ∥ q n + r n ∥ 2 L 2 ([ − R n ,R n ] ,ϕ ) ≥ 1 2 ∥ q n ∥ 2 L 2 ([ − R n ,R n ] ,ϕ ) − ∥ r n ∥ 2 L 2 ( ϕ ) ( ∵ 2( a 2 + b 2 ) ≥ ( a − b ) 2 ) ≥ 1 4 ∥ q n ∥ 2 L 2 ( ϕ ) − ∥ r n ∥ 2 L 2 ( ϕ ) (b y Prop osition A.7 with κ = 2) ≥ 1 8 ∥ q n ∥ 2 L 2 ( ϕ ) , (b y inequality ( 14 )) pro vided that n is large enough. C Pro of of the Applications In this section, we pro v e Theorem 4.3 , Prop osition 4.4 , Theorems 4.5 , 4.6 , and 4.7 . 30 C.1 Preliminaries: Y atracos’ Construction and Lemmas W e ﬁrst recall the application of Y atracos’ sc heme idea ( Y atracos , 1985 ) for robust densit y estima- tion in total v ariation. Consider an η -co v ering { Q 1 , . . . , Q N } of P M ,d in total v ariation. Then, w e deﬁne the Y atracos’ class A b y A := { A ij : i  = j ∈ [ N ] } , A ij :=  x : dQ i d ( Q i + Q j ) ( x ) ≥ dQ j d ( Q i + Q j ) ( x )  , so that |A| ≤ N 2 . Giv en the class A , we deﬁne a pseudo-distance dist as follows. dist( P 1 , P 2 ) := sup A ∈A | P 1 ( A ) − P 2 ( A ) | . Then, dist satisﬁes triangular inequality . Moreov er, it approximates the total v ariation on P , in the sense that dist( Q i , Q j ) = TV( Q i , Q j ) , dist( P 1 , P 2 ) ≤ TV ( P 1 , P 2 ) ≤ dist( P 1 , P 2 ) + 4 ϵ, ∀ P 1 , P 2 ∈ P M ,d . Giv en i.i.d. observ ations X 1 , . . . , X n as in ( 22 ), we deﬁne the Y atracos’ estimator b P b y b P := argmin P ′ ∈P M ,d dist  P ′ , b P n  , (55) where b P n := 1 n P n i =1 δ X i is the empirical distribution. Note that the Y atracos’ scheme w orks even if P = (1 − ϵ ) P f π + ϵQ is outside P M ,d . If w e denote by b f the density of b P , in particular, we ha ve TV  f π , b f  ≤ 3 η + 2dist  P , b P n  + 3 inf P ′ ∈P M ,d TV( P , P ′ ) . (56) See Section 32.3 of Poly anskiy and W u ( 2025 ) for recen t review on the Y atracos’ estimator. As a consequence, we can derive the minimax upp er b ound in Prop osition 4.4 , noting that w e hav e log N ≲ log d +1 (1 /η ) from Lemma C.1 . It only remains to choose appropriate η for ( 56 ). See App endix C.2 for the details. Lemma C.1 (TV entrop y b ound in d dimension) . R e c al l the deﬁnition of c overing numb er fr om Deﬁnition 4.1 . We have log N TV ( P M ,d , ϵ ) ≲ log d +1  1 ϵ  . Pr o of. F or the one-dimensional case ( d = 1), the en tropy b ound is due to Ghosal and V an Der V aart ( 2001 ). Recent w orks extended this result to arbitrary dimensions. ( Saha and Gun tub o yina , 2020 ; Ma et al. , 2025 ). Let P m b e the collection of m -atomic Gaussian mixtures in P M ,d and deﬁne m ⋆ := inf ( m ∈ N : sup P ∈P M ,d inf P m ∈P m TV( P , P m ) ≤ ϵ 2 ) . 31 Then, Prop osition 5 of Ma et al. ( 2025 ) shows m ⋆ ≲ log d (1 /ϵ ). On the other hand, parametric en tropy bound on ﬁnite mixtures sho ws log N TV  P m ⋆ , ϵ 2  ≲ m ⋆ d log  1 ϵ  . Com bining these results with triangular inequalit y concludes the pro of. Lemma C.2 ( Chen et al. ( 2018 )) . Supp ose P 1 and P 2 ar e pr ob ability me asur es such that TV( P 1 , P 2 ) ≤ ϵ 1 − ϵ . Then, ther e exist two pr ob ability me asur es Q 1 and Q 2 such that (1 − ϵ ) P 1 + ϵQ 1 = (1 − ϵ ) P 2 + ϵQ 2 . C.2 Pro ofs W e pro ceed to prov e Theorem 4.3 , Prop osition 4.4 , Theorems 4.5 , 4.6 , and 4.7 in this section. Pr o of of The or em 4.3 . First, for one estimator b P , supp ose e P is the pro jection of b P into P under TV distance. Then, for every P ∈ P , we ha v e TV  P , e P  ≤ TV  P , b P  + TV  b P , e P  ≤ 2TV  P , b P  . This allo ws b P to b e restricted to P up to universal constan ts. Second, the upp er bound follows immediately from the inequalit y ( 1 ) and Prop osition 4.2 . Third, applying Corollary 2.4 gives P h TV  P , b P  ≥ J − 1  ϵ n 4 i ≥ P h H  P , b P  ≥ ϵ n 4 i ≥ 1 2 , (57) where w e deﬁne α ( t ) as in ( 4 ) and J ( t ) as J ( t ) := C 0 t ∨ t 1 − α ( t ) , (58) for t > 0. Note that the in v erse J − 1 is well-deﬁned in the regime where n → ∞ as J is strictly increasing in (0 , t 0 ) for some t 0 > 0. The last inequality in ( 57 ) is due to F ano’s inequalit y used in the pro of of Corollary 11 in Jia et al. ( 2023 ). W e conclude that inf b P ∈P sup P ∈P E P h TV 2  P , b P i ≳  J − 1  ϵ n 4  2 ≳ ϵ 2  1+ 2+ δ log(log(1 /ϵ n ) ∨ e )  n . Pr o of of Pr op osition 4.4 . Observe that inf P ′ ∈P M ,d TV( P , P ′ ) ≤ TV ( P, P f π ) ≤ ϵ. Hence, b y ( 56 ), the standard Y atracos’ construction ( 55 ) leads to a prop er estimator b f satisfying TV  f π , b f  ≤ 3 ϵ + 3 η + 2dist  P , b P n  . (59) 32 Applying the Ho eﬀding bound and union b ound, w e hav e P  dist  P , b P n  ≥ s  ≤ 1 ∧ 2 |A| exp  − ns 2 2  , (60) E P  dist 2  P , b P n  ≤ 2 (1 + log (2 |A| )) n . Lemma C.1 implies log |A| ≤ 2 log N TV ( P M ,d , η ) ≲ log d +1 (1 /η ) . Accordingly , we c ho ose optimal η ≍ log d/ 2 ( n ) / √ n to conclude the pro of. Pr o of of The or em 4.5 . Let b f b e the prop er estimator from the proof of Prop osition 4.4 . W e deﬁne J ( · ) as in ( 58 ). Observ e that J ( · ) is subadditive, i.e., J ( s + t ) ≤ J ( s ) + J ( t ) holds for all s, t > 0, pro vided that C 0 is not to o small, dep ending only on δ > 0. Th us, applying Corollary 2.4 to ( 59 ) giv es H  f π , b f  ≤ 3 J ( ϵ ) + 3 J ( η ) + 2 J  dist  P , b P n  . Hence, the c hoice of η in the pro of of Proposition 4.4 pro ves the desired bound in ( 23 ). Pr o of of The or em 4.6 . The minimax low er b ound in ϵ can b e obtained from standard tw o-p oin t metho d. Our sharpness result, Theorem 3.1 , sho ws that there exist tw o “one-dimensional” proba- bilit y measures π ⋆ and η ⋆ , supp orted on the b ounded interv al [ − M , M ], such that TV ( f π ⋆ , f η ⋆ ) ≤ ϵ ≤ ϵ 1 − ϵ and that H ( f π ⋆ , f η ⋆ ) ≳ ϵ  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  . Note that we can also construct d -dimensional probabilit y measures π and η with the same property as w e hav e TV( f π , f η ) = TV( f π ⋆ , f η ⋆ ) and H ( f π , f η ) = H ( f π ⋆ , f η ⋆ ) for π = π ⋆ ⊗ δ ⊗ ( d − 1) 0 = π ⋆ ⊗ δ 0 ⊗ · · · ⊗ δ 0 , η = η ⋆ ⊗ δ ⊗ ( d − 1) 0 = η ⋆ ⊗ δ 0 ⊗ · · · ⊗ δ 0 , where δ 0 denotes the p oin t mass at zero and ⊗ the product measure. Thus, it follo ws from Lemma C.2 and the same tw o p oin t argumen t in Chen et al. ( 2018 ) that inf b f sup π ,Q E h H 2  f π , b f i ≳ ϵ 2  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  . Pr o of of The or em 4.7 . This pro of crucially relies on the proof of Theorem 3.5 in Saha and Gun- tub o yina ( 2020 ). Our proof, how ev er, diﬀers from theirs in the choice of ρ : they tak e ρ = (2 π ) − d/ 2 n − 1 , whereas w e use ρ = (2 π ) − d/ 2  E 2 ( ϵ, n ) ∧ e − 2  , where w e deﬁne E 2 ( ϵ, n ) as in ( 24 ). 33 Recall that the oracle Ba yes estimator b θ ⋆ ( · ) is giv en b y ( 26 ), and consider the follo wing decom- p osition: E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2 ≤ 2 E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ρ ( X )    2 + 2 E X ∼ f π    b θ ⋆ ρ ( X ) − b θ ⋆ ( X )    2 , (61) where w e deﬁne b θ ⋆ ρ ( X ) := X + ∇ f π ( X ) f π ( X ) ∨ ρ . The ﬁrst term ab o ve is b ounded from ab o ve as follo ws, using Theorem E.1 in Saha and Guntuboyina ( 2020 ), whic h is a generalization of Theorem 3 in Jiang and Zhang ( 2009 ). E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ρ ( X )    2 = Z      ∇ b f ( x ) b f ( x ) ∨ ρ − ∇ f π ( x ) f π ( x ) ∨ ρ      2 f π ( x ) dx ≲ H 2  f π , b f    log 1 H  f π , b f  ∨ log 3  1 E ( ϵ, n ) ∨ e    . F or the second term in ( 61 ), w e hav e E X ∼ f π    b θ ⋆ ρ ( X ) − b θ ⋆ ( X )    2 = Z     ∇ f π ( x ) f π ( x ) ∨ ρ − ∇ f π ( x ) f π ( x )     2 f π ( x ) dx = Z  1 − f π ( x ) f π ( x ) ∨ ρ  2 ∥∇ f π ( x ) ∥ 2 f π ( x ) dx ≲ E 2 ( ϵ, n ) log d  1 E ( ϵ, n ) ∨ e  . The last inequalit y is due to Lemma 4.3 in Saha and Guntuboyina ( 2020 ). Recall from our Theorem 4.5 that E h H 2  f π , b f i ≲ E 2 ( ϵ, n ) . F or brevity , write H := H  f π , b f  and E := E ( ϵ, n ) for the remainder of the pro of. Then, E  H 2 log 1 H  = E  H 2 log 1 H 1 { H ≤ E ≤ e − 1 }  + E  H 2 log 1 H 1 { H ≤ E } 1 {E > e − 1 }  + E  H 2 log 1 H 1 { H > E }  ≤ E 2 log  1 E ∨ e  + E  H 2 log 1 H 1 {E > e − 1 }  + E  H 2  P  H 2 > E 2  log  1 E ∨ e  ≲ E 2 log  1 E ∨ e  . (b y Marko v inequalit y ) T aking all into accoun t, we conclude that E  E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2  ≲ E 2 log 3 ∨ d  1 E ∨ e  ≲ ϵ 2  1 − 2+2 δ log(log(1 /ϵ ) ∨ e )  + n − (1 − o d (1)) . Note that the extra logarithmic factors are absorb ed into the slack parameter δ > 0 and o d (1), resp ectiv ely . Since the choice of δ > 0 is arbitrary , replace δ with δ / 2 to prov e the b ound ( 28 ). 34

Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian Mixtures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment