Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian Mixtures

We study the relation between the total variation (TV) and Hellinger distances between two Gaussian location mixtures. Our first result establishes a general upper bound: for any two mixing distributions supported on a compact set, the Hellinger dist…

Authors: Joonhyuk Jung, Chao Gao

Sharp Inequalities b et w een T otal V ariation and Hellinger Distances for Gaussian Mixtures Jo onh yuk Jung 1 and Chao Gao ∗ 1 1 Dep artment of Statistics, University of Chic ago Abstract W e study the relation b et w een the total v ariation (TV) and Hellinger distances b et ween t wo Gaussian lo cation mixtures. Our first result establishes a general upp er b ound: for an y t wo mixing distributions supp orted on a compact set, the Hellinger distance b et ween the tw o mixtures is controlled by the TV distance raised to a p o wer 1 − o (1), where the o (1) term is of order 1 / log log (1 / TV). W e also construct t wo sequences of mixing distributions that demonstrate the sharpness of this b ound. T ak en together, our results resolv e an op en problem raised in Jia et al. ( 2023 ) and thus lead to an entropic characterization of learning Gaussian mixtures in total v ariation. Our inequality also yields optimal robust estimation of Gaussian mixtures in Hellinger distance, whic h has a direct implication for bounding the minimax regret of empirical Ba yes under Hub er contamination. 1 In tro duction The Gaussian lo cation mixture is one of the most fundamental models used in nonparametric densit y estimation, Bay esian inference, and clustering ( Lindsay , 1995 ; Dasgupta , 1999 ). Giv en a probabilit y measure π supported on R d , the induced Gaussian mixture is defined by f π ( x ) = Z R d ϕ d ( x − θ ) dπ ( θ ) , where ϕ d ( x ) = (2 π ) − d/ 2 exp( −∥ x ∥ 2 2 / 2) is the densit y function of the d -dimensional standard Gaus- sian distribution. In this pap er, w e study the relation betw een the total v ariation distance TV( p, q ) := 1 2 R | p − q | and the Hellinger distance H ( p, q ) := q 1 2 R ( √ p − √ q ) 2 of t w o Gaussian mixture densities. Without an y restriction on the distributions, it is w ell known that H 2 ( p, q ) ≤ TV( p, q ) ≤ √ 2 H ( p, q ) . (1) The Hellinger distance is a commonly used loss function for density estimation ( W ong and Shen , 1995 ). It is especially useful in the setting of Gaussian lo cation mixture estimation giv en its direct consequence for b ounding the regret of an empirical Ba y es estimator using a plug-in estimator of the prior ( Jiang and Zhang , 2009 ). When the data set con tains a small subset of arbitrary outliers, ∗ The researc h of CG is supported in part by NSF Grants ECCS-2216912 and DMS-2310769, and an Alfred Sloan fello wship. 1 the density estimation problem can b e regarded as missp ecified under total v ariation. Therefore, sharp inequalities are necessary for deriving optimal error rates for robust densit y estimation of Gaussian lo cation mixtures, and the inequalities in ( 1 ) are to o lo ose for this purpose. Relations b et ween f -divergences of Gaussian lo cation mixtures hav e b een studied in the literature. In particular, for distributions π and η supp orted on a b ounded Euclidean ball { θ ∈ R d : ∥ θ ∥ 2 ≤ M } , it w as prov ed b y Jia et al. ( 2023 ) that the induced Gaussian mixtures f π and f η satisfy H 2 ( f π , f η ) ≍ KL( f π ∥ f η ) , (2) up to constant factors depending on M and d . Here, KL( p ∥ q ) := R p log p q denotes the Kullback– Leibler div ergence. The result in ( 2 ) implies an entropic characterization of the minimax rate of estimating the Gaussian lo cation mixture. The pap er Jia et al. ( 2023 ) also in vestigated the relation b et w een the total v ariation distance TV( f π , f η ) and the L 2 distance ∥ f π − f η ∥ 2 . How ev er, whether the relation TV( f π , f η ) ≍ H ( f π , f η ) holds w as explicitly listed as an op en question in the pap er. In this pap er, w e resolve this open problem by pro ving that H ( f π , f η ) ≤ TV 1 − o (1) ( f π , f η ) , where the o (1) term in the exp onen t is of order Ω  1 log log(1 / TV ( f π , f η ))  . W e also construct sequences of distributions π n and η n sho wing that this 1 − o (1) factor is indeed necessary , thereb y disproving the linear comparability TV ( f π , f η ) ≍ H ( f π , f η ) for Gaussian location mixtures. Our proof is based on an expansion of the ratio ( f π − f η ) /ϕ d using Hermite polynomials. Key steps in the analysis in volv e the deriv ation of the m ultiv ariate Nikolskii-t yp e inequality (see Prop osition A.6 ) and restricted-range inequalit y (see Proposition A.7 ). As a direct application, for densit y estimation of f π with data generated from the Huber contamination mo del (1 − ϵ ) P f π + ϵQ , where Q is arbitrary , w e show that the minimax rate under the Hellinger distance is given b y ϵ 1 − Θ  1 log log(1 /ϵ )  , pro vided that the sample size satisfies n ≥ p oly(1 /ϵ ). 1.1 P ap er Organization The remainder of this pap er is organized as follows. Our main result is presented in Section 2 , follo wed b y the sharpness construction in Section 3 . Two applications of the main results—en tropic c haracterization of Gaussian lo cation mixture estimation in total v ariation and robust density estimation—are discussed in Section 4 . In Section 5 , w e briefly discuss a few open directions. Due to page limits, most technical proofs are deferred to the app endices. 1.2 Notation Let N 0 b e the set of nonnegative in tegers and R the set of real num bers. W e use the b oldface notation, e.g., k and l , for m ulti-index. F or k = ( k 1 , . . . , k d ) ∈ N d 0 , w e write | k | := k 1 + · · · + k d . W e denote by ∥ θ ∥ 2 and ∥ θ ∥ ∞ the Euclidean norm and ∞ -norm of θ ∈ R d , resp ectiv ely . F or a real matrix A ∈ R m × n , ∥ A ∥ ∞ := max {∥ Ax ∥ ∞ : ∥ x ∥ ∞ = 1 } is the op erator norm induced b y the ∞ - norm of v ectors. Recall that ϕ d denotes the d -dimensional standard Gaussian densit y . W e may use 2 ϕ = ϕ 1 when w e only discuss one-dimensional results. F or p ∈ { 1 , 2 } , a measurable set A ⊆ R d , and a measurable function g : R d → R , w e write ∥ g ∥ L p ( A ,ϕ d ) as  R A | g ( x ) | p ϕ d ( x ) dx  1 /p =  R A | g | p ϕ d  1 /p , whenev er the in tegral exists. The abbreviation for L p ( R d , ϕ d ) is often L p ( ϕ d ) when no confusion arises. Let Π d n b e the set of real p olynomials of total degree ≤ n in d v ariables. W e also write Π n = Π 1 n when d = 1. F or k ∈ N 0 , we define the one-dimensional (normalized) Hermite p olynomial h k ∈ Π k b y h k ( x ) := ( − 1) k √ k ! ϕ ( x ) d k dx k ϕ ( x ) . (3) F or arbitrary dimensions, w e define the Hermite p olynomial h k ∈ Π d | k | b y tensor pro ducts of one- dimensional Hermite p olynomials: h k ( x ) := d Y j =1 h k j ( x j ) . Note that deg h k = | k | and the collection { h k : | k | ≤ n } forms an orthonormal basis of Π d n with resp ect to the L 2 ( ϕ d )-norm. The dimension of Π d n is given b y  n + d n  . F or in teger or real v alues, w e write a ∨ b := max { a, b } and a ∧ b := min { a, b } . F or a p ositiv e integer N ∈ N , we write [ N ] := { 1 , . . . , N } . F or a real num b er x , ⌈ x ⌉ is the smallest integer no smaller than x and ⌊ x ⌋ is the largest integer no larger than x . F or a, b : G → [0 , ∞ ), we write a ≲ b or a = O ( b ) if there exists some constant C > 0 indep enden t of g such that a ( g ) ≤ C b ( g ) holds for all g ∈ G . W e write a ≳ b or a = Ω( b ) if b ≲ a . W e write a ≍ b or a = Θ( b ) if a ≲ b and b ≲ a . 2 Main Results In this section, w e presen t our main results. The first result b ounds the χ 2 -div ergence χ 2 ( p ∥ q ) := R ( p − q ) 2 q b y the total v ariation. Theorem 2.1 (Inequality b etw een TV distance and χ 2 -div ergence) . L et π and η b e pr ob ability me asur es supp orte d on the d -dimensional cub e [ − M , M ] d . L et δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that q χ 2 ( f π ∥ f η ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we define α ( t ) := 2 + δ log (log(1 /t ) ∨ e ) , (4) for t > 0 . R emark 2.2 . Note that α ( t ) is increasing in t and that α ( t ) → 0 as t ↓ 0. Ho wev er, t − α ( t ) is decreasing in t and t − α ( t ) → + ∞ as t ↓ 0. R emark 2.3 . W e note that the exp onen t α ( t ) do es not dep end on M or d . The dep endence on M and d app ears solely in the constant C 0 . W e will discuss the dep endency of C 0 on M and d later in App endix A.3 . 3 Corollary 2.4 (Inequality b et w een TV and Hellinger distances) . L et π and η b e pr ob abil- ity me asur es supp orte d on the d -dimensional cub e [ − M , M ] d . L et δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that H ( f π , f η ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we define α ( · ) as in ( 4 ) . Pr o of. This is a direct consequence of Theorem 2.1 , noting that H 2 ( p, q ) ≤ χ 2 ( p ∥ q ) holds in general. A key argument of deriving the results of Theorem 2.1 and Corollary 2.4 is to relate the L 1 ( ϕ d ) and L 2 ( ϕ d ) norms of the ratio f π − f η ϕ d . W e note that the L 1 ( ϕ d )-norm of f π − f η ϕ d is exactly t wice the total v ariation. On the other hand, the squared Hellinger distance and the χ 2 -div ergence behav e lik e the squared L 2 ( ϕ d )-norm. Theorem 2.5 (Inequality betw een L 1 ( ϕ d ) and L 2 ( ϕ d ) norms) . L et π and η b e pr ob ability me asur es supp orte d on the d -dimensional cub e [ − 2 M , 2 M ] d . Define g := f π − f η ϕ d and supp ose δ > 0 . Then, ther e exists C 0 = C 0 ( δ, M , d ) > 0 , not dep ending on π or η , such that ∥ g ∥ L 2 ( ϕ d ) ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) , wher e we define α ( · ) as in ( 4 ) . Pr o of. W e provide the pro of of the result in general dimension later in App endix A.2 , built up on the inequalities in App endix A.1 . Here, we giv e a sketc h of the proof for the one-dimensional setting with d = 1. Recall the definition ( 3 ) of (one-dimensional) Hermite polynomials, and consider the Hermite p olynomial expansion (see Lemma A.1 ) of g as follows. g ( x ) = Z R ϕ 1 ( x − θ ) ϕ 1 ( x ) d ( π − η )( θ ) = Z R ∞ X k =0 θ k √ k ! h k ( x ) d ( π − η )( θ ) (b y Lemma A.1 ) = ∞ X k =0 ∆ k √ k ! h k ( x ) , where ∆ k := R R θ k d ( π − η )( θ ). W e decomp ose g = q + r , where q = n X k =0 ∆ k √ k ! h k , r = ∞ X k = n +1 ∆ k √ k ! h k , and n is an in teger that will be determined later. T o handle the L 1 ( ϕ 1 )-norm of q ∈ Π n , w e define c n := inf n ∥ P ∥ L 1 ( ϕ 1 ) : P ∈ Π n , ∥ P ∥ L 2 ( ϕ 1 ) = 1 o . (5) Note first that c n ≤ 1 b y Cauch y-Sc h warz inequalit y . F or P ∈ Π n , the Nikolskii-t yp e inequalit y ( Nev ai and T otik , 1987 ) states that sup x ∈ R    P ( x ) ϕ 1 / 2 1 ( x )    ≲ n 1 / 4 ∥ P ∥ L 2 ( ϕ 1 ) . (6) 4 Th us, the following argumen t implies that c n ≥ cn − 1 / 4 e − n holds for some universal constant c > 0: ∥ P ∥ 2 L 2 ( ϕ 1 ) = Z ∞ −∞ P 2 ϕ 1 ≤ 2 Z 2 √ n +1 − 2 √ n +1 P 2 ϕ 1 (Restricted-range inequalit y ) ≤ 2 sup | x |≤ 2 √ n +1    ϕ − 1 / 2 1 ( x )    sup x ∈ R    P ( x ) ϕ 1 / 2 1 ( x )    Z ∞ −∞ | P ϕ 1 | ≲ e n · n 1 / 4 ∥ P ∥ L 2 ( ϕ 1 ) · ∥ P ∥ L 1 ( ϕ 1 ) . (b y ( 6 )) The abov e restricted-range inequality is due to Theorem 6.2(b) of Lubinsky ( 2007 ) with W = ϕ 1 / 2 1 b eing the F reud-t yp e w eigh t function. In addition to c n , another tec hnical ingredien t is to control the tail norm ∥ r ∥ L 2 ( ϕ 1 ) . Knowing that | ∆ k | ≤ 2(2 M ) k , w e hav e ∥ r ∥ L 2 ( ϕ 1 ) ≤ ∞ X k = n +1 4(4 M 2 ) k k ! ! 1 / 2 ≤  C n + 1  ( n +1) / 2 , where C is a p ositiv e constant depending solely on M . No w we are ready to lo wer bound ∥ g ∥ L 1 ( ϕ 1 ) , ∥ g ∥ L 1 ( ϕ 1 ) ≥ ∥ q ∥ L 1 ( ϕ 1 ) − ∥ r ∥ L 1 ( ϕ 1 ) ≥ c n ∥ q ∥ L 2 ( ϕ 1 ) − ∥ r ∥ L 2 ( ϕ 1 ) (b y ( 5 )) ≥ c n ∥ g ∥ L 2 ( ϕ 1 ) − 2 ∥ r ∥ L 2 ( ϕ 1 ) , where the last inequality holds since c n ≤ 1. T ogether with the lo wer b ound for c n and the upp er b ound for ∥ r ∥ L 2 ( ϕ 1 ) , w e hav e 2 t ≥ sup n ≥ 1 ( cn − 1 / 4 e − n ∥ g ∥ L 2 ( ϕ 1 ) − 2  C n + 1  ( n +1) / 2 ) , where t = 1 2 ∥ g ∥ L 1 ( ϕ d ) = TV( f π , f η ). Finally , w e choose n ≈ 2 log (1 /t ) log log(1 /t ) to conclude the pro of. The full pro of in App endix A is self-con tained, and the main c hallenge is to generalize the Nik olskii-t yp e inequalit y and the restricted-range inequalit y to arbitrary dimension. See Prop ositions A.6 and A.7 , resp ectiv ely . Pr o of of The or em 2.1 . Here we sho w that the Theorem 2.1 follows directly from Theorem 2.5 and that the constants C 0 in the b oth theorems coincide. Fix θ ∈ [ − M , M ] d . Consider a translation map τ θ ( x ) = x − θ and define the following push-forw ard measures: π θ := ( τ θ ) ♯ π , η θ := ( τ θ ) ♯ η . 5 Note that these are simply translations of the original measures and are supp orted on [ − 2 M , 2 M ] d . Define g θ := f π θ − f η θ ϕ d . Then, ∥ g θ ∥ 2 L 2 ( ϕ d ) = Z R d ( f π ( x + θ ) − f η ( x + θ )) 2 ϕ d ( x ) dx = Z R d ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx, ∥ g θ ∥ L 1 ( ϕ d ) = Z R d | f π ( x + θ ) − f η ( x + θ ) | dx = Z R d | f π ( x ) − f η ( x ) | dx = 2TV( f π , f η ) . Since g θ ob eys the inequalit y in Theorem 2.5 , there exists C 0 = C 0 ( δ, M , d ) > 0, not dep ending on π , η , or θ , suc h that  Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx  1 / 2 ≤  C 0 ∨ TV − α (TV( f π ,f η )) ( f π , f η )  TV( f π , f η ) . Mean while, we can apply Jensen’s inequalit y p oin twise in x to get ( f π ( x ) − f η ( x )) 2 f η ( x ) ≤ Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dη ( θ ) . In tegrate b oth sides in x . Then, use F ubini-T onelli (nonnegativity) and the fact that a mixture in tegral is upp er b ounded b y the supremum of its in tegrand to show that χ 2 ( f π ∥ f η ) ≤ sup θ ∈ [ − M ,M ] d Z ( f π ( x ) − f η ( x )) 2 ϕ d ( x − θ ) dx, th us concluding the pro of. 3 Sharpness In this section, w e establish the sharpness of the inequalities b y showing that the presence of the exp onen t α ( · ) is necessary up to a constant. Since our construction of sharp examples is in one dimension, w e will write ϕ = ϕ 1 for simplicit y of notation. Note that Theorem 4.6 and its pro of demonstrate that the minimax low er b ound for density estimation in arbitrary dimensions can b e established using only the one-dimensional sharpness result. W e first show the sharpness of the Corollary 2.4 , and then the sharpness of Theorem 2.1 follows immediately b y H 2 ( p, q ) ≤ χ 2 ( p ∥ q ). Theorem 3.1 (Sharpness of Corollary 2.4 ) . Ther e exist two se quenc es of pr ob ability me asur es { π n } and { η n } supp orte d on [ − M , M ] such that, if we define TV n := TV( f π n , f η n ) , H n := H ( f π n , f η n ) , then TV n ↓ 0 as n → ∞ , and mor e over it holds for al l n that TV n < e − e and that H n ≥ TV 1 − α ∗ (TV n ) n , wher e we define α ∗ ( t ) := 0 . 33 log log(1 /t ) , t > 0 . 6 In the sequel, w e will construct three pairs of sequences of mixing distributions, namely , ( π n , η n ), ( π (1) n , η (1) n ), and ( π (2) n , η (2) n ). W e b egin with Lemma 3.2 , provid ing a sharp example ( π n , η n ) of Theorem 2.5 . Corollary 3.3 then mo difies this example to ( π (1) n , η (1) n ) sho wing the sharpness of Theorem 2.1 . Finally , Corollary 3.4 constructs ( π (2) n , η (2) n ) from ( π (1) n , η (1) n ) to sho w the sharpness of Corollary 2.4 . Before w e pro ceed to construct a sharp example of Theorem 2.5 , we recall the essential ingre- dien ts of the pro of of the theorem: 1) The quan tity c n , defined in ( 5 ), can be b ounded from below b y e − O ( n ) ; 2) W e can control the tail norm ∥ r ∥ L 2 ( ϕ 1 ) b y e − Ω( n log n ) using the difference b etw een higher order moments of the mixing distributions. W e note that the sequence of monomials ( x n ) n is a sharp instance of the c n , since the norm ratio ∥ x n ∥ L 1 ( ϕ 1 ) / ∥ x n ∥ L 2 ( ϕ 1 ) is deca ying exponentially in n . Knowing this fact, given n , w e construct an example such that the L 2 ( ϕ 1 ) pro jection of ( f π n − f η n ) /ϕ 1 on to Π n is prop ortional to x n . T o this end, we will first find ( n + 1) p oints in [ − M , M ] as the support of the tw o mixing distributions, denoted b y θ 0 , · · · , θ n , and then w e match the lo wer order momen ts ∆ 0 , . . . , ∆ n so that n X k =0 ∆ k √ k ! h k ∝ x n . Giv en the v alues of θ 0 , · · · , θ n , the difference of the low er order momen ts ∆ 0 , . . . , ∆ n can b e solv ed by a linear equation that in volv es in verting a V andermonde matrix (see Lemma B.2 for the definition). W e choose θ 0 , · · · , θ n to b e zeros of the ( n + 1)-th Chebyshev p olynomial of the first kind, i.e., Cheb yshev no des, since the corresp onding inv erse V andermonde matrix is stable ( Gautsc hi , 1974 ). T n +1 ∈ Π n +1 , the ( n + 1)-th Chebyshev polynomial of the first kind, is defined b y T n +1 (cos( θ )) = cos(( n + 1) θ ) . (7) In addition to stability of the in verse V andermonde matrix, another adv antage of using the Cheby- shev nodes is that w e can recursively b ound the difference of the higher order moments giv en the lo wer order moments. The prop erties of the construction are summarized b y the following Lemma 3.2 whose pro of will b e giv en in App endix B . Lemma 3.2 (Sharp example of Theorem 2.5 ) . L et n ≥ 11 b e an o dd numb er and θ j = cos  2 j +1 2 n +2 π  , j = 0 , . . . , n b e the zer os of Chebyshev p olynomial of the first kind, T n +1 ( x ) . Given M > 0 , define a = 1 ∧ M and ∆ k =       a ( √ 2 − 1)  n +1 ( n − k )!! , k is o dd , 0 , k is even , (8) for k = 0 , 1 , . . . , n , wher e ( n − k )!! is the double factorial. Define ( w 0 , . . . , w n ) ∈ R n +1 to b e the unique ve ctor solving ∆ k = n X j =0 w j ( aθ j ) k , k = 0 , 1 , . . . , n. (9) A c c or dingly, define two discr ete pr ob ability me asur es π n := n X j =0  1 n + 1 + w j  δ aθ j , η n := n X j =0 1 n + 1 δ aθ j , (10) wher e δ aθ j denotes the p oint mass at aθ j . Then, 7 1. w j is wel l-define d and | w j | ≤ 1 n +1 for al l j . 2. π n and η n ar e valid discr ete pr ob ability me asur es supp orte d on [ − M , M ] . 3. F or 0 ≤ k ≤ n , ∆ k = R θ k d ( π n − η n )( θ ) satisfies | ∆ k | ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b k − n , (11) wher e b := a p n 2 . 77 . 4. If we further define ∆ k := R θ k d ( π n − η n )( θ ) for k > n , then ( 11 ) is also true. 5. If we write q n ( x ) = P n k =0 ∆ k √ k ! h k ( x ) and r n ( x ) = P ∞ k = n +1 ∆ k √ k ! h k ( x ) , then q n ( x ) = n a ( √ 2 − 1) o n +1 x n n ! . (12) In addition, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that ∥ r n ∥ L 2 ( ϕ ) ≤ 1 32 exp  n 5 . 53  ∥ q n ∥ L 1 ( ϕ ) (13) ≤ 1 16 exp  −  log(2) 2 − 1 5 . 53  n  ∥ q n ∥ L 2 ( ϕ ) . (14) 6. g n = q n + r n satisfies lim n →∞ 1 n log n log 1 ∥ g n ∥ L 1 ( ϕ ) ! = lim n →∞ 1 n log n log 1 ∥ g n ∥ L 2 ( ϕ ) ! = 1 2 . (15) Pr o of. W e will give the full pro of in App endix B.2 . The key argumen t, which is to deriv e the b ound ( 11 ) for k > n is sk etched b elow. W rite the Chebyshev p olynomial as T n +1 ( x ) = 2 n ( x n +1 − σ 2 x n − 1 + σ 4 x n − 3 − · · · + ( − 1) ( n +1) / 2 σ n +1 ). The c hoice of the supp ort { aθ 0 , . . . , aθ n } im- plies that T n +1 ( θ j ) = 0 for all j = 0 , · · · , n , and thus ( aθ j ) K +1 = σ 2 a 2 ( aθ j ) K − 1 − σ 4 a 4 ( aθ j ) K − 3 + · · · + ( − 1) ( n − 1) / 2 σ n +1 a n +1 ( aθ j ) K − n . This implies | ∆ K +1 | =    P n j =0 w j ( aθ j ) K +1    ≤ σ 2 a 2 | ∆ K − 1 | + σ 4 a 4 | ∆ K − 3 | + · · · + σ n +1 a n +1 | ∆ K − n | , from which w e can b ound all | ∆ k | for k > n via mathematical induction. Corollary 3.3 (Sharp example of Theorem 2.1 ) . R e c al l the definition ( 10 ) of π n and η n fr om the ab ove. L et R n = √ 8 n + 4 , λ n = exp( − R n ) , (16) and ac c or dingly define π (1) n := (1 − λ n ) δ 0 + λ n π n , η (1) n := (1 − λ n ) δ 0 + λ n η n , (17) wher e δ 0 denotes the p oint mass at zer o. Then, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that TV  f π (1) n , f η (1) n  = λ n 2 ∥ g n ∥ L 1 ( ϕ ) , r χ 2  f π (1) n ∥ f η (1) n  ≥ λ n 4 ∥ q n ∥ L 2 ( ϕ ) . (18) 8 Pr o of. See App endix B.2 . Corollary 3.4 (Sharp example of Corollary 2.4 ) . R e c al l the definition ( 17 ) of π (1) n and η (1) n fr om the ab ove. L et π (2) n := 1 4 π (1) n + 3 4 η (1) n , η (2) n := η (1) n . (19) Then, ther e exists a universal N 0 ∈ N such that it holds for al l n ≥ N 0 that TV  f π (2) n , f η (2) n  = λ n 8 ∥ g n ∥ L 1 ( ϕ ) , H  f π (2) n , f η (2) n  ≥ λ n 64 ∥ q n ∥ L 2 ( ϕ ) . (20) Pr o of. The equality for TV distance is straigh tforward. Now, observ e for all x ∈ R that u ( x ) := f π (1) n ( x ) f η (1) n ( x ) − 1 = (1 − λ n ) ϕ ( x ) + P n j =0  λ n n +1 + λ n w j  ϕ ( x − aθ j ) (1 − λ n ) ϕ ( x ) + P n j =0 λ n n +1 ϕ ( x − aθ j ) − 1 ≤ max 0 ≤ j ≤ n λ n n +1 + λ n w j λ n n +1 − 1 ≤ 1 ( ∵ | w j | ≤ 1 n +1 ) and hence that ∥ u ∥ ∞ ≤ 1. W rite H 2  f π (2) n , f η (2) n  = H 2  1 4 f π (1) n + 3 4 f η (1) n , f η (1) n  = Z F  1 + u 4  f η (1) n , where F ( t ) := 1 2 ( √ t − 1) 2 . A T a ylor expansion of F gives F  1 + u 4  = u 2 128 − u 3 32(4 + v ) 5 / 2 (for some | v | ≤ | u | ) ≥ u 2 128 − u 2 288 √ 3 ( ∥ u ∥ ∞ ≤ 1) ≥ u 2 256 . In tegrating against f η (1) n yields H 2  f π (2) n , f η (2) n  ≥ 1 256 χ 2  f π (1) n ∥ f η (1) n  , concluding the pro of. No w we are ready to prov e Theorem 3.1 (Sharpness of Corollary 2.4 ) with the ab o ve ( π (2) n , η (2) n ). Pr o of of The or em 3.1 . Let TV n := TV  f π (2) n , f η (2) n  , H n := H  f π (2) n , f η (2) n  . 9 Then, ( 15 ), ( 16 ), and ( 20 ) imply that lim n →∞ 1 n log n log  1 TV n  = 1 2 . Th us, it holds for large enough n that 8 ∥ g n ∥ L 1 ( ϕ ) ≤ 8 ∥ q n ∥ L 1 ( ϕ ) + 8 ∥ r n ∥ L 2 ( ϕ ) ≤ 1 2 exp  n 5 . 53  ∥ q n ∥ L 1 ( ϕ ) (b y ( 13 )) ≤ exp  −  log(2) − 2 5 . 53  n 2  ∥ q n ∥ L 2 ( ϕ ) (b y ( 14 )) ≤ exp  − 0 . 33 log(1 / TV n ) log log(1 / TV n )  ∥ q n ∥ L 2 ( ϕ ) . Multiply b oth sides b y λ n 64 to conclude that TV n = λ n 8 ∥ g n ∥ L 1 ( ϕ ) (b y ( 20 )) ≤ TV α ∗ (TV n ) n λ n 64 ∥ q n ∥ L 2 ( ϕ ) (b y the definition of α ∗ ( · )) ≤ TV α ∗ (TV n ) n H n . (again b y ( 20 )) R emark 3.5 . A careful reader can verify that the constant 0 . 33 in α ∗ ( · ) can b e replaced by any p ositiv e real strictly less than log (2) − 1 4 log(2) ≈ 0 . 332. 4 Applications In this section, we pro vide a few consequences of our results. The notation “ ≲ , ≳ , ≍ ” in this section will hide constan ts dep ending on M or d . 4.1 En tropic Characterization of Learning in TV The c haracterization of minimax rates of estimation via metric entrop y has b een inv estigated b y LeCam ( 1973 ); Birg ´ e ( 1983 , 1986 ); Y atracos ( 1985 ); Haussler and Opper ( 1997 ); Y ang and Barron ( 1999 ). While minimax upp er and low er b ounds do not necessarily matc h in general, recen t w ork b y Jia et al. ( 2023 ) show ed that estimating Gaussian mixture densities with b ounded support under Hellinger distance admits an exact en tropic characterization of the minimax rate, owing to the fact that H 2 ( f π , f η ) ≍ KL( f π ∥ f η ). Similarly , our result of Corollary 2.4 that relates total v ariation and Hellinger distances also implies such a c haracterization for the same problem under total v ariation, up to a 1 − o (1) exp onen t in the rate. W e first define the metric en trop y of Gaussian location mixtures, and then state a result of Jia et al. ( 2023 ). Definition 4.1. Let P M ,d b e the collection of d -dimensional Gaussian mixtures where the mixing distributions are supp orted on the d -dimensional cub e [ − M , M ] d . F or a distribution class P ⊆ P M ,d , its (global) Hellinger cov ering n umber is defined by N H ( P , ϵ ) := min { N : ∃ P 1 , . . . , P N , sup R ∈P inf 1 ≤ i ≤ N H ( R, P i ) ≤ ϵ } . 10 The lo cal Hellinger co vering n um b er of P is N H,loc ( P , ϵ ) := sup P ∈P ,η ≥ ϵ N H ( B H ( P , η ) , η / 2) , where B H ( P , η ) = { R ∈ P : H ( P , R ) ≤ η } . W e define the global/lo cal total v ariation cov ering n umber in the same manner. Prop osition 4.2 (Corollary 11 of Jia et al. ( 2023 )) . Supp ose P is a c omp act subset (in Hel linger) of P M ,d . L et b P = b P ( X 1 , . . . , X n ) denote an estimator b ase d on X 1 , . . . , X n dr awn i.i.d. fr om P ∈ P . Then, inf b P sup P ∈P E P h H 2  P , b P i ≍ inf b P ∈P sup P ∈P E P h H 2  P , b P i ≍ ϵ 2 n , wher e ϵ 2 n ≍ inf ϵ> 0  ϵ 2 + 1 n log N H,loc ( P , ϵ )  . (21) Unlik e the Hellinger distance, there only exists an en tropic characterization of the minimax upp er b ound in total v ariation ( Y atracos , 1985 ). An en tropic low er b ound is not av ailable in the literature to the b est of our knowledge. By Corollary 2.4 , the rate ϵ n determined by the Hellinger en tropy ( 21 ) also characterizes the minimax rate of estimation under total v ariation as follows. Theorem 4.3 (Learning Gaussian mixtures in total v ariation) . Under the same c onditions as in Pr op osition 4.2 , for any δ > 0 , we have ϵ 2  1+ 2+ δ log(log(1 /ϵ n ) ∨ e )  n ≲ inf b P sup P ∈P E P h TV 2  P , b P i ≍ inf b P ∈P sup P ∈P E P h TV 2  P , b P i ≲ ϵ 2 n , wher e we define ϵ n as in ( 21 ) . 4.2 Robust Density Estimation In this section, w e consider the problem of estimating a Gaussian mixture with contaminated data, X 1 , . . . , X n i.i.d. ∼ P := (1 − ϵ ) P f π + ϵQ, (22) where the distribution P f π ∈ P M ,d has densit y function f π and Q is an arbitrary distribution of con tamination. The data generating process in ( 22 ) is recognized as Huber’s contamination mo del ( Hub er , 1964 ). Robust densit y estimation with Hub er con tamination has b een previously studied b y Liu and Gao ( 2019 ); Zhang and Ren ( 2023 ); Hum b ert et al. ( 2022 ), and the class of kernel densit y estimators are shown to estimate H¨ older smo oth densit y functions with optimal rates. Our main goal is to estimate the Gaussian mixture f π under Hellinger distance, since Hellinger error of densit y estimation directly implies a regret b ound for empirical Ba yes learning in the Gaussian sequence ( Jiang and Zhang , 2009 ; Saha and Guntuboyina , 2020 ). T o this end, we will first introduce a robust estimator that has statistical guarantee under the total v ariation distance. This step is standard b y the construction of Y atracos ( 1985 ) since the Hub er contamination ( 22 ) is a sp ecial case of mo del missp ecification under total v ariation. Details of the Y atracos’ estimator will b e giv en in App endix C.1 . Its statistical guarantee is given b y the follo wing prop osition. 11 Prop osition 4.4 (Robust density estimation in TV) . Consider the data gener ating pr o c ess in ( 22 ) . Then, the Y atr ac os’ estimator b f satisfies sup π ,Q E h TV 2  f π , b f i ≲ ϵ 2 + log d +1 ( n ) n , wher e the exp e ctation is under ( 22 ) and the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d . Note that the Y atracos’ estimator is a prop er estimator in the sense that b f itself is also a Gaus- sian location mixture with mixing distribution supported on [ − M , M ] d . Thus, our Corollary 2.4 directly implies a minimax upp er b ound in Hellinger distance as follo ws. Theorem 4.5 (Robust densit y estimation in Hellinger) . Consider the data gener ating pr o c ess in ( 22 ) . Supp ose δ > 0 . Then, the Y atr ac os’ estimator b f satisfies sup π ,Q E h H 2  f π , b f i ≲ E 2 ( ϵ, n ) , (23) wher e we define E 2 ( ϵ, n ) := ϵ 2  1 − 2+ δ log(log(1 /ϵ ) ∨ e )  +  1 n  1 − o d (1) , (24) the exp e ctation is under ( 22 ) , the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d , and o d (1) is a p ositive r e al-value d function of n and d , which c onver ges to zer o as n → ∞ . W e note that estimating f π in Hellinger distance has previously b een studied b y Kim and Gun tub o yina ( 2022 ); Saha and Gun tub o yina ( 2020 ); Soloff et al. ( 2025 ) in the sp ecial case of ( 22 ) with ϵ = 0. Compared with these results, it is lik ely that the second term n − (1 − o d (1)) in ( 24 ) can still b e slightly impro ved. How ever, this would require tec hniques very different from our Corollary 2.4 , and we will lea v e it as future work. On the other hand, the first term ϵ 2  1 − 2+ δ log(log(1 /ϵ ) ∨ e )  in ( 24 ) can b e sho wn to b e optimal. The follo wing result is obtained b y applying the tw o-point argument in Chen et al. ( 2018 ) to the sharpness example used in Theorem 3.1 . Theorem 4.6 (Minimax low er b ound on robust density estimation in Hellinger) . Consider the data gener ating pr o c ess in ( 22 ) . Then, we have inf b f sup π ,Q E h H 2  f π , b f i ≳ ϵ 2  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  , (25) wher e the exp e ctation is under ( 22 ) and the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d . The Hellinger b ound in Theorem 4.5 can b e applied to empirical Ba yes learning of Gaussian means with outliers. T o motiv ate this application, consider the following Gaussian location mo del with prior π , X | θ ∼ N ( θ , I d ) , θ ∼ π . With the knowledge of the prior, the (oracle) Ba yes estimator with resp ect to the squared error loss is giv en by the p osterior mean, b θ ⋆ ( X ) = X + ∇ f π ( X ) f π ( X ) . (26) 12 This formula is kno wn as Tw eedie’s formula ( Efron , 2011 ). Without the kno wledge of π , an empir- ical Ba yes estimator replaces the f π in ( 26 ) b y its estimator, b θ ( X ) := X + ∇ b f ( X ) b f ( X ) . The regret ( Saha and Guntuboyina , 2020 ; Soloff et al. , 2025 ) of b θ ( X ) is quan tified by E X ∼ f π    b θ ( X ) − b θ ⋆ ( X )    2 , whic h is actually the Fisher div ergence b et ween f π and b f . In a t ypical empirical Bay es setting, one has i.i.d. observ ations generated by f π . Here, we consider a more general data generating pro cess in ( 22 ) that allo ws the presence of arbitrary outliers. This requires the estimator b f to b e robust, and th us the Y atracos’ estimator satisfying the risk b ound in Theorem 4.5 is adopted here. W e note that the clean data setting of the problem with ϵ = 0 has b een w ell studied in the literature ( James et al. , 1961 ; Efron and Morris , 1972 , 1973 ; Johnstone , 2002 ; Ignatiadis and Sen , 2025 ), and the nonparametric maximum likelihoo d estimator (NPMLE) and sieve MLE are sho wn to achiev e the parametric rate up to some logarithmic factor ( W ong and Shen , 1995 ; Genov ese and W asserman , 2000 ; Ghosal and V an Der V aart , 2001 ; Jiang and Zhang , 2009 ; Saha and Guntuboyina , 2020 ; Soloff et al. , 2025 ). How ever, when ϵ > 0, it is unclear whether the NPMLE still works with presence of arbitrary outliers. W e susp ect that the error rate of the NPMLE has a highly sub- optimal dep endence on ϵ . In terms of the technique of analyzing the regret b ound, results in Jiang and Zhang ( 2009 ); Saha and Gun tub o yina ( 2020 ); Soloff et al. ( 2025 ) and related papers crucially rely on the Hellinger con trol of the Fisher div ergence. See Theorem E.1 of Saha and Gun tub o yina ( 2020 ) for instance. Note that these works employ ed a regularized version of b θ ( X ) in the following form to av oid n umerical instability when the denominator b ecomes close to zero. b θ ρ ( X ) := X + ∇ b f ( X ) b f ( X ) ∨ ρ . (27) F ollowing the same strategy , the result of Theorem 4.7 is an immediate consequence of Theorem 4.5 . Theorem 4.7 (Robust regret bound) . Consider the data gener ating pr o c ess in ( 22 ) . Supp ose b θ ⋆ ( · ) is as in ( 26 ) . Then, ther e exists ρ = ρ ( ϵ, n ) > 0 such that b θ ρ ( · ) in ( 27 ) with b f b eing the Y atr ac os’ estimator satisfies sup π ,Q E  E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2  ≲ E 2 ( ϵ, n ) , (28) wher e the outer exp e ctation is under ( 22 ) , the supr emum is taken over al l Q and π such that supp( π ) ⊆ [ − M , M ] d , and the err or function E 2 ( ϵ, n ) is define d as in ( 24 ) . See App endix C.2 for detailed pro ofs of Theorem 4.3 , Proposition 4.4 , Theorems 4.5 , 4.6 , and 4.7 . 13 5 Discussion W e establish a sharp relation betw een the total v ariation and the Hellinger distances in this paper. Our results are deriv ed for d -dimensional isotropic Gaussian mixture mo dels with a fixed co v ariance I d . While w e discuss implications for empirical Ba yes metho ds, these pro cedures often in volv e a join t prior on location and co v ariance. Extending our results to heteroscedastic Gaussian mixtures is an in teresting direction for future w ork. Another op en problem closely related to our pap er is the sharp relation b et w een the total v ariation and the L 2 distances. Resolving this question will hav e direct implications for nonparametric densit y estimation under the L 2 loss. Finally , establishing a sharp connection betw een the total v ariation distance and the Fisher div ergence will further deepen the understanding of empirical Bay es pro cedures under the robust setting. Ac kno wledgemen ts W e thank Nikolaos Ignatiadis for fruitful discussions on the implications of our results for empirical Ba yes. References Birg ´ e, L. (1983). Appro ximation dans les espaces m´ etriques et th ´ eorie de l’estimation. Zeitschrift f¨ ur Wahrscheinlichkeitsthe orie und verwandte Gebiete , 65(2):181–237. Birg ´ e, L. (1986). On estimating a densit y using Hellinger distance and some other strange facts. Pr ob ability the ory and r elate d fields , 71(2):271–291. Chen, M., Gao, C., and Ren, Z. (2018). Robust co v ariance and scatter matrix estimation under Hub er’s con tamination mo del. The Annals of Statistics , 46(5):1932–1960. Dasgupta, S. (1999). Learning mixtures of Gaussians. In 40th annual symp osium on foundations of c omputer scienc e (Cat. No. 99CB37039) , pages 634–644. IEEE. Efron, B. (2011). Tw eedie’s form ula and selection bias. Journal of the A meric an Statistic al Asso- ciation , 106(496):1602–1614. Efron, B. and Morris, C. (1972). Empirical Bay es on vector observ ations: An extension of Stein’s metho d. Biometrika , 59(2):335–347. Efron, B. and Morris, C. (1973). Stein’s estimation rule and its comp etitors—an empirical Bay es approac h. Journal of the Americ an Statistic al Asso ciation , 68(341):117–130. Gautsc hi, W. (1974). Norm estimates for inv erses of V andermonde matrices. Numerische Mathe- matik , 23(4):337–347. Geno vese, C. R. and W asserman, L. (2000). Rates of con vergence for the Gaussian mixture sieve. The A nnals of Statistics , 28(4):1105–1127. Ghosal, S. and V an Der V aart, A. W. (2001). En tropies and rates of conv ergence for maximum lik eliho o d and Bay es estimation for mixtures of normal densities. The A nnals of Statistics , pages 1233–1263. Guillemin, V. and Sternberg, S. (2013). Semi-classic al analysis . International Press Boston, MA. 14 Haussler, D. and Opper, M. (1997). Mutual information, metric entrop y and cum ulative relativ e en tropy risk. The Annals of Statistics , 25(6):2451–2492. Hub er, P . J. (1964). Robust estimation of a lo cation parameter. The A nnals of Mathematic al Statistics , 35(1):73–101. Hum b ert, P ., Le Bars, B., and Min vielle, L. (2022). Robust kernel densit y estimation with median- of-means principle. In International Confer enc e on Machine L e arning , pages 9444–9465. PMLR. Ignatiadis, N. and Sen, B. (2025). Empirical partially Bay es multiple testing and comp ound χ 2 decisions. The A nnals of Statistics , 53(1):1–36. James, W., Stein, C., et al. (1961). Estimation with quadratic loss. In Pr o c e e dings of the fourth Berkeley symp osium on mathematic al statistics and pr ob ability , volume 1, pages 361–379. Uni- v ersity of California Press. Jia, Z., P olyanskiy , Y., and W u, Y. (2023). Entropic c haracterization of optimal rates for learning Gaussian mixtures. In The Thirty Sixth A nnual Confer enc e on L e arning The ory , pages 4296– 4335. PMLR. Jiang, W. and Zhang, C.-H. (2009). General maxim um likelihoo d empirical Bay es estimation of normal means. The A nnals of Statistics , pages 1647–1684. Johnstone, I. M. (2002). F unction estimation and Gaussian sequence mo dels. Unpublishe d manuscript , 2(5.3):2. Kim, A. K. and Gun tub o yina, A. (2022). Minimax b ounds for estimating multiv ariate Gaussian lo cation mixtures. Ele ctr onic Journal of Statistics , 16(1):1461–1484. LeCam, L. (1973). Conv ergence of estimates under dimensionality restrictions. The Annals of Statistics , pages 38–53. Lindsa y , B. G. (1995). Mixture mo dels: Theory , geometry and applications. In NSF-CBMS R e gional Confer enc e Series in Pr ob ability and Statistics , pages i–163. JSTOR. Liu, H. and Gao, C. (2019). Density estimation with con tamination: minimax rates and theory of adaptation. Ele ctr onic Journal of Statistics , 13:3613–3653. Lubinsky , D. S. (2007). A surv ey of w eigh ted appro ximation for exponential w eights. arXiv pr eprint math/0701099 . Ma, Y., W u, Y., and Y ang, P . (2025). On the b est approximation b y finite Gaussian mixtures. IEEE T r ansactions on Information The ory . Maizlish, O. and Prymak, A. (2015). Conv ex p olynomial approximation in R d with Freud w eigh ts. Journal of Appr oximation The ory , 192:60–68. Nev ai, P . and T otik, V. (1987). Sharp Nik olskii inequalities with exp onen tial w eights. Analysis Mathematic a , 13(4):261–267. P olyanskiy , Y. and W u, Y. (2025). Information the ory: F r om c o ding to le arning . Cambridge univ ersity press. 15 Saha, S. and Guntuboyina, A. (2020). On the nonparametric maxim um likelihoo d estimator for Gaussian lo cation mixture densities with application to Gaussian denoising. The Annals of Statistics , 48(2):738–762. Soloff, J. A., Gun tub o yina, A., and Sen, B. (2025). Multiv ariate, heteroscedastic empirical Ba yes via nonparametric maximum likelihoo d. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87(1):1–32. Szeg, G. (1939). Ortho gonal p olynomials , volume 23. American Mathematical So c. W atson, G. N. (1933). Notes on generating functions of p olynomials:(2) Hermite p olynomials. Journal of the L ondon Mathematic al So ciety , 1(3):194–199. W ong, W. H. and Shen, X. (1995). Probability inequalities for likelihoo d ratios and con vergence rates of siev e MLEs. The Annals of Statistics , pages 339–362. Y ang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of conv er- gence. The A nnals of Statistics , pages 1564–1599. Y atracos, Y. G. (1985). Rates of con v ergence of minim um distance estimators and Kolmogoro v’s en tropy . The A nnals of Statistics , 13(2):768–774. Zhang, P . and Ren, Z. (2023). Adaptiv e minimax densit y estimation on R d for Hub er’s contami- nation mo del. Information and Infer enc e: A Journal of the IMA , 12(4):3042–3066. 16 A Pro of of the Main Results A.1 Preliminaries: Hermite Polynomials and Inequalities This section has tw o main goals. The first is to develop an understanding of the Hilb ert space L 2 ( R d , ϕ d ) using the Christoffel-Darb oux k ernel (Prop osition A.2 ), whic h pav es the wa y for the pro of of the Nikolskii-t yp e inequality (Proposition A.6 ) and restricted-range inequalit y (Prop osi- tion A.7 ). The second is to pro ve Prop osition A.8 , which is a key ingredient in the pro of of our main result, Theorem 2.5 . The results in this section hav e imp ortan t implications in quantum mec hanics. How ev er, we p ostpone their physical in terpretation for the momen t. W e first proceed to prov e the Prop osi- tion A.8 and the Theorem 2.5 without relying on any physics, and then return to discuss the ph ysical meaning at the end. The study of orthogonal p olynomials has a long and rich history , encompassing w orks from Szeg ( 1939 ) to Lubinsky ( 2007 ), among man y others. Results on multiv ariate p olynomials are relativ ely limited and disp ersed throughout div erse literatures, including theoretical mathematics and quan tum physics, making a unified o verview challenging. F or the sake of k eeping the present pap er self-contained, w e summarize the essential results in this section. W e refer to the notations defined in Section 1.2 and fix d ≥ 1 throughout this section. Lemma A.1 (Hermite polynomial expansion) . F or θ = ( θ 1 , . . . , θ d ) ∈ R d and x = ( x 1 , . . . , x d ) ∈ R d , we have ϕ d ( x − θ ) ϕ d ( x ) = X k ∈ N d 0 θ k √ k ! h k ( x ) , wher e we define θ k := d Y j =1 θ k j j , k ! := d Y j =1 k j ! . Pr o of. The one-dimensional version of this result is classical and easy to sho w. See, for example, Equation (5.5.7) of Szeg ( 1939 ). W e can generalize to arbitrary dimensions as follows. ϕ d ( x − θ ) ϕ d ( x ) = exp  ⟨ θ , x ⟩ 2 − 1 2 ∥ θ ∥ 2 2  = d Y j =1 exp  θ j x j − 1 2 θ 2 j  = d Y j =1 ∞ X k j =0 θ k j j p k j ! h k j ( x j ) . Expand the pro duct to conclude the pro of. Prop osition A.2 (Christoffel-Darb oux kernel) . F or n ∈ N 0 , define the n -th Christoffel-Darb oux kernel K n as K n ( x, y ) := X | k |≤ n h k ( x ) h k ( y ) . (29) Then, given x ∈ R d , 17 1. K n ( x, · ) ∈ Π d n . 2. ⟨ f , K n ( x, · ) ⟩ L 2 ( R d ,ϕ d ) = f ( x ) holds for al l f ∈ Π d n . Pr o of. The first statement is obvious. Due to linearit y , it suffices to pro ve the second statement when f = h k for some | k | ≤ n , which is straigh tforward. Prop osition A.3 (Christoffel-Darb oux function) . Given x ∈ R d , inf n ∥ P ∥ 2 L 2 ( R d ,ϕ d ) : P ∈ Π d n , P ( x ) = 1 o = 1 K n ( x, x ) . (30) Pr o of. F or P ∈ Π d n suc h that P ( x ) = 1, write P = P | k |≤ n c k h k so that 1 = X | k |≤ n c k h k ( x ) ≤   X | k |≤ n c 2 k     X | k |≤ n h 2 k ( x )   = ∥ P ∥ 2 L 2 ( R d ,ϕ d ) K n ( x, x ) , demonstrating the lo w er b ound. The low er bound is attained b y the p olynomial K n ( x, · ) K n ( x,x ) ∈ Π d n . In view of Prop osition A.3 , it is imp ortan t to study an upp er b ound on the diagonal entries K n ( x, x ) of the C–D kernel. T o achiev e this, w e first introduce a useful lemma. Lemma A.4 (Mehler’s formula) . F or k ∈ N d 0 , define E k := 2 | k | + d . F or x, y ∈ R d and t > 0 , define the Mehler kernel by M ( x, y ; t ) := X k ∈ N d 0 e − tE k h k ( x ) h k ( y ) ϕ 1 / 2 d ( x ) ϕ 1 / 2 d ( y ) . (31) Then, we have the fol lowing close d-form formula: M ( x, y ; t ) = (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 + ∥ y ∥ 2 2 4 tanh(2 t ) + ⟨ x, y ⟩ 2 2 sinh(2 t ) ! . (32) If y = x , in p articular, then M ( x, x ; t ) = (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 2 tanh( t ) ! . (33) Pr o of. The right hand side of ( 31 ) is factorized to d Y j =1 X k j ∈ N 0 e − t (2 k j +1) h k j ( x j ) h k j ( y j ) ϕ 1 / 2 1 ( x j ) ϕ 1 / 2 1 ( y j ) . Since the closed-form formula ( 32 ) can also be factorized in the same manner, it suffices to show ( 32 ) only for d = 1. There are many known pro ofs of the one-dimensional Mehler’s form ula. One suc h pro of dates back (at least) to W atson ( 1933 ). Since it is quite short, w e include it b elo w. Recall the F ourier transform of ϕ 1 : ϕ 1 ( x ) = 1 2 π Z exp  − ξ 2 2 + ixξ  dξ . 18 Hence, from the definition of h k , h k ( x ) ϕ 1 / 2 1 ( x ) = ( − 1) k √ k ! ϕ − 1 / 2 1 ( x ) d k dx k ϕ 1 ( x ) = 1 2 π √ k ! ϕ − 1 / 2 1 ( x ) Z ( − iξ ) k exp  − ξ 2 2 + ixξ  dξ . In conclusion, ∞ X k =0 e − t (2 k +1) h k ( x ) h k ( y ) ϕ 1 / 2 1 ( x ) ϕ 1 / 2 1 ( y ) = (2 π ) − 3 / 2 exp  − t + x 2 + y 2 4  Z Z exp  − ξ 2 + ζ 2 2 + ixξ + iy ζ  ∞ X k =0 ( − e − 2 t ξ ζ ) k k ! dξ dζ = (2 π ) − 3 / 2 exp  − t + x 2 + y 2 4  Z Z exp  − ξ 2 + ζ 2 2 − e − 2 t ξ ζ + ixξ + iy ζ  dξ dζ = (2 π (1 − e − 4 t )) − 1 / 2 exp  − t + x 2 + y 2 4 − x 2 + y 2 − 2 e − 2 t xy 2(1 − e − 4 t )  = (4 π sinh(2 t )) − 1 / 2 exp  − x 2 + y 2 4 tanh(2 t ) + xy 2 sinh(2 t )  . W e hav e derived the explicit form of Mehler’s form ula, which implies the follo wing corollary . Corollary A.5 (Upp er bounds of the C-D kernel) . R e c al l the definition ( 29 ) of Christoffel-Darb oux kernel K n ( x, x ) . F or n ∈ N 0 , define E n,d := 2 n + d, C n,d :=  ( n + d ) n + d n n d d  1 / 2 . (34) Then, we have sup x ∈ R d K n ( x, x ) ϕ d ( x ) ≤ (2 π ) − d/ 2 C n,d , (35) C n,d ≤  e ( n + d ) d  d/ 2 = O ( n d/ 2 ) . (36) F urthermor e, for κ > 1 , Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤  e 2 d r κ κ − 1  d/ 2 E d/ 2 n,d exp ( − c ( κ ) E n,d ) , (37) wher e we define c ( κ ) := p κ ( κ − 1) − log  √ κ + √ κ − 1  > 0 . Pr o of. The inequality ( 36 ) is straigh tforward. The other inequalities ( 35 ) and ( 37 ) can b e deriv ed 19 from Chernoff b ound using the Mehler’s formula (Lemma A.4 ) as follows. F or all x ∈ R d and t > 0, K n ( x, x ) ϕ d ( x ) = X | k |≤ n h 2 k ( x ) ϕ d ( x ) (b y ( 29 )) ≤ e tE n,d X | k |≤ n e − tE k h 2 k ( x ) ϕ d ( x ) ( E k ≤ E n,d ) ≤ e tE n,d M ( x, x ; t ) (b y ( 31 )) = e tE n,d (4 π sinh(2 t )) − d/ 2 exp − ∥ x ∥ 2 2 2 tanh( t ) ! . (b y ( 33 )) Therefore, sup x ∈ R d K n ( x, x ) ϕ d ( x ) ≤ inf t> 0 e tE n,d (4 π sinh(2 t )) − d/ 2 = (2 π ) − d/ 2 C n,d , where the infim um is attained at t = 1 4 log  1 + d n  . Similarly , for all t > 0 and 0 < s < tanh( t ), Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤ e tE n,d (4 π sinh(2 t )) − d/ 2 Z ∥ x ∥ 2 > √ 2 κE n,d exp − ∥ x ∥ 2 2 2 tanh( t ) ! ≤ exp (( t − κs ) E n,d ) (4 π sinh(2 t )) − d/ 2 Z R d exp − ∥ x ∥ 2 2 2 (tanh( t ) − s ) ! = exp (( t − κs ) E n,d ) (2 sinh(2 t )(tanh( t ) − s )) − d/ 2 . No w fix t = log  √ κ + √ κ − 1  > 0 so that cosh( t ) = √ κ and that sinh( t ) = √ κ − 1. Let s = tanh( t ) − d 2 κE n,d accordingly to deduce Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) ≤ exp  d 2 − c ( κ ) E n,d  2 d p κ ( κ − 1) κE n,d ! − d/ 2 , whic h is the desired result. Note that the c hoice of ( t, s ) is asymptotically optimal as E n,d → ∞ . W e ha ve derived upp er bounds on the diagonal en tries K n ( x, x ) of the C-D k ernel. Using these b ounds, we now introduce three norm inequalities in Π d n , stated as Propositions A.6 , A.7 , and A.8 . The first is the Nikolskii-t yp e inequalit y . In case d = 1, the Nik olskii-type inequality has b een extensively studied. F or instance, the pap er by Nev ai and T otik ( 1987 ) focuses on the one- dimensional setting and establishes the sharpness of the Nik olskii-type inequalities (with more general w eigh t functions). Note that the Mhask ar–Rakhmano v–Saff (MRS) n umber a n discussed in that pap er is linearly comparable to p 2 E n,d , the threshold. The second is the restricted-range inequality . Likewise, in the one-dimensional setting, the restricted-range inequality has b een studied in great depth; see Chapter 6 of the survey Lubinsky ( 2007 ). F or higher dimensions, a few results are kno wn as well; for example, see Lemma 5 of Maizlish and Prymak ( 2015 ). The third, to the best of our knowledge, do es not ha ve a standard name, but it can b e deriv ed as a com bination of the preceding t wo and will pla y an essential role in our main result. 20 Prop osition A.6 (Nikolskii-t yp e inequality) . R e c al l the definition ( 34 ) of C n,d . F or al l P ∈ Π d n , we have sup x ∈ R d    P ( x ) ϕ 1 / 2 d ( x )    ≤ (2 π ) − d/ 4 C 1 / 2 n,d ∥ P ∥ L 2 ( R d ,ϕ d ) . Pr o of. According to Prop osition A.3 and Corollary A.5 , it holds for all x ∈ R d that P 2 ( x ) ϕ d ( x ) ≤ (2 π ) − d/ 2 C n,d P 2 ( x ) K n ( x, x ) (b y ( 35 )) ≤ (2 π ) − d/ 2 C n,d ∥ P ∥ 2 L 2 ( R d ,ϕ d ) . (b y ( 30 )) T ake square roots of the b oth sides to conclude the pro of. Prop osition A.7 (Restricted-range inequality) . R e c al l the definition ( 34 ) of E n,d . Supp ose κ > 1 . Then, ther e exists a c onstant A = A ( κ ) , dep ending only on κ , such that, if E n,d ≥ Ad , then, for al l P ∈ Π d n , we have Z R d P 2 ϕ d ≤ 2 Z ∥ x ∥ 2 ≤ √ 2 κE n,d P 2 ( x ) ϕ d ( x ) . Pr o of. Supp ose E n,d d ≥ 1 c ( κ ) log  e c ( κ ) r κ κ − 1 ∨ e  =: A ( κ ) , (38) where we define c ( κ ) as in Corollary A.5 . F or P ∈ Π d n , write P = P | k |≤ n c k h k so that R P 2 ϕ d = P | k |≤ n c 2 k . W e hav e Z ∥ x ∥ 2 > √ 2 κE n,d P 2 ( x ) ϕ d ( x ) = X | k |≤ n X | l |≤ n c k M kl c l , where w e define M kl := Z ∥ x ∥ 2 > √ 2 κE n,d h k ( x ) h l ( x ) ϕ d ( x ) . Here, M = ( M kl ) is a (dim Π d n ) × (dim Π d n ) p ositive semi-definite matrix. Th us, ev ery eigen v alue of M is at most its trace. That is, Z ∥ x ∥ 2 > √ 2 κE n,d P 2 ( x ) ϕ d ( x ) ≤  Z R d P 2 ϕ d  trace( M ) . It suffices to show that the trace is at most 1 2 . By the definition ( 29 ) of Christoffel-Darboux k ernel, trace( M ) = X | k |≤ n M kk = Z ∥ x ∥ 2 > √ 2 κE n,d K n ( x, x ) ϕ d ( x ) . (39) Note that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, the assumption ( 38 ) implies e c ( κ ) r κ κ − 1 2 c ( κ ) d E n,d ≤ exp  2 c ( κ ) d E n,d  . (40) 21 In conclusion, trace( M ) ≤  e 2 d r κ κ − 1 E n,d  d/ 2 exp ( − c ( κ ) E n,d ) (b y ( 37 ) and ( 39 )) ≤ 2 − d . (b y ( 40 )) The following Prop osition A.8 is simply a com bination of the t w o preceding Prop ositions A.6 and A.7 , and it plays a cen tral role in the pro of of our main result. Prop osition A.8 (Asymptotic lo w er b ound of L 1 ( R d , ϕ d )-norm in Π d n ) . R e c al l the definition ( 34 ) of E n,d and C n,d . Define c n,d := inf n ∥ P ∥ L 1 ( R d ,ϕ d ) : P ∈ Π d n , ∥ P ∥ L 2 ( R d ,ϕ d ) = 1 o . (41) If the assumption ( 38 ) of the pr evious Pr op osition A.7 holds, then c n,d ≥ 1 2 C − 1 / 2 n,d e − κE n,d / 2 . (42) Pr o of. F or P ∈ Π d n , ∥ P ∥ 2 L 2 ( R d ,ϕ d ) ≤ 2 Z ∥ x ∥ 2 ≤ √ 2 κE n,d P 2 ( x ) ϕ d ( x ) (b y Prop osition A.7 ) ≤ 2 sup ∥ x ∥ 2 ≤ √ 2 κE n,d    ϕ − 1 / 2 d ( x )    sup x ∈ R d    P ( x ) ϕ 1 / 2 d ( x )    Z R d | P ϕ d | ≤ 2  (2 π ) d/ 4 e κE n,d / 2   (2 π ) − d/ 4 C 1 / 2 n,d ∥ P ∥ L 2 ( R d ,ϕ d )  ∥ P ∥ L 1 ( R d ,ϕ d ) . (b y Prop osition A.6 ) Cancel out ∥ P ∥ L 2 ( R d ,ϕ d ) from the b oth sides to pro ve the inequalit y ( 42 ). Corollary A.9. R e c al l the definition ( 41 ) of c n,d . Supp ose κ 1 > 1 . Then, ther e exists a c onstant A 1 = A 1 ( κ 1 ) , dep ending only on κ 1 , such that, if n ≥ A 1 d , then we have c n,d ≥ 3 e − κ 1 n . Pr o of. Supp ose n d ≥ inf κ  1 ∨ A ( κ ) 2 ∨ 1 2( κ 1 − κ ) log  3 8 e 1+2 κ 2( κ 1 − κ ) ∨ e  =: A 1 ( κ 1 ) , (43) where w e define A ( κ ) as in ( 38 ), and the infim um is tak en with respect to κ suc h that 1 < κ < κ 1 . Recall that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, the assumption ( 43 ) implies 3 8 e 1+2 κ 2( κ 1 − κ ) 4( κ 1 − κ ) d n ≤ exp  4( κ 1 − κ ) d n  . (44) 22 In conclusion, c n,d ≥ 1 2 C − 1 / 2 n,d e − κE n,d / 2 (b y ( 42 )) ≥ 1 2  e 1+2 κ ( n + d ) d  − d/ 4 exp ( − κn ) (by ( 36 )) ≥ 1 3  2 e 1+2 κ d n  − d/ 4 exp ( − κn ) ( ∵ n ≥ d ) ≥ 3 2 d − 1 exp( − κ 1 n ) . (b y ( 44 )) Since E n,d = 2 n + d ≥ 2 n , the assumption ( 43 ) also implies the assumption ( 38 ) of Prop osition A.7 . W e ha ve derived all the preliminary results required for the pro of of our main theorem. Lastly , w e introduce one technical lemma to conclude this section. Lemma A.10 (Lambert W function) . Given κ 2 > 1 , B 0 ≥ 1 , and t ∈ (0 , 1) , define w 0 := 1 ∨ 2 κ 2 − 1 log  B 0 κ 2 − 1 ∨ e  , (45) n 0 :=  2 B 0 e w 0 ∨ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  . Then, it holds for al l n ≥ n 0 that  2 B 0 n + 1  ( n +1) / 2 ≤ t. (46) Pr o of. Let w > 0 b e the unique p ositiv e real num b er suc h that log(1 /t ) = B 0 w e w . Then,  2 B 0 2 B 0 e w  B 0 e w = t. Since the function z 7→ (2 B 0 /z ) z / 2 is decreasing for z > 2 B 0 /e , it suffices to sho w n + 1 ≥ 2 B 0 e w to prov e the inequality ( 46 ). W e divide the argument in to tw o cases, (a) w < w 0 and (b) w ≥ w 0 . In case (a) w < w 0 , it is obvious that n + 1 ≥ n 0 + 1 ≥ 2 B 0 e w 0 ≥ 2 B 0 e w . Hence, we no w supp ose (b) w ≥ w 0 . Recall that z ≥ 2 log( a ∨ e ) implies az ≤ e z . Th us, ( 45 ) implies B 0 κ 2 − 1 ( κ 2 − 1) w ≤ exp (( κ 2 − 1) w ) . (47) F urthermore, since B 0 ≥ 1 and w 0 ≥ 1, w e hav e log(1 /t ) = B 0 w e w ≥ e and n + 1 ≥ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e ) = 2 κ 2 B 0 w e w log ( B 0 w e w ) ≥ 2 B 0 e w , where the last inequality is equiv alen t to ( 47 ). 23 A.2 Pro of of the Main Theorem W e ha ve already shown in the main text that Theorem 2.5 implies Theorem 2.1 . Therefore, we pro ceed to pro v e the Theorem 2.5 here. Pr o of of The or em 2.5 . Let κ 1 > 1 and κ 2 > 1 satisfy 2 κ 1 κ 2 = 2 + δ . First, in view of Corollary A.9 , there exists a p ositiv e integer A 1 = A 1 ( κ 1 ), dep ending only on κ 1 , suc h that n ≥ A 1 d = ⇒ c n,d ≥ 3 e − κ 1 n . (48) Let t := 1 2 ∥ g ∥ L 1 ( ϕ d ) = TV( f π , f η ) ∈ (0 , 1). In view of Lemma A.10 , define n := A 1 d ∨ B ∨  2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  ∈ N 0 , (49) where B 0 = B 0 ( κ 1 , M 2 d ) :=  1 ∨ 2 eM 2 d  e 2 κ 1 , (50) B = B ( κ 1 , κ 2 , M 2 d ) :=  2 B 0 exp  1 ∨ 2 κ 2 − 1 log  B 0 κ 2 − 1 ∨ e  . (51) Observ e from Lemma A.1 that g = X k ∈ N d 0 ∆ k √ k ! h k , ∆ k = Z R d θ k d ( π − η )( θ ) . W e decomp ose g = q + r , where q = X | k |≤ n ∆ k √ k ! h k ∈ Π d n , r = X | k | >n ∆ k √ k ! h k . F rom the compactness of the supp ort, | ∆ k | ≤ 2(2 M ) | k | . Th us, b y multinomial theorem and Stir- ling’s form ula, X | k | = m ∆ 2 k k ! ≤ X | k | = m 4(4 M 2 ) m k ! = 4(4 M 2 d ) m m ! ≤ 4 √ 2 π m  4 eM 2 d m  m . (52) It follo ws from the definition ( 49 ) that n + 1 ≥ 2 B 0 e ≥ 2  1 ∨ 2 eM 2 d  e 1+2 κ 1 ≥ 16 ∨ 8 eM 2 d . Thus, ∥ r ∥ 2 L 2 ( ϕ d ) = X | k | >n ∆ 2 k k ! ≤ ∞ X m = n +1 4 p 2 π ( n + 1)  4 eM 2 d n + 1  m (b y ( 52 )) ≤ ∞ X m = n +1 1 2 m − n − 1 √ 2 π  4 eM 2 d n + 1  n +1 ( ∵ n + 1 ≥ 16 ∨ 8 eM 2 d ) ≤  4 eM 2 d n + 1  n +1 . ( ∵ 2 ≤ √ 2 π ) It follo ws from the definition ( 50 ) of B 0 that 4 eM 2 d ≤ 2 B 0 e − 2 κ 1 . Hence, b y Lemma A.10 , ∥ r ∥ L 2 ( ϕ d ) ≤  2 B 0 e − 2 κ 1 n + 1  ( n +1) / 2 ≤ e − κ 1 n t ≤ 1 2 e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) . (53) 24 The last inequality follows from the H¨ older’s inequality ∥ g ∥ L 1 ( ϕ d ) ≤ ∥ g ∥ L 2 ( ϕ d ) . W e define c 0 = c 0 ( κ 1 , κ 2 , M , d ) := e − κ 1 ( A 1 d ∨ B ) and conclude that 2 t = ∥ g ∥ L 1 ( ϕ d ) ≥ ∥ q ∥ L 1 ( ϕ d ) − ∥ r ∥ L 1 ( ϕ d ) ( ∵ g = q + r ) ≥ c n,d ∥ q ∥ L 2 ( ϕ d ) − ∥ r ∥ L 2 ( ϕ d ) (b y ( 41 )) ≥ c n,d ∥ g ∥ L 2 ( ϕ d ) − 2 ∥ r ∥ L 2 ( ϕ d ) ( ∵ c n,d ≤ 1) ≥ 3 e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) − e − κ 1 n ∥ g ∥ L 2 ( ϕ d ) (b y ( 48 ) and ( 53 )) ≥ 2 exp  − κ 1  A 1 d ∨ B ∨ 2 κ 2 log(1 /t ) log (log(1 /t ) ∨ e )  ∥ g ∥ L 2 ( ϕ d ) ≥ 2  c 0 ∧ t α ( t )  ∥ g ∥ L 2 ( ϕ d ) , where α ( t ) = 2 κ 1 κ 2 log (log(1 /t ) ∨ e ) . Letting C 0 := c − 1 0 giv es the desired result ∥ g ∥ L 2 ( ϕ d ) ≤  C 0 ∨ t − α ( t )  t . A.3 Dep endency of the Constant In this section, w e discuss how the constan t C 0 in the main Theorems 2.1 and 2.5 dep ends on the radius M and dimension d . In short, log( C 0 ) has a p olynomial order in M 2 d , and it is “nearly” linear in the regime where δ → ∞ . Prop osition A.11 (Dependency of C 0 on M and d ) . The c onstants C 0 = C 0 ( δ, M , d ) in The o- r ems 2.1 and 2.5 c oincide. Mor e over, if we define A 1 = A 1 ( κ 1 ) and B = B ( κ 1 , κ 2 , M 2 d ) as in ( 43 ) and ( 51 ) , r esp e ctively, then we c an sp e cify the c onstant as log( C 0 ) := inf 2 κ 1 κ 2 =2+ δ κ 1 ( A 1 d ∨ B ) , wher e the infimum is taken with r esp e ct to κ 1 , κ 2 > 1 such that 2 κ 1 κ 2 = 2 + δ . Pr o of. The definition ( 43 ) of A 1 = A 1 ( κ 1 ) reflects the assumption of Corollary A.9 , which is required to meet the condition of Prop ositions A.7 and A.8 and to guarantee that c n,d defined in ( 41 ) is not less than 3 e − κ 1 n , as demonstrated in the Corollary A.9 . On the other hand, the definitions ( 50 ) and ( 51 ) of B 0 and B reflect Lemma A.10 , whic h is essen tial to con trol the tail norm ∥ r ∥ L 2 ( ϕ d ) of g = f π − f η ϕ d . W e give more detailed discussion b elo w. The first observ ation is that once κ 1 > 1 is fixed, A 1 is merely a universal constan t. This shows that log ( C 0 ) must dep end on the dimension d at least linearly . In con trast, the b eha vior of B 0 and B describ ed in ( 50 ) and ( 51 ) is more in tricate. It suffices to consider the regime where 2 eM 2 d > 1 b ecause if the radius M of supp ort is to o small, we can simply em b ed the support in to a larger cube. Therefore, once κ 1 is fixed, we hav e B 0 ≍ M 2 d . If in ( 51 ) we are allow ed to tak e κ 2 sufficien tly large, then w e w ould obtain log( C 0 ) ≍ B 0 ≍ M 2 d . How ever, this cannot b e achiev ed in the regime where δ > 0 is fixed and M 2 d is large. In such a situation, w e ha ve the follo wing p olynomial rate: log( C 0 ) ≍ ( M 2 d ) κ 2 +1 κ 2 − 1 . If δ > 0 is taken sufficien tly large, the p olynomial order in M 2 d ma y recov er the limit κ 2 +1 κ 2 − 1 → 1. 25 A.4 Ph ysical Interpretation: Quan tum Harmonic Oscillator In this section, w e pro vide ph ysical interpretation of the restricted-range inequality , Proposition A.7 . A classical Hamiltonian of a particle in R d is giv en by H cl = 1 2 ∥ ξ ∥ 2 2 + V ( x ) , where ξ and x are the momentum and p osition of the particle, resp ectively . The classical harmonic oscillator is defined b y the p oten tial energy V ( x ) := 1 2 ∥ x ∥ 2 2 . The quan tum-mechanical analog of the Hamiltonian is given b y the follo wing differential operator. H = − ℏ 2 2 ∇ 2 + V : ψ 7→ − ℏ 2 2  ∂ 2 ∂ x 2 1 + · · · + ∂ 2 ∂ x 2 d  ψ + 1 2  x 2 1 + · · · + x 2 d  ψ . Here ψ : R d → R is a w a ve function and ℏ > 0 is a constant closely related to the Planck constant, while w e assume natural (mathematical) length and energy scales. Prop osition A.12 (Isotropic quantum harmonic oscillator) . F or k ∈ N d 0 , define the Hermite function as ψ k ( x ) :=  2 ℏ  d/ 4 h k r 2 ℏ x ! ϕ 1 / 2 d r 2 ℏ x ! . Then, 1. H is a self-adjoint op er ator. 2. (normalization) ∥ ψ k ∥ L 2 ( R d ) = 1 . 3. (Schr¨ odinger e quation) H ψ k = E k ψ k wher e the eigenvalue is E k = ℏ 2 (2 | k | + d ) . 4. { ψ k } c onsists entir ely of eigenfunctions of H . Mor e over, if we define the Mehler kernel M ( x, y ; t ) := P k ∈ N d 0 e − tE k ψ k ( x ) ψ k ( y ) for t > 0 , then M ( x, y ; t ) = (2 π ℏ sinh( ℏ t )) − d/ 2 exp − ∥ x ∥ 2 2 + ∥ y ∥ 2 2 2 ℏ tanh( ℏ t ) + ⟨ x, y ⟩ L 2 ( R d ) ℏ sinh( ℏ t ) ! . Pr o of. See Lemma A.4 . R emark A.13 . The eigen v alue E k is the energy lev el of the state k . A complex-analytical analog of Mehler k ernel is the F eynman propagator, where t > 0 represen ts inv erse temp erature. F or the sake of the preceding pro ofs, we are only interested in the sp ecial case ℏ = 2, in which ψ k = h k ϕ 1 / 2 d and E k = 2 | k | + d . Recall that Corollary A.5 describes upp er b ounds of the quantit y K n ( x, x ) ϕ d ( x ) inv olving the diagonal entries of Christoffel-Darb oux k ernel ( 29 ). The quantit y can b e rewritten as K n ( x, x ) ϕ d ( x ) = X E k ≤ E n,d ψ 2 k ( x ) , (54) 26 where E n,d = 2 n + d as in ( 34 ). Th us, ( 54 ) represents the diagonal entries of low-energy sp ectral pro jector k ernel and explains the spatial density of states (DOS). As suc h, lo cal W eyl la w states that, giv en x ∈ R d , in the classical regime where E n,d → ∞ , w e hav e X E k ≤ E n,d ψ 2 k ( x ) → (4 π ) − d Z H cl ≤ E n,d dξ = (4 π ) − d ω d  2 E n,d − ∥ x ∥ 2 2  d/ 2 , where ω d is the volume of d -dimensional unit (Euclidean) ball. Therefore, in the classically forbidden region where ∥ x ∥ 2 > p 2 E n,d , i.e., the p oten tial energy exceeds the mec hanical energy , we exp ect the quan tit y ( 54 ) to con v erge to zero as n → ∞ . The tail bound ( 37 ) is the mathematically rigorous v ersion of this intuition. Refer to Guillemin and Stern b erg ( 2013 ) for further details. B Pro of of the Sharpness This section completes the pro of of our sharpness result b y proving Lemma 3.2 and Corollary 3.3 . B.1 Preliminaries: Cheb yshev P olynomials and Lemmas Lemma B.1. Supp ose | ∆ k | ≤ 2 b k holds for al l k ∈ N . Then, ther e exists N ∈ N such that n ≥ N ∨ (2 . 77) b 2 = ⇒ ∞ X k = n +1 ∆ 2 k k ! ≤  eb 2 n + 1  n +1 . Pr o of. According to Stirling’s form ula, there exists N ∈ N , not depending on b , suc h that, if n ≥ N , ∆ 2 n + ℓ ( n + ℓ )! ≤ 4 b 2( n + ℓ ) ( n + ℓ )! ≤  1 − e 2 . 77   eb 2 n + ℓ  n + ℓ holds for ℓ ≥ 1. If we assume further that n ≥ (2 . 77) b 2 , then ∞ X ℓ =1  1 − e 2 . 77   eb 2 n + ℓ  n + ℓ ≤ ∞ X ℓ =1  1 − e 2 . 77   e 2 . 77  ℓ − 1  eb 2 n + 1  n +1 =  eb 2 n + 1  n +1 . Lemma B.2 (Chebyshev p olynomials of the first kind) . L et n ≥ 11 and θ j = cos  2 j +1 2 n +2 π  , j = 0 , . . . , n b e the zer os of Chebyshev p olynomial of the first kind, T n +1 ( x ) , with de gr e e n + 1 . Then, 1. | T n +1 ( t √ − 1) | = n ( t + √ t 2 + 1) n +1 + ( t − √ t 2 + 1) n +1 o / 2 holds for t > 0 . 2. z n = p − n 2 . 77 satisfies 1 2 n | z n | n +1 | T n +1 ( z n ) | < 2 . 3.   V − 1 n +1   ∞ ≤ (1+ √ 2) n +1 n +1 , wher e V n +1 =    1 · · · 1 . . . . . . . . . θ n 0 · · · θ n n    is the ( n + 1) × ( n + 1) V andermonde matrix involving θ 0 , . . . , θ n . 27 Pr o of. First, applying de Moivre’s formula to the definition ( 7 ) gives T n +1 ( x ) = 1 2  ζ n +1 + ζ − ( n +1)  , where x ∈ C and ζ = x ± √ x 2 − 1. (No matter which branch is c hosen for the square ro ot, the tw o summands are recipro cal to eac h other.) Second, if z n = p − n 2 . 77 , then 1 2 n | z n | n +1 | T n +1 ( z n ) | =   1 + q 1 + 2 . 77 n 2   n +1 +   1 − q 1 + 2 . 77 n 2   n +1 → exp  2 . 77 4  < 2 , as n → ∞ . (A more careful computation shows n ≥ 11 is sufficient.) Finally , according to Example 6.2 of Gautsc hi ( 1974 ), we ha ve   V − 1 n +1   ∞ ≤ 3 3 / 4 2( n + 1) | T n +1 ( √ − 1) | ≤ (1 + √ 2) n +1 n + 1 . Lemma B.3. L et n b e a p ositive o dd numb er. Then, max  ( n/ 2 . 77) ℓ (2 ℓ )!! : ℓ = 0 , . . . , n − 1 2  ≤ exp  n 5 . 54  , wher e (2 ℓ )!! denotes a double factorial. Pr o of. F or ℓ ≥ 1, we ha ve (2 ℓ )!! = 2 ℓ ℓ ! and ( n/ 2 . 77) ℓ 2 ℓ ℓ ! ≤  en 5 . 54 ℓ  ℓ ≤ exp  n 5 . 54  . The first inequalit y holds from Stirling’s formula and the second one is given by optimizing with resp ect to ℓ o ver positive reals. The optimal v alue is attained at ℓ = n/ (5 . 54). B.2 Pro ofs W e now proceed to pro ve Lemma 3.2 and Corollary 3.3 . Pr o of of L emma 3.2 . W e solve the follo wing linear system:    1 0 . . . 0 a n       1 · · · 1 . . . . . . . . . θ n 0 · · · θ n n       w 0 . . . w n    =    ∆ 0 . . . ∆ n    . By the third statemen t of Lemma B.2 , w e ha v e | w j | ≤   V − 1 n +1   ∞ a − n ∆ n ≤ 1 n +1 for all j . Indeed, π n and η n are v alid probabilit y measures supported on [ − M , M ] since P n j =0 w j = ∆ 0 = 0. W e also ha ve ∆ k = n X j =0 w j ( aθ j ) k = Z θ k d ( π n − η n )( θ ) , 28 for k = 0 , 1 , . . . , n . Lemma B.3 verifies that ( 11 ) holds for all 0 ≤ k ≤ n . W e will no w use mathematical induction to show that, in fact, ( 11 ) holds for all k ≥ 0. Let K ≥ n and assume the induction h yp othesis ( 11 ) to b e true for all k ≤ K . Recall that T n +1 ( x ) = 2 n ( x n +1 − σ 2 x n − 1 + σ 4 x n − 3 − · · · + ( − 1) ( n +1) / 2 σ n +1 ) , where σ m denotes the m -th elemen tary symmetric function of the zeros θ 0 , . . . , θ n . Since T n +1 ( θ j ) = 0, ( aθ j ) K +1 = σ 2 a 2 ( aθ j ) K − 1 − σ 4 a 4 ( aθ j ) K − 3 + · · · + ( − 1) ( n − 1) / 2 σ n +1 a n +1 ( aθ j ) K − n , | ∆ K +1 | =       n X j =0 w j ( aθ j ) K +1       ≤ σ 2 a 2 | ∆ K − 1 | + σ 4 a 4 | ∆ K − 3 | + · · · + σ n +1 a n +1 | ∆ K − n | ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n  σ 2 ( a/b ) 2 + σ 4 ( a/b ) 4 + · · ·  = n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n  a n +1 2 n b n +1     T n +1  b a √ − 1      − 1  ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b K +1 − n . The last inequality follows from the second statement of Lemma B.2 . W e hav e sho wn that the induction h yp othesis ( 11 ) is also true for k = K + 1. Thus, ( 11 ) is true for all k ≥ 0. Now, we pro ceed to pro v e the v ery last statemen t. In view of Lemma B.1 , there exists N ∈ N , not dep ending on a or b , such that if n ≥ N , then ∥ r n ∥ L 2 ( ϕ ) ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  b − n  eb 2 n + 1  ( n +1) / 2 ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54  r n 2 . 77  e n + 1  ( n +1) / 2 ≤ n a ( √ 2 − 1) o n +1 exp  n 5 . 54   e n  n/ 2 . Lastly , observing that q n ( x ) = ( n − 1) / 2 X ℓ =0 n a ( √ 2 − 1) o n +1 h n − 2 ℓ ( x ) (2 ℓ )!! p ( n − 2 ℓ )! = n a ( √ 2 − 1) o n +1 x n n ! giv es the following explicit form ula for L 1 ( ϕ ) and L 2 ( ϕ ) norms of q n . ∥ q n ∥ L 1 ( ϕ ) = n a ( √ 2 − 1) o n +1 2 n/ 2 π − 1 / 2 Γ  n +1 2  n ! = n a ( √ 2 − 1) o n +1 ( π n ) − 1 / 2  e n  n/ 2  1 + O  1 n  , ∥ q n ∥ L 2 ( ϕ ) = n a ( √ 2 − 1) o n +1 2 n/ 2 π − 1 / 4 Γ 1 / 2  n + 1 2  n ! = n a ( √ 2 − 1) o n +1 ( π n ) − 1 / 2  e n  n/ 2 2 n 2 − 1 4  1 + O  1 n  . 29 Comparing these asymptotics based on Stirling’s form ula sho ws ( 13 ). In particular, b oth ∥ q n ∥ L 1 ( ϕ ) and ∥ q n ∥ L 2 ( ϕ ) deca y in a hyper-exp onen tial rate of exp( − n log n/ 2), and the tail norm ∥ r n ∥ L 2 ( ϕ ) cannot deviate from ∥ q n ∥ L 1 ( ϕ ) or ∥ q n ∥ L 2 ( ϕ ) faster than an exponential rate in n . W e hav e ( 15 ) in conclusion. Pr o of of Cor ol lary 3.3 . The equalit y for TV distance is straigh tforward. In view of the ab o ve Lemma 3.2 , let n b e a large enough o dd n umber. By construction, we ha v e f π (1) n ( x ) = (1 − λ n ) ϕ ( x ) + n X j =0  λ n n + 1 + λ n w j  ϕ ( x − aθ j ) , f η (1) n ( x ) = (1 − λ n ) ϕ ( x ) + n X j =0 λ n n + 1 ϕ ( x − aθ j ) . Recall from the lemma that | θ j | ≤ 1 for all j and that 0 < a ≤ 1. Also, recall the definition ( 16 ) of R n and λ n . Observ e for all x ∈ [ − R n , R n ] and j that ϕ ( x − aθ j ) ϕ ( x ) = exp  aθ j x − 1 2 a 2 θ 2 j  ≤ exp( | aθ j | R n ) ≤ exp( R n ) and that f η (1) n ( x ) ≤ (1 − λ n + λ n exp( R n )) ϕ ( x ) ≤ 2 ϕ ( x ) . Lastly , recall the definition ( 34 ) of E n,d . Note that E n, 1 = 2 n + 1 and that R n = √ 8 n + 4 = p 2 κE n, 1 holds for κ = 2. Therefore, we ha ve 2 λ 2 n χ 2  f π (1) n ∥ f η (1) n  ≥ 2 λ 2 n Z R n − R n  f π (1) n − f η (1) n  2 f η (1) n ≥ Z R n − R n ( f π n − f η n ) 2 ϕ = ∥ q n + r n ∥ 2 L 2 ([ − R n ,R n ] ,ϕ ) ≥ 1 2 ∥ q n ∥ 2 L 2 ([ − R n ,R n ] ,ϕ ) − ∥ r n ∥ 2 L 2 ( ϕ ) ( ∵ 2( a 2 + b 2 ) ≥ ( a − b ) 2 ) ≥ 1 4 ∥ q n ∥ 2 L 2 ( ϕ ) − ∥ r n ∥ 2 L 2 ( ϕ ) (b y Prop osition A.7 with κ = 2) ≥ 1 8 ∥ q n ∥ 2 L 2 ( ϕ ) , (b y inequality ( 14 )) pro vided that n is large enough. C Pro of of the Applications In this section, we pro v e Theorem 4.3 , Prop osition 4.4 , Theorems 4.5 , 4.6 , and 4.7 . 30 C.1 Preliminaries: Y atracos’ Construction and Lemmas W e first recall the application of Y atracos’ sc heme idea ( Y atracos , 1985 ) for robust densit y estima- tion in total v ariation. Consider an η -co v ering { Q 1 , . . . , Q N } of P M ,d in total v ariation. Then, w e define the Y atracos’ class A b y A := { A ij : i  = j ∈ [ N ] } , A ij :=  x : dQ i d ( Q i + Q j ) ( x ) ≥ dQ j d ( Q i + Q j ) ( x )  , so that |A| ≤ N 2 . Giv en the class A , we define a pseudo-distance dist as follows. dist( P 1 , P 2 ) := sup A ∈A | P 1 ( A ) − P 2 ( A ) | . Then, dist satisfies triangular inequality . Moreov er, it approximates the total v ariation on P , in the sense that dist( Q i , Q j ) = TV( Q i , Q j ) , dist( P 1 , P 2 ) ≤ TV ( P 1 , P 2 ) ≤ dist( P 1 , P 2 ) + 4 ϵ, ∀ P 1 , P 2 ∈ P M ,d . Giv en i.i.d. observ ations X 1 , . . . , X n as in ( 22 ), we define the Y atracos’ estimator b P b y b P := argmin P ′ ∈P M ,d dist  P ′ , b P n  , (55) where b P n := 1 n P n i =1 δ X i is the empirical distribution. Note that the Y atracos’ scheme w orks even if P = (1 − ϵ ) P f π + ϵQ is outside P M ,d . If w e denote by b f the density of b P , in particular, we ha ve TV  f π , b f  ≤ 3 η + 2dist  P , b P n  + 3 inf P ′ ∈P M ,d TV( P , P ′ ) . (56) See Section 32.3 of Poly anskiy and W u ( 2025 ) for recen t review on the Y atracos’ estimator. As a consequence, we can derive the minimax upp er b ound in Prop osition 4.4 , noting that w e hav e log N ≲ log d +1 (1 /η ) from Lemma C.1 . It only remains to choose appropriate η for ( 56 ). See App endix C.2 for the details. Lemma C.1 (TV entrop y b ound in d dimension) . R e c al l the definition of c overing numb er fr om Definition 4.1 . We have log N TV ( P M ,d , ϵ ) ≲ log d +1  1 ϵ  . Pr o of. F or the one-dimensional case ( d = 1), the en tropy b ound is due to Ghosal and V an Der V aart ( 2001 ). Recent w orks extended this result to arbitrary dimensions. ( Saha and Gun tub o yina , 2020 ; Ma et al. , 2025 ). Let P m b e the collection of m -atomic Gaussian mixtures in P M ,d and define m ⋆ := inf ( m ∈ N : sup P ∈P M ,d inf P m ∈P m TV( P , P m ) ≤ ϵ 2 ) . 31 Then, Prop osition 5 of Ma et al. ( 2025 ) shows m ⋆ ≲ log d (1 /ϵ ). On the other hand, parametric en tropy bound on finite mixtures sho ws log N TV  P m ⋆ , ϵ 2  ≲ m ⋆ d log  1 ϵ  . Com bining these results with triangular inequalit y concludes the pro of. Lemma C.2 ( Chen et al. ( 2018 )) . Supp ose P 1 and P 2 ar e pr ob ability me asur es such that TV( P 1 , P 2 ) ≤ ϵ 1 − ϵ . Then, ther e exist two pr ob ability me asur es Q 1 and Q 2 such that (1 − ϵ ) P 1 + ϵQ 1 = (1 − ϵ ) P 2 + ϵQ 2 . C.2 Pro ofs W e pro ceed to prov e Theorem 4.3 , Prop osition 4.4 , Theorems 4.5 , 4.6 , and 4.7 in this section. Pr o of of The or em 4.3 . First, for one estimator b P , supp ose e P is the pro jection of b P into P under TV distance. Then, for every P ∈ P , we ha v e TV  P , e P  ≤ TV  P , b P  + TV  b P , e P  ≤ 2TV  P , b P  . This allo ws b P to b e restricted to P up to universal constan ts. Second, the upp er bound follows immediately from the inequalit y ( 1 ) and Prop osition 4.2 . Third, applying Corollary 2.4 gives P h TV  P , b P  ≥ J − 1  ϵ n 4 i ≥ P h H  P , b P  ≥ ϵ n 4 i ≥ 1 2 , (57) where w e define α ( t ) as in ( 4 ) and J ( t ) as J ( t ) := C 0 t ∨ t 1 − α ( t ) , (58) for t > 0. Note that the in v erse J − 1 is well-defined in the regime where n → ∞ as J is strictly increasing in (0 , t 0 ) for some t 0 > 0. The last inequality in ( 57 ) is due to F ano’s inequalit y used in the pro of of Corollary 11 in Jia et al. ( 2023 ). W e conclude that inf b P ∈P sup P ∈P E P h TV 2  P , b P i ≳  J − 1  ϵ n 4  2 ≳ ϵ 2  1+ 2+ δ log(log(1 /ϵ n ) ∨ e )  n . Pr o of of Pr op osition 4.4 . Observe that inf P ′ ∈P M ,d TV( P , P ′ ) ≤ TV ( P, P f π ) ≤ ϵ. Hence, b y ( 56 ), the standard Y atracos’ construction ( 55 ) leads to a prop er estimator b f satisfying TV  f π , b f  ≤ 3 ϵ + 3 η + 2dist  P , b P n  . (59) 32 Applying the Ho effding bound and union b ound, w e hav e P  dist  P , b P n  ≥ s  ≤ 1 ∧ 2 |A| exp  − ns 2 2  , (60) E P  dist 2  P , b P n  ≤ 2 (1 + log (2 |A| )) n . Lemma C.1 implies log |A| ≤ 2 log N TV ( P M ,d , η ) ≲ log d +1 (1 /η ) . Accordingly , we c ho ose optimal η ≍ log d/ 2 ( n ) / √ n to conclude the pro of. Pr o of of The or em 4.5 . Let b f b e the prop er estimator from the proof of Prop osition 4.4 . W e define J ( · ) as in ( 58 ). Observ e that J ( · ) is subadditive, i.e., J ( s + t ) ≤ J ( s ) + J ( t ) holds for all s, t > 0, pro vided that C 0 is not to o small, dep ending only on δ > 0. Th us, applying Corollary 2.4 to ( 59 ) giv es H  f π , b f  ≤ 3 J ( ϵ ) + 3 J ( η ) + 2 J  dist  P , b P n  . Hence, the c hoice of η in the pro of of Proposition 4.4 pro ves the desired bound in ( 23 ). Pr o of of The or em 4.6 . The minimax low er b ound in ϵ can b e obtained from standard tw o-p oin t metho d. Our sharpness result, Theorem 3.1 , sho ws that there exist tw o “one-dimensional” proba- bilit y measures π ⋆ and η ⋆ , supp orted on the b ounded interv al [ − M , M ], such that TV ( f π ⋆ , f η ⋆ ) ≤ ϵ ≤ ϵ 1 − ϵ and that H ( f π ⋆ , f η ⋆ ) ≳ ϵ  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  . Note that we can also construct d -dimensional probabilit y measures π and η with the same property as w e hav e TV( f π , f η ) = TV( f π ⋆ , f η ⋆ ) and H ( f π , f η ) = H ( f π ⋆ , f η ⋆ ) for π = π ⋆ ⊗ δ ⊗ ( d − 1) 0 = π ⋆ ⊗ δ 0 ⊗ · · · ⊗ δ 0 , η = η ⋆ ⊗ δ ⊗ ( d − 1) 0 = η ⋆ ⊗ δ 0 ⊗ · · · ⊗ δ 0 , where δ 0 denotes the p oin t mass at zero and ⊗ the product measure. Thus, it follo ws from Lemma C.2 and the same tw o p oin t argumen t in Chen et al. ( 2018 ) that inf b f sup π ,Q E h H 2  f π , b f i ≳ ϵ 2  1 − 0 . 33 log(log(1 /ϵ ) ∨ e )  . Pr o of of The or em 4.7 . This pro of crucially relies on the proof of Theorem 3.5 in Saha and Gun- tub o yina ( 2020 ). Our proof, how ev er, differs from theirs in the choice of ρ : they tak e ρ = (2 π ) − d/ 2 n − 1 , whereas w e use ρ = (2 π ) − d/ 2  E 2 ( ϵ, n ) ∧ e − 2  , where w e define E 2 ( ϵ, n ) as in ( 24 ). 33 Recall that the oracle Ba yes estimator b θ ⋆ ( · ) is giv en b y ( 26 ), and consider the follo wing decom- p osition: E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2 ≤ 2 E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ρ ( X )    2 + 2 E X ∼ f π    b θ ⋆ ρ ( X ) − b θ ⋆ ( X )    2 , (61) where w e define b θ ⋆ ρ ( X ) := X + ∇ f π ( X ) f π ( X ) ∨ ρ . The first term ab o ve is b ounded from ab o ve as follo ws, using Theorem E.1 in Saha and Guntuboyina ( 2020 ), whic h is a generalization of Theorem 3 in Jiang and Zhang ( 2009 ). E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ρ ( X )    2 = Z      ∇ b f ( x ) b f ( x ) ∨ ρ − ∇ f π ( x ) f π ( x ) ∨ ρ      2 f π ( x ) dx ≲ H 2  f π , b f    log 1 H  f π , b f  ∨ log 3  1 E ( ϵ, n ) ∨ e    . F or the second term in ( 61 ), w e hav e E X ∼ f π    b θ ⋆ ρ ( X ) − b θ ⋆ ( X )    2 = Z     ∇ f π ( x ) f π ( x ) ∨ ρ − ∇ f π ( x ) f π ( x )     2 f π ( x ) dx = Z  1 − f π ( x ) f π ( x ) ∨ ρ  2 ∥∇ f π ( x ) ∥ 2 f π ( x ) dx ≲ E 2 ( ϵ, n ) log d  1 E ( ϵ, n ) ∨ e  . The last inequalit y is due to Lemma 4.3 in Saha and Guntuboyina ( 2020 ). Recall from our Theorem 4.5 that E h H 2  f π , b f i ≲ E 2 ( ϵ, n ) . F or brevity , write H := H  f π , b f  and E := E ( ϵ, n ) for the remainder of the pro of. Then, E  H 2 log 1 H  = E  H 2 log 1 H 1 { H ≤ E ≤ e − 1 }  + E  H 2 log 1 H 1 { H ≤ E } 1 {E > e − 1 }  + E  H 2 log 1 H 1 { H > E }  ≤ E 2 log  1 E ∨ e  + E  H 2 log 1 H 1 {E > e − 1 }  + E  H 2  P  H 2 > E 2  log  1 E ∨ e  ≲ E 2 log  1 E ∨ e  . (b y Marko v inequalit y ) T aking all into accoun t, we conclude that E  E X ∼ f π    b θ ρ ( X ) − b θ ⋆ ( X )    2  ≲ E 2 log 3 ∨ d  1 E ∨ e  ≲ ϵ 2  1 − 2+2 δ log(log(1 /ϵ ) ∨ e )  + n − (1 − o d (1)) . Note that the extra logarithmic factors are absorb ed into the slack parameter δ > 0 and o d (1), resp ectiv ely . Since the choice of δ > 0 is arbitrary , replace δ with δ / 2 to prov e the b ound ( 28 ). 34

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment