Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations

Inﬂuence F unctions for Mac hine Learning: Nonparametric Estimators for En tropies, Div ergences and Mutual Informations Kirthev asan Kandasam y kandasamy@cs.cmu.edu Aksha y Krishnam urthy akshaykr@cs.cmu.edu Barnab´ as P´ oczos bapoczos@cs.cmu.edu Sc ho ol of Computer Science, Carnegie Mellon Univ ersity Larry W asserman larry@stat.cmu.edu Departmen t of Statistics, Carnegie Mellon Universit y James M. Robins robins@hsph.harvard.edu Departmen t of Biostatistics, Harv ard Universit y Abstract W e prop ose and analyze estimators for statistical functionals of one or more distributions under nonparametric assumptions. Our estimators are based on the theory of inﬂuence functions, which appear in the semiparametric statistics literature. W e show that estimators based either on data-splitting or a lea ve-one-out technique enjo y fast rates of conv ergence and other fav orable theoretical properties. W e apply this framework to derive estimators for several p opular information theoretic quantities, and via empirical ev aluation, sho w the adv antage of this approac h ov er existing estimators. 1 In tro duction En tropies, divergences, and m utual informations are classical information-theoretic quantities that pla y fundamen tal roles in statistics, machine learning, and across the mathematical sciences. In addition to their use as analytical tools, they arise in a v ariety of applications including hypothesis testing, parameter estimation, feature selection, and optimal exp erimental design. In man y of these applications, it is important to estimate these functionals from data so that they can b e used in downstream algorithmic or scien tiﬁc tasks. In this pap er, we develop a recip e for estimating statistical functionals of one or more nonparametric distributions based on the notion of inﬂuence functions. En tropy estimators are used in applications ranging from indep enden t comp onen ts analysis [ Learned-Miller and John , 2003 ], in trinsic dimension estimation [ Carter et al. , 2010 ] and sev eral signal pro cessing applica- tions [ Hero et al. , 2002 ]. Div ergence estimators are useful in statistical tasks suc h as t w o-sample testing. Recen tly they hav e also gained p opularity as they are used to measure (dis)-similarity b et ween ob jects that are mo deled as distributions, in what is known as the “mac hine learning on distributions” framew ork [ Dhillon et al. , 2003 ; P´ oczos et al. , 2011 ]. Mutual information estimators hav e been used in in learning tree-structured Mark ov random ﬁelds [ Liu et al. , 2012 ], feature selection [ Peng et al. , 2005 ], clustering [ Lewi et al. , 2006 ] and neuron classiﬁcation [ Schneidman et al. , 2002 ]. In the parametric setting, conditional div ergence and conditional mutual information estimators are used for conditional tw o sample testing or as building blo cks for structure learning in graphical mo dels. Nonparametric estimators for these quantities could p oten tially allo w us to generalise several of these algorithms to the nonparametric domain. Our approach gives sample- eﬃcien t estimators for all these quantities (and many others), which often outperfom the existing estimators b oth theoretically and empirically . Our approac h to estimating these functionals is based on p ost-hoc correction of a preliminary estimator using the V on Mises Expansion v an der V aart [ 1998 ]; F ernholz [ 1983 ]. This idea has b een used b efore in semiparametric statistics literature [ Birg´ e and Massart , 1995 ; Robins et al. , 2009 ]. Ho wev er, hitherto most studies are restricted to functionals of one distribution and hav e fo cused on a “data-split” approac h which splits the samples for density estimation and functional estimation. While the data-split (DS) estimator is 1 kno wn to achiev e the parametric conv ergence rate for suﬃciently smooth densities Birg´ e and Massart [ 1995 ]; Lauren t [ 1996 ], in practical settings splitting the data results in p o or empirical p erformance. In this pap er we introduce the calculus of inﬂuence functions to the machine learning communit y and considerably expand existing results b y prop osing a “lea ve-one-out” (LOO) estimator which makes eﬃcient use of the data and has b etter empirical p erformance than the DS technique. W e also extend the framework of inﬂuence functions to functionals of m ultiple distributions and dev elop b oth DS and LOO estimators. The main con tributions of this pap er are: 1. W e prop ose a LOO technique to estimate functionals of a single distribution with the same con vergence rates as the DS estimator. How ev er, the LOO estimator p erforms b etter empirically . 2. W e extend the framework to functionals of m ultiple distributions and analyse their conv ergence. Under suﬃcien t smo othness both DS and LOO estimators achiev e the parametric rate and the DS estimator has a limiting normal distribution. 3. W e pro ve a lo wer b ound for estimating functionals of m ultiple distributions. W e use this to establish minimax optimalit y of the DS and LOO estimators under suﬃcient smo othness. 4. W e use the approach to construct and implement estimators for v arious entrop y , divergence, m utual information quantities and their conditional versions. A subset of these functionals are listed in T a- ble 1 . F or sev eral functionals, these ar e the only known estimators . Our soft ware is publicly av ailable at github.com/kirthevasank/if-estimators . 5. W e compare our estimators against several other approac hes in simulation. Despite the generality of our approach, our estimators are comp etitiv e with and in man y cases sup erior to existing specialized approac hes for sp eciﬁc functionals. W e also demonstrate ho w our estimators can b e used in mac hine learning applications via an image clustering task. Our fo cus on information theoretic quan tities is due to their relev ance in mac hine learning applications, rather than a limitation of our approac h. Indeed our techniques apply to any smo oth functional. History: W e provide a brief history of the p ost-ho c correction technique and inﬂuence functions. W e defer a detailed discussion of other approaches to estimating functionals to Section 5 . T o our knowledge, the ﬁrst pap er using a p ost-hoc correction estimator for was that of Bick el and Rito v [ 1988 ]. The line of w ork follo wing this pap er analyzed in tegral functionals of a single one dimensional density of the form R ν ( p ) [ Bic kel and Ritov , 1988 ; Birg´ e and Massart , 1995 ; Laurent , 1996 ; Kerkyac harian and Picard , 1996 ]. A recent pap er by Krishnamurth y et al. [ 2014 ] also extends this line to functionals of m ultiple densities, but only considers p olynomial functionals of the form R p α q β for densities p and q . Moreov er, all these works use data splitting. Our work builds on these by extending to a more general class of functionals and we prop ose the sup erior LOO estimator. A fundamental quan tit y in the design of our estimators is the inﬂuenc e function , whic h app ears b oth in robust and semiparametric statistics. Indeed, our work is inspired by that of Robins et al. [ 2009 ] and Emery et al. [ 1998 ] who prop ose a (data-split) inﬂuence-function based estimator for functionals of a single distribution. Their analysis for nonparametric problems rely on ideas from semiparametric statistics: they deﬁne inﬂuence functions for parametric models and then analyze estimators b y lo oking at all parametric submodels through the true parameter. 2 Preliminaries Let X b e a compact metric space equipp ed with a measure µ , e.g. the Leb esgue measure. Let P and Q b e measures o ver X that are absolutely contin uous w.r.t µ . Let p, q ∈ L 2 ( X ) be the Radon-Nikodym deriv ativ es with resp ect to µ . W e fo cus on estimating functionals of the form: T ( P ) = T ( p ) = φ  Z ν ( p ) dµ  or T ( P , Q ) = T ( p, q ) = φ  Z ν ( p, q ) dµ  , (1) 2 where φ, ν are real v alued Lipsc hitz functions that t wice diﬀeren tiable. Our framework p ermits more general functionals – e.g. functionals based on the conditional densities – but we will fo cus on this form for ease of exp osition. T o facilitate presentation of the main deﬁnitions, it is easiest to work with functionals of one distribution T ( P ). Deﬁne M to be the set of all measures that are absolutely con tinuous w.r.t µ , whose Radon-Nik o dym deriv atives b elong to L 2 ( X ). Cen tral to our dev elopment is the V on Mises expansion (VME), whic h is the distributional analog of the T aylor expansion. F or this we introduce the Gˆ ateaux deriv ative which imp oses a notion of diﬀerentiabili ty in top ological spaces. W e then introduce the inﬂuenc e function . Deﬁnition 1. The map T 0 : M → R wher e T 0 ( H ; P ) = ∂ T ( P + tH ) ∂ t | t =0 is c al le d the Gˆ ate aux derivative at P if the derivative exists and is line ar and c ontinuous in H . We say T is Gˆ ate aux diﬀer entiable at P if T 0 exists at P . Deﬁnition 2. L et T b e Gˆ ate aux diﬀer entiable at P . A function ψ ( · ; P ) : X → R which satisﬁes T 0 ( Q − P ; P ) = R ψ ( x ; P )d Q ( x ) , is the inﬂuenc e function of T w.r.t the distribution P . The existence and uniqueness of the inﬂuence function is guaranteed b y the Riesz representation theorem, since the domain of T is a bijection of L 2 ( X ) and consequen tly a Hilb ert space. The classical work of F ernholz [ 1983 ] deﬁnes the inﬂuence function in terms of the Gˆ ateaux deriv ative by , ψ ( x, P ) = T 0 ( δ x − P, P ) = ∂ T ((1 − t ) P + tδ x ) ∂ t    t =0 , (2) where δ x is the dirac delta function at x . While our functionals are deﬁned only on non-atomic distributions, w e can still use ( 2 ) to compute the inﬂuence function. The function computed this w ay can b e shown to satisfy Deﬁnition 2 . Based on the ab o v e, the ﬁrst order VME is, T ( Q ) = T ( P ) + T 0 ( Q − P ; P ) + R 2 ( P , Q ) = T ( P ) + Z ψ ( x ; P )d Q ( x ) + R 2 ( P , Q ) , (3) where R 2 is the second order remainder. Gˆ ateaux diﬀerentiabilit y alone will not be suﬃcient for our purp oses. In what follows, we will assign Q → F and P → b F , where F , b F are the true and estimated distributions. W e would like to b ound the remainder in terms of a distance b et w een F and b F . By taking the domain of T to b e only measures with contin uous densities, we can control R 2 using the L 2 metric of the densities. This essen tially means that our functionals satisfy a stronger form of diﬀeren tiabilit y called F r´ ec het diﬀerentiabilit y [ v an der V aart , 1998 ; F ernholz , 1983 ] in the L 2 metric. Consequently , w e can write all deriv atives in terms of the densities, and the VME reduces to a functional T a ylor expansion on the densities (Lemmas 9 , 10 in App endix A ): T ( q ) = T ( p ) + φ 0  Z ν ( p )  Z ( q − p ) ν 0 ( p ) + R 2 ( p, q ) = T ( p ) + Z ψ ( x ; p ) q ( x )d x + O ( k p − q k 2 2 ) . (4) This expansion will b e the basis for our estimators. These ideas ge neralize to functionals of m ultiple distributions and to settings where the functional inv olves quan tities other than the density . A functional T ( P , Q ) of tw o distributions has tw o Gˆ ateaux deriv atives, T 0 i ( · ; P, Q ) for i = 1 , 2 formed by p erturbing the i th argument with the other ﬁxed. The inﬂuence functions ψ 1 , ψ 2 satisfy , ∀ P 1 , P 2 ∈ M , T 0 1 ( Q 1 − P 1 ; P 1 , P 2 ) = ∂ T ( P 1 + t ( Q 1 − P 1 ) , P 2 ) ∂ t    t =0 = Z ψ 1 ( u ; P 1 , P 2 )d Q 1 ( u ) . (5) T 0 2 ( Q 2 − P 2 ; P 1 , P 2 ) = ∂ T ( P 1 , P 2 + t ( Q 2 − P 2 )) ∂ t    t =0 = Z ψ 2 ( u ; P 1 , P 2 )d Q 2 ( u ) . 3 The VME can b e written as, T ( q 1 , q 2 ) = T ( p 1 , p 2 ) + Z ψ 1 ( x ; p 1 , p 2 ) q 1 ( x )d x + Z ψ 2 ( x ; p 1 , p 2 ) q 2 ( x )d x + O ( k p 1 − q 1 k 2 2 ) + O ( k p 2 − q 2 k 2 2 ) . (6) 3 Estimating F unctionals First consider estimating a functional of a single distribution, T ( f ) = φ ( R ν ( f ) dµ ) from samples X n 1 ∼ f . Using the VME ( 4 ), Emery et al. [ 1998 ] and Robins et al. [ 2009 ] suggested a natural estimator. If we use half of the data X n/ 2 1 to construct a densit y estimate ˆ f (1) = ˆ f (1) ( X n/ 2 1 ), then b y ( 4 ): T ( f ) − T ( ˆ f (1) ) = Z ψ ( x ; ˆ f (1) ) f ( x ) dµ + O ( k f − ˆ f (1) k 2 2 ) . Since the inﬂuence function does not dep end on the unknown distribution F , the ﬁrst term on the right hand side is simply an exp ectation of ψ ( X ; ˆ f (1) ) at F . W e use the second half of the data to estimate this exp ectation with its sample mean. This leads to the following preliminary estimator: b T (1) DS = T ( ˆ f (1) ) + 1 n/ 2 n X i = n/ 2+1 ψ ( X i ; ˆ f (1) ) . (7) W e can similarly construct an estimator b T (2) DS b y using X n n/ 2+1 for densit y estimation and X n/ 2 1 for a veraging. Our ﬁnal estimator is obtained via the av erage b T DS = ( b T (1) DS + b T (2) DS ) / 2. In what follo ws, we shall refer to this estimator as the Data-Split (DS) estimator. The rate of con v ergence of this estimator is determined by the error in the VME O ( k f − ˆ f (1) k 2 2 ) and the n − 1 / 2 rate for estimating an exp ectation. Low er b ounds from sev eral literature [ Laurent , 1996 ; Birg´ e and Massart , 1995 ] conﬁrm minimax optimalit y of the DS estimator when f is suﬃciently smooth. The data splitting trick is commonly used in sev eral other works Birg´ e and Massart [ 1995 ]; Laurent [ 1996 ]; Krishnam urthy et al. [ 2014 ]. The analysis of DS estimators is straightforw ard as the rate directly follows from the Cauc hy-Sc h warz inequalit y . While in theory , DS estimators enjoy go od rates of conv ergence, from a practical stand p oint, the data splitting is unsatisfying since using only half the data each for estimation and av eraging in v ariably decreases the accuracy . As an alternativ e, w e prop ose a Leav e-One-Out (LOO) version of the ab o ve estimator which makes more eﬃcien t use of the data, b T LOO = 1 n n X i =1 T ( ˆ f − i ) + ψ ( X i ; ˆ f − i ) . (8) where ˆ f − i is the kernel densit y estimate using all the samples X n 1 except for X i . Theoretically , we pro ve that the LOO Estimator achiev es the same rate of conv ergence as the DS estimator but emprically it p erforms m uch b etter. W e can extend this method for functionals of t wo distributions. Akin to the one distribution case, we prop ose the follo wing DS and LOO versions. b T (1) DS = T ( ˆ f (1) , ˆ g (1) ) + 1 n/ 2 n X i = n/ 2+1 ψ f ( X i ; ˆ f (1) , ˆ g (1) ) + 1 m/ 2 m X j = m/ 2+1 ψ g ( Y j ; ˆ f (1) , ˆ g (1) ) . (9) b T LOO = 1 max( n, m ) max( n,m ) X i =1 T ( ˆ f − i , ˆ g − i ) + ψ f ( X i ; ˆ f − i , ˆ g − i ) + ψ g ( Y i ; ˆ f − i , ˆ g − i ) . (10) 4 Here, ˆ g (1) , ˆ g − i are deﬁned similar to ˆ f (1) , ˆ f − i . F or the DS estimator we swap the samples to compute b T (2) DS and then av erage. F or the LOO estimator, if n > m w e cycle through the p oin ts Y m 1 un til we ha ve summed o ver all X n 1 or vice versa. Note that b T LOO is asymmetric when n 6 = m . A seemingly natural alternative w ould b e to sum o ver all nm pairings of X i ’s and Y j ’s. How ev er, the latter approac h is more computationally burdensome. Moreo ver, a straightforw ard mo diﬁcation of our analysis in App endix D.2 sho ws that b oth estimators ha ve the same rate of conv ergence if n and m are of the same order. Examples: W e demonstrate the generalit y of our framew ork by presenting estimators for several en tropies, div ergences and m utual informations and their conditional versions in T able 1 . F or sev eral functionals in the table, these ar e the ﬁrst estimators pr op ose d . W e hop e that this table will serve as a go od reference for practitioners. F or sev eral functionals (e.g. conditional and unconditional R´ en yi- α divergence, conditional Tsallis- α mutual information and more) the estimators are not listed only b ecause the expressions are to o long to ﬁt into the table. Our softw are implemen ts a total of 17 functionals whic h include all the estimators in the table. In App endix F we illustrate how to apply our framework to derive an estimator for any functional via an example. As will b e discussed in Section 5 , when compared to other alternativ es, our technique has several fa vourable prop erties. Computationally , the complexity of our metho d is O ( n 2 ) when compared to O ( n 3 ) for some other metho ds and for several functionals we do not require numeric integration. Additionally , unlike most other metho ds, we do not require any tuning of hyperparameters. F unctional LOO Estimator Tsallis- α En trop y 1 α − 1  1 − R p α  1 1 − α + 1 n P i R b p α − i − α α − 1 b p α − 1 − i ( X i ) R ´ en yi- α En tropy − 1 α − 1 log R p α α α − 1 + 1 n P i − 1 α − 1 log R b p α − i − b p α − 1 − i ( X i )+ Shannon Entrop y − R p log p − 1 n P i log b p − i ( X i ) L 2 2 Div ergence R ( p X − p Y ) 2 2 n P i b p X, − i ( X i ) − b p Y ( X i ) − R ( b p X, − i − b p Y ) 2 + 2 m P j b p X ( Y j ) − b p Y , − j ( Y j ) Hellinger Divergence 2 − 2 R p X 1 / 2 p Y 1 / 2 2 − 1 n P i b p − 1 / 2 X, − i ( X i ) b p 1 / 2 Y ( X i ) − 1 m P j b p 1 / 2 X ( Y j ) b p − 1 / 2 Y , − j ( Y j ) Chi-Squared Divergence R ( p X − p Y ) 2 p X − 1 + 1 n P i b p 2 Y ( X i ) b p 2 X, − i ( X i ) + 2 1 m P j b p Y , − j ( Y j ) b p X ( Y j ) f -Divergence R φ ( p X p Y ) p Y 1 n P i φ 0  b p X, − i ( X i ) b p Y ( X i )  + 1 m P j  φ  b p Y , − j ( Y j ) b p X ( Y j )  − b p X ( Y j ) b p Y , − j ( Y j ) φ  b p X ( Y j ) b p Y , − j ( Y j )  Tsallis- α Div ergence 1 α − 1  R p X α p Y 1 − α − 1  1 1 − α + α α − 1 1 n P i  b p X, − i ( X i ) b p Y ( X i )  α − 1 − 1 m P j  b p X ( Y j ) b p Y , − j ( Y j )  α KL divergence R p X log p X p Y 1 + 1 n P i log b p X, − i ( X i ) b p Y ( X i ) − 1 m P j b p X ( Y j ) b p Y , − j ( Y j ) Conditional-Tsallis- α div ergence R p Z 1 α − 1  R p α X | Z p 1 − α Y | Z − 1  1 1 − α + α α − 1 1 n P i  b p X Z, − i ( V i ) b p Y Z ( V i )  α − 1 − 1 m P j  b p X Z ( W j ) b p Y Z, − j ( W j )  α Conditional-KL divergence R p Z R p X | Z log p X | Z p Y | Z 1 + 1 n P i log b p X Z, − i ( V i ) b p Y Z ( V i ) − 1 m P j b p X Z ( W j ) b p Y Z, − j ( W j ) Shannon Mutual Information R p X Y log p X Y p X p Y 1 n P i log b p X Y , − i ( X i , Y i ) − log b p X, − i ( X i ) − log b p Y , − i ( Y i ) Conditional Tsallis- α MI R p Z 1 α − 1  R p α X,Y | Z p 1 − α X | Z p 1 − α Y | Z − 1  1 1 − α + 1 α − 1 1 n P i α  b p X Y Z, − i ( X i ,Y i ,Z i ) b p Z ( Z i ) b p X Z, − i ( X i ,Z i ) b p Y Z, − i ( Y i ,Z i )  α − 1 − (1 − α ) 1 α − 1 1 n P i b p α − 2 Z ( Z i ) R b p α X Y Z, − i ( · , · , Z i ) b p 1 − α X Z, − i ( · , Z i ) + 1 α − 1 1 n P i (1 − α ) b p − α X Z, − i ( X i , Z i ) b p 1 − α Z ( Z i ) R b p α X Y Z, − i ( X i , · , Z i ) b p 1 − α Y Z, − i ( · , Z i ) + 1 α − 1 1 n P i (1 − α ) b p − α Y Z, − i ( Y i , Z i ) b p α − 1 Z ( Z i ) R b p α X Y Z, − i ( · , Y i , Z i ) b p 1 − α X Z, − i ( · , · ) T able 1: Deﬁnitions of functionals and the corresponding estimators. Here p X | Z , p X Z etc. are conditional and joint distributions. F or the conditional div ergences we take V i = ( X i , Z 1 i ), W j = ( Y j , Z 2 j ) to b e the samples from p X Z , p Y Z resp ectiv ely . F or the mutual informations we hav e samples ( X i , Y i ) ∼ p X Y and for the conditional v ersions we hav e ( X i , Y i , Z i ) ∼ p X Y Z . 5 4 Analysis Some smoothness assumptions on the densities are warran ted to make estimation tractable. W e use the H¨ older class, which is now standard in nonparametrics literature. Deﬁnition 3 (H¨ older Class) . L et X ⊂ R d b e a c omp act sp ac e. F or any r = ( r 1 , . . . , r d ) , r i ∈ N , deﬁne | r | = P i r i and D r = ∂ | r | ∂ x r 1 1 ...∂ x r d d . The H¨ older class Σ( s, L ) is the set of functions on L 2 ( X ) satisfying, | D r f ( x ) − D r f ( y ) | ≤ L k x − y k s − r , for al l r s.t. | r | ≤ b s c and for al l x, y ∈ X . Moreo ver, deﬁne the Bounded H¨ older Class Σ( s, L, B 0 , B ) to b e { f ∈ Σ( s, L ) : B 0 < f < B } . Note that large s implies higher smo othness. Given n samples X n 1 from a d -dimensional density f , the kernel density estimator (KDE) with bandwidth h is ˆ f ( t ) = 1 / ( nh d ) P n i =1 K  t − X i h  . Here K : R d → R is a smo othing k ernel Tsybako v [ 2008 ]. When f ∈ Σ( s, L ), by selecting h ∈ Θ( n − 1 2 s + d ) the KDE ac hieves the minimax rate of O P ( n − 2 s 2 s + d ) in mean squared error. F urther, if f is in the b ounded H¨ older class Σ( s, L, B 0 , B ) one can truncate the KDE from b elo w at B 0 and from ab o ve at B and achiev e the same conv ergence rate [ Birg´ e and Massart , 1995 ]. In our analysis, the density estimators ˆ f (1) , ˆ f − i , ˆ g (1) , ˆ g − i are formed by either a KDE or a truncated KDE, and w e will make use of these results. W e will also need the following regularity condition on the inﬂuence function. This is satisﬁed for smooth functionals including those in T able 1 . W e demonstrate this in our example in App endix F . Assumption 4. F or a functional T ( f ) , the inﬂuenc e function ψ satisﬁes, E  ( ψ ( X ; f 0 ) − ψ ( X ; f )) 2  ∈ O ( k f − f 0 k 2 ) as k f − f 0 k 2 → 0 . F or a functional T ( f , g ) of two distributions, the inﬂuenc e functions ψ f , ψ g satisfy, E f h ( ψ f ( X ; f 0 , g 0 ) − ψ f ( X ; f , g )) 2 i ∈ O ( k f − f 0 k 2 + k g − g 0 k 2 ) as k f − f 0 k 2 , k g − g 0 k 2 → 0 . E g h ( ψ g ( Y ; f 0 , g 0 ) − ψ g ( Y ; f , g )) 2 i ∈ O ( k f − f 0 k 2 + k g − g 0 k 2 ) as k f − f 0 k 2 , k g − g 0 k 2 → 0 . Under the ab o ve assumptions, it is kno wn Emery et al. [ 1998 ]; Robins et al. [ 2009 ] that the DS estimator on a single distribution achiev es the mean squared error (MSE) E [( b T DS − T ( f )) 2 ] ∈ O ( n − 4 s 2 s + d + n − 1 ) and further is asymptotically normal when s > d/ 2. W e hav e reviewed them along with a pro of in App endix B . Note that Robins et al. [ 2009 ] analyse b T DS in the semiparametric setting. W e present a simpler self contained analysis that directly uses the VME and has more in terpretable assumptions. Bounding the bias and v ariance of the DS estimator to establish the conv ergence rate follows via a straightforw ard conditioning argument and Cauc hy Sc hw arz. How ev er, an attractive prop ert y is that the analysis is agnostic to the densit y estimator used pro vided it achiev es the correct rates. F or the LOO estimator prop osed in ( 8 ), we establish the following result. Theorem 5 ( Con v ergence of LOO Estimator for T ( f )) . L et f ∈ Σ( s, L, B , B 0 ) and ψ satisfy Assump- tion 4 . Then, E [( b T LOO − T ( f )) 2 ] is O ( n − 4 s 2 s + d ) when s < d/ 2 and O ( n − 1 ) when s ≥ d/ 2 . The k ey tec hnical c hallenge in analysing the LOO estimator (when compared to the DS estimator) is in b ounding the v ariance with several correlated terms in the summation. The b ounded diﬀerence inequality is a p opular trick used in such settings, but this requires a suprem um on the inﬂuence functions which leads to signiﬁcantly worse rates. Instead we use the Efron-Stein inequalit y whic h provides an integrated v ersion of bounded diﬀerences that can recov er the correct rate when coupled with Assumption 4 . Our proof is con tingent on the use of the KDE as the density estimator. While our empirical studies indicate that b T LOO ’s limiting distribution is normal (Fig 2(c) ), the pro of seems challenging due to the correlation b et ween terms in the summation. W e conjecture that b T LOO is indeed asymptotically normal but for now leav e it as an op en problem. 6 W e reiterate that while the conv ergence rates are the same for b oth DS and LOO estimators, the data splitting degrades empirical p erformance of b T DS . No w we turn our attention to functionals of tw o distributions. When analysing asymptotics we will assume that as n, m → ∞ , n/ ( n + m ) → ζ ∈ (0 , 1). Denote N = n + m . F or the DS estimator ( 9 ) w e generalise our analysis for one distribution to establish the theorem b elo w. Theorem 6 ( Con vergence/Asymptotic Normality of DS Estimator for T ( f , g )) . L et f , g ∈ Σ( s, L, B , B 0 ) and ψ f , ψ g satisfy Assumption 4 . Then, E [( b T DS − T ( f , g )) 2 ] is O ( n − 4 s 2 s + d + m − 4 s 2 s + d ) when s < d/ 2 and O ( n − 1 + m − 1 ) when s ≥ d/ 2 . F urther, when s > d/ 2 and when ψ f , ψ g 6 = 0 , b T DS is asymptotic al ly normal, √ N ( b T DS − T ( f , g )) D − → N  0 , 1 ζ V f [ ψ f ( X ; f , g )] + 1 1 − ζ V g [ ψ g ( Y ; f , g )]  . (11) The asymptotic normalit y result is useful as it allows us to construct asymptotic conﬁdence in terv als for a functional. Even though the asymptotic v ariance of the inﬂuence function is not known, by Slutzky’s theorem any consistent estimate of the v ariance giv es a v alid asymptotic conﬁdence in terv al. In fact, w e can use an inﬂuence function based estimator for the asymptotic v ariance, since it is also a diﬀeren tiable functional of the densities. W e demonstrate this in our example in App endix F . The condition ψ f , ψ g 6 = 0 is somewhat technical. When b oth ψ f and ψ g are zero, the ﬁrst order terms v anishes and the estimator con verges very fast (at rate 1 /n 2 ). How ever, the asymptotic behavior of the estimator is unclear. While this degeneracy o ccurs only on a meagre set, it do es arise for imp ortan t choices. One example is the n ull hypothesis f = g in tw o-sample testing problems. Finally , for the LOO estimator ( 10 ) on t wo distributions we hav e the follo wing result. Theorem 7 ( Conv ergence of LOO Estimator for T ( f , g )) . L et f , g ∈ Σ( s, L, B , B 0 ) and ψ f , ψ g satisfy Assumption 4 . Then, E [( b T LOO − T ( f , g )) 2 ] is O ( n − 4 s 2 s + d + m − 4 s 2 s + d ) when s < d/ 2 and O ( n − 1 + m − 1 ) when s ≥ d/ 2 . F or man y functionals, a H¨ olderian assumption (Σ( s, L )) alone is suﬃcient to guarantee the rates in Theo- rems 5 , 6 and 7 . Ho wev er, for some functionals (such as the α -div ergences) we require ˆ f , ˆ g , f , g to b e b ounded ab o v e and b elow. Existing results [ Krishnamurth y et al. , 2014 ; Birg ´ e and Massart , 1995 ] demonstrate that estimating suc h quantities is diﬃcult without this assumption. No w w e turn our atten tion to the question of statistical diﬃcult y . Via low er b ounds given by Birg´ e and Massart [ 1995 ] and Lauren t [ 1996 ] we know that the DS and LOO estimators are minimax optimal when s > d/ 2 for functionals of one distribution. In the following theorem, w e presen t a low er bound for estimating functionals of t wo distributions. Theorem 8 ( Lo wer Bound for T ( f , g )) . L et f , g ∈ Σ( s, L ) and b T b e any estimator for T ( f , g ) . Deﬁne τ = min { 8 s/ (4 s + d ) , 1 } . Then ther e exists a strictly p ositive c onstant c such that, lim inf n →∞ inf b T sup f ,g ∈ Σ( s,L ) E  ( b T − T ( f , g )) 2  ≥ c  n − τ + m − τ  . Our pro of, given in App endix E , is based on LeCam’s metho d Tsybak ov [ 2008 ] and generalises the analysis of Birg´ e and Massart [ 1995 ] for functionals of one distribution. This establishes minimax optimality of the DS/LOO estimators for functionals of tw o distributions when s ≥ d/ 2. How ev er, when s < d/ 2 there is a gap b et w een our technique and the low er b ound and it is natural to ask if it is p ossible to impro ve on our rates in this regime. A series of w ork [ Birg ´ e and Massart , 1995 ; Laurent , 1996 ; Kerkyac harian and Picard , 1996 ] sho ws that, for integral functionals of one distribution, one can achiev e the n − 1 rate when s > d/ 4 by estimating the second order term in the functional T aylor expansion. This second order correction w as also done for p olynomial functionals of tw o distributions with similar statistical gains [ Krishnam urthy et al. , 2014 ]. While w e b eliev e this is possible here, these estimators are conceptually complicated and computationally exp ensiv e – requiring O ( n 3 + m 3 ) eﬀort when compared to the O ( n 2 + m 2 ) eﬀort for our estimator. The ﬁrst order estimator has a fa vorable balance b et w een statistical and computational eﬃciency . F urther, not m uch is kno wn ab out the limiting distribution of second order estimators. 7 5 Comparison with Other Approac hes Estimation of statistical functionals under nonparametric assumptions has receiv ed considerable attention o ver the last few decades. A large b o dy of work has focused on estimating the Shannon entrop y– Beirlant et al. [ 1997 ] giv es a nice review of results and techniques. More recent w ork in the single-distribution setting includes estimation of R´ en yi and Tsallis en tropies [ Leonenk o and Seleznjev , 2010 ; P´ al et al. , 2010 ]. There are also several pap ers extending some of these tec hniques to the div ergence estimation [ Krishnam urthy et al. , 2014 ; P´ oczos and Schneider , 2011 ; W ang et al. , 2009 ; K¨ allb erg and Seleznjev , 2012 ; P´ erez-Cruz , 2008 ]. Man y of the existing metho ds can b e categorized as plug-in methods: they are based on estimating the densities either via a KDE or using k -Nearest Neighbors ( k -NN) and ev aluating the functional on these estimates. Plug-in metho ds are conceptually simple but unfortunately suﬀer several drawbac ks. First, they typically hav e worse con vergence rate than our approach, achieving the parametric rate only when s ≥ d as opposed to s ≥ d/ 2 [ Liu et al. , 2012 ; Singh and Poczos , 2014 ]. Secondly , using either the KDE or k -NN, obtaining the b est rates for plug-in metho ds requires undersmoothing the density estimate and w e are not a ware for principled approac hes for h yp erparameter tuning here. In con trast, the bandwidth used in our estimators is the optimal bandwidth for density estimation, so a num b er of approaches such as cross v alidation are av ailable. This is conv enien t for a practitioner as our metho d do es not r e quir e tuning hyp er p ar ameters . Secondly , plugin metho ds based on the KDE alwa ys require computationally burdensome n umeric integration. In our approach, numeric integration can b e a voided for man y functionals of in terest (See T able 1 ). There is also another line of work on estimating f -Divergences. Nguy en et al. [ 2010 ] estimate f -divergences b y solving a conv ex program and analyse the technique when the likelihoo d ratio of the densities is in an RKHS. Comparing the theoretical results is not straightforw ard since it is not clear how to port their assumptions to our setting. F urther, the size of the con vex program increases with the sample size which is problematic for large samples. Mo on and Hero [ 2014 ] use a weigh ted ensemble estimator for f -div ergences. They establish asymptotically normality and the parametric conv ergence rates only when s ≥ d , which is a stronger smo othness assumption than is required b y our tec hnique. Both these w orks only consider f -div ergences. Our metho d has wider applicability and includes f -divergences as a sp ecial case. 6 Exp erimen ts 6.1 Sim ulations First, we compare the estimators derived using our metho ds on a series of synthetic examples in 1 − 4 dimensions. F or the DS/LOO estimators, we estimate the density via a KDE with the smo othing kernels constructed using Legendre p olynomials [ Tsybak ov , 2008 ]. In b oth cases and for the plug in estimator we c ho ose the bandwidth by p erforming 5-fold cross v alidation. The in tegration for the plug in estimator is appro ximated numerically . W e test the estimators on a series of syn thetic datasets in 1 − 4 dimension. The sp eciﬁcs of the data generating distributions and metho ds compared to are given below. The results are shown in Figures 1 and 2 . W e make the following observ ations. In most cases the LOO estimator p erforms b est. The DS estimator approac hes the LOO estimator when there are many samples but is generally inferior to the LOO estimator with few samples. This, as w e ha ve explained b efore is b ecause data splitting do es not make eﬃcien t use of the data. The k -NN estimator for div ergences P´ oczos et al. [ 2011 ] requires choosing a k . F or this estimator, we used the default setting for k giv en in the softw are. As p erformance is sensitive to the choice of k , it performs w ell in some cases but p oorly in other cases. W e reiterate that our estimators do not require any setting of h yp erparameters. Next, w e present some results on asymptotic normality . W e test the DS and LOO estimators on a 4- dimensional Hellinger divergence estimation problem. W e use 4000 samples for estimation. W e rep eat this exp erimen t 200 times and compare the empiritical asymptotic distribution (i.e. the √ 4000( b T − T ( f , g )) / b S 8 10 2 10 3 10 −1 n Error Shannon Entropy 1D Plug−in DS LOO kNN KDP Vasicek−KDE 10 2 10 3 10 −1 n Error Shannon Entropy 2D Plug−in DS LOO kNN KDP Voronoi 10 2 10 3 10 −4 10 −3 10 −2 10 −1 n Error KL Divergence Plug−in DS LOO kNN 10 2 10 3 10 −1 10 0 n Error Renyi−0.75 Divergence Plug−in DS LOO kNN 10 2 10 3 10 −4 10 −3 10 −2 10 −1 n Error Hellinger Divergence Plug−in DS LOO kNN 10 2 10 3 10 −4 10 −3 10 −2 10 −1 10 0 n Error Tsallis−0.75 Divergence Plug−in DS LOO kNN Figure 1: Comparison of DS/LOO estimators against alternatives on diﬀerent functionals. The y -axis is the error | b T − T ( f , g ) | and the x -axis is the num b er of samples. All curves were pro duced by av eraging ov er 50 exp erimen ts. Some curves are slightly wiggly probably due to discretisation in hyperparameter selection. v alues where b S is the estimated asymptotic v ariance) to a N (0 , 1) distribution on a QQ plot. The results in Figure 2 suggest that b oth estimators are asymptotically normal. Details: In our sim ulations, for the ﬁrst ﬁgure comparing the Shannon En tropy in Fig 1 we generated data from the follo wing one dimensional density , f 1 ( t ) = 0 . 5 + 0 . 5 t 9 F or this, with probabilit y 1 / 2 w e sample from the uniform distribution U (0 , 1) on (0 , 1) and otherwise sample 10 p oin ts from U (0 , 1) and pick the maxim um. F or the third ﬁgure in Fig 1 comparing the KL div ergence, w e generate data from the one dimensional density f 2 ( t ) = 0 . 5 + 0 . 5 t 19 (1 − t ) 19 B (20 , 20) where B ( · , · ) is the Beta function. F or this, with probability 1 / 2 we sample from U (0 , 1) and otherwise sample from a Beta(20 , 20) distribution. The second and fourth ﬁgures of Fig 1 w e sampled from a 2 dimensional densit y where the ﬁrst dimension w as f 1 and the second was U (0 , 1). The ﬁfth and sixth were from a 2 dimensional density where the ﬁrst dimension was f 2 and the second was U (0 , 1). In all ﬁgures of Fig. 2 , the ﬁrst distribution w as a 4-dimensional density where all dimensions are f 2 . The latter was U (0 , 1) 4 . Metho ds compared to: In addition to the plug-in, DS and LOO estimators w e p erform comparisons with several other estimators. F or the Shannon Entrop y we compare our metho d to the k -NN estimator of Goria et al. [ 2005 ], the metho d of Stow ell and Plumbley [ 2009 ] which uses K − D partitioning, the metho d of Noughabi and Noughabi [ 2013 ] based on V asicek’s spacing metho d and that of Learned-Miller and John [ 2003 ] based on V oronoi tess ellation. F or the KL divergence we compare against the k -NN metho d of P´ erez- Cruz [ 2008 ] and that of Ramırez et al. [ 2009 ] based on the pow er sp ectral densit y represen tation using Szego’s theorem. F or R´ en yi- α , Tsallis- α and Hellinger divergences w e compared against the k -NN metho d of P´ oczos et al. [ 2011 ]. Softw are for these estimators is obtained either directly from the pap ers or from Szab´ o [ 2014 ]. 9 10 2 10 3 10 −1 n Error Conditional Tsallis−0.75 Divergence DS LOO (a) −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4 X Quantiles Y Quantiles (b) −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 X Quantiles Y Quantiles (c) Figure 2: Fig (a) : Comparison of the LOO vs DS estimator on estimating the Conditional Tsallis divergence in 4 dimensions. Note that the plug-in estimator is intractable due to numerical integration. There are no other known estimators for the conditional tsallis div ergence. Figs (b) , (c) : QQ plots obtained using 4000 samples for Hellinger div ergence estimation in 4 dimensions using the DS and LOO estimators resp ectiv ely . 6.2 Image Clustering T ask Here we demonstrate a simple image clustering task using a nonparametric divergence estimator. F or this w e use images from the ETH-80 dataset. The ob jectiv e here is not to champion our approach for image clustering against all metho ds for image clustering out there. Rather, we just wish to demonstrate that our estimators can b e easily and intuitiv ely applied to man y Machine Learning problems. W e use the three categories Apples, Co ws and Cups and randomly select 50 images from eac h category . Some sample images are sho wn in Fig 3(a) . W e conv ert the images to grey scale and extract the SIFT features from each image. The SIFT features are 128-dimensional but w e pro ject it to 4 dimensions via PCA. This is necessary b ecause nonparametric metho ds work b est in low dimensions. Now we can treat each image as a collection of features, and hence a sample from a 4 dimensional distribution. W e estimate the Hellinger div ergence b etw een these “distributions”. Then w e construct an aﬃnity matrix A where the similarity metric b et ween the i th and j th image is given by A ij = exp( − b H 2 ( X i , X j )). Here X i and X j denotes the pro jected SIFT samples from images i and j and b H ( X i , X j ) is the estimated Hellinger divergence b et ween the distributions. Finally , w e run a sp ectral clustering algorithm on the matrix A . Figure 3(b) depicts the aﬃnit y matrix A when the images were ordered according to their class lab el. The aﬃnit y matrix exhibits blo ck-diagonal structure which indicates that our Hellinger div ergence estimator can in fact identify patterns in the images. Our approach ac hieved a clustering accuracy of 92 . 47%. When w e used the k -NN based estimator of P´ oczos et al. [ 2011 ] we achiev ed an accuracy of 90 . 04%. When we instead applied Sp ectral clustering naiv ely , with A ij = exp( − L 2 ( P i , P j ) 2 ) where L 2 ( P i , P j ) is the squared L 2 distance b et w een the pixel intensities we ac hieved an accuracy of 70 . 18%. W e also tried A ij = exp( − α b H 2 ( X i , X j )) as the aﬃnit y for diﬀeren t c hoices of α and found that our estimator still performed b est. W e also experimented with the R ´ en yi- α and Tsallis- α divergences and obtained similar results. On the same note, one can imagine that these div ergence estimators can also be used for a classiﬁcation task. F or instance we can treat exp( − b H 2 ( X i , X j )) as a similarity metric b et ween the images and use it in a classiﬁer suc h as an SVM. 7 Conclusion W e generalise existing results in V on Mises estimation by prop osing an empirically sup erior LOO technique for estimating functionals and extending the framework to functionals of tw o distributions. W e also prov e a low er b ound for the latter setting. W e demonstrate the practical utility of our tec hnique via comparisons against other alternatives and an image clustering application. An op en problem arising out of our work is to deriv e the limiting distribution of the LOO estimator. 10 (a) (b) Figure 3: (a) Some sample images from the three categories apples, cows and cups. (b) The aﬃnity matrix used in clustering. Ac kno wledgemen ts This w ork is supp orted in part b y NSF Big Data grant I IS-1247658 and DOE grant DESC0011114. References Jan Beirlant, Edw ard J. Dudewicz, L´ aszl´ o Gy¨ orﬁ, and Edw ard C. V an der Meulen. Nonparametric en tropy estimation: An ov erview. International Journal of Mathematic al and Statistic al Scienc es , 1997. P eter J. Bick el and Y a’acov Rito v. Estimating integrated squared densit y deriv atives: sharp best order of con vergence estimates. Sankhy¯ a: The Indian Journal of Statistics , 1988. Lucien Birg´ e and Pascal Massart. Estimation of integral functionals of a density. A nn. of Stat. , 1995. Kevin M. Carter, Raviv Raich, and Alfred O. Hero. On lo cal in trinsic dimension estimation and its applications. IEEE T r ansactions on Signal Pr o c essing , 2010. Inderjit S. Dhillon, Subramany am Mallela, and Rah ul Kumar. A Divisive Information Theoretic F eature Clustering Algorithm for T ext Classiﬁcation. J. Mach. L e arn. R es. , 2003. M Emery , A Nemirovski, and D V oiculescu. L e ctur es on Pr ob. The ory and Stat. Springer, 1998. Luisa F ernholz. V on Mises c alculus for statistic al functionals . Lecture notes in statistics. Springer, 1983. Mohammed Naw az Goria, Nikolai N Leonenk o, Victor V Mergel, and Pier Luigi No vi In verardi. A new class of random vector entrop y estimators and its applications. Nonpar ametric Statistics , 2005. Hero, Bing Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal Pr o c essing Magazine , 19, 2002. Da vid K¨ allb erg and Oleg Seleznjev. Estimation of entrop y-type integral functionals. arXiv , 2012. G ´ erard Kerkyac harian and Dominique Picard. Estimating nonquadratic functionals of a density using haar w av elets. Annals of Stat. , 1996. Aksha y Krishnamurth y , Kirthev asan Kandasamy , Barnabas Poczos, and Larry W asserman. Nonparametric Estima- tion of R´ en yi Divergence and F riends. In ICML , 2014. B ´ eatrice Laurent. Eﬃcient estimation of integral functionals of a density . A nn. of Stat. , 1996. Erik Learned-Miller and Fisher John. ICA using spacings estimates of entrop y. Mach. L e arn. R es. , 2003. Nik olai Leonenko and Oleg Seleznjev. Statistical inference for the epsilon-entrop y and the quadratic R´ en yi entrop y. Journal of Multivariate Analysis , 2010. Jerem y Lewi, Rob ert Butera, and Liam Paninski. Real-time adaptiv e information-theoretic optimization of neuro- ph ysiology exp eriments. In NIPS , 2006. 11 Han Liu, Larry W asserman, and John D Laﬀerty . Exponential concentration for m utual information estimation with application to forests. In NIPS , 2012. Erik G Miller. A new class of En tropy Estimators for Multi-dimensional Densities. In ICASSP , 2003. Kevin Mo on and Alfred Hero. Multiv ariate f-divergence Estimation With Conﬁdence. In NIPS , 2014. XuanLong Nguyen, Martin J. W ainwrigh t, and Michael I. Jordan. Estimating div ergence functionals and the likelihoo d ratio by conv ex risk minimization. IEEE T r ansactions on Information The ory , 2010. Ha vv a Alizadeh Noughabi and Reza Alizadeh Noughabi. On the Entrop y Estimators. Journal of Statistic al Compu- tation and Simulation , 2013. D´ avid P´ al, Barnab´ as P´ oczos, and Csaba Szep esv´ ari. Estimation of R´ enyi Entrop y and Mutual Information Based on Generalized Nearest-Neighbor Graphs. In NIPS , 2010. Hanc huan Peng, F ulmi Long, and Chris Ding. F eature selection based on mutual information criteria of max- dep endency , max-relev ance, and min-redundancy . IEEE P AMI , 2005. F ernando P´ erez-Cruz. KL div ergence estimation of con tinuous distributions. In IEEE ISIT , 2008. Barnab´ as P´ oczos and Jeﬀ Schneider. On the estimation of alpha-divergences. In AIST A TS , 2011. Barnab´ as P´ oczos, Liang Xiong, and Jeﬀ G. Schneider. Nonparametric Divergence Estimation with Applications to Mac hine Learning on Distributions. In UAI , 2011. Da vid Ramırez, Javier Vıa, Ignacio Santamarıa, and Pedro Cresp o. Entrop y and Kullback-Leibler Divergence Esti- mation based on Szegos Theorem. In EUSIPCO , 2009. James Robins, Lingling Li, Eric Tchetgen, and Aad W. v an der V aart. Quadratic semiparametric V on Mises Calculus. Metrika , 2009. Elad Schneidman, William Bialek, and Michael J. Berry I I. An Information Theoretic Approac h to the F unctional Classiﬁcation of Neurons. In NIPS , 2002. Shashank Singh and Barnabas Poczos. Exp onential Concen tration of a Densit y F unctional Estimator. In NIPS , 2014. Dan Sto well and Mark D Plumbley . F ast Multidimensional Entrop y Estimation by k-d Partitioning. IEEE Signal Pr o c ess. L ett. , 2009. Zolt´ an Szab´ o. Information Theoretical Estimators T o olbox. J. Mach. L e arn. Res. , 2014. Alexandre B. Tsybako v. Intr o duction to Nonp ar ametric Estimation . Springer, 2008. Aad W. v an der V aart. Asymptotic Statistics . Cam bridge Universit y Press, 1998. Qing W ang, Sanjeev R. Kulk arni, and Sergio V erd´ u. Div ergence estimation for multidimensional densities via k- nearest-neigh b or distances. IEEE T r ansactions on Information The ory , 2009. App endix A Auxiliary Results Lemma 9 (VME and F unctional T a ylor Expansion) . L et P , Q have densities p, q and let T ( P ) = φ ( R ν ( p )) . Then the ﬁrst or der VME of T ( Q ) ar ound P r e duc es to a functional T aylor exp ansion ar ound p : T ( Q ) = T ( P ) + T 0 ( Q − P ; P ) + R 2 = T ( p ) + φ 0  Z ν ( p )  Z ν 0 ( p )( q − p ) + R 2 (12) 12 Pr o of. It is suﬃcient to show that the ﬁrst order terms are equal. T 0 ( Q − P ; P ) = ∂ T ((1 − t ) P + tQ ) ∂ t    t =0 = ∂ ∂ t φ  Z ν ((1 − t ) p + tq )    t =0 = φ 0  Z ν ((1 − t ) p + tq )  Z ν 0 ((1 − t ) p + tq )( q − p )   t =0 = φ 0  Z ν ( p )  Z ν 0 ( p )( q − p ) Lemma 10 (VME and F unctional T aylor Expansion - Two Distributions) . L et P 1 , P 2 , Q 1 , Q 2 b e distributions with densities p 1 , p 2 , q 1 , q 2 . L et T ( P 1 , P 2 ) = R ν ( p 1 , p 2 ) . Then, T ( Q 1 , Q 2 ) = T ( P 1 , P 2 ) + T 0 1 ( Q 1 − P 1 ; P 1 , P 2 ) + T 0 2 ( Q 2 − P 2 ; P 1 , P 2 ) + R 2 (13) = T ( P 1 , P 2 ) + φ 0  Z ν ( p )   Z ∂ ν ( p 1 ( x ) , p 2 ( x )) ∂ p 1 ( x ) ( q 1 − p 1 )d x + Z ∂ ν ( p 1 ( x ) , p 2 ( x )) ∂ p 2 ( x ) ( q 2 − p 2 )d x  + R 2 Pr o of. Is similar to Lemma 9 . Lemma 11. L et f , g b e two densities b ounde d ab ove and b elow on a c omp act sp ac e. Then for al l a, b k f a − g a k b ∈ O ( k f − g k b ) Pr o of. F ollo ws from the expansion, Z | f a − g a | b = Z | g a ( x ) + a ( f ( x ) − g ( x )) g a − 1 ∗ ( x ) − g a ( x ) | b ≤ a b sup | g b ( a − 1) ∗ ( x ) | Z | f − g | b . Here g ∗ ( x ) takes an intermediate v alue b et ween f ( x ) and g ( x ). In the second step we hav e used the b ound- edness of f , g to b ound f ∗ . Finally , we will make use of the Efron Stein inequality stated b elow in our analysis. Lemma 12 (Efron-Stein Inequalit y) . L et X 1 , . . . , X n , X 0 1 , . . . , X 0 n b e indep endent r andom variables wher e X i , X 0 i ∈ X i . L et Z = f ( X 1 , . . . , X n ) and Z ( i ) = f ( X 1 , . . . , X 0 i , . . . , X n ) wher e f : X 1 × · · · × X n → R . Then, V ( Z ) ≤ 1 2 E " n X i =1 ( Z − Z ( i ) ) 2 # B Review: DS Estimator on a Single Distribution This section is intended to b e a review of the data split estimator used in Robins et al. [ 2009 ]. The estimator w as originally analysed in the semiparametric setting. How ever, in order to b e self contained we provide an h analysis that directly uses the V on Mises Expansion. W e state our main result b elow. Theorem 13. Supp ose f ∈ Σ( s, L, B , B 0 ) and ψ satisﬁes Assumption 4 . Then, E [( b T DS − T ( f )) 2 ] is O ( n − 4 s 2 s + d ) when s < d/ 2 and O ( n − 1 ) when s > d/ 2 . F urther, when s > d/ 2 and when ψ f , ψ g 6 = 0 , b T DS is asymptotic al ly normal. √ n ( b T DS − T ( f , g )) D − → N  0 , 1 ζ V f [ ψ f ( X ; f , g )] + 1 1 − ζ V g [ ψ g ( Y ; f , g )]  (14) 13 W e b egin the pro of with a series of technical lemmas. Lemma 14. The Inﬂuenc e F unction has zer o me an. i.e. E P [ ψ ( X ; P )] = 0 . Pr o of. 0 = T 0 ( P − P ; P ) = R ψ ( x ; P )d P ( x ). No w we prov e the following lemma on the preliminary estimator b T (1) DS . Lemma 15 (Conditional Bias and V ariance) . L et ˆ f (1) b e a c onsistent estimator for f in the L 2 metric. L et T have b ounde d se c ond derivatives and let sup x ψ ( x ; f ) and V X ∼ f ψ ( X ; g ) b e b ounde d for al l g ∈ M . Then, the bias of the pr eliminary estimator b T (1) DS ( 7 ) c onditione d on X n/ 2 1 is O ( k f − ˆ f (1) k 2 2 ) . The c onditional varianc e is O (1 /n ) . Pr o of. First consider the conditional bias, E X n n/ 2+1 h b T (1) DS − T ( f ) | X n/ 2 1 i = E X n n/ 2+1   T ( ˆ f (1) ) + 2 n n X i = n/ 2+1 ψ ( X i ; ˆ f (1) ) − T ( f ) | X n/ 2 1   = T ( ˆ f (1) ) + E f h ψ ( X ; ˆ f (1) ) i − T ( f ) ∈ O ( k ˆ f (1) − f k 2 2 ) . (15) The last step follo ws from the b oundedness of the second deriv ative from which the ﬁrst order functional T aylor expansion ( 4 ) holds. The conditional v ariance is, V X n n/ 2+1 h b T (1) DS | X n/ 2 1 i = V X n n/ 2+1   2 n n X i = n/ 2+1 ψ ( X ; ˆ f (1) )    X n/ 2 1   = 2 n V f h ψ ( X ; ˆ f (1) ) i ∈ O ( n − 1 ) . (16) Lemma 16 (Asymptotic Normalit y) . Supp ose in addition to the c onditions in the lemma ab ove we also have Assumption 4 and k ˆ f (1) − f k 2 ∈ o P ( n − 1 / 4 ) and ψ 6 = 0 . Then, √ n ( b T DS − T ( f )) D − → N (0 , V f ψ ( X ; f )) . Pr o of. W e b egin with the following expansion around ˆ f (1) , T ( f ) = T ( ˆ f (1) ) + Z ψ ( u ; ˆ f (1) ) f ( u )d µ ( u ) + O ( k ˆ f (1) − f k 2 ) . (17) First consider b T (1) DS . W e can write r n 2  b T (1) DS − T ( f )  = r n 2   T ( ˆ f (1) ) + 2 n n X i = n/ 2+1 ψ ( X i ; ˆ f (1) ) − T ( f )   (18) = r 2 n n X i = n/ 2+1  ψ ( X i ; ˆ f (1) ) − ψ ( X i ; f ) −  Z ψ ( u ; ˆ f (1) ) f ( u )d u − Z ψ ( u ; f ) f ( u )d u  + r 2 n n X i = n/ 2+1 ψ ( X i ; f ) + √ n O  k ˆ f (1) − f k 2  . In the second step w e used the VME in ( 17 ). In the third step, w e added and subtracted P i ψ ( X i ; f ) and also added E ψ ( · ; f ) = 0. Ab o ve, the third term is o P (1) as k ˆ f (1) − f k 2 ∈ o P ( n − 1 / 4 ). The ﬁrst term which 14 w e shall denote by Q n can also b e shown to be o P (1) via Cheb yshev’s inequality . It is suﬃcient to show P ( | Q n | >  | X n/ 2 1 ) → 0. First note that, V [ Q n | X n/ 2 1 ] = V   r 2 n n X i = n/ 2+1  ψ ( X i ; ˆ f (1) ) − ψ ( X i ; f ) −  Z ψ ( u ; ˆ f (1) ) f ( u )d u − Z ψ ( u ; f ) f ( u )d u     X n/ 2 1   = V  ψ ( X ; ˆ f (1) ) − ψ ( X ; f ) −  Z ψ ( u ; ˆ f (1) ) f ( u )d u − Z ψ ( u ; f ) f ( u )d u     X n/ 2 1  ≤ E   ψ ( X ; ˆ f (1) ) − ψ ( X ; f )  2  ∈ O ( k ˆ f (1) − f k 2 ) → 0 , (19) where the last step follows from Assumption 4 . Now, P ( | Q n | >  | X n 1 ) ≤ V ( Q n | X n/ 2 1 ) / → 0. Hence we hav e r n 2 ( b T (1) DS − T ( f )) = r 2 n n X i = n/ 2+1 ψ ( X i ; f ) + o P (1) W e can similarly show r n 2 ( b T (2) DS − T ( f )) = r 2 n n X i = n/ 2+1 ψ ( X i ; f ) + o P (1) Therefore, b y the CL T and Slutzky’s theorem, √ n ( b T DS − T ( f )) = 1 √ 2  r n 2 ( b T (1) DS − T ( f )) + r n 2 ( b T (2) DS − T ( f ))  = n − 1 / 2 n X i =1 ψ ( X i ; f ) + o P (1) D − → N (0 , V f ψ ( X ; f ) W e are now ready to pro ve Theorem 13 . Note that the brunt of the work for the DS estimator was in analysing the preliminary estimator b T DS . Pr o of of The or em 13 . W e ﬁrst note that in a H¨ older class, with n samples the KDE achiev es the rate E k p − ˆ p k 2 ∈ O ( n − 2 s 2 s + d ). Then the bias of b T DS is, E X n/ 2 1 E X n n/ 2+1 h b T (1) DS − T ( f ) | X n/ 2 1 i = E X n/ 2 1 h O  k f − ˆ f (1) k 2 i ∈ O  n − 2 s 2 s + d  It immediately follo ws that E h b T DS − T ( f ) i ∈ O  n − 2 s 2 s + d  . F or the v ariance, we use Theorem 15 and the Law of total v ariance for b T (1) DS , V X n 1 h b T (1) DS i = 1 n E X n/ 2 1 V f h ψ ( X ; ˆ f (1) , ˆ g ) i + + V X n/ 2 1 h E X n n/ 2+1 h b T DS − T ( f ) | X n/ 2 1 ii ∈ O  1 n  + E X n/ 2 1 h O  k f − ˆ f (1) k 4 i ∈ O  n − 1 + n − 4 s 2 s + d  In the second step we used the fact that V Z ≤ E Z 2 . F urther, E X n/ 2 1 V f h ψ ( X ; ˆ f (1) ) i is b ounded since ψ is b ounded. The v ariance of b T DS can b e b ounded using the Cauch y Sch w arz Inequality , V h b T DS i = V " b T (1) DS + b T (2) DS 2 # = 1 4  V b T (1) DS + V b T (2) DS + 2 C ov( b T (1) DS , b T (2) DS )  ≤ max  V b T (1) DS , V b T (2) DS  ∈ O  n − 1 + n − 4 s 2 s + d  Finally for asymptotic normalit y , when s > d/ 2, E k ˆ f (1) − f k 2 ∈ O ( n − s 2 s + d ) ∈ o ( n − 1 / 4 ). 15 C Analysis of LOO Estimator Pr o of of The or em 5 . First note that we can b ound the mean squared error via the bias and v ariance terms. E [( b T LOO − T ( f )) 2 ] ≤ | E b T LOO − T ( f ) | 2 + E [( b T LOO − E b T LOO ) 2 ] The bias is b ounded via a straightforw ard conditioning argumen t. E | b T LOO − T ( f ) | = E [ T ( ˆ f − i ) + ψ ( X i ; ˆ f − i ) − T ( f )] = E X − i h E X i [ T ( ˆ f − i ) + ψ ( X i ; ˆ f − i ) − T ( f )] i = E X − i h O ( k ˆ f − i − f k 2 ) i ≤ C 1 n − 2 s 2 s + d (20) for some constant C 1 . The last step follows by observing that the KDE achiev es the rate n − 2 s 2 s + d in integrated squared error. T o b ound the v ariance w e use the Efron-Stein inequality . F or this consider tw o sets of samples X n 1 = { X 1 , X 2 , . . . , X n } and X n 1 0 = { X 0 1 , X 2 , . . . , X n } whic h are the same except for the ﬁrst p oin t. Denote the estimators obtained using X n 1 and X n 1 0 b y b T LOO and b T 0 LOO resp ectiv ely . T o apply Efron-Stein we shall b ound E [( b T LOO − b T 0 LOO ) 2 ]. Note that, | b T LOO − b T 0 LOO | ≤ 1 n | ψ ( X 1 ; ˆ f − 1 ) − ψ ( X 0 1 ; ˆ f − 1 ) | + 1 n X i 6 =1 | T ( ˆ f − i ) − T ( ˆ f 0 − i ) | + 1 n X i 6 =1 | ψ ( X i ; ˆ f − i ) − ψ ( X i ; ˆ f 0 − i ) | (21) The ﬁrst term can b e b ounded by 2 k ψ k ∞ /n using the b oundedness of ψ . Each term inside the summation in the second term in ( 21 ) can b e b ounded via, | T ( ˆ f − i ) − T ( ˆ f 0 − i ) | ≤ L φ Z | ν ( ˆ f − i ) − ν ( ˆ f 0 − i ) | ≤ L ν L ν Z | ˆ f − i − ˆ f 0 − i | (22) ≤ L φ L ν Z 1 nh d    K  X 1 − u h  − K  X 0 1 − u h     d u ≤ k K k ∞ L φ L ν n . The substitution ( X 1 − u ) /h = z for integration eliminates the 1 /h d . Here L φ , L ν are the Lipschitz constan ts of φ, ν . T o apply Efron-Stein we need to b ound the exp ectation of the LHS ov er X 1 , X 0 1 , X 2 , . . . , X n . Since the ﬁrst t wo terms in ( 21 ) are b ounded p oin t wise b y O (1 /n 2 ) they are also b ounded in expectation. By Jensen’s inequalit y we can write, | b T LOO − b T 0 LOO | 2 ≤ 12 k ψ k 2 ∞ n 2 + 3 k K k 2 ∞ L 2 φ L 2 ν n 2 + 3 n 2   X i 6 =1 | ψ ( X i ; ˆ f − i ) − ψ ( X i ; ˆ f 0 − i ) |   2 (23) F or the third, such a p oint wise b ound do es not hold so we will directly b ound the exp ectation. X 1 6 = i,j E h | ψ ( X i ; ˆ f − i ) − ψ ( X i ; ˆ f 0 − i ) || ψ ( X j ; ˆ f − j ) − ψ ( X j ; ˆ f 0 − j ) | i (24) W e then hav e, E  ( ψ ( X i ; ˆ f − i ) − ψ ( X i ; ˆ f 0 − i )) 2  ≤ E X 1 ,X 0 1  C Z | ˆ f − i − ˆ f 0 − i | 2  ≤ C B 2 Z 1 n 2 h 2 d  K  x 1 − u h  − K  x 0 1 − u h  2 d x 1 d x 0 1 u ≤ 2 C B 2 k K k 2 ∞ n 2 16 I the ﬁrst step we hav e used Assumption 4 and in the last step the substitutions ( x 1 − x i ) /h = u and ( x 1 − x j ) /h = v remo ves the 1 /h d t wice. Then, by applying Cauch y Sc hw arz each term inside the summation in ( 24 ) is O (1 /n 2 ). Since each term inside equation ( 24 ) is O (1 /n 2 ) and there are ( n − 1) 2 terms it is O (1). Com bining all these results with equation ( 23 ) w e get, E [( b T LOO − b T 0 LOO ) 2 ] ∈ O  1 n 2  No w, b y applying the Efron-Stein inequality we get V ( b T LOO ) ≤ C 2 n . Therefore the mean squared error E [( T − b T LOO ) 2 ] ∈ O ( n − 4 s 2 s + d + n − 1 ) whic h completes the pro of. D Pro ofs of Results on F unctionals of Tw o Distributions D.1 DS Estimator W e generalise the results in App endix B to analyse the DS estimator for tw o distributions. As b efore we b egin with a series of lemmas. Lemma 17. The inﬂuenc e functions have zer o me an. I.e. E P 1 [ ψ 1 ( X ; P 1 ; P 2 )] = 0 ∀ P 2 ∈ M E P 2 [ ψ 2 ( Y ; P 1 ; P 2 )] = 0 ∀ P 1 ∈ M (25) Pr o of. 0 = T 0 i ( P i − P i ; P 1 ; P 2 ) = R ψ i ( u ; P 1 , P 2 )d P i ( u ) for i = 1 , 2. Lemma 18 (Bias & V ariance of ( 9 )) . L et ˆ f (1) , ˆ g (1) b e c onsistent estimators for f , g in the L 2 sense. L et T have b ounde d se c ond derivatives and let sup x ψ f ( x ; f , g ) , sup x ψ g ( x ; f , g ) , V f ψ ( X ; f 0 , g 0 ) , V g ψ g ( X ; f 0 , g 0 ) b e b ounde d for al l f , f 0 , g , g 0 ∈ M . Then the bias of b T (1) DS c onditione d on X n/ 2 1 , Y m/ 2 1 is | T − E [ b T (1) DS | X n/ 2 1 , Y m/ 2 1 ] ∈ O ( k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2 ) . The c onditional varianc e is V [ b T (1) DS | X n/ 2 1 , Y m/ 2 1 ] ∈ O ( n − 1 + m − 1 ) . Pr o of. First consider the bias conditioned on X n/ 2 1 , Y m/ 2 1 , E h b T (1) DS − T ( f , g ) | X n/ 2 1 , Y m/ 2 1 i = E   T ( ˆ f (1) , ˆ g (1) ) + 2 n n X i = n/ 2+1 ψ f ( X i ; ˆ f (1) , ˆ g (1) ) + 2 m m X j = m/ 2+1 ψ g ( Y j ; ˆ f (1) , ˆ g (1) ) − T ( f , g )      X n/ 2 1 , Y m/ 2 1   = T ( ˆ f (1) , ˆ g (1) ) + Z ψ f ( x ; ˆ f (1) , ˆ g (1) ) f ( x )d µ ( x ) + Z ψ g ( x ; ˆ f (1) , ˆ g (1) ) g ( x )d µ ( x ) − T ( f , g ) = O  k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2  The last step follows from the b oundedness of the second deriv atives from whic h the ﬁrst order functional T aylor expansion ( 6 ) holds. The conditional v ariance is, V h b T (1) DS | X n/ 2 1 , Y m/ 2 1 i = V " 1 n 2 n X i = n +1 ψ f ( X i ; ˆ f (1) , ˆ g (1) )    X n/ 2 1 # + V   1 m 2 m X j = m +1 ψ g ( Y j ; ˆ f (1) , ˆ g (1) )    Y m/ 2 1   = 1 n V f h ψ f ( X ; ˆ f (1) , ˆ g (1) ) i + 1 m V g h ψ g ( Y ; ˆ f (1) , ˆ g (1) ) i ∈ O  1 n + 1 m  The last step follo ws from the b oundedness of the v ariance of the inﬂuence functions. 17 The follo wing lemma characterises conditions for asymptotic normality . Lemma 19 (Asymptotic Normality) . Supp ose, in addition to the c onditions in The or em 18 ab ove and the r e gularity assumption 4 we also have k ˆ f − f k ∈ o P ( n − 1 / 4 ) , k ˆ g − g k ∈ o P ( m − 1 / 4 ) and ψ f , ψ g 6 = 0 . Then we have asymptotic Normality for b T DS , √ N ( b T DS − T ( f , g )) D − → N  0 , 1 ζ V f [ ψ f ( X ; f , g )] + 1 1 − ζ V g [ ψ g ( Y ; f , g )]  (26) Pr o of. W e b egin with the following expansions around ( ˆ f (1) , ˆ g (1) ), T ( f , g ) = T ( ˆ f (1) , ˆ g (1) ) + Z ψ f ( u ; ˆ f (1) , ˆ g (1) ) f ( u )d u + Z ψ g ( u ; ˆ f (1) , ˆ g (1) ) g ( u )d u + O  k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2  Consider b T (1) DS . W e can write r N 2 ( b T (1) DS − T ( f )) (27) = r N 2   T ( ˆ f (1) , ˆ g (1) ) + 2 n n X i = n/ 2+1 ψ f ( X i ; f , g ) + 2 m m X j = m/ 2+1 ψ g ( Y j ; f , g ) − T ( f , g )   = r N 2 2 n n X i = n/ 2+1 ψ ( X i ; ˆ f (1) , ˆ g (1) ) + 2 m m X j = m/ 2+1 ψ ( X j ; ˆ f (1) , ˆ g (1) ) − E f h ψ ( X ; ˆ f (1) , ˆ g (1) ) i − E g h ψ ( X ; ˆ f (1) , ˆ g (1) ) i ! + √ N O  k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2  = r 2 N n n − 1 / 2 n X i = n/ 2+1  ψ f ( X i ; ˆ f (1) , ˆ g (1) ) − ψ f ( X i ; f , g ) − ( E f ψ f ( X ; ˆ f (1) , ˆ g (1) ) + E f ψ f ( X ; f , g ))  + r 2 N m m − 1 / 2 m X j = m/ 2+1  ψ g ( Y j ; ˆ f (1) , ˆ g (1) ) − ψ g ( Y j ; f , g ) − ( E g ψ g ( Y ; ˆ f (1) , ˆ g (1) ) + E g ψ g ( Y ; f , g ))  + r 2 N n n − 1 / 2 n X i = n/ 2+1 ψ f ( X i ; f , g ) + r 2 N m m − 1 / 2 m X j = m/ 2+1 ψ g ( Y j ; f , g ) + √ N O  k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2  The ﬁfth term is o P (1) b y the assumptions. The ﬁrst and second terms are also o P (1) . T o see this, denote the ﬁrst term b y Q n . V h Q n | X n/ 2 1 , Y m/ 2 1 i = N n V f   n X i = n/ 2+1  ψ f ( X ; ˆ f (1) , ˆ g (1) ) − ψ f ( X ; f , g ) − ( E f ψ f ( X ; ˆ f (1) , ˆ g (1) ) + E f ψ f ( X ; f , g ))    ≤ N n E f   ψ f ( X i ; ˆ f (1) , ˆ g (1) ) − ψ f ( X i ; f , g )  2  → 0 where we hav e used the regularity assumption 4 . F urther, P ( | Q n | >  | X n/ 2 1 , Y m/ 2 1 ) ≤ V [ Q n | X n/ 2 1 , Y m/ 2 1 ]  → 0, hence the ﬁrst term is o P (1). The pro of for the second term is similar. Therefore we hav e, r N 2 ( b T (1) DS − T ( f )) = r 2 N n n − 1 / 2 n X i = n/ 2+1 ψ f ( X i ; f , g ) + r 2 N m m − 1 / 2 m X j = m/ 2+1 ψ g ( Y j ; f , g ) + o P (1) 18 Using a similar argumen t on b T (2) DS w e get, r N 2 ( b T (2) DS − T ( f )) = r 2 N n n − 1 / 2 n/ 2 X i =1 ψ f ( X i ; f , g ) + r 2 N m m − 1 / 2 m/ 2 X j =1 ψ g ( Y j ; f , g ) + o P (1) Therefore, √ N ( b T (2) DS − T ( f )) = √ 2   r 2 N n n − 1 / 2 n X i =1 ψ f ( X i ; f , g ) + r 2 N m m − 1 / 2 m X j =1 ψ g ( Y j ; f , g )   + o P (1) = r N n n − 1 / 2 2 n X i =1 ψ f ( X i ; f , g ) + r N m m − 1 / 2 2 m X j =1 ψ g ( Y j ; f , g ) + o P (1) By the CL T and Slutzky’s theorem this conv erges weakly to the RHS of ( 26 ). W e are now ready to prov e the rates of conv ergence for the DS estimator in the H¨ older class. Pr o of of The or em 13 . . W e ﬁrst note that in a H¨ older class, with n samples the KDE ac hieves the rate E k p − ˆ p k 2 ∈ O ( n − 2 s 2 s + d ). Then the bias for the preliminary estimator b T (1) DS is, E h b T (1) DS − T ( f , g ) | X n/ 2 1 , Y m/ 2 1 i = E X n/ 2 1 ,Y m/ 2 1 h O  k f − ˆ f (1) k 2 + k g − ˆ g (1) k 2 i ∈ O  n − 2 s 2 s + d + m − 2 s 2 s + d  The same could b e said ab out b T (2) DS . It therefore follows that E h b T DS − T i = E  1 2  b T (1) DS − T ( f )  + 1 2  b T (2) DS − T ( f )   ∈ O  n − 2 s 2 s + d + m − 2 s 2 s + d  F or the v ariance, we use Theorem 18 and the Law of total v ariance to ﬁrst control V b T (1) DS , V h b T (1) DS i = 1 n E h V f h ψ f ( X ; ˆ f (1) , ˆ g (1) ) | X n/ 2 1 ii + 1 m E h V g h ψ g ( Y ; ˆ f (1) , ˆ g (1) ) | Y m/ 2 1 ii + V h E h b T LOO − T ( f , g ) | X n/ 2 1 Y m/ 2 1 ii ∈ O  1 n + 1 m  + E h O  k f − ˆ f (1) k 4 + k g − ˆ g (1) k 4 i ∈ O  n − 1 + m − 1 + n − 4 s 2 s + d + m − 4 s 2 s + d  In the second step we used the fact that V Z ≤ E Z 2 . F urther, E X n/ 2 1 V f h ψ f ( X ; ˆ f (1) , ˆ g (1) ) i , E Y m/ 2 1 V g h ψ g ( Y ; ˆ f (1) , ˆ g (1) ) i are b ounded since ψ f , ψ g are b ounded. Then by applying the Cauc hy Sch w arz inequality as be fore we get V b T DS ∈ O  n − 1 + m − 1 + n − 4 s 2 s + d + m − 4 s 2 s + d  . Finally when s > d/ 2, we hav e the required o P ( n − 1 / 4 ) , o P ( m − 1 / 4 ) rates on k ˆ f − f k and k ˆ g − g k which giv es us asymptotic normalit y . D.2 LOO Estimator Pr o of of The or em 7 . Assume w.l.o.g that n > m . As b efore, the bias follows via conditioning. E | b T LOO − T ( f , g ) | = E [ T ( ˆ f − i , ˆ g − i ) + ψ f ( X i ; ˆ f − i , ˆ g − i ) + ψ g ( Y i ; ˆ f − i , ˆ g − i ) − T ( f , g )] = E h O ( k ˆ f − i − f k 2 + k ˆ g − g k 2 ) i ≤ C 1 ( n − 2 s 2 s + d + m − 2 s 2 s + d ) 19 for some constan t C 1 . T o bound the v ariance w e use the Efron-Stein inequality . Consider the samples { X 1 , . . . , X n , Y 1 , . . . , Y m } and { X 0 1 , . . . , X n , Y 1 , . . . , Y m } and denote the estimates obtained by b T LOO and b T 0 LOO resp ectiv ely . Recall that w e need to b ound E [( b T LOO − b T LOO ) 2 ]. Note that, | b T LOO − b T 0 LOO | ≤ 1 n | ψ f ( X 1 ; ˆ f − 1 , ˆ g − 1 ) − ψ f ( X 0 1 ; ˆ f − 1 , ˆ g − 1 ) | + 1 n X i 6 =1 | T ( ˆ f − i , ˆ g − i ) − T ( ˆ f 0 − i , ˆ g − i ) | + | ψ f ( X i ; ˆ f − i , ˆ g − i ) − ψ f ( X i ; ˆ f 0 − i , ˆ g − i ) | + | ψ g ( Y i ; ˆ f − i , ˆ g − i ) − ψ g ( Y i ; ˆ f 0 − i , ˆ g − i ) | The ﬁrst term can b e b ounded by 2 k ψ f k ∞ /n using the b oundedness of the inﬂuence function on b ounded densities. By using an argumen t similar to Equation ( 22 ) in the one distribution case, we can also b ound eac h term inside the summation of the second term via, | T ( ˆ f − i , ˆ g − i ) − T ( ˆ f 0 − i , ˆ g − i ) | ≤ k K k ∞ L φ L ν n Then, b y Jensen’s inequality we hav e, | b T LOO − b T 0 LOO | 2 ≤ 8 k ψ f k 2 ∞ n 2 + 4 k K k 2 ∞ L 2 φ L 2 ν n 2 + 4 n 2   X i 6 =1 | ψ f ( X i ; ˆ f − i , ˆ g − i ) − ψ f ( X i ; ˆ f 0 − i , ˆ g − i ) |   2 + 4 n 2   X i 6 =1 | ψ g ( Y i ; ˆ f − i , ˆ g − i ) − ψ g ( Y i ; ˆ f 0 − i , ˆ g − i ) |   2 The third and fourth terms can b e b ound in exp ectation using a similar technique to b ound the third term in equation 22 . Precisely , b y using Assumption ( 4 ) and Cauch y Sch w arz we get, E  | ψ f ( X i ; ˆ f − i , ˆ g − i ) − ψ f ( X i ; ˆ f 0 − i , ˆ g − i ) || ψ f ( X j ; ˆ f − j , ˆ g − j ) − ψ f ( X j ; ˆ f 0 − j , ˆ g − j ) |  ≤ 2 C B 2 k K k 2 ∞ n 2 E  | ψ g ( Y i ; ˆ f − i , ˆ g − i ) − ψ g ( Y i ; ˆ f 0 − i , ˆ g − i ) || ψ g ( Y j ; ˆ f − j , ˆ g − j ) − ψ g ( Y j ; ˆ f 0 − j , ˆ g − j ) |  ≤ 2 C B 2 k K k 2 ∞ n 2 This leads us to a O (1 /n 2 ) b ound for E [( b T LOO − b T 0 LOO ) 2 ], E [( b T LOO − b T 0 LOO ) 2 ] ≤ 8 k ψ f k 2 ∞ + 4 k K k 2 ∞ L 2 φ L 2 ν + 16 C B 2 k K k 2 ∞ n 2 No w consider, the set of samples { X 1 , . . . , X n , Y 1 , . . . , Y m } and { X 1 , . . . , X n , Y 0 1 , . . . , Y m } and denote the estimates obtained b y b T LOO and b T 0 LOO resp ectiv ely . Note that some of the Y instances are rep eated but eac h p oint o ccurs at most n/m times. The remaining argument is exactly the same except that we need to accoun t for this rep etition. W e ha ve, | b T LOO − b T 0 LOO | ≤ n m 1 n | ψ f ( X 1 ; ˆ f − 1 , ˆ g − 1 ) − ψ f ( X 0 1 ; ˆ f − 1 , ˆ g − 1 ) | + n m 1 n X i 6 =1  | T ( ˆ f − i , ˆ g − i ) − T ( ˆ f 0 − i , ˆ g − i ) | + | ψ f ( X i ; ˆ f − i , ˆ g − i ) − ψ f ( X i ; ˆ f 0 − i , ˆ g − i ) | + | ψ g ( Y i ; ˆ f − i , ˆ g − i ) − ψ g ( Y i ; ˆ f 0 − i , ˆ g − i ) |  (28) And hence, E [( b T LOO − b T 0 LOO ) 2 ] ≤ k ψ g k 2 ∞ m 2 + n 2 m 4 4 k K k 2 ∞ L 2 φ L 2 ν + O  n 4 m 6  where the last tw o terms of ( 28 ) are b ounded by O ( n 4 /m 6 ) after squaring and then taking the exp ectation. W e ha ve been a bit sloppy b y b ounding the diﬀerence by n/m and not d n/m e but it is clear that this do esn’t aﬀect the rate. 20 Finally b y the Efron Stein inequality we hav e V ( b T LOO ) ∈ O  1 n + n 4 m 5  whic h is O (1 /n + 1 /m ) if n and m are of the same order. This is the case if for instance there exists ζ l , ζ u ∈ (0 , 1) such that ζ l ≤ n/m ≤ ζ u . Therefore the mean squared error is E [( T − b T LOO ) 2 ] ∈ O ( n − 4 s 2 s + d + m − 4 s 2 s + d + n − 1 + m − 1 ) which completes the pro of. E Pro of of Lo w er Bound (Theorem 8 ) W e will prov e the low er b ound in the bounded H¨ older class Σ( s, L, B , B 0 ) noting that the low er b ound also applies to Σ( s, L ). Our main to ol will b e LeCam’s metho d where we reduce the estimation problem to a testing problem. In the testing problem we construct a set of alternativ es satisfying certain separation prop erties from the n ull. F or this w e will use some technical results from Birg´ e and Massart [ 1995 ] and Krishnam urthy et al. [ 2014 ]. First we state LeCam’s metho d b elo w adapted to our setting. W e deﬁne the squared Hellinger Div ergence b et ween tw o distributions P , Q with densities p, q to b e H 2 ( P , Q ) = Z  p p ( x ) − p q ( x )  2 d x = 2 − 2 Z p ( x ) q ( x )d x Theorem 20. L et T : M × M → R . Consider a p ar ameter sp ac e Θ ⊂ M × M such that ( f , g ) ∈ Θ and ( p λ , q λ ) ∈ Θ for al l λ in some index set Λ . Denote the distributions of f , g , p λ , q λ by F , G, P λ , Q λ r esp e ctively. Deﬁne P × Q = 1 | Λ | P λ ∈ Λ P n λ × Q m λ . If, ther e exists ( f , g ) ∈ Θ , γ < 2 and β > 0 such that the fol lowing two c onditions ar e satisﬁe d H 2 ( F n × G m , P × Q ) ≤ γ T ( p λ , q λ ) ≥ T ( f , g ) + 2 β ∀ λ ∈ Λ then, inf b T sup ( f ,g ) ∈ Θ P  | b T − T ( f , g ) | > β  1 2  1 − p γ (1 − γ / 4)  > 0 . Pr o of. The proof is a straightforw ard mo diﬁcation of Theorem 2.2 of Tsybako v [ 2008 ] whic h we provide here for completeness. Let Θ 0 = { ( p, q ) ∈ Θ | T ( p, q ) ≤ T ( f , g ) } and Θ 1 = { ( p, q ) ∈ Θ | T ( p, q ) ≥ T ( f , g ) + 2 β } . Hence ( f , g ) ∈ Θ 0 and ( p λ , q λ ) ∈ Θ 1 for all λ ∈ Λ. Giv en n samples from p 0 and m samples from q 0 consider the simple vs simple h yp othesis testing problem of H 0 : ( p 0 , q 0 ) ∈ Θ 0 vs H 1 : ( p 0 , q 0 ) ∈ Θ 1 . The probability of error p e of any test Ψ test is lo wer b ounded b y p e ≥ 1 2  1 − q H 2 ( F n × G m , P × Q )(1 − H 2 ( F n × G m , P × Q )) / 4  . See Lemma 2.1, Lemma 2.3 and Theorem 2.2 of Tsybako v [ 2008 ]. Therefore, inf ψ sup ( p 0 ,q 0 ) ∈ Θ 0 , ( p 00 ,q 00 ) ∈ Θ 0 p e ≥ 1 2  1 − p γ (1 − γ / 4)  If we make an error in the testing problem the error in estimation is least β in the estimation problem which completes the pro of of the theorem. 21 Consider the set Γ = {− 1 , 1 } ` and a set of densities p γ = f (1 + P ` j =1 γ j v j ) indexed b y each γ ∈ Γ. Here f is itself a density and the v j ’s are p erturbations on f . W e will also use the follo wing result from Birg ´ e and Massart [ 1995 ] which b ounds the Hellinger div ergence b et ween the pro duct distribution F n and the mixture pro duct distribution P n = 1 | Γ | P γ ∈ Γ P n γ . Prop osition 21. L et { R 1 , . . . , R ` } b e a p artition of [0 , 1] d . L et ρ j is zer o exc ept on R j and satisﬁes k ρ j k ∞ ≤ 1 , R ρ j f = 0 and R ρ 2 j f = α j . F urther, denote α = P j k ρ j k ∞ , s = nα 2 sup j P ( R j ) and c = n sup j α j . Then, H 2 ( F n , P n ) ≤ n 2 3 ` X j =1 α 2 j . W e also use the following technical result from Krishnamurth y et al. [ 2014 ] and adapt it to our setting. Prop osition 22 (T aken from Krishnamurth y et al. [ 2014 ]) . L et R 1 , . . . , R ` b e a p artition of [0 , 1] d e ach having size ` − 1 /d . Ther e exists functions u 1 , . . . , u ` such that, supp ( u j ) ⊂ { x | B ( x,  ) ⊂ R j } , Z u 2 j ∈ Θ( ` − 1 ) , Z u j = 0 , Z ψ f ( x ; f , g ) u j ( x ) = Z ψ g ( x ; f , g ) u j ( x ) = 0 , k D r u j k ∞ ≤ ` r/d ∀ r s.t X j r j ≤ s + 1 wher e B ( x,  ) denotes an L 2 b al l ar ound x with r adius  . Her e  is any numb er b etwe en 0 and 1 . Pr o of. F or this we use an orthonormal system of q ( > 4) functions on (0 , 1) d satisfying φ 1 = 1, supp ( φ j ) ⊂ [ , 1 −  ] d for any  > 0 and k D r φ j k ∞ ≤ J for some J < ∞ . No w for an y giv en functions η 1 , η 2 w e can ﬁnd a function υ such that υ ∈ span( { φ j } ), R υ φ 1 = R υ η 1 = R υ η 2 = 0. W rite υ = P i c j φ j . Then D r υ = P j c j D r φ j whic h implies k D r υ k ∞ ≤ K √ q . Let ν ( · ) = 1 J √ q υ ( · ). Clearly , R ν 2 is uppe r and low er b ounded and k D r ν k ∞ ≤ 1. T o construct the functions u j , w e map (0 , 1) d to R j b y appropriately scaling it. Then, u j ( x ) = ν ( m 1 /d ( x − j )) where j is the p oin t corresp onding to 0 after mapping. Moreov er let η 1 b e ψ f ( · ; f , g ) constrained to R j (and scaled bac k to ﬁt (0 , 1) d ). Let η 2 b e the same with ψ g . Now, R R j u 2 j = 1 ` R ν 2 ∈ Θ( ` − 1 ). Also, clearly k D r u j k ≤ m r/d . All 5 conditions ab o ve are satisﬁed. W e now hav e all necessary ingredients to prov e the lo wer b ound. Pr o of of The or em 8 . T o apply Theorem 20 we will need to construct the set of alternatives Λ whic h contains tuples ( p λ , q λ ) that satisfy the conditions of Theorem 20 . First apply Proposition 22 with ` = ` 1 to obtain the index set ˜ Γ = {− 1 , 1 } ` 1 and the functions u 1 , . . . , u ` 1 . Apply it again with ` = ` 2 to obtain the index set ∆ = {− 1 , 1 } ` 2 and the functions v 1 , . . . , v ` 2 . Deﬁne Γ , ∆ be the follo wing set of functions which are p erturbed around f and g resp ectiv ely , Γ = n p γ = f + K 1 ` 1 X j =1 γ j u j | γ ∈ ˜ Γ o ∆ = n q δ = g + K 2 ` 2 X j =1 δ j v j | δ ∈ ˜ ∆ o Since the p erturbations in Prop osition 22 are condensed into the small R j ’s it in v ariably violates the H¨ older assumption. The scaling K 1 and K 2 are necessary to shrink the p erturbation and ensure that p γ , q δ ∈ Σ( s, L ). By following essen tially an identical argumen t to Krishnam urthy et al. [ 2014 ] (Section E.2) we ha ve that p γ ∈ Σ( s, L ) if K  ` − s/d 1 and q δ ∈ Σ( s, L ) if K 2  ` − s/d 2 . W e will set ` 1 and ` 2 later on to obtain the required rates. F or future reference denote P n = 1 | Γ | P γ ∈ Γ P n γ and Q m =  1 | ∆ | P δ ∈ ∆ Q m δ  . 22 No w our set of alternatives are formed by the pro duct of Γ and ∆ Λ = Γ × ∆ = { ( p γ , q δ ) | p γ ∈ Γ , q δ ∈ ∆ } First note that for an y ( p λ , q λ ) = ( p γ , q δ ) ∈ Λ, b y the second order functional T aylor expansion we hav e, T ( p λ , q λ ) = T ( f , g ) + Z ψ f ( x ; f , g ) p λ + Z ψ g ( x ; f , g ) q λ + R 2 By Lemma 17 and the construction the ﬁrst order terms v anish since, Z ψ f ( x ; f , g )   f + K 1 X j γ j u j   = K 1 X j γ j Z ψ f ( x ; f , g ) u j = 0 . The same is true for R ψ g ( x ; f , g ). The second order term can b e upp er b ounded by R 2 = φ 00  Z ν ( f ∗ , g ∗ )   Z ∂ 2 ν ( f ∗ ( x ) , g ∗ ( x )) ∂ f 2 ( x ) ( p λ − f ) 2 + Z ∂ 2 ν ( f ∗ ( x ) , g ∗ ( x )) ∂ g 2 ( x ) ( q λ − g ) 2 + 2 Z ∂ 2 ν ( f ∗ ( x ) , g ∗ ( x )) ∂ g ( x ) ∂ g ( x ) ( p λ − f )( q λ − g )  ≥ σ min  k p λ − f k 2 + k q λ − g k 2  ≥ σ min  K 2 1 + K 2 2  F or the second step note that ( f ∗ , g ∗ ) lies in line segment b et ween ( p λ , q λ ) and ( f , g ) and is therefore b oth upp er and lo wer b ounded. Therefore, the Hessian ev aluated at ( f ∗ , g ∗ ) is strictly positive deﬁ- nite with some minimum eigen v alue σ min . F or the third step we hav e used that ( p λ − f , q λ − g ) = ( K 1 P ` 1 j =1 γ j u j , K 2 P ` 2 j =1 δ j v j ) and that the u j ’s are orthonormal and k u j k 2 = 1. This establishes the 2 β separation b et ween the n ull and the alternative as required b y Theorem 20 with β = σ min ( K 2 1 + K 2 2 ) / 2. Precisely , T ( p λ , q λ ) ≥ T ( f , g ) + O ( ` − 2 s/d 1 + ` − 2 s/d 2 ) No w we need to bound the Hellinger separation, b et ween F n × G m and P × Q . First note that by our construction, P × Q = 1 | Λ | X λ ∈ Λ P n λ × Q m λ =   1 | Γ | X γ ∈ Γ P n γ   × 1 | ∆ | X δ ∈ ∆ Q m δ ! = P n × Q m By the tensorization prop ert y of the Hellinger aﬃnit y we hav e, H 2 ( F n × G m , P × Q ) = 2  1 −  1 − H 2 ( F n , P n ) 2   1 − H 2 ( G m , Q m ) 2  ≤ H 2 ( F n , P n ) + H 2 ( G m , Q m ) W e no w apply Prop osition 21 to b ound each Hellinger divergence. If we denote ρ j ( · ) = K 1 u j ( · ) /f ( · ) then w e see that the ρ j ’s satisfy the conditions of the prop osition and further p γ = f (1 + P j γ j ρ j ) allowing us to use the b ound. Accordingly α j = R ρ 2 j f ≤ C K 2 1 /` 1 for some C . Hence, H 2 ( F n , P n ) ≤ n 2 3 m X j =1 α 2 j ≤ C n 2 K 4 1 ` 1 ∈ O ( n 2 ` − 4 s + d d 1 ) . A similar argumen t yields H 2 ( G m , Q m ) ∈ O ( m 2 ` − 4 s + d d 2 ). If we pic k ` 1 = n 2 d 4 s + d and ` 2 = m 2 d 4 s + d and hence K 1 = n − 2 s 2 s + d and K 2 = m − − 2 s 2 s + d , then w e hav e that the Hellinger separation is b ounded by a constant. H 2 ( F n × G m , P × Q ) ≤ H 2 ( F n , P n ) + H 2 ( G m , Q m ) ∈ O (1) 23 F urther, the error is larger than β  K s 1 + K 2 2  n − 4 s 2 s + d + m − 4 s 2 s + d . The ﬁrst part of the lo wer b ound for τ = 8 s/ (4 s + d ) is concluded by Marko v’s inequality , E  ( b T − T ( f , g )) 2 ] ( n − τ / 2 + m − τ / 2 ) 2 ≤ P  | b T − T ( f , g ) | > ( n − τ / 2 + m − τ / 2 )  > c where we note that ( n − τ / 2 + m − τ / 2 ) 2  n − τ + m − τ . The n − 1 + m − 1 lo wer b ound is straightforw ard as as w e cannot do b etter than the the parametric rate Bick el and Rito v [ 1988 ]. See Krishnamurth y et al. [ 2014 ] for an pro of that uses a contradiction argument in the setting n = m . F An Illustrativ e Example - The Conditional Tsallis Div ergence In this section w e present a step by step guide on applying our framework to estimating any desired functional. W e choose the Conditional Tsallis divergence b ecause p edagogically it is a go o d example in T able 1 to illustrate the technique. By following a similar pro cedure, one may derive an estimator for any desired functional. The estimators are derived in Section F.1 and in Section F.2 we discuss conditions for the theoretical guaran tees and asymptotic normality . The Conditional Tsallis- α divergence ( α 6 = 0 , 1) b et ween X and Y conditioned on Z can b e written in terms of join t densities p X Z , p Y Z . C T α ( p X | Z k p Y | Z ; p Z ) = C T α ( p X Z , p Y Z ) = Z p Z ( z ) 1 α − 1  Z p α X | Z ( u, z ) p 1 − α Y | Z ( u, z )d u − 1  d z = 1 1 − α + 1 α − 1 Z p α X Z ( u, z ) p β Y Z ( u, z )d u d z where we hav e taken β = 1 − α . W e ha ve samples V i = ( X i , Z 1 i ) ∼ p X Z , i = 1 , . . . , n and W j = ( Y j , Z 1 j ) ∼ p Y Z , j = 1 , . . . , m W e will assume p X Z , p Y Z ∈ Σ( s, L, B 0 , B ). F or brevity , we will write p = ( p X Z , p Y Z ) and ˆ p = ( b p X Z , b p Y Z ). F.1 The Estimators W e ﬁrst compute the inﬂuence functions of C T α and the use it to deriv e the DS/LOO estimators. Prop osition 23 (Inﬂuence F unctions of C T α ) . The inﬂuenc e functions of C T α w.r.t p X Z , p Y Z ar e ψ X Z ( X, Z 1 ; p X Z , p Y Z ) = α α − 1  p X Z α − 1 ( X, Z 1 ) p Y Z β ( X, Z 1 ) − Z p X Z α p Y Z β  (29) ψ Y Z ( Y , Z 2 ; p X Z , p Y Z ) = −  p X Z α ( Y , Z 2 ) p Y Z β − 1 ( Y , Z 2 ) − Z p X Z α p Y Z β  Pr o of. Recall that we can deriv e the inﬂuence functions via ψ X Z ( X, Z 1 ; p ) = C T α 0 X Z ( δ X,Z 1 − p X Z ; p ), ψ Y Z ( Y , Z 2 ; p ) = C T α 0 Y Z ( δ X,Z 2 − p Y Z ; p ) where C T α 0 X Z , C T α 0 Y Z are the Gˆ ateaux deriv atives of C T α w.r.t p X Z , p Y Z resp ectiv ely . Hence, ψ X Z ( X, Z 1 ) = 1 α − 1 ∂ ∂ t Z ((1 − t ) p X Z + tδ X Z 1 ) α p Y Z β    t =0 = α α − 1 Z p X Z α − 1 p Y Z β ( δ X Z 1 − p X Z ) from which the result follows. Deriving ψ Y Z is similar. Alternatively , we can directly show that ψ X Z , ψ Y Z in Equation ( 29 ) satisfy Deﬁnition 2 . 24 DS esti mator : Use V n/ 2 1 , W m/ 2 1 to construct density estimates b p (1) X Z , b p (1) Y Z for p X Z , p Y Z . Then, use V 2 n n/ 2+1 , W m m/ 2+1 to add the sample means of the inﬂuence functions given in Theorem 23 . This results in our preliminary estimator, b C T (1) α = 1 1 − α + α α − 1 2 n n X i = n/ 2+1 b p (1) X Z ( X i , Z 1 i ) b p (1) Y Z ( X i , Z 1 i ) ! α − 1 − 2 m m X j = m/ 2+1 b p (1) X Z ( Y j , Z 2 j ) b p (1) Y Z ( Y j , Z 2 j ) ! α (30) The ﬁnal estimate is b C T α, DS = ( b C T (1) α + b C T (2) α ) / 2 where b C T (2) α is obtained b y swapping the tw o samples. LOO Estimator: Denote the densit y estimates of p X Z , p Y Z without the i th sample by b p X Z, − i and b p Y Z, − i . Then the LOO estimator is, b C T α, LOO = 1 1 − α + α α − 1 1 n n X i =1  b p X Z, − i ( X i , Z 1 i ) b p Y Z ( X i , Z 1 i )  α − 1 −  b p X Z ( Y i , Z 2 i ) b p Y Z, − i ( Y i , Z 2 i )  α (31) F.2 Analysis and Asymptotic Conﬁdence Interv als W e b egin with a functional T aylor expansion of C T α ( f , g ) around ( f 0 , g 0 ). Since α, β 6 = 0 , 1, we can b ound the second order terms b y O  k f − f 0 k 2 + k g − g 0 k 2  . C T α ( f , g ) = C T α ( f 0 , g 0 ) + α α − 1 Z f α − 1 0 g β 0 − Z f α 0 g β − 1 0 + O  k f − f 0 k 2 + k g − g 0 k 2  (32) Precisely , the second order remainder is, α 2 α − 1 Z f α − 2 ∗ g β ∗ ( f − f 0 ) 2 − β Z f α ∗ g β − 2 ∗ ( g − g 0 ) 2 + αβ α − 1 Z f α − 1 ∗ g β ∗ ( f − f 0 )( g − g 0 ) where ( f ∗ , g ∗ ) is in the line segment b et w een ( f , g ) and ( f 0 , g 0 ). If f , g , f 0 , g 0 are b ounded ab o v e and b elo w so are f ∗ , g ∗ and f a ∗ g b ∗ where a, b are co eﬃcien ts dep ending on α . The ﬁrst tw o terms are resp ectiv ely O  k f − f 0 k 2  , O  k g − g 0 k 2  . The cross term can b e b ounded via,   R ( f − f 0 )( g − g 0 )   ≤ R max {| f − f 0 | 2 , | g − g 0 | 2 } ∈ O ( k f − f 0 k 2 + k g − g 0 k 2 ). As mentioned earlier, the b oundedness of the densities give us the required rates giv en in Theorems 7 for b oth estimators. F or the DS estimator, to show asymptotic normality , w e need to verify the conditions in Theorem 19 . W e state it formally b elo w, but pro ve it at the end of this section. Corollary 24. L et p X Y , p X Z ∈ Σ( s, L, B , B 0 ) . Then b C T α, DS is asymptotic al ly normal when p X Z 6 = p Y Z and s > d/ 2 . Finally , to construct a conﬁdence interv al w e need a consistent estimate of the asymptotic v ariance : 1 ζ V X Z [ ψ X Z ( V ; p )] + 1 1 − ζ V Y Z [ ψ Y Z ( W ; p )] where, V X Z [ ψ X Z ( X, Z 1 ; p X Z , p Y Z )] =  α α − 1  2 Z p X Z 2 α − 1 p Y Z 2 β −  Z p X Z α p Y Z β  2 ! V Y Z [ ψ Y Z ( Y , Z 2 ; p X Z , p Y Z )] = Z p X Z 2 α p Y Z 2 β − 1 −  Z p X Z α p Y Z β  2 ! F rom our analysis abov e, w e kno w that an y functional of the form S ( a, b ) = R p X Z a p Y Z b , a + b = 1 , a, b 6 = 0 , 1 can b e estimated via a LOO estimate b S ( a, b ) = 1 n n X i =1 a b p b Y Z, − i ( V i ) b p b X Z, − i ( V i ) + b b p a X Z, − i ( W i ) b p a Y Z, − i ( W i ) 25 where b p X Z, − i , b p Y Z, − i are the densit y estimates from V − i , W − i resp ectiv ely . n/ N is a consistent estimator for ζ . This giv es the following estimator for the asymptotic v ariance, N n α 2 ( α − 1) 2 b S (2 α − 1 , 2 β ) + N m b S (2 α, 2 β − 1) − N ( mα 2 + n ( α − 1) 2 ) nm ( α − 1) 2 b S 2 ( α, β ) . The consistency of this estimator follo ws from the consistency of b S ( a, b ) for S ( a, b ), Slutzky’s theorem and the con tinuous mapping theorem. Pr o of of Cor ol lary 24 . W e now pro v e that the DS estimator satisﬁes the necessary conditions for asymptotic normalit y . W e begin by showing that C T α ’s inﬂuence functions satisfy the regularity condition 4 . W e will sho w this for ψ Y Z . The pro of for ψ X Z is similar. Consider tw o pairs of densities ( f , g ) ( f 0 , g 0 ) on the ( X Z, Y Z ) spaces. Z ( ψ X Z ( u ; f , g ) − ψ X Z ( u ; f 0 , g 0 )) 2 f = α 2 (1 − α ) 2 Z  f α − 1 g β − Z f α g β −  f 0 α − 1 g 0 β − Z f 0 α g 0 β  2 f ≤ 2 α 2 (1 − α ) 2 " Z  f α − 1 g β − f 0 α − 1 g 0 β  2 f +  Z f α g β − Z f 0 α g 0 β  2 # ≤ 2 α 2 (1 − α ) 2  Z  f α − 1 g β − f 0 α − 1 g 0 β  2 f + Z  f α g β − f 0 α g 0 β  2  ≤ 4 α 2 (1 − α ) 2  k g β k 2 ∞ Z ( f α − 1 − f 0 α − 1 ) 2 + k f 0 α − 1 k 2 ∞ Z ( g β − g 0 β ) 2 + k g β k 2 ∞ Z ( f α − f 0 α ) 2 + k f 0 α k 2 ∞ Z ( g β − g 0 β ) 2  ∈ O  k f − f 0 k 2  + O  k g − g 0 k 2  where, in the second and fourth steps we ha v e used Jensen’s inequality . The last step follows from the b oundedness of all our densities and estimates and by lemma 11 . The b ounded v ariance condition of the inﬂuence functions also follo ws from the b oundedness of the densities. V p X Z ψ X Z ( V ; p X Z , p Y Z ) ≤ α 2 ( α − 1) 2 E p X Z  p X Z 2 α − 2 ( X, Z 1 ) p Y Z 2 β ( X, Z 1 )  = α 2 ( α − 1) 2 Z p X Z 2 α − 1 p Y Z 2 β < ∞ W e can b ound V p Y Z ψ Y Z similarly . F or the fourth condition, note that when p X Z = p Y Z , ψ X Z ( X, Z 1 ; p X Z , p X Z ) = α α − 1  p X Z α + β − 1 ( X, Z 1 ) − Z p X Z  = 0 , and similarly ψ Y Z = 0 . Otherwise, ψ X Z dep ends explicitly on X, Z and is nonzero. Therefore we hav e asymptotic normalit y aw ay from p X Z = p Y Z . 26

Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment