High dimensional gaussian classification
High dimensional data analysis is known to be as a challenging problem. In this article, we give a theoretical analysis of high dimensional classification of Gaussian data which relies on a geometrical analysis of the error measure. It links a proble…
Authors: Robin Girard
k F 10 k 2 L 2 ( γ C ) ≥ r . F rom the pr eceding prop o s ition, uniformly on a ll the p ossible v alues o f µ 1 and µ 0 , the lea rning err o r a nd the exces s risk can converge to zer o only if n p tends to 0. Recall that if no a priori as sumption is done o n m 10 , ¯ m 10 is the b est estimator (according to the mean square error) of m 10 . Also, as in the estimation of a high dimensional vector problem (such as those describ ed in ([ 9 ])), one should make a more restrictive hypothesis on m 10 . W e will supp ose, in Sectio n 5 , that if ( a k ) k ≥ 0 are the co efficients of C − 1 / 2 m 10 in a well chosen ba sis, then P k ≥ 0 a q k ≤ R q for 0 < q < 2. Pr o of. As in the preceding pro po sition, we will use inequality ( 13 ). Also it is sufficient to show the following E [ | α | ] ≥ ar c c os 1 √ p − 3 ( √ n k F 10 k L 2 ( γ C ) + 1) where α is defined by ( 5 ). Because the function ar c cos is dec r easing and concav e on [0 , 1], it suffices to obtain E " |h F 10 , ˆ F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ 1 √ p − 3 ( √ n k F 10 k L 2 ( γ C ) + 1) . (17) On the o ther ha nd, E " |h F 10 , ˆ F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ E " k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # + E " |h F 10 , ˆ F 10 − F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ E " k F 10 k 2 L 2 ( γ C ) k ˆ F 10 k 2 L 2 ( γ C ) # 1 / 2 1 + E " h F 10 , ˆ F 10 − F 10 i 2 L 2 ( γ C ) k F 10 k 2 L 2 ( γ C ) # 1 / 2 , where this last inequality results from Cauch y-Sch w artz. Recall that ˆ F 10 = F 10 + C − 1 / 2 √ n ξ , where ξ is a s tandardised gaussian random vector of R p . Also , we easily obtain, E " h F 10 , ˆ F 10 − F 10 i 2 L 2 ( γ C ) k F 10 k 2 L 2 ( γ C ) # 1 / 2 = 1 √ n , and k F 10 k 2 L 2 ( γ C ) k ˆ F 10 k 2 L 2 ( γ C ) = k √ nC 1 / 2 F 10 k 2 R p k √ nC 1 / 2 F 10 + ξ k 2 R p . The rest o f the pro o f follows fr om the following simple fact which is a con- sequence of the Co chran Theorem a nd a classical ca lculation on χ 2 random imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 14 v ariables: Let σ > 0, β ∈ R p , X a gaussia n r andom vector of R p with mean β and cov aria nce I p . Then E 1 k X k 2 R p ≤ 1 p − 3 . 2.3. Case wher e k F 10 k L 2 ( γ C ) diver ges: wel l sep ar ate d data. W e s hall now rapidly consider the case when the data a r e well sepa rated: the case where k F 10 k L 2 ( γ C ) diverges. In the next theorem, we assume that p tends to infinity . Theorem 2. 2. Su pp ose that 0 < α < π / 2 ( α is define d by ( 5 )), and that cos( α ) k F 10 k L 2 ( γ C ) → ∞ when p tends to infinity. We then have R → 0 si lim inf p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | < 1 b ≥ 1 8 si lim sup p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | > 1 when p → ∞ . This theorem is prov ed in Sectio n 7 . In the case of w ell separated data it is obvious tha t the o ptima l rule will p er form p erfectly . Theo rem 2.2 s hows that for a g iven estimator ˆ F 10 one should c heck that the proba bilit y to have lim inf p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | > 1 is small enough. 3. Q uadratic p erturba tion of quadratic rule 3.1. Main r esults and r emarks ab out the infini te dimensional setting In the cas e where C 1 6 = C 0 , L 10 ( x ) = L Q 10 ( x ) is a p olynomial function of degree t wo on R p : L Q 10 ( x ) = − 1 2 h A 10 ( x − s 10 ) , x − s 10 i R p + h G 10 , x − s 10 i R p − c, (18) where A 10 = C − 1 1 − C − 1 0 , G 10 = S m 10 , (19) S = C − 1 0 + C − 1 1 2 , c = 1 8 h Am 10 , m 10 i R p − 1 2 log | det( C − 1 0 C 1 ) | , m 10 and s 10 are defined by ( 3 ). Remark 3.1. The e qu ation ( 19 ) giving L Q 10 ( x ) c an b e mo difie d using the fact that A 10 = 1 2 C − 1 / 2 1 W 10 C − 1 / 2 1 − C − 1 / 2 0 W 01 C − 1 / 2 0 wher e W ij = I − C 1 / 2 i C − 1 j C 1 / 2 i . (20) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 15 This mo dific ation has two advantages. It involves W ij which play an imp ortant r ole in the infin ite dimensional fr amework (se e r emark 3.2 ). On the other hand, it involves W 10 as much as W 01 which c an le ad in pr actic e (while estimating A 10 ) to a symmet ric pr o c e dur e that do es not give mor e imp ortanc e to any gr oup. In the clas sification pro blem, a p olynomia l of degree tw o b L Q 10 ( x ) is used as a substitute for L 10 . W e decide tha t X co mes fr o m cla ss o ne if it b elong s to ˆ V = n x ∈ R p tq b L Q 10 ( x ) ≥ 0 o , (21) The following theorem gives o ur solution to Problem 1 . Theorem 3.1. L et γ b e a gaussian me asur e on R p . Supp ose that L Q 10 is a p olynomial of de gr e e two on R p and that we have k L Q 10 k L 2 ( γ ) ≥ r for r > 0 . Then, for al l q ∈ ]0 , 1[ , ther e exists c 1 ( r , q ) > 0 su ch that R ( 1 ˆ V ) ≤ c 1 ( r , q ) kL Q 10 − b L Q 10 k q/ 3 L 2 ( γ ) , (22) wher e ˆ V is given by ( 21 ) and R by ( 8 ). W e emphasise the fact that c 1 ( r , q ) dep ends only r and q . In par ticular it do es not depend on the dimension p of the problem. The pr o of of this Theorem is giv en in Section 8 . It is implicitly infinite dimensional, and the prec e ding theorem co uld have b een stated in an infinite dimensional fr a mework. W e do not w a nt to introduce this complicated framework and we refer to [ 8 ] for an int ro duction to the sub ject. The infinite dimensio nal fra mework highlights a particular asp ect o f the problem that is c o ntained in the following remar k. Remark 3.2. [infinite dimensional fr amework] W hen X is a sep ar able Hilb ert sp ac e (it c an also b e a sep ar able Banach sp ac e in t he c ase of LDA) two gaussian me asur es γ C 1 ,µ 1 and γ C 0 ,µ 0 that ar e not e qu ivalent ar e ortho gonal. If t hese me asu r es ar e ortho gonal then t he observe d data fr om the t wo classes ar e p erfe ct ly sep ar ate d and C ( g ∗ ) = 0 . In this c ase one c an hop e to obtain C ( g ) = 0 for a r e asonable classific ation rule g (Even if it is not trivial, se e The or em 2.2 in the line ar c ase). A ne c essary and sufficient c ondition for these me asur es to b e e quivalent is that m 10 = µ 1 − µ 0 ∈ H ( γ C 1 ,µ 1 ) = H ( γ C 0 ,µ 0 ) , (23) and W 10 = I − C 1 / 2 1 C − 1 0 C 1 / 2 1 ∈ H S ( X ) , (24) wher e H ( γ ) is t he r epr o ducing Kernel Hilb ert S p ac e asso ciate d with a gaussian me asur e γ and H S ( X ) is the sp ac e of Hilb ert Shmidt op er ators with values in X (se e c or ol la ries p293 in [ 8 ]). In p articular, the eigenvalues of W 10 ar e in l 2 . In the c ase wher e t hey ar e e quivalent, one c an define L 10 as a limit (almost sur ely and L 2 ) of its finite dimensional c oun terp art. This c an also b e understand as me asur able and squar e d inte gr able (with r esp e ct to γ C 1 ,µ 1 ) p olynomials of de gr e e two in X (se e Chapter 5 . 1 0 in [ 8 ]). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 16 3.2. Comment and Cor ol lary . Supp os e b L Q 10 ( x ) is defined substituting ˆ G 10 , ˆ s 10 ˆ A 10 and ˆ c to G 10 , s 10 A 10 and c in ( 18 ). If we note δ 0 = ˆ c − c + D ˆ G 10 + ( ˆ A ∗ 10 + ˆ A 10 )( ˆ s 10 − s 10 ) , ˆ s 10 − s 10 E R p , (25) ( A ∗ is the transp ose of a matrix A ) δ L = ˆ G 10 − G 10 + ( ˆ A ∗ 10 + ˆ A 10 )( ˆ s 10 − s 10 ) (26) and δ Q = ˆ A 10 − A 10 , (27) we then g et, by straightforward calculation: ∀ x ∈ R p b L Q 10 ( x ) = L Q 10 ( x ) + δ 0 + h δ L , x − s 10 i R p − 1 2 h δ Q ( x − s 10 ) , x − s 10 i R p . (2 8 ) Also, a re result are ab out q uadratic p erturbations of quadratic rules. The following coro llary of Theorem 3.1 is ea sier to use. Corollary 3.1. L et X = R p and C b e a s ymm et ric p ositive definite matrix on R p . S u pp ose t hat ther e exists r > 0 su ch that kL 10 k 2 L 2 ( γ C,s 10 ) > r . Then, for 1 ˆ V given by ( 21 ) and for al l 0 < q < 1 t her e exists c 1 ( r , q ) > 0 su ch that: R ( 1 ˆ V ) ≤ c 1 ( r , q ) 1 2 k C ( A 10 − ˆ A 10 ) k 2 H S ( R p ) + k C 1 / 2 δ L k 2 R p +2 δ 2 0 + 1 2 trace 2 ( C ( A 10 − ˆ A 10 )) q/ 3 , wher e δ L is given by ( 26 ) and δ 0 by ( 25 ) . Pr o of. Let us recall that δ Q is g iven b y ( 27 ). W e hav e kL 10 − b L 10 k 2 L 2 ( γ C,s 10 ) = k 1 2 ( δ Q ( x ) − E γ C [ q δ Q ( X )]) − h δ L , x i R p − ( δ 0 − 1 2 E γ C [ q δ Q ( X )]) k 2 L 2 ( γ C ) ≤ 1 4 V ar ( q C 1 / 2 δ Q C 1 / 2 ( ξ )) + V ar ( h C 1 / 2 δ L , ξ i R p ) + 2 δ 2 0 + 2 E 2 γ C [ q C 1 / 2 δ Q C 1 / 2 ( ξ )] ( ξ ❀ γ I p , 0 , no te that there is equa lity here) = 1 2 k C 1 / 2 δ Q C 1 / 2 k 2 H S ( R p ) + k C 1 / 2 δ L k 2 R p + 2 δ 2 0 + 1 2 trace 2 ( C 1 / 2 δ Q C 1 / 2 ) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 17 3.3. Comp ar ison of this r esult with those obtaine d for LDA. The prece ding theor em and its co r ollary ar e less powerful than those o btained for the LDA pr o cedure and some co njectures might b e made in a par allel with Theorem 2.1 . In this theorem and in Theore m 2.2 , bo th co ncerning linea r r ules, we explained and quantified how par ameter estimation errors are less impo rtant when k F 10 k L 2 ( γ C ) is large. This observ ation was based on the pres ence of a ter m exp onentially decreasing with k F 10 k L 2 ( γ C ) in the quantities which determine the upper b ound to the learning error (a nd as a conseq uence the excess risk). In Theorem 3.1 concer ning Q D A pro cedur e, we did no t obtain that type of term. Nevertheless, Remark 3.2 (more precis e ly the relation this leads to equiv alence of the mea s ures) allow us to co njecture that s uch a term e x ists. W e also hav e to clarify the hypothesis under which the nor m of L Q 10 is low er bo unded. Le t us recall tha t this hypothesis guar anties that the co ns tant c 1 in equation ( 22 ) is independent of the parameter s of the problem. In a pa rallel with the results obtained for the pro cedur e LD A the lower b ound tha t is required for the nor m of L Q 10 corres p o nds to the a ssumption that the tw o groups co nsidered can a lwa ys be distinguished. W e b elieve that even if this hypo thesis is natural, it is deeply linked with e rror measure that is used in our proo f: the lear ning error . Hence, it is obvious that the ex cess risk is s ma ll when the data cannot be disting uished (see Section 6 for a fuller discussion) but our res ult do es no t reflect this fact. W e do not discuss the estimation of G 10 which le a ds to the sa me analysis as that for F 10 in the cas e o f a linear rule . Let us now discuss the estimation of W 10 (and W 01 ). 3.4. Thr esholding estimation of an op er ator and l i ne arisation of a pr o c e dur e. Recall that W 10 is a symmetr ic matrix . Supp o se we k now an o rthonormal bas e in which it is dia gonal. Let λ 10 = ( λ 10 i ) i =1 ,...,p be the vector o f its eigenv alues. T o build the estimator ˆ W 10 of W 10 , we hav e to es tima te its eigenv alues. It remains to meas ure the lea rning erro r and hence the estimation error of the eigenv alues vector in l 2 norm. Supp ose that p tends to infinity . W e will recall later that if the measure o f class 0 and 1 tend to equiv alent gaussian measur e in a sepa rable Hilb ert spa ce, then W 10 tends to b e Hilb ert-Schmidt. This means that λ 10 stays in l 2 ( N ). Once again, if λ 10 has co efficients decreasing sufficiently fast, the thresho lding estimatio n s hould b e used. This thres holding es timation is no lo ng er a reduction of the dimension of the space in which the r ules acts, but b ecomes a linea risation of the classificatio n rules - It can b e in terpreted as a r eduction of the dimension of the space in which the used rule lives- Indeed, let ˆ W 10 = P l i =1 ˆ λ 10 i e i ⊗ e i for l ≤ p and ( e i ) i =1 ,...,p be an orthonorma l bases of R p , we hav e: b L Q 10 = l X i =1 ˆ λ 10 i h e i , x − ˆ s 10 i 2 R p + g ( x ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 18 Figure 1 . Sep ar ation of the data in a dir e ction wher e the v arianc es ar e differ ent . The two gr oups c an b e identified with their el lip soids of c onc ent r ation: a horizontal el lipsoid and a vertic al el lipso id. the two gr oups have the same me an, but differ e nt c ovarianc e, which makes the data q uite wel l sep ar ate d. One c an t ake advantage of t his sep ar ation only if a quadr atic rule is use d. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 19 where g ( x ) is affine and defined on R p . In this cas e, the plug-in rule is affine in a subspace of dimension p − l a nd q uadratic in the subspace of dimension l spanned by ( e i ) i =1 ,...,l . Let us note that b eca us e W 10 = I − C − 1 / 2 1 C 0 C − 1 / 2 1 , setting the eige n v alues of ˆ W ij to zero in a subspace of R p , is equiv alent to choosing a subspace in which the cov aria nce matrices C 1 and C 0 are ”close enough”. In this subs pa ce, one ca n suppo se that C 1 equals C 0 . The cla ssification rule, in this subspace, is linear . Figure 1 illustra tes the case w her e the eigenv a lues of W 10 are big enough and why a qua dratic rule is b etter in tha t c a se. 4. Cl assification pro cedure in hi gh di m ension: a wa y to solve Problem 2 4.1. Intr o du ction. In this sec tio n, we give a pra ctical metho d of classifica tion for gauss ia n data in high dimension and hence pr esent our contribution to Problem 2 . Note that if we only treat the binar y cla ssification problem, it is easy to extend our pro cedure to the ca se of K classes as we hav e do ne in [ 15 ]. Reca ll that we are giv e n n 1 observ ations from P 1 and n 0 observ ations from P 0 . W e will no te n = n 1 + n 0 . W e supp ose tha t each of the n k vectors of gr oup k is comp ose d o f the p fir st wa velet co efficient (see [ 20 ]) of a random curve fro m X = L 2 [0 , 1] whic h is a realisatio n o f a gaussian r a ndom v aria ble P k = γ C k ,µ k of unknown mean a nd cov aria nce. Recall that a lea rning rule can b e defined by a partition of R p . W e co nstruct this partition ˆ V , R p \ ˆ V of R p with the use of a frontier functions b L 10 : ˆ V = n x ∈ R p : b L 10 ( x ) ≥ 0 o , (29) which s hould b e given in the s equel. W e divide here the presentation into tw o parts. In the first part, we g ive a theoretical result in the ca se where the cov ariance matrices are supp osed to b e known. In the second part, we give the metho d that is used when the co v ariances are unknown. W e k eep the notation of the preceding sectio ns. In the ca s e of LDA pro cedure, m 10 = µ 1 − µ 0 F 10 = C − 1 m 10 , s 10 = µ 1 + µ 0 2 , and in the cas e of the QDA pr o cedure, G 10 = 1 2 ( C − 1 1 + C − 1 0 ) m 10 , A 10 = C − 1 1 − C − 1 0 . 4.2. Case of known and e qual c ovarianc e: pr o c e dur e and the or etic al r esult. Notation and assumpti ons. Let ¯ µ k be the empirical mea n o f the lea rning data ( X ik ) i =1 ,...,n k of clas s k . W e supp os e here tha t the cov ariance of group 0 and 1 equal C , and that s 10 is known. The separ ation fr ontier b etw e e n the imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 20 t wo g roups is a ffine a nd F 10 is the only unknown parameter . W e supp ose that the lear ning set is made of n 1 = n 0 = n ( p ) / 2 p -dimensional vectors. W e g ive a metho d to c onstruct an estimator of F 10 and g ive theoretical results when n ( p ) tends then to infinity m uch more s lowly than p . F or q > 0, the ball l q p ( R ) is comp osed of the vectors θ ∈ R p such that p X i =1 | θ i | q ≤ R q . W e will note Ω p (Θ( R ) , r ) = { ( x, y , C ) ∈ R p × R p × C p such that (30) C − 1 / 2 ( x − y ) ∈ Θ ( R ) a nd k C − 1 / 2 ( x − y ) k R p ≥ r o where C p is the set of symmetric definite p ositive matric e s in R p . If ( µ 0 , µ 1 , C ) ∈ Ω p (Θ( R ) , r ), we w ill note D ( ˆ L 10 ) = C ( 1 ˆ V ) − C ( 1 V ) , (31) where ˆ V is given by ( 29 ) and V is given b y ( 2 ). The Pro cedure. The plug -in rule affect the obser v ation X to class 1 if it belo ngs to ˆ V defined by ( 29 ) where b L 10 = h ˆ F 10 , X − s 10 i R p . W e estimate F 10 = C − 1 m 10 by ˆ F 10 = C − 1 ˆ m 10 , where the co efficients o f C − 1 / 2 ˆ m 10 are g iven by y 10 l 1 | y 10 l | >λ F DR 10 l =1 ,...,p , where y 10 l = C − 1 / 2 ( ¯ µ 1 − ¯ µ 0 ) l =1 ,...,p , and λ F DR 10 is chosen by the Benja mini a nd Ho cheberg pro cedure [ 4 ] for the control of the false discov er y rate (FDR) of the following m ultiple h yp otheses: ∀ l = 1 , . . . , p H 0 l : E [ y 10 l ] = 0 : V er sus H 0 l : E [ y 10 l ] 6 = 0 (3 2 ) W e r ecall tha t this pro cedur e is the following. The ( | y 10 l | ) l are ordered in de- creasing order: | y 10(1) | ≥ · · · ≥ | y 10( p ) | and λ F DR 10 = | y 10( k F D R 10 ) | where k F DR 10 = max ( k ∈ { 1 , . . . , p } : | y 10( k ) | ≥ s 1 n ( p ) z b p k 2 p ) , z ( α ) is the quantile of order α o f a sta ndardized ga ussian random v a riable and b p ∈ [0 , 1 / 2[ is low er bounded b y c 0 log p where c 0 is a p ositive constant (which do es no t dep e nd o n p . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 21 Theoretical result Theorem 4.1. L et R > 0 , and q ∈ ]0 , 2[ . L et ˆ V b e define d by ( 29 ) and η p = p − 1 q R p n ( p ) . Supp ose that p tends to infinity. If η q p ∈ [ log 5 ( p ) p , p − δ ] for δ > 0 , then, for r > 0 , we have sup ( µ 0 ,µ 1 ,C ) ∈ Ω p ( l q ( R ) ,r ) E P ⊗ n h D p ( ˆ L 10 ) i ≤ 1 + o p (1) r √ 2 log 1 / 2 p R q n ( p ) q/ 2 Rn 1 / 2 ( p ) 2 − q 2 , wher e D p is t he exc ess risk as define d by ( 31 ), and P ⊗ n is t he law of t he le arning set. Pr o of. The cov a r iance matrix of the vector C − 1 / 2 ( ¯ µ 1 − ¯ µ 0 ) eq ua ls I p 1 n ( p ) . W e then have to use succes sively The o rem 2.1 (of this article), Theo r em 1 . 1 o f Abramovic h et .al [ 1 ], and Theore m 5 po int 3 b. of Donoho and Jo hnstone [ 11 ] to b e a ble to wr ite, ∀ r > 0 : sup ( µ 0 ,µ 1 ,C ) ∈ Ω p ( l q ( R ) ,r ) E P ⊗ n h D 2 p ( ˆ L 10 ) i ≤ 1 + o p (1) r 2 √ 2 log 1 / 2 p R q n ( p ) q/ 2 Rn 1 / 2 ( p ) 2 − q . This inequa lity leads to the result by the use of the J ensen inequality: E P ⊗ n h D p ( ˆ L 10 ) i ≤ E P ⊗ n h D 2 p ( ˆ L 10 ) i 1 / 2 . Comments. Let us ma ke a few r emarks o n this r esult. 1. The r ate of conv erge nce is fas ter when q is close to 0, and s low er when it is close to 2. This leads us to consider the spar sity of C − 1 / 2 ( µ 0 − µ 1 ), and makes the use of the w av elet basis attractive. On the one hand, it transforms a wide class of curves in to spar se vectors and on the other hand, it almost diagona lises a wide class of cov a riance op erato rs. 2. W e could obta in the same sp eed with a universal threshold (i.e with the threshold λ U = 1 n ( p ) p 2 log( p ) ). In this ca se, the constan t 1+ o p (1) r 2 would not b e that go o d (cf [ 1 ]). 3. W e are not aw ar e of any r esults concerning the conv ergence of any cla ssifi- cation pr o cedure in this framework (the high dimensional g aussian fra me- work with the s e t of po ssible para meter deter mined by Ω p ). Indeed we do not make any str o ng assumption on C . Bick el and Le v ina [ 6 ] as well as F an and F an [ 12 ] supp ose in their work that the ratio b etw een the highes t and the low es t eig env alue is lo w er a nd upp er- b o unded. Even if o ur Theorem do esnot tr eat the case wher e C is unknown the hypotheses we us e seems more na tural. Let us reca ll tha t if Y is a gaussia n random v ar iable with imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 22 v alues in a Hilber t Space, then the cov ariance op era tor is neces s arily nu- clear. Also, the assumption used by the ab ove mentioned author s does not allow us to consider gaussian measures with supp ort in a Hilbe r t space. 4. Finding the sig nificant comp onent of the nor mal vector F 10 defining the optimal separating hype rplan is equiv a le nt with finding the significa n t contrast in a m ultiv a riate ANOV A. Hence, c o ntrolling the exp ected fals e discov er y ra te in this ANO V A is sufficient to get a go o d cla ssification rule. 4.3. The c ase of diff er ent un know n c ovarianc es F or the r est of this sectio n, if k ∈ { 0 , 1 } , ¯ µ k will b e the empirical mean o f the Learning data of class k . W e are going to use a diagonal es timator ˆ C k of the cov aria nce ma trix C k . The diag o nal elemen ts of ˆ C k will b e ( ˆ σ 2 kq ) q =1 ,. ..,p . F or q ∈ { 1 , . . . , p } , k ∈ { 0 , 1 } , ˆ σ 2 kq will we the unbiased version of the empir ic al v ariance o f feature q of the o bs erv ations ( X ikq ) i =1 ,...,n k of class k . W e will note ˆ s 10 = ( ¯ µ 1 + ¯ µ 0 ) / 2 . The classifica tion rule used choos es that X ∈ R p comes from the class k if X belo ngs to ˆ V k given by ( 29 ) and ˆ L 10 = − 1 2 h ˆ A 10 ( x − ˆ s 10 ) , x − ˆ s 10 i R p + h ˆ G 10 , x − ˆ s 10 i R p − ˆ c 10 , where the quantities of this equation will b e given in w ha t follows. for all (1 , 0) ∈ { 1 , . . . , K } 2 , 1 6 = 0, we now giv e ˆ G 10 (equation ( 33 )), ˆ A 10 (equation 34 ), and ˆ c 10 (equation 35 ). W e estimate G 10 = 1 2 ( C − 1 1 + C − 1 0 ) m 10 by ˆ G 10 = 1 √ 2 1 ˆ σ 2 1 q + 1 ˆ σ 2 0 q ! 1 / 2 y 10 q 1 | y 10 q | >λ F DR 10 q =1 ,. ..,p (33) where y 10 q = 1 √ 2 1 ˆ σ 2 1 q + 1 ˆ σ 2 0 q ! 1 / 2 ( ˆ µ 1 q − ˆ µ 0 q ) , and λ F DR 10 is c ho sen by the Benjamini and Ho cheberg pr o cedure. This pro cedure is the following. Let V ar 0 ( y ij q ) b e the v ar iance of y 10 q calculated under the hypothesis that µ 1 q = µ 0 q . The ter m 1 + ˆ σ 2 1 q / ˆ σ 2 0 q 2 n 1 + 1 + ˆ σ 2 0 q / ˆ σ 2 1 q 2 n 0 is an estimation of this v ariance when σ 2 kq ( k = 0 , 1) are known a nd equal to ˆ σ 2 kq . In pra ctice, we substitute these terms for V ar 0 ( y 10 q ). The real ( | y 10 q | / q V ar 0 ( y 10 q )) q =1 ,. ..,p imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 23 are o r dered by decreasing order: | y 10(1) | / q V ar 0 ( y 10(1) ) ≥ · · · ≥ | y 10( p ) / q V ar 0 ( y 10( p ) ) | and λ F DR 10 = | y 10( k F D R 10 ) | where k F DR 10 = max k : | y 10( k ) | ≥ s 1 + ˆ σ 2 1( k ) / ˆ σ 2 0( k ) 2 n 1 + 1 + ˆ σ 2 0( k ) / ˆ σ 2 1( k ) 2 n 0 z b p k 2 p , z ( α ) is the quantile of order α o f a sta ndardized ga ussian random v a riable and b p ∈ [0 , 1[ is as in the preceding algo rithm. In practice, we choo s e b p = 0 . 01, but one could k eep a part o f the learning set to lea rn the be s t v alue of b p . Note that in the applica tion we hav e in mind, the learning se t is to o small to b e divided. In addition, the choice of b p , in view o f Theorem 4.1 do es not determine the p erfor mances o f the algor ithm. In pra c tice the differe nc e of classificatio n error b etw een the choices b p = 0 . 01 a nd b p = 0 . 05 for example, is not imp or tant. This first part of the metho ds constitute a dimensio n re duction. Indeed, the only co ordina tes of ( ˆ G 10 q ) q =1 ,. ..,p that are k ept non n ull are those for which | y 10 q | ≥ λ F DR ij . The linea r application asso cia ted with ( ˆ G 10 q ) q =1 ,. ..,p only acts in k F DR 10 directions. Let us a lso note that if we extend our pro cedure to a m ul- ticlass proc e dure, for tw o couples of classes ( i, j ) 6 = ( l, m ), the cor resp onding estimations G ij and G lm might b e based on different dimension reduction. Remark 4.1. Th e test ing pr o c e du re use d c an b e analyse d as a ”vertic al” ANOV A that r eve als the inter esting dir e ction 1. in which classific ation should b e done (with thr esholding estimation of G 10 ) 2. in which classific ation should b e quadr atic (with thr esholding estimation of A 10 ). The matrix A 10 is estimated b y a diagonal matrix with diag onal elements given by ˆ a 10 q = 1 ˆ σ 2 1 q − 1 ˆ σ 2 0 q ! 1 | w 10 q |≥ η F D R 10 , where w 10 q = ˆ σ 2 1 q − ˆ σ 2 0 q , q = 1 , . . . , p, (34) and the threshold η F DR 10 is chosen with the same type o f pr o cedure as the one used to find λ F DR 10 . Let V ar 0 ( w 10 q ) be the v ariance of w 10 q under the h yp o thes is that σ 1 q = σ 0 q . The term 2 ˆ σ 4 1 q n 1 − 1 + 2 ˆ σ 4 0 q n 0 − 1 is an estimation of it that we use in practice. The r eal num b er s ( | w 10 q / p V ar 0 ( w 10 q ) | ) q are ordered by decreasing order: | w 10(1) / q V ar 0 ( w 10 p ) | ≥ · · · ≥ | w 10( p ) / q V ar 0 ( w 10 p ) | and η F DR 10 = | w 10( k F D R 10 ) | imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 24 where k F DR 10 = max k : | w 10( k ) | ≥ s 2 ˆ σ 4 1( k ) n 1 − 1 + 2 ˆ σ 4 0( k ) n 0 − 1 z b p k 2 p . This part of the metho d constitutes a linearisatio n o f the rule. Indeed, the directions q ∈ { 1 , . . . , p } in which ˆ a 10 q is 0 are the directions in which the cla s- sification rule betw een the g roups 1 and 0 is linear. In the other dir e c tions, the rule is qua dratic. The use of this metho ds is still motiv ated by Theo rem 4.1 and the theorems used in its pro of, but it nee ds additional theoretical justification. W e will finally note: ˆ c 10 = p X q =1 1 | w 10 q |≥ η F DR 10 1 8 ˆ a 10 q ( ¯ µ 1 q − ¯ µ 0 q ) 2 + 1 2 log | det( ˆ σ − 1 0 q ˆ σ 1 q ) | . (35) 5. Appl ication to medi cal data and the TIMIT database W e are g oing to study the p erfor ma nce o f the given pro cedure. With that a im, we compa r e our metho d with the one g iven by Rossi and Villa [ 22 ] on the database TI MIT. W e then use tes t o ur pro cedure on medical da ta. 5.1. Comp ar ison of our metho d with the one of R ossi and Vil la i n the c ase of two class classific ation Rossi and Villa use a supp or t vector machine (SVM) with differe n t t ypes of kernels. Recall that the SVM pr o cedure is to cons tr uct an affine frontier function f given by f ( x ) = h w, x i R p + b, where w a nd b a re solutio ns of a n optimiza tion pro blem of the fo llowing type: min w, b,ξ k w k 2 R p + C N X i =1 ξ i under y i ( h w, x i i R n + b ) ≥ 1 − ξ i , ξ i ≥ 0 i = 1 , . . . , n where ( x i , y i ) i =1 ,...,n are the couples (observ ations, lab e ls) of the learning set. The TIMIT databa se has notably been studied b y Ha stie et al. [ 18 ]. This database includes phonemes ”a a” and ” ao ” pr onounced by many differen t per - sons. The co rresp onding re c ords a r e curves observed at a fine enough sampling frequency . More precisely , one curve is a p -dimensional vector with p = 256. The learning set is comp osed of 519 ”aa” and 759 ” ao ” and the test s et is comp osed of 176 ”aa” and 263 ”ao” . Also , the curves ( x i ) i =1 ,..., 519 are those which corres p o nd to the pronunciation of phoneme ” aa” and the label y i = 0 is as s o ciated to them. The lab e l ”1” is asso cia ted to the other curv es which corres p o nd to the pronunciation of phoneme ” a o ”. The metho d o f Ro ssi a nd Villa g ives almost the same results as ours: 2 0% o f classification mistakes. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 25 5.2. Applic ation to me dic al data The medical pr oblem is the following. In Mag netic resona nc e imag e ry , one can obtain spectra characterizing tissues loca lized in some area of the br ain. The sp ectra obtained can b e used to characterize tumors . Unfortunately , even for a sp ecia lis t, it is hard to define a go od r ule to a sso ciate the name of a tumor with a g iven sp ectra. Some spec tra hav e be e n o btained on identified tumors. W e hav e b een given these sp ectra. In order to hav e enough s p ectr a in o ur lea rning set, we reta ined five groups o f sp ectr a (some of them reg rouping many tumors). The glio bla stomes of the fir st type 1 , the glioblas tomes o f the second t ype , the Meningiomes, the Metastase s and the healthy tissues. The databa s e provided by the sp ecialis ts contains 21 gliobla s tomes of first type, 9 g lioblastomes of second t yp e , 16 M´ eningiomes, 18 m´ etastases and 9 healthy tissues, that is, 7 5 s p ectr a sampled at 1024 p oints. W e giv e the plot of the sp ectra considered in Figure 2 . In order to test our proce dur e, we used a str a tegy o f t ype ”leave on out”. Figure 4 leads us to an exp erimental confir mation tha t in the case of tw o class classification, the chosen dimension is a go o d one. W e tested differen t co nfigurations s ummarized in the table Figure 3 . The classification erro r rate is s till s ignificant, but the reduction dimensio n pro cedure provides a r eduction of the error ra te (Reca ll that in the case of 4 gro ups having equal a priori probability a r ule that w ould gue s s randomly the type of tumor would have an er ror rate of 75%). Ther e a re t wo reaso ns for this mo derate per formances. Roughly , theor etical physic predicts that a sp ectrum a s so ciated with a g iven tumor, for example a Glioblastome, is a r andom v aria ble y = ( y q ) q =1 ,. ..,p that has a quite s ma ll v ar iability . Also, we shuold b e able to separate easily sp ectra asso ciated with different gr oups. Unfor tuna tely , in practice, the instrument ation leads to a measur ement of sp ectra z = ( z q ) q =1 ,. ..,p having complex v alues and for which ther e exists a sequence of angles ( ψ q ) q =1 .. .,p such that: ∀ q ∈ { 1 , . . . , p } y q = ℜ ( e iψ q z q ) . This sequence of a ngles is unknown. The theoretical physics o f instrumentation shows tha t there a re t w o r eal ( a, b ) such that ∀ q ∈ { 1 , . . . , p } ψ q = aq + b. Metho ds to obtain a a nd b are not sufficiently efficien t, but this repr e sents an active field o f research. W e chose to ask the physicians to change the phase manually in order to ha ve a homogeneous real pa r t of the s pectr a in a pa rticular group and we kept the r eal part of the sp ectra . The change of phase made by the physicians is not optimal and the residua l v aria tion of the phase c reates a certain disparity o f observed sp e ctra inside ea ch gro up. This disparity can b e seen Figure 2 . The incorpo ration o f the phase int o a cla ssification algorithm, 1 The group of Gli oblastomes has a to o large v ariability , also, we chose to divide it into t w o groups: first type and second t ype. These tw o types correspond to the presence of certain c hemical substances. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 26 (a) 21 gli oblastomes A (b) 9 glioblastomes B (c) 16 Meningiomes (d) 18 metastases (e) 9 healthy tissues Figure 2 . Sp e ctr a of the le arning set Groups considered all all except Glioblastomes of fir s t t yp e Metastases and M eningiomes error rate 43 % 30 % 5% Figure 3 . Consider e d gr oups and err or r ate in e ach c ase. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 27 Figure 4 . Classific ation e rr or r ate (in a two gro up pr oblem: M´ eningiomes v ersus Glioblas- tomes of first typ e) as a function of t he sele ct e d dimension. The dimension sele cted by our algorithm i s marke d by a b lack p oint in the Figur e. and the use of the complex nature of the data will be the ob ject of further studies. W e no te, how ever tha t these phase problems in the F ourier domain can be translated interestingly in the temp o ral domain. Finally , the learning s et is still to o small. W e hop e to see the size incr ease in the forthcoming years. 6. A mo re geom etric alternative measure of error: the l earning error 6.1. Definition and main r esul t W e hav e a lready defined the learning erro r to b e R ( g ) = P ( g ( X ) 6 = Y et g ∗ ( X ) = Y ) , which whe n Y ❀ U ( { 0 , 1 } ) equals R ( g ) = 1 2 ( P 1 ( g ( X ) 6 = 1 et g ∗ ( X ) = 1) + P 0 ( g ( X ) 6 = 0 et g ∗ ( X ) = 0)) . In other words, the lear ning er r or is the pr obability to misclassify X with g and to classify it cor rectly with g ∗ . The p o in t that motiv a tes the use of this err or is that 1. it le ads to a simple g eometric interpretation (mos tly used in the t w o follow- ing Sections) and hence it is used in all the further theore tical developmen t we will give; 2. it is not sensitive to the p ossible indis tinguishability o f the distributions P 0 and P 1 and it lea ds to low er bounds a s in Se c tio n 2 (see remark below). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 28 It follows ea sily from C ( g ) − C ( g ∗ ) = P ( g ( X ) 6 = Y et g ∗ ( X ) = Y ) − P ( g ( X ) = Y et g ∗ ( X ) 6 = Y ) , that a classification rule g satis fies : C ( g ) − C ( g ∗ ) ≤ R ( g ) . (36) In the ga ussian case that is studied in this article, we prov ed the following theorem that gives a r everse inequality of ( 36 ). Theorem 6.1. L et g ∗ b e the optimal ru le in the binary classific ation pr oblem (as pr esente d in Se ction 1 ) . 1. If P 0 and P 1 have the same c ovarianc e C and r esp e ctive me ans µ 1 and µ 0 , then, for al l me asur able fu n ctions g : R p → { 0 , 1 } , we have: C ( g ) − C ( g ∗ ) ≥ min ( √ 2 π 2 ∗ 16 2 k C − 1 / 2 m 10 k R p e k C − 1 / 2 m 10 k 2 R p 8 R ( g ) 2 , R ( g ) 8 ) , wher e m 10 = µ 1 − µ 0 . 2. Le t c 1 > 0 and P ( c 1 ) b e t he set of c ouples ( P, Q ) of gaussian me asure on R p such t hat d 1 ( P, Q ) > c 1 . If ( P 1 , P 0 ) ∈ P ( c 1 ) then t her e ex ists a c onstant c ( c 1 ) > 0 (t hat only dep ends on c 1 ) such that C ( g ) − C ( g ∗ ) ≥ min c ( c 1 ) R ( g ) 8 , R ( g ) 8 . Before we prov e this result, let us comment it. Comments. Let us no te that C ( g ) − C ( g ∗ ) ≤ 1 2 d 1 ( P 1 , P 0 ) . Also, in the ca se where d 1 ( P 1 , P 0 ) tends to 0, the excess risk do es not measure the differ ence b etw een g and g ∗ but the proximity o f P 1 and P 0 . The lear ning error is not sensitive to this scale phenomeno n, a s witness the following example. Example 6.1. L et µ ≥ 0 , P 1 = N ( µ, 1) and P 0 = N ( − µ, 1) . In this c ase, for al l a ∈ R R ( 1 [ a, ∞ [ ) = 1 2 ( P (0 < ξ + µ < a ) + P ( a < ξ − µ < 0)) , wher e ξ ❀ N (0 , 1) ; and d 1 ( P 1 , P 0 ) → 0 if and only if µ → 0 in which c ase R ( 1 [ a, ∞ , [ ) → 1 2 P ( ξ ∈ [0 , | a | ]) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 29 Under t hese c onditions, t he le arning err or asso ciate d with 1 [ a, ∞ , [ tends to 0 only if a tends t o 0 . In other wor ds, whe n µ → 0 , the le arning err or makes a differ enc e b etwe en the rules 1 [100 , ∞ , [ and g ∗ = 1 [0 , ∞ , [ : inf µ< 50 R (1 [100 , ∞ [ ) ≥ 1 2 P ( ξ ∈ [0 , | 5 0 | ]) ≈ 1 4 while we have C (1 [100 , ∞ [ ) − C ( g ∗ ) ≤ 1 2 d 1 ( P 1 , P 0 ) ≤ µ √ 2 π . Remark 6.1. By definition, is t he quantity of inter est. The pr oblem with it is that it c an gives cr e dit to every given pr o c e dur e when d 1 ( P 1 , P 0 ) is sufficiently smal l. Also, one c annot ar gue that a rule is n ever go o d ac c or ding to the exc ess risk. In t he pr e c e ding example, t he pr o c e dur e g ( x ) = 1 [100 , ∞ [ ( x ) is uniformly (on say | µ | ≤ 5 0 ) inc onsistent ac c or ding to the le arning err or but n ot ac c or ding to the ex c ess risk. The ma in c o nsequence o f this Theor em has already b een used in Section 2 . 2 . F rom equation ( 36 ), if ( g n ) n ≥ 0 is a s e quence of cla ssification rules such that R ( g n ) tends to z e ro, then C ( g n ) − C ( g ∗ ) tends to zero . Theor e m 6.1 , implies the conv er s e result. 6.2. Pr o of of The or em 6.1 Pr o of. Let us take K 1 = { x ∈ R p : g ( x ) 6 = 1 et g ∗ ( x ) = 1 } and K 0 = { x ∈ R p : g ( x ) 6 = 0 et g ∗ ( x ) = 0 } . Also, R ( g ) = 1 2 ( P 1 ( K 1 ) + P 0 ( K 0 )) and at least one of the following tw o inequa l- ities is satisfied (from the pigeonhole principle): P 1 ( K 1 ) ≥ R ( g ) , P 0 ( K 0 ) ≥ R ( g ) . Without lo s s of generality we will supp ose that P 1 ( K 1 ) ≥ R ( g ) which implies P 1 ( K 1 ) + P 0 ( K 1 ) ≥ R ( g ). Note that we hav e C ( g ) − C ( g ∗ ) = P ( g 6 = Y ) − P ( g ∗ 6 = Y ) = 1 2 ( P 1 ( K 1 ) − P 1 ( K 0 )) + 1 2 ( P 0 ( K 0 ) − P 0 ( K 1 )) ( b y conditioning with resp ect to Y ) = 1 2 (( P 1 − P 0 )( K 1 ) + ( P 0 − P 1 )( K 0 )) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 30 and, b ecause g ∗ ( X ) = 1 if a nd only if dP 1 ≥ dP 0 (b y definition of g ∗ and from the fact that Y ❀ U ( { 0 , 1 } )), we get C ( g ) − C ( g ∗ ) = 1 2 Z 1 K 1 ∪ K 0 | dP 1 − dP 0 | ≥ 1 2 Z 1 K 1 | dP 1 − dP 0 | . (37) A straightforward c alculation (see for example [ 15 ] Pr op osition 1.4.2 Chapter 1 Part I) leads to Z X m ( x )( dP 1 − dP 0 ) = 2 E P m ( X ) e f 10 ( P, X ) | sinh 1 2 L 10 ( X ) | , for all measurable m , where P is any pr o bability mea s ure that dominates P 1 and P 0 , f 10 ( P, X ) = 1 2 log( dP 1 dP dP 0 dP ) and L 10 ( x ) = log( dP 1 dP 0 ( x )). In par ticular d 1 ( P 1 , P 0 ) = 2 E P e f 10 ( P, X ) | sinh 1 2 L 10 ( X ) | , Also note that whenever K ⊂ { x ∈ R p : L 10 ( x ) ≥ 0 } we hav e P 1 ( K ) − P 0 ( K ) = 2 E P [1 K e f 10 ( P, X ) sinh( L 10 ( X ) / 2)] , and a s a cons equence, ( 37 ) can b e r ewritten C ( g ) − C ( g ∗ ) ≥ E [1 K 1 ( X ) e f 10 ( P, X ) sinh( L 10 ( X ) / 2)] . (38) It can a lso b e shown that P 1 ( K ) + P 0 ( K ) = 2 E P [1 K e f 10 ( P, X ) cosh( L 10 ( X ) / 2)] , and co nsequently , P 1 ( K 1 ) + P 0 ( K 1 ) ≥ R ( g ) is rewr itten 2 E P [1 K 1 ( X ) e f 10 ( P, X ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) . (39) On the o ther ha nd, d 1 ( P 1 , P 0 ) ≥ c 1 leads to: 2 E P [ e f 10 ( P, X ) | sinh( L 10 ( X ) / 2) | ] ≥ c 1 . (40) In the rest of the pro of, w e shall com bine ( 39 ) and ( 40 ) in order to low er bo und the r ight member of ( 38 ). W e r e ma rk that the left member in ( 39 ) and the r ight member of ( 38 ) only differ by a factor tw o and replacing a s inh by a cosh. F or our purp os e, these tw o functions only differ fundamentally nea r zero . W e are g oing to deco mpo se K 1 int o tw o disjoint sets. Also, we will define K + 1 = { x ∈ K 1 : L 10 ( x ) ≥ 2 } et K − 1 = { x ∈ K 1 : L 10 ( x ) ≤ 2 } . Let us a lso define A and B by: Z K 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) = Z K + 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) | {z } A + Z K − 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) | {z } B . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 31 F rom ( 39 ), (a nd the pigeo nhole principle) tw o cases can o ccur. In the firs t cas e E P [1 K + 1 ( X ) e f 10 ( P, x ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) / 4 , and in the s econd E P [1 K − 1 ( X ) e f 10 ( P, x ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) / 4 . (41) In the firs t case, be c ause X ∈ K + 1 implies sinh( L 10 ( X ) / 2) ≥ 1 2 cosh( L 10 ( X ) / 2) (ln(6) ≤ 2) , we hav e A ≥ R ( g ) / 8 and hence the des ired result ( it s uffices to r emark that L 10 ( x ) ≥ 0 if x ∈ K 1 which implies B ≥ 0). W e shall now consider the ca se wher e ( 41 ) is satisfied. In this case, b eca use cosh( x ) ≤ 2 for all | x | ≤ 1, we hav e Z K − 1 e f 10 ( P, x ) P ( dx ) ≥ R ( g ) / 8 . Also, the definitio n dν = e f 10 ( P, x ) dP R e f 10 ( P, x ) dP , makes ν a proba bilit y mea s ure on R p and ν ( K − 1 ) ≥ R ( g ) / 8 . (42) On the o ther ha nd, (see the definitio n of f 10 ) Z e f 10 ( P, x ) dP = Z p dP 1 dP 0 = A 2 ( P 1 , P 0 ) ( A 2 ( P 1 , P 0 ) is the Hellinge r a ffinit y b e t w een P 1 and P 0 ) which leads to B = A 2 ( P 1 , P 0 ) Z ∞ 0 ν X ∈ K − 1 and | sinh( L 10 ( X ) / 2) | ≥ t dt. (43) W e hav e ν ( X ∈ K − 1 ) = ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t + ν X ∈ K − 1 and | sinh( L 10 / 2) | ≥ t . Let g b e the application which asso cia tes to t > 0 the real g ( t ) = sup ( P 1 ,P 0 ) ∈P ( c 1 ) ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) . (44) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 32 F or every t > 0, we hav e: ν X ∈ K − 1 and | sinh( L 10 / 2) | ≥ t = ν ( X ∈ K − 1 ) − ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t W e then deduce from this inequality and from ( 43 ) that for all ǫ ≥ 0, B ≥ A 2 ( P 1 , P 0 ) Z ǫ 0 ν X ∈ K − 1 and | sinh( L 10 ( X ) / 2) | ≥ t dt ≥ ǫν ( X ∈ K − 1 ) − A 2 ( P 1 , P 0 ) Z ǫ 0 ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t dt ) ≥ ǫ R ( g ) / 8 − Z ǫ 0 ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t dtA 2 ( P 1 , P 0 ) where this la st inequality results from ( 42 ). The rest of the pro of relies on the following le mma. Lemma 6. 1. 1. The applic ation g define d by ( 44 ) le ads t o g ( t ) ≤ c ( c 1 ) A 2 ( P 1 , P 0 ) t 1 / 7 ( c ( c 1 ) is a p ositive c onst ant that only dep ends on c 1 ). 2. In the c ase wher e C 1 = C 0 = C , we have ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t ≤ 4 t √ 2 π k C − 1 / 2 m 10 k R p . W e prov e this r esult at the end of the current pro o f. Let us note that it is equation ( 40 ) that plays a crucial r ole in the pro of. In the case where C 1 6 = C 2 , Z ǫ 0 ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t dtA 2 ( P 1 , P 0 ) ≤ ˜ c ( c 1 ) ǫ 1+1 / 7 , and the choice ǫ = R ( g ) 16 ˜ c ( c 1 ) 7 leads to the des ired result. In the case wher e C 1 = C 2 , Z ǫ 0 ν X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t dt ≤ 2 ǫ 2 √ 2 π k C − 1 / 2 m 10 k R p , and the choice ǫ = √ 2 π k C − 1 / 2 m 10 k R p R ( g ) 32 A 2 ( P 1 ,P 0 ) leads to the desired result. Indeed, in the ca se wher e C 1 = C 0 , classical calculation leads to A 2 ( P 1 , P 0 ) = Z e f 10 ( P, X ) dP = e − k C − 1 ( µ 1 − µ 0 ) k 2 R p 8 . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 33 Let us now prov e Lemma ( 6.1 ) Pr o of. Let us b egin by point 2. It is sufficient to notice that if P 1 | 0 is a gaussia n measure with c ov aria nce C and mean s 10 , and if X is a random v ar ia ble drawn from P 1 | 0 , then e f 10 ( P 1 | 0 ,X ) = e − k C − 1 ( µ 1 − µ 0 ) k 2 R p 8 in distribution L 10 ( X ) ❀ N (0 , σ 2 ) , where σ 2 = k C − 1 ( µ 1 − µ 0 ) k 2 R p . Also, we g et ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) = P |N (0 , σ 2 ) | ≤ 2 Arg sinh ( t ) ≤ 4 Arg sinh ( t ) √ 2 π σ ≤ 4 t √ 2 π σ . Let us now prov e p oint 1 of the Lemma. ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) ≤ Z 1 | sinh( L 10 ( x ) / 2) |≤ t dP 1 dP 0 1 / 2 dP 0 / A 2 ( P 1 , P 0 ) . ≤ P 1 / 2 0 ( |L 10 ( X ) / 2 | ≤ t ) A 2 ( P 1 , P 0 ) (from Ca uch y-Sch wartz and Arg sh ( y ) ≥ y ) . Finally , we conclude from point 2 of Theor em 8.4 , given in Section 8 , which hypothesis is s atisfied since: c 1 ≤ d 1 ( P 1 , P 0 ) ≤ 2 p K ( P 0 , P 1 ) (from Pinsker inequality (see [ 24 ])) , ≤ 2 kL 10 k 1 / 2 L 2 ( P 0 ) (from Cauch y- Schartz inequality) . 7. A geo m etrical Analysis of LDA to sol v e Problem 1 7.1. Intr o du ction and first r esul t Let X be a sepa r able Ba nach spa ce X = R p , e ndowed with its Bo rel σ - field a nd a gaussian measure γ . Throughout the next section, we will asso cia te to any measurable f the set V f = { x ∈ X : f ( x ) ≥ 0 } . (45) In this section X = R p . Rec a ll that α (defined by ( 5 )) is the angle, a ccording to the geometr y of L 2 ( γ C ) b etw e e n F 10 et ˆ F 10 . This quantit y will play a very imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 34 impo rtant role in the whole section. In or der to shorten the notation, we will replace R ( 1 ˆ V ) b y R in this s e ction a nd those tha t follow. Recall that F 10 = C − 1 m 10 , m 10 = µ 1 − µ 0 , s 10 = µ 1 + µ 0 2 , where µ 1 , (resp. µ 0 ) a nd C are the mean and (common) cov aria nce o f the dis- tribution P 1 = γ C,µ 1 (resp. P 0 = γ C,µ 0 ) of data fro m g roup 1 (resp. 0). With the ab ov e defined nota tion ( 45 ), the o ptimal rule and the plug-in r ule can b e rewritten with V = V h F 10 ,x − s 10 i R p and ˆ V = V h ˆ F 10 ,x − ˆ s 10 i R p F or the purp os e of this section, let us note tha t the learning e r ror studied in the preceding section and in tro duced by equation ( 8 ) is (in the case of LDA) R = 1 2 γ C,µ 0 X ∈ ˆ V \ V + γ C,µ 1 X ∈ V \ ˆ V . which implies R = 1 2 γ C,s 10 X ∈ ˆ V \ V − m 10 2 + γ C,s 10 X ∈ V \ ˆ V + m 10 2 . (46) The Pro blem now b ecomes to that o f measuring t wo ar eas of R p with γ C,s 10 . Standard prop erties of gaussia n measure now leads to R = 1 2 γ p ( V h .,G p i R p \ V h .,G p + e p i R p + d 0 ) − G p 2 (47) + 1 2 γ p ( V h .,G p + e p i R p + d 0 \ V h .,G p i R p ) + G p 2 , where d 0 = h ˆ F 10 ; ˆ s 10 − s 10 i R p , G p = C 1 / 2 F 10 = C − 1 / 2 m 10 , ˆ G p = C 1 / 2 ˆ F 10 and e p = C 1 / 2 ( ˆ F 10 − F 10 ) . (48) One may no te that the change of g eometry implies k G p k R p = k F 10 k L 2 ( γ ) , k ˆ G p k R p = k ˆ F 10 k L 2 ( γ ) , k e p k p = k F 10 − ˆ F 10 k L 2 ( γ C ) , (4 9 ) and α (defined by equa tion ( 5 )) is the angle, in the g eometry of R p betw een G p and ˆ G p . The following theorem gives low er b ounds and upp er b ounds on the learning error R as functions of (amo ng others) α . Its pro o f r elies on the fact that R is the measure by γ 2 of t wo ”s imple” areas of R p (see Figure 5 ) and the use of four elementary prop erties of gauss ian measure to b e given later (see Figure 6 ). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 35 Theorem 7 .1. L et d 0 = h ˆ F 10 , ˆ s 10 − s 10 i R p . The L e arning err or R as a function of α satisfies: ∀ α ∈ [ − π , π ] R ( α ) = R ( − α ) . The L e arning err or also satisfies the fol lowing ine quality If α ≥ π 2 , then R ≥ 1 2 . If 0 ≤ α < π 2 , t hen we have R ≤ 1 2 and we distinguish b et we en four c ases. 1. If | d 0 | ≤ 1 4 |h F 10 , ˆ F 10 i L 2 ( γ C ) | , we have: e − k F 10 k 2 L 2 ( γ C ) 8 1 4 α 2 π + 1 2 γ 1 " 0; | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! ≤ R , (50) and R ≤ e − k F 10 k 2 L 2 ( γ C ) cos( α ) 2 32 α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! . (51) 2. If 1 4 |h F 10 , ˆ F 10 i L 2 ( γ C ) | < | d 0 | ≤ 1 2 |h F 10 , ˆ F 10 i L 2 ( γ C ) | , we have: e − k F 10 k 2 L 2 ( γ C ) 2 1 4 1 2 γ 1 0; k F 10 k L 2 ( γ C ) 4 + α 2 π ≤ R (52) R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . (53) 3. If 1 2 |h F 10 , ˆ F 10 i L 2 ( γ C ) | < | d 0 | , we have: α 4 π + 1 4 γ 1 0; k F 10 k L 2 ( γ C ) 2 ≤ R , (54) R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . 4. If | d 0 | = 0 , t hen we have e − k F 10 k 2 L 2 ( γ C ) 8 α 2 π ≤ R . (55) Pr o of. Step 1: The pr oblem is two dimensional W e shall prove this equality: R = 1 2 γ 2 Q a − − y + + 1 2 γ 2 Q b − − y − , (56) where Q a − , Q b − , y + and y − will be defined below. Q a − and Q b − are t wo a reas of R 2 , y + and y − are tw o vectors of R 2 and all these quantities are illustrated Figure imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 36 5 . In the following we shall us e the notatio n ˜ e p = Π G ⊥ p e p for the orthogona l pro jection of e p on the o rthogona l to G p in R p . W e will s upp os e that k ˜ e p k R p 6 = 0, since the par t o f the result concer ning k ˜ e p k R p = 0 is straight forward. The calculation of R is in tr ins ically a calculus in the tw o dimensio nal spa ce M p , spanned b y G p and ˜ e p . In o rder to make this fact clear, note that for a ll z 1 ∈ M p z 2 ∈ M ⊥ p we hav e: V h .,G p + e p i R p + d 0 \ V h .,G p i R p + z 1 + z 2 = V h .,G p + e p i R p + d 0 \ V h .,G p i R p + z 1 and V h .,G p i R p \ V h .,G p + e p i R p + d 0 + z 1 + z 2 = V h .,G p i R p \ V h .,G p + e p i R p + d 0 + z 1 (here M ⊥ p was the or thogonal of M p in R p ). By the tensorial pro p erty of γ p and equation ( 47 ), we finally get R = 1 2 γ 2 M p ∩ ( V h . ,G p + e p i R p + d 0 \ V h . ,G p i R p − G p 2 ) (57) + 1 2 γ 2 M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p + d 0 + G p 2 ) . (58) Also, in the sequel we w ill identify M p with R 2 , D and ˆ D will be the stra ight lines of M p with eq ua tion h ., G p i R p = 0 and h ., G p + e p i R p + d 0 = 0. It can ea sily be shown that these lines intersect in a p given by a p = − d 0 ˜ e p k ˜ e p k 2 R p . (59) Also, V h . ,G p i R p = V h . − a p ,G p i R p et V h . ,G p + e p i R p + d 0 = V h . − a p ,G p + e p i R p , and with the sa me calculus that was used to o bta in ( 47 ), equa tion ( 57 ) beco mes: R = 1 2 γ 2 M p ∩ ( V h . ,G p + e p i R p \ V h . ,G p i R p ) − G p 2 + a p (60) + 1 2 γ 2 M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p ) + G p 2 + a p . (6 1) Notice that for rea sons of s ymmetry we can assume that d 0 ≥ 0 without lo ss of generality . In the sequel, we shall use the notation y + = G p 2 − a p et y − = − G p 2 − a p , (62) the co ordinates of y + in the orthonor ma l co ordina te sys tem obtained from the orthogo nal c o ordinate sys tem (0 , ˜ e p , G p ) will b e noted ( y h , y v ) and are equal ( d 0 k ˜ e p k R p , k G p k R p 2 ). W e shall also no te Q a − = M p ∩ ( V h . ,G p + e p i R p \ V h . ,G p i R p ) et Q b − = M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p ) . (63) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 37 Figure 5 . Figur e giving the definition of Q a − , Q b − , Q + , and Q ǫ for L emma 7.1 W e finally derive equa tion ( 56 ). F ro m Figure 5 , we notice tha t replacing α by − α , R do es not change; that if 0 < α ≤ π / 2 then R ≤ 1 2 and if π ≥ α ≥ π / 2 then R p ≥ 1 / 2. Also, we will now s uppo s e that α ∈ [0 , π / 2]. Step 2 . The rest of the pr o of r elies on the following lemma. Lemma 7. 1. L et, Q + and Q ǫ b e define d by Figur e 5 forming, with Q a − et Q b − , a p artition of R 2 . L et u = tan( α ) y h . We t hen have • If y − ∈ Q − , t hen 1 2 γ 1 ([0; | y v | ]) + α 2 π + γ 1 ([0 , y v 2 ]) γ 1 0; y v / 2 cos( α ) sin( α ) ≤ γ 2 ( Q b − − y − ) γ 2 ( Q b − − y − ) ≤ α 2 π + γ 1 ([0; | u | (1 + tan( α ))]) , (64) • If y − ∈ Q + , then e − y 2 v 2 1 2 1 2 γ 1 ([0; | u | ]) + α 2 π ≤ γ 2 ( Q b − − y − ) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 38 γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2 γ 1 ([0; ((1 + tan( α )) | u | ]) + α 2 π , (65) • If y − ∈ Q ǫ , t hen e − (1+ ǫ ) 2 | u | 2 2 1 2 1 2 γ 1 ([0; | u | ]) + α 2 π ≤ γ 2 ( Q b − − y − ) γ 2 ( Q b − − y − ) ≤ γ 1 ([0; (1 + tan( α )) | u | ]) + α 2 π . (66) • We have c onc erning γ 2 ( Q a − − y + ) : γ 2 ( Q a − − y + ) ≤ γ 2 ( Q b − − y − ) . (67) • Final ly, if y h = 0 , we have e − y 2 v 2 α 2 π ≤ γ 2 ( Q a − − y + ) = γ 2 ( Q b − − y − ) . (6 8 ) This Lemma will b e prov e n in Subsection 7.3 , let us s ee how it implies The- orem 7.1 . Fix ǫ = 1 for the rest of the pro of (Other v alues of ǫ will help us in the pro of of Theo rem 2 .2 ). E quation ( 67 ) of the lemma implies that 1 2 γ 2 ( Q b − − y − ) ≤ R ≤ γ 2 ( Q b − − y − ) . Recall that ( y h , y v ) has be en defined following equa tion ( 62 ) as the co ordina tes of y + and that u = tan( α ) y h . A simple ca lculation leads to u = | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) et y 2 v = k F 10 k 2 L 2 ( γ C ) 4 . If 1 2 |h G p , ˆ G p i R p | < | d 0 | , we hav e in the pr eceding Lemma y − ∈ Q − and: 1 4 γ 1 0; tan( α ) k F 10 k L 2 ( γ C ) 2 + α 4 π ≤ R R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . The case wher e | d 0 | < 1 4 |h G p , ˆ G p i R p | (which means that 2 | u | < | y v | ) is the case where y − ∈ Q + , and we then have: e − k F 10 k 2 L 2 ( γ C ) 8 1 4 α 2 π + 1 2 γ 1 " 0; | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! ≤ R , and R ≤ e − k F 10 k 2 L 2 ( γ C ) cos( α ) 2 32 α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 39 If 1 4 |h G p , ˆ G p i R p | < | d 0 | < 1 2 |h G p , ˆ G p i R p | , (which means tha t 2 | u | > | y v | > | u | ) we hav e in the preceding lemma y − ∈ Q ǫ ( ǫ = 1), a nd since in this case | y v | > | u | > | y v | / 2, we get: e − k F 10 k 2 L 2 ( γ C ) 2 1 4 1 2 γ 1 0; k F 10 k L 2 ( γ C ) 4 + α 2 π ≤ R and R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . This ends the pro of of Theo rem 7.1 . 7.2. Pr o of of The or em 2.2 Theorem ( 2.2 ) is a lso a cons equence o f the prec eding Lemma. W e will use the preceding lemma while tuning the v alue of ǫ . W e use without restating them the definitions g iven befo r e the preceding lemma. Let us a ssume that 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | has an infer io r limit a < 1. Then, there ex- ists ǫ > 0 such that y + and y − (defined by ( 62 )) b elong to Q + (for k F 10 k L 2 cos( α ) large enough), then eq uation ( 65 ) implies that R ≤ e − ǫ 2 k F 10 k 2 L 2 cos 2 ( α ) 2(1+ ǫ ) 2 1 + | α | 2 π , and R tends to 0 when k F 10 k 2 L 2 cos 2 ( α ) tends to infinity . If now 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | tends to a > 1, then y + or y − (given by ( 62 )) b elo ngs to Q − (for k F 10 k L 2 cos( α ) large eno ugh). And sinc e in this ca s e equatio n ( 64 ) leads to R ≥ 1 4 1 2 γ 1 ([0; k F 10 k L 2 / 2]) (69) + γ 1 0; k F 10 k L 2 cos( α ) 4 sin( α ) γ 1 ([0; k F 10 k L 2 / 4]) + α 2 π , we obtain the desired result b y letting k F 10 k L 2 tend to infinit y . One has to observe that α depends on k F 10 k L 2 and that the limit v alues α = π / 2 and α = 0 require the use of different ter ms in inequality ( 69 ). This ends the pro of of Theo rem 2 .2 . 7.3. Pr o of of L emma 7.1 This pro of is the central pa rt of this section. It is mostly geometrica l, and r equire only is the following four pr op erties (given by Figure 6 ): imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 40 Figure 6 . The four pr op ert ies use d in the pr o of • Prop erty 1 . If A ⊂ R 2 betw een the tw o half str aight lines (0 , u ) and (0 , v ) such that Angle( u, v ) = α , then γ 2 ( A ) = α 2 π . This result follows directly from rotatio na l inv ar iance of the gaus s ian mea s ure. Such a n area will b e called an angular p ortion of size α and centre 0. • Prop erties 2 and 3 . Let y ∈ R 2 , D a s traight line of R 2 , b the or thogonal pro jection of y on D and h the dista nce from y to D . If A ⊂ R 2 and A is included in the half plan delimited by D that do es not contain y , then γ 2 ( A − y ) ≤ e − h 2 / 2 γ 2 ( A − b ). This is prop er t y 2. If A ⊂ R p is included in the half plan delimited by D that contains y then γ 2 ( A − y ) ≥ e − h 2 / 2 γ 2 ( A − b ).This is pro p e rty 3. • Prop erty 4 . If A = [0; d ] × [0; ∞ [ (see Figure 6 ) then γ 2 ( A ) = 1 2 γ 1 ([0; d ]). Such a re ctangle will b e ca lled an infinite re c tangle of origin 0 and heig ht d . W e will note q and ˆ q the o r thogonal pr o jections o f y o n D and ˆ D . The prop erties 2 and 3 a r e well known but for the sake o ff completeness we r ecall their pro of. It suffices to note tha t γ 2 ( A − y ) = Z x ∈ A 1 2 π e − k x − y k 2 R 2 2 dx = e − h 2 2 Z x ∈ A 1 2 π e − k x − b k 2 R 2 2 e h x − b,y − b i R 2 dx, imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 41 Figure 7 . Figur e to visualize de pr o of and that x ∈ A implies h x − b, y − b i R 2 ≤ 0 for prop erty 2 and h x − b, y − b i R 2 ≥ 0 for prop erty 3. W e are now g oing to distinguish b etw een a num b er o f c ases and, in each of them, use the announced pro pe r ties. First note that the ineq uality concerning y + is trivial. Figure 7 a nd 5 will b e useful in the following. Case y − ∈ Q b − . In this case | y v | ≤ | u | . One can include in Q b − the disjoint union of an infinite rectangle of or igin y − , and height | y v | ; an a ngular p ortion of s ize α a nd ce n tre y − ; a nd a recta ngle with vertex y − height | y v | / 2 and length | y v / 2 cos( α ) sin( α ) | . Using prop erties 4 and 1 , we then get: 1 2 γ 1 ([0; | y v | ]) + α 2 π + γ 1 ([0 , y v 2 ]) γ 1 0; y v / 2 cos( α ) sin( α ) ≤ γ 2 ( Q b − − y − ) . (70) On the other ha nd, Q b − can b e included in the disjoin t unio n o f an angular po rtioin with centre y − , of tw o infinite r ectangles with height less than or equal to | u | tan( α ) and of tw o infinite recta ngle o f height low er or eq ual to | u | . Also, prop erties 1 and 4 imply: γ 2 ( Q b − − y − ) ≤ α 2 π + γ 1 ([0; | u | (1 + tan( α ))]) . (71) Case y − ∈ Q + . In this case | y v | > (1 + ǫ ) | u | , y − is at a distanc e | y v | from D and at a distance ( | y v | − | u | ) co s( α ) ≥ ǫ 1+ ǫ | y v | cos( α ) fro m ˆ D . Pr op erties 2 a nd 3 imply: e − y 2 v 2 γ 2 ( Q b − − q ) ≤ γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2 γ 2 ( Q b − − ˆ q ) . (72) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 42 One can include in Q b − an angular p ortion o f s ize α and with ce n tre q o r a n infinite r ectangle of orig in y and height | u | . Also, prop erties 1 and 4 imply , with ( 72 ) and the fact tha t max( a, b ) ≥ a + b 2 the equation: 1 2 1 2 γ 1 ([0; | u | ]) + α 2 π ≤ γ 2 ( Q b − − q ) . The set Q b − can b e included in the unio n of an a ngular p ortion o f size α ce ntred in ˆ q and of tw o infinite r ectangles o f o rigin ˆ q and height | u | (1 + tan( α )). Also , prop erties 1 and 4 together with ( 72 ) and max( a, b ) ≥ a + b 2 imply the following equation: e − y 2 v 2 1 2 1 2 γ 1 ([0; | u | ]) + α 2 π ≤ γ 2 ( Q b − − y − ) , (7 3 ) γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2 γ 1 ([0; | u | (1 + tan( α ))]) + α 2 π . Case y − ∈ Q ǫ . In this case (1 + ǫ ) | u | > | y v | > | u | , y − is at a dis tance | y v | ≤ (1 + ǫ ) | u | from D and at a dista nce ( | y v | − | u | ) cos( α ) ≥ 0 from ˆ D . Prop erties 2 and 3 imply e − (1+ ǫ ) 2 | u | 2 2 γ 2 ( Q b − − q ) ≤ γ 2 ( Q b − − y − ) ≤ γ 2 ( Q b − − ˆ q ) . (74) from which we deduce the following inequa lity in the s a me wa y as in the pre- ceding paragr aph: e − (1+ ǫ ) 2 | u | 2 2 1 2 1 2 γ 1 ([0; | u | ]) + α 2 π ≤ γ 2 ( Q b − − y − ) , (75) γ 2 ( Q b − − y − ) ≤ γ 1 ([0; | u | (1 + tan( α ))]) + α 2 π . This ends the pro of of the Lemma . Remark 7.1 (On log-co ncav e mea s ures) . It is natur al to ask which typ e of pr ob ability me asur e satisfies the four pr op erties use d. Conc ern ing pr op erty 2 , it is p ossible to c onsider me asu r es t hat ar e not gaussian. Su pp ose that µ is a pr ob- ability me asu re on R p with p ositive density, ae − φ with r esp e ct t o the L eb esgue me asur e, wher e φ is strictly c onvex in the sense that their exists c > 0 such t hat for al l x, y ∈ R p φ ( x ) + φ ( y ) − 2 φ x + y 2 ≥ c 2 k x − y k 2 R p , (76) φ (0) = 0 = Ar g inf φ , a is a p ositive c onstant and φ is r adial: ther e exists a function ψ fr om R to R such that φ ( x ) = ψ ( k x k ) . L et y ∈ R p , D b e a hyp erplane of R p , b the ortho gonal pr oje ction of y on D , h the distanc e fr om y to D and A ⊂ R p include d in the half sp ac e delimite d by D which do es not c ontain y . On e c an show (se e pr op osition 3 . 3 . 1 p126 in [ 15 ]) t hat µ ( A − y ) ≤ e − c h 2 2 µ ( A − b ) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 43 7.4. Pr o of of The or em 2.1 Pr o of. The seco nd equation o f the Theorem results directly fro m equatio n ( 51 ) in Theorem 7.1 . T o s how the first equa tion of the Theo rem, we will four cases. Case num b er 4 is the imp ortant one that relies o n the use of Theo rem 7.1 . The other case s rely on verifying that the right member of the first equation o f the Theorem is no t to o small. 1. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) < 0. Let us note that b ecause R is a probability , we hav e R ≤ 1. In addition, E ≥ k F 10 − ˆ F 10 k L 2 ( γ C ) ≥ k F 10 k L 2 ( γ C ) . which implies that R p ≤ E k F 10 k L 2 ( γ C ) . 2. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0 and k ˆ F 10 k L 2 ( γ C ) ≤ 1 2 k F 10 k L 2 ( γ C ) . Recall that R is upp er b ounded by 1 2 when h ˆ F 10 , F 10 i L 2 ( γ C ) > 0 (se e Theorem 7.1 , it is the ca se where α de fined by ( 5 ) s atisfies − π / 2 ≤ α ≤ π / 2). In addition, the inequality k ˆ F 10 k L 2 ( γ C ) ≤ 1 2 k F 10 k L 2 ( γ C ) implies E ≥ 1 2 k F 10 k L 2 ( γ C ) , and as a consequence R p ≤ 1 2 implies that R p ≤ E k F 10 k L 2 ( γ C ) . 3. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0, k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) et π 2 > α > π 4 (recall that α has b een defined by 5 ). Since π 2 > α > π 4 , we hav e c o s( α ) ≤ 1 2 and as a co nsequence and with the help of ( 5 ): h ˆ F 10 , F 10 i L 2 ( γ C ) ≤ √ 2 2 k ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) . Under this last constraint, we hav e min ˆ F 10 k F 10 − ˆ F 10 k 2 L 2 ( γ C ) = min α (1 − α ) 2 + α 2 k F 10 k 2 L 2 ( γ C ) = k F 10 k 2 L 2 ( γ C ) , which again implies R p ≤ E k F 10 k L 2 ( γ C ) . 4. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0, k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) and α < π 4 . Since α ∈ [0 , π 4 ], the concavit y of the sin function gives α π ≤ sin( α ) 2 √ 2 . In addition, the relatio n k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) implies that sin( α ) = k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) ≤ 2 k F 10 − ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 44 (the first inequality is a trigo nometric formula). Finally , we o btain: α π ≤ k F 10 − ˆ F 10 k L 2 ( γ C ) √ 2 k F 10 k L 2 ( γ C ) . (77) Recall that d 0 = h ˆ F 10 , ˆ s 10 − s 10 i R p . The equality defining α ( 5 ) and the fact that cos( α ) ≥ √ 2 2 now imply: | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) ≤ √ 2 | d 0 | sin( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) (since cos( α ) ≥ √ 2 2 ) = √ 2 | d 0 | k ˆ F 10 k L 2 ( γ C ) (from a trigonometr ic formula) . Also, noticing that γ 1 ([0; u ]) ≤ u √ 2 π , and tha t tan( α ) ≤ 1, we get: γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! ≤ γ 1 " 0; 2 √ 2 | d 0 | k ˆ F 10 k L 2 ( γ C ) #! (78) ≤ 2 | d 0 | √ π k ˆ F 10 k L 2 ( γ C ) . In the cases 1, 2 and 3 of Theo rem 7.1 , b ecause tan( α ) ≤ 1 ( α ≤ π 4 ), the equations ( 77 ), ( 78 ), ( 51 ),( 54 ) imply: R ≤ E k F 10 k L 2 ( γ C ) . This ends the pro of of Theorem 2.1 . 8. A gene ral sc heme to so lv e Problem 1 8.1. Intr o du ction and main r esul t Presen tatio n of the main ide as. I n this section, we will pr ov e r esults con- cerning the QDA pro cedure. Reca ll that the learning er ror R (The probability to misclass ify data with a given rule when the o ptima l rule gives a correct clas- siication) satisfies: R ≤ 1 2 P 1 ( X ∈ V ˆ L Q 10 △ V L Q 10 ) + P 0 ( X ∈ V ˆ L Q 10 △ V L Q 10 ) (79) (If f : X → R , V f is defined by ( 45 ) at the b eginning of the preceding section). Indeed, the even t X ∈ V ˆ L Q 10 △ V L Q 10 corres p o nds to the case where decisio ns (go o d or er roneous) ta ken b y the optimal rule a nd the plug-in rule are different. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 45 Remark 8.1. In the c ase of pr o c e dur e LDA, we had R = 1 2 γ C,s 10 X ∈ ˆ V \ V − m 10 2 + γ C,s 10 X ∈ V \ ˆ V + m 10 2 . F r om this e quation, one c an e asily de duc e that 2 R = 1 2 γ C,s 10 X ∈ ˆ V △ V − m 10 2 + γ C,s 10 X ∈ V △ ˆ V + m 10 2 , and as a c onse qu en c e: 2 R = 1 2 P 1 ( X ∈ V ˆ L A 10 △ V L A 10 ) + P 0 ( X ∈ V ˆ L A 10 △ V L A 10 ) . (80) It is less obvious that this typ e of r elation is true in the ” quadr atic c ase. It’s se ems less obvious. In subsection 8.2 we will present a technique to put an uppe r b ound on the probabilities like P ( V f △ V f + δ ). In this type of quantit y , we shall call p ertur- bation function the meas urable function δ (which can be though t as a small function) a nd optimal fro ntier function the measurable function f fr o m X to R . In the case of the QDA, the r esults o btained are co nsequences of Theor em 8.1 given in the next paragr aph, with frontier function f = L Q 10 and pe r turbation function δ = ˆ L Q 10 − L Q 10 . A general result concerning quadratic p erturbation of a quadratic rule. In the sequel we need to int ro duce so me quantities r elated to g aussian measure in separ a ble Banach spaces, and X is a separa ble Bana ch Space. W e refer to [ 8 ] and its se ction on measurable po lynomials for a rig ourous tre a tment of the sub ject. The Hilb ert Spa c e of measur a ble affine function from X to R with finite L 2 ( γ C,m ) norm and null integral with r esp ect to γ C,m will b e denoted by X ∗ γ C,m . The Hilb ert space o f mea surable quadratic fo r m in L 2 ( γ C,m ) with null int egral with r esp ect to γ C,m will b e denoted E 2 ( γ C,m ). The spa ce of measur able quadratic for ms in L 2 ( γ C,m ) will b e denoted by X ∗ 2 γ and we hav e the c la ssical gaussian chaos decomp osition in L 2 ( γ C,m ): X ∗ 2 γ = { C te } ⊕ X ∗ γ C,m ⊕ E 2 ( γ C,m ) . In infinite dimension H ( γ C,m ) is the repro ducing kernel Hilb ert space asso ci- ated to γ C,m , in finite dimensio n ( X = R p ), we hav e (if C is of full ra nk) H ( γ C,m ) = R p . Recall that to eac h Hilbe r t-Schmidt op erato r A on H ( γ C,m ), one can as so ciate the meas ur able element of E 2 ( γ C,m ) and that each element of E 2 ( γ C,m ) is a sso ciated to a unique Hilb ert-Schmidt o p er ator on H ( γ C,m ). In imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 46 finite dimension, if C is of full r ank: q γ C,m A ( x ) = q C − 1 / 2 AC − 1 / 2 ( x − m ) − Z X q C − 1 / 2 AC − 1 / 2 ( x − m ) γ C,m ( dx ) ( recall that q A ( x ) = h Ax, x i R p ) = h AC − 1 / 2 ( x − m ) , C − 1 / 2 ( x − m ) i R p − p X i =1 λ i , where ( λ i ) i =1 ,...,p is the vector o f the eig env alues o f A . Theorem 8.1. L et X b e a sep ar able Banach sp ac e, γ C,m b e a gaussian me asur e on X with me an m and c ovarianc e C . L et A and D b e 2 symmetric Hilb ert- Schmidt op er ators on H ( γ C,m ) , F , d ∈ X ∗ γ C,m , and c, d 0 ∈ R . L et f ( x ) = c + F ( x ) + q γ C,m A ( x ) and δ ( x ) = d 0 + d ( x ) + q γ C,m D ( x ) b e the function defining V f and V f + δ (If g : X → R , V g is define d by e qu ation ( 45 )). Final ly, let r, R ∈ R b e such that R > r > 0 . 1. A ssume t hat r ≤ k f k L 2 ( γ C,m ) . Then, for al l q ∈ ]0 , 1 [ , ther e exists c 1 ( r , q ) > 0 (that only dep en ds on r and R ) such that γ C,m ( V f △ V f + δ ) ≤ c 1 ( r , q ) k δ k q/ 3 L 2 ( γ C,m ) . (81) 2. If | E L 2 ( γ C,m ) [ f ] | > r and k f k L 2 ( γ C,m ) , then, for al l q ∈ ]0 , 1[ , ther e exists c 2 ( r , q ) > 0 (that only dep ends on r and R ) such that γ C,m ( V f △ V f + δ ) ≤ c 2 ( r , q ) k δ k 2 q/ 7 L 2 ( γ C,m ) . (82) The tw o following subsections a re devoted to the pro of of this theorem. Sub- section 8.2 pres ent s a genera l metho dology to o btain this type of result, and in Section 8 .4 , we a pply this metho dology to obtain Theorem 8.1 . 8.2. De c omp osition of the domain W e will give an upper bo und to the pr obability that X ∈ V f ∆ V f + δ . In the ca ses we hav e in mind, this set is es sentially comp os ed of elements for which δ takes large v a lues or f is near zer o. Also, w e shall bound the mea sure of areas on which 1. the pe r turbation is larg e (with large deviatio n inequa lity), 2. | f | is small (with an inequality such as P ( | f ( X ) | ≤ ǫ ) ≤ g ( ǫ )). Lemma 8.1 that follows is based on the t w o fo llowing assumptions. 1. Assumption A 1 . It exis ts c 0 , c 1 > 0, h δ : R + → R + non decreasing such that h δ (0) = 0 , lim s →∞ h δ ( s ) = ∞ and ∀ s > 0 , P ( | δ ( X ) − E [ δ ( X )] | ≥ c 0 h δ ( s )) ≤ c 1 e − s 2 2 . (83) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 47 2. Assumption A 2 . It exis ts β > 0 a nd c 2 > 0 such that ∀ ǫ > 0 , P ( | f ( X ) | ≤ ǫ ) ≤ c 2 ǫ β . (84) Remark 8. 2 . The function h δ of As s u mption A 1 wil l help u s in me asuring t he effe ct of a p ertu rb ation δ . Lemma 8.1. Un der Assum ption A 1 ( 83 ) and A 2 ( 84 ), for al l q ∈ ]0 ; 1[ we have: P ( X ∈ V f ∆ V f + δ ) ≤ c 1 − q 1 c 2 | E P [ δ ( X )] | qβ + r 2 π 1 − q c 2 c 1 − q 1 2 E " c 0 h δ | ξ | √ 1 − q + 1 + | E P [ δ ( X )] | qβ # , wher e ξ is a c ent re d r e al gaussian r andom variable with varianc e 1 . Pr o of. Recall that V f = { x : f ( x ) ≥ 0 } . P ( X ∈ V f ∆ V f + δ ) = P ( − ( δ ( X ) − E [ δ ( X )]) − E [ δ ( X )] ≤ f ( X ) ≤ 0 or 0 ≤ f ( X ) ≤ ( δ ( X ) − E [ δ ( X )]) + E [ δ ( X )]) , also, P ( X ∈ V f ∆ V f + δ ) ≤ P ( U ) , where U = {| f ( X ) | ≤ | δ ( X ) − E [ δ ( X )] | + | E [ δ ( X )] |} . Define B j = { c 0 h δ ( j ) ≤ | δ ( X ) − E [ δ ( X )] | < c 0 h δ ( j + 1) } for j ∈ N . This family of event s p er mits us to r ecov er a ll p os s ible even ts. W e observe that P ( U ) = X j ≥ 0 P ( U ∩ B j ) , and then using the Holder inequality , ( p + q = 1) we get: P ( U ) ≤ X j ≥ 0 P ( U ∩ B j ) q P ( B j ) p . It follows tha t P ( X ∈ V f ∆ V f + δ ) ≤ X j P ( | f ( X ) | ≤ | E [ δ ( X )] | + c 0 h δ ( j + 1)) q P ( | δ ( X ) − E [ δ ( X )] | ≥ c 0 h δ ( j )) 1 − q ≤ c 2 c 1 − q 1 X j ≥ 0 ( | E [ δ ( X )] | + c 0 h δ ( j + 1)) qβ e − (1 − q ) j 2 2 , ( from assumption A1 and A2 ) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 48 ≤ c 2 c 1 − q 1 | E [ δ ( X )] | qβ 0 + r 2 π 1 − q Z ∞ 0 ( h δ ( x + 1) + | E [ δ ( X )] | ) qβ r 1 − q 2 π e − (1 − q ) x 2 2 dx ! which implies the desired result. Lemma 8 . 2. L et δ 1 , . . . , δ k b e k p ert urb ations satisfying assumption A 1 define d by e quation ( 83 ) with the err or functions h δ 1 , . . . , h δ k . Then, if h δ = P k i =1 h δ i , ther e exists c 0 ( k ) , c 1 ( k ) > 0 such that ∀ s > 0 P ( | δ − E ( δ ) | ≥ c 0 h δ ( s )) ≤ c 1 e − s 2 2 . (85) Pr o of. Recall that for all i , h δ i ≥ 0 . Let us fix s > 0. T he pr o of relies on the pig eonhole principle. Indeed, if P k i =1 | δ i − E [ δ i ] | ≥ k P k i =1 c 0 i h δ i ( s ) then there exists i 0 ∈ { 1 , . . . , k } suc h that | δ i 0 − E [ δ i 0 ] | ≥ P k i =1 c 0 i h δ i ( s ). If we fix c 0 = k max c 0 i , we then hav e P k X i =1 δ i − E [ δ i ] ≥ c 0 k X i =1 h δ i ( s ) ! ≤ P k X i =1 | δ i − E [ δ i ] | ≥ k k X i =1 c 0 i h δ i ( s ) ! ( from the tr iangle inequality and the fact that c 0 k X i =1 h δ i ( s ) ≥ k k X i =1 c 0 i h δ i ( s ) ) ≤ P ∃ i 0 ∈ { 1 , . . . , k } : | δ i 0 − E [ δ i 0 ] | ≥ k X i =1 c 0 i h δ i ( s ) ! (pigeon hole pr inciple) ≤ k X i =1 P ( | δ i − E [ δ i ] | ≥ c 0 i h δ i ( s )) (subadditivity of pro ba bilit y) ≤ k X i =1 c 1 i e − s 2 2 ( h δ i satisfies a ssumption A 1 ) , which e nds the pro of. The r e sults that a llow us to verify a s sumption A2 ar e pres ent ed in Section 8.5 . W e now reca ll some s tandard large deviation results that allow us to verify assumption A1. 8.3. L ar ge deviation In the c a se where δ is linea r or Lipschits, the following class ical r esult (see for example [ 8 ] (p1 7 4)) a llows us to chec k a ssumption A 1 . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 49 Theorem 8.2 . L et γ = γ C b e a gaussian me asure of c ovarianc e C on X a sep ar able Banach Sp ac e, H = H ( γ ) b e the asso ciate d r epr o ducing kernel H ilb ert Sp ac e, δ : X → R a fu n ction s uch that ther e exists N ( δ ) > 0 with | δ ( x + h ) − δ ( x ) | ≤ N ( δ ) | h | H ( γ ) ∀ h ∈ H ( γ ) γ − ps. (86) Then ∀ s > 0 γ x ∈ X : | δ ( x ) − Z δ ( x ) dγ | > s ≤ 2 e − s 2 2 N ( δ ) 2 (87) In the cas e where δ is quadra tic, the following result from Ma ssart and Lau- rent [ 19 ] (Lemma 1 p1325 ) will help us to check assumption A 1 . Theorem 8. 3. If D = D ia g ( d 1 , . . . , d p ) and q D ( x ) = h Dx, x i R p , t hen γ p x ∈ R p : q D ( x ) − Z R p q D ( x ) γ p ( dx ) ≥ s 2 k q D k L 2 ( γ p ) + sup i | d i | s 2 ≤ e − s 2 2 (88) γ p x ∈ R p : q D ( x ) − Z R p q D ( x ) γ p ( dx ) ≤ − s 2 k q D k L 2 ( γ p ) ≤ e − s 2 2 (89) As a co nsequence, assumption A 1 is satisfied with h δ ( s ) = s 2 k q D k L 2 ( γ p ) + s 2 sup i | d i | ) ≤ k q D k L 2 ( γ p ) ( s 2 + s 2 ). The use w e will make of these results is e n tirely contained in the following corolla r y . Corollary 8 . 1. L et X b e a sep ar able Banach sp ac e, γ a gaussian me asur e on X and δ ∈ E 2 ( γ ) . Then δ satisfies assumption A 1 with h δ ( s ) = k δ − E γ [ δ ] k L 2 ( γ ) ( s + s 2 ) . Pr o of. It suffices to chec k the result for X = R p and to use a standa rd approx- imation ar gument. Recall that in L 2 ( γ ), we have X ∗ 2 ,γ = { cte } ⊕ X ∗ γ ⊕ E 2 ( γ ). Also, there exists a unique triplet δ 0 = E γ [ δ ] ∈ { cte } , δ 1 ∈ X ∗ γ and δ 2 ∈ E 2 ( γ ) such that δ = δ 0 + δ 1 + δ 2 . F rom the preceding corollar y , assumption A 1 is satis - fied for p ertur ba tion δ 2 , measur e P = γ and h δ 2 ( s ) = k δ 2 k L 2 ( γ ) ( s + s 2 ). Because δ 1 ∈ X ∗ γ , δ 1 is affine. Also , by Theor em 8.2 , the a ssumption A 1 is satisfied for per turbation δ 1 with h δ 1 ( s ) = s k δ 1 k L 2 ( γ ) . W e can then conclude us ing Lemma 8.2 and the fa c t that k δ 2 k L 2 ( γ ) ( s + s 2 ) + s k δ 1 k L 2 ( γ ) ≤ ( k δ 1 k L 2 ( γ ) + k δ 2 k L 2 ( γ ) )( s + s 2 ) ≤ √ 2( s + s 2 ) k δ − δ 0 k L 2 ( γ ) . W e now hav e all elements to demonstrate Theor e m 8.1 . 8.4. Pr o of of The or em 8.1 As announced, we sha ll apply Theo rem 8.1 . F rom Theor em 8.4 Assumption A 2 is satisfied with β = 1 / 3 in the case 1 of our Theor em and for β = 2 / 7 in the imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 50 case 2 of our The o rem. In b oth cases the c o nstant c 2 depe nds on r only . In b oth cases, from the preceding corollar y , assumption A 2 is satis fied with the function h δ ( s ) = ( s + s 2 ) k δ − δ 0 k L 2 ( γ ) . Also, if we apply Lemma 8.1 , for all q ∈ ]0 , 1 [, there e x ists a constant C ( r , q ) > 0 s uch that γ ( V f ∆ V f + δ ) ≤ C ( r , q ) | E γ ( δ ) | + k δ − E [ δ ] k L 2 ( γ ) qβ , and a co ns tant C ′ ( r , q ) > 0 such that γ ( V f ∆ V f + δ ) ≤ C ′ ( r , q ) k δ k qβ L 2 ( γ ) , This ends the pro of of the Theo rem. 8.5. Smal l cr own pr ob abili t y In this subsection X ∗ 2 is the set of rea l random v ar iables that can b e written c + P i ≥ 1 β i ( ξ 2 i − 1) + α i ξ i with c ∈ R , β = ( β i ) i ∈ l 2 ( N ), α = ( α i ) i ∈ l 2 ( N ) ( ξ i ) i ∈ N is a seq uence of indep endent identically distributed ga ussian random v ariables with mean 0 and v a riance 1. Let q ∈ X ∗ 2 given by q = c + X i ≥ 0 α i ξ i + X i β i ( ξ 2 i − 1) . we will note n 1 ( q ) = max i | α i | n 2 ( q ) = max i | β i | , σ ( q ) = X i ≥ 0 2 β 2 i + α 2 i 1 / 2 . (90) Theorem 8. 4. 1. Ther e exists C ( c 0 ) > 0 such that sup { P ( | q | ≤ ǫ ) : q ∈ X ∗ 2 : | E [ q ] | ≥ c 0 } ≤ C ( c 0 ) ǫ 2 / 7 . 2. Ther e exists C ′ ( c 0 ) > 0 such t hat sup P ( | q | ≤ ǫ ) : q ∈ X ∗ 2 : E [ q 2 ] ≥ c 0 ≤ C ′ ( c 0 ) ǫ 1 / 3 . 3. Le t q ∈ X ∗ 2 , for al l ǫ ≥ 0 , P ( | q | ≤ ǫ ) ≤ s 1 π ǫ n 2 ( q ) . Remark 8.3. This r esult may se em s urprising, and we did not show it is opti- mal. If n 2 ( q ) = max i | β i | > c 0 , the b ound of p oint 3 is optimal in t he sense t hat if β = (1 , 0 , . . . ) , c = 1 and α = 0 we get P ( | q | ≤ ǫ ) = P ( | ξ 2 | ≤ ǫ ) ∼ C ǫ 1 / 2 (for a c onst ant C which c an b e c alculate d ex plicitly). In addi tion, when k β k l 2 → 0 the b ehaviour of P ( | q | ≤ ǫ ) tends to b e the same as P ( |k α k l 2 N (0 , 1) − c | ≤ ǫ ) ∼ C ′ ( c 0 ) ǫ . Also, it may b e c onje ct ur e d that p oints 1 and 2 of the The or em c an b e impr ove d (in or der to obtai n an ex p onent 1 / 2 inste ad of 2 / 7 and 1 / 3 ) but we b elieve this is unlikely. The difficult c ases to study (and p oint 3 of the fol lowing pr o of demonstr ate this) ar e t hose with k β k ∞ → 0 but k β k l 2 do es not tend to zer o. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 51 Pr o of. W e shall pro ceed in four steps. Step 1 . W e claim tha t if | E [ q ] | > ǫ then P ( | q | ≤ ǫ ) ≤ σ 2 ( q ) ( | E [ q ] | − ǫ ) 2 . (91) Notice that | q − E [ q ] | ≥ || q | − | E [ q ] || and if | q | < ǫ < | E [ q ] | then || q | − | E [ q ] || = | E [ q ] | − | q | and | q | ≥ | E [ q ] | − | q − E [ q ] | . Also P ( | q | ≤ ǫ ) ≤ P ( | E [ q ] | − | q − E [ q ] | ≤ ǫ ) = P (1 ≤ | q − E [ q ] | | E [ q ] | − ǫ ) which implies ( 91 ) b y the Markov inequality . Step 2 . W e will ass ume witho ut loss o f g enerality that for a ll i ∈ N α i ≥ 0 . This is wha t we will do. In the following, α i 0 = max i α i , j 0 ∈ arg max | β j | and sign( x ) is the function that returns the sign of the real x . W e claim that P ( | q | ≤ ǫ ) ≤ s 1 π ǫ n 2 ( q ) . (92) Let Z = X i 6 = j 0 α i ξ i + β i ( ξ 2 i − 1) . T o o btain the des ired ine q uality , note that for all α j 0 ≥ 0, β j 0 6 = 0 P | Z + α j 0 ξ + β j 0 ( ξ 2 − 1) | ≤ ǫ = P | sign( β j 0 ) Z + α j 0 ξ + | β j 0 | ( ξ 2 − 1) | ≤ ǫ = P | sign( β j 0 ) Z | β j 0 | + ( ξ + α j 0 2 | β j 0 | ) 2 − 1 − α 2 j 0 4 β 2 j 0 ) | ≤ ǫ | β j 0 | ! = P ξ ∈ f α j 0 ,β j 0 ( − ǫ ) − α j 0 2 | β j 0 | ; f α j 0 ,β j 0 ( ǫ ) − α j 0 2 | β j 0 | . where f α,β ( ǫ ) = s (1 + α 2 4 β 2 − sign( β ) Z − ǫ | β | ) + , and ( x ) + = x 1 x ≥ 0 . The inequalit y ( 92 ) results from the c ho ice α = α j 0 and β = β j 0 and from the fact that if u ∈ R , q ( u + ǫ | β j 0 | ) + − q ( u − ǫ | β j 0 | ) + ≤ q 2 ǫ n 2 ( q ) . Step 3 W e cla im that P ( | q | ≤ ǫ ) ≤ 208 n 2 ( q ) σ ( q ) + 2 ǫ σ ( q ) e − ( | E [ q ] |− ǫ ) 2 σ 2 ( q ) . (93) W e prove the following lemma (which is a central limit theor em) at the end o f the pro of. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 52 Lemma 8 .3. L et X i = β i ( ξ 2 i − 1) + α i ξ i , ξ b e a gaussian c enter e d r andom variable with varianc e 1 and σ ( q ) given by ( 90 ). W e obtain: sup ǫ ≥ 0 P | E γ [ q ] + X i ≥ 0 X i | ≤ ǫ − P | ξ + E γ [ q ] σ ( q ) | ≤ ǫ σ ( q ) ≤ 104 max( | β i | ) σ ( q ) . Also, b ecause | E [ q ] | > ǫ P | ξ + E [ q ] σ ( q ) | ≤ ǫ σ ( q ) ≤ 2 ǫ σ ( q ) e − ( | E [ q ] |− ǫ ) 2 σ 2 ( q ) , we hav e inequa lit y ( 93 ). Step 4 . As anno unced w e will distinguish several disjo int cases to demo nstrate po int s 1 and 2 of the theo rem. W e b eg in with p o int 1 . 1. In the case where σ ( q ) < ǫ 1 / 7 , it is the inequa lit y from step 1 ( 91 ) that leads to the desired conclusion. 2. In the case where n 2 ( q ) ≥ ǫ 3 / 7 , it is the inequality fro m step 2 ( 9 2 ) that leads to the desired conclusion. 3. In the case wher e n 2 ( q ) < ǫ 3 / 7 and σ ( q ) > ǫ 1 / 7 , it is the ineq uality fro m step 3 ( 93 ) that leads to the desired conclusio n. W e conclude with p oint 2. 1. In the case where n 2 ( q ) ≥ ǫ 1 / 3 , it is the inequality fro m step 2 ( 92 ) that leads to the desired conclusion. 2. In the ca se where n 2 ( q ) < ǫ 1 / 3 it is the inequality from step 3 ( 93 ) that leads to the desired conclusion. W e now give the pro of of theorem 8.3 . Pr o of. This pro of is decomp osed into tw o steps. In the first step, we calcula te ∀ α, β ∈ R , φ α,β ( t ) = E h e it ( ξα + β ( ξ 2 − 1)) i , (94) and in the s econd o ne we deduce that fo r a ll | t | < σ 6 max j | β j | = a | Y j ≥ 0 φ α j ,β j ( t/σ ) − e − t 2 / 2 | ≤ 4 max j | β j | σ | t | 3 2 e − t 2 / 6 , (95) which implies the desired result from the Essen inequality (see for ex ample [ 23 ] imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 53 p358) sup u ∈ R P 1 σ X j ≥ 0 α j ξ j + β j ( ξ 2 j − 1) ≥ u − Φ( u ) ≤ Z a − a Q i ≥ 0 φ α,β ( t/σ ) − e − t 2 / 2 t dt + 24 a √ 2 π ≤ 4 max j | β j | σ Z R t 2 2 e − t 2 6 dt + max j | β j | 72 √ 2 σ √ π = max j | β j | σ 72 r 2 π + 32 ! ≤ 104 max j | β j | σ , where Φ is the cum ulative distr ibutio n function of a standardised gaussia n rea l random v ar iable. Step 1. Let Ω β = { z ∈ C 2 ℑ ( z ) β > − 1 } and ψ α,β ( z ) b e g iven by ∀ α, β ∈ R , z ∈ ω β ψ α,β ( z ) = e − β iz (1 − 2 β iz ) 1 / 2 e − 1 / 2 α 2 z 2 (1 − 2 βiz ) . The function ψ α,β is analytic on Ω β . The function φ α,β ( t ) defined by ( 94 ) ca n be contin ued int o a n analytic function on the domain Ω β and b eca use x 2 2 + y ( αx + β ( x 2 − 1)) = 1 2 (1 + 2 β y )( x + αy 1 + 2 β y ) 2 − α 2 y 2 2(1 + 2 β y ) we observe that ∀ y > − 1 2 β ψ α,β ( iy ) = φ α,β ( iy ) . Also, we can deduce that φ α,β ( z ) and ψ α,β ( z ) are equal on Ω β and in particular on R which g ives ∀ α, β ∈ R , t ∈ R φ α,β ( t ) = e − β it (1 − 2 β it ) 1 / 2 e − 1 / 2 α 2 t 2 (1 − 2 βit ) . Step 2. P ro of of ( 95 ). The preceding equation gives | Y i ≥ 0 φ α,β ( t/σ ) − e − t 2 / 2 | = e − t 2 2 | e z − 1 | ≤ e − t 2 2 | z | e z , where u = t σ et z = t 2 2 + X j ≥ 0 ( − 1 / 2 α 2 j u 2 (1 − 2 β j iu ) + 1 2 ( − 2 β j ui − lo g(1 − 2 β j ui )) ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 54 and hence z = X j ≥ 0 ( u 2 α 2 j 2 − 1 2 α 2 j u 2 (1 − 2 β j iu ) ! + u 2 2 β 2 j 2 − 1 2 (2 β j ui + log (1 − 2 β j ui )) !) . (96) In addition, if | t | < σ 6 m ax i | β i | , then for all j ∈ N | 2 uβ j | < 1 3 and we hav e (cf T aylor expansion (1) p352 in [ 23 ] ) | log(1 − 2 β j ui ) + 2 β j ui − 4 β 2 j u 2 2 | ≤ 8 | uβ j | 3 3 1 1 − | 2 u β j | ≤ 4 | uβ j | 2 max j | β j | . W e also have | u 2 α 2 j 2 − 1 2 α 2 j u 2 (1 − 2 β j iu ) | ≤ 1 2 α 2 j | u | 3 2 | β j | 1 + 4 β 2 j u 2 ≤ α 2 j | u | 3 max j | β j | . As a consequence, if | t | < σ 6 max i | β i | , then ( 96 ) implies: | z | ≤ 2 σ 2 | u | 3 max j | β j | = 2 max j | β j | σ | t | 3 , and e − t 2 2 −| z | ≤ e − t 2 2 (1 − 2 3 ) = e − t 2 6 . Ac knowledgemen ts This work has b e en done with s uppo rt fro m La Region Rhones-Alp es. References [1] F. Abramovic h, Y. Benjamini, D. Do no ho, a nd I. Johnstone. Adapting to unknown sparsity b y controlling the false discov er y rate. Annals of statistics , 34, 2006 . [2] T. Anderso n and R. Bahadur. Classifica tion into tw o mult iv ariate nor mal distributions with different covriance matrices. Annals of Mathema tilc al Statistics , 33(2):4 20–43 1, 1962 . [3] J.Y. Audiber t and A. Tsybakov. F ast learning rates for plug-in clas s ifiers under the margin condition. Annals of St atistics , 20 06. [4] Y. Benjamini and Y. Ho ch ber g. Controlling the false discovery rate :a prac- tical and p oweful approach to multiple testing. J ou r n al of R oyal Statistic al So ciety B , 57:2 89–30 0, 1995 . [5] A Berlinet, G Biau, and L Rouvi` ere. F unctional clas sification with w av elet. 2005. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classific ation 55 [6] P . Bick e l and E . Lev ina. Some theory for fisher’s linea r dis criminant func- tion, ’naive bayes’, and some a lter natives w hen there a re ma ny more v ari- ables than observ atio ns. Bernoul li , 10 (6):989–1 010, 2 0 04. [7] P . Bick el a nd E. Levina. Regular ized estimation of lar ge cov ar iance matri- ces. Annals of St atistics , 20 0 7. [8] V. I. Bogachev. Gaussian Me asur es . AMS, 1998 . [9] E. Candes. Mo dern statistical e stimation via o r acle inequalities. Acta Numeric a , pages 1– 6 9, 20 0 6. [10] D. Donoho. High-dimensional data analysis: the curses and blessings of dimensionalit y . Av a ilable at h ttp:// www- stat.stanford.edu/do noho/Lecture s , 2 000. [11] D. L. Dono ho a nd I. Johnstone. Minimax risk ov er lp-balls for lq-error. Pr ob ability The ory and R elate d Fields , (9 9 ):277–3 03, 1994. [12] J . F an and F an Y. High dimensio nal cla ssification us ing features annealed independenc e rules. T echn ical rep ort, Princeton University , 2007 . [13] R. Fisher . The use of multiple measurements in taxo nomic proble ms . An- nals of Eugenics , 7:179– 188, 1936. [14] J . F riedman, T. Hastie, a nd R. Tibshirani. The Elements of St atistic al L e arning . Spr ing er, 2001. [15] R. Girard. R e duct ion de dimension en statistique et applic ation ` a la se g- mentation d’image s hyp ersp e ctr ales . PhD thesis, Universit ´ e J o seph F ourier, 2008. [16] V. Gir a rdin and R. Seno ussi. Semigroup statio nary pro cess es and sp ectral representation. Bernoul li , 9(5):857– 876, 2003 . [17] Ulf Grenander. Sto chastic pro cesse s and sta tistical inference. Arki v for Matematik , 1:195 – 277, 1950 . [18] T. Hastie, A. Buja, a nd R. Tibshirani. Penalised discr imina nt ana lysis. Annals of S tatistics , 23:73 –102, 1995 . [19] B. Laurent a nd P . Massart. Adaptive estimatio n of a quadratic functional by mo del selection. The ann als of Statistics , 28 (5):1302 – 1338 , 2 000. [20] S. Ma llat. A Wavelet T our of Signal Pr o c essing . Academic P ress, 1 999. [21] S. Ma lla t, G. Papanicola ou, and Z. Zhang . Adaptive cov ariance estimatio ni of lo cally stationar y pro cess es. The annals of St atistics , 26 (1):1–47, 19 98. [22] F. Ro ssi a nd N. Villa. Supp or t vector ma chine for functional data class ifi- cation. N eur o c omputing , 69:7 3 0–74 2, 2006 . [23] Sho r ack. Pr ob ability for Statistitian . Springer, 200 0. [24] A. Tsyba ko v. Intr o duction a l’estimation n on-p ar ametrique . Spring er, 2 004. [25] Y az ici. Sto chastic deconv olution o ver g r oups. IEEE T r ans. on In formation The ory , 50 (3), 2 0 04. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment