High dimensional gaussian classification

arXiv: math.ST/ 0806.072 9 High dimension al gaussian classiﬁcatio n Robin Girard ?? LJK, Gr eno ble, F r anc e Abstract: High dimensional data analysis is kno wn to b e as a challenging problem (see [ 10 ]). In this ar ticle, we give a theoretical analysis of high dimensional classiﬁcation of Gau ssian data which relies on a geometrical analysis of the error measure. It links a problem of classiﬁcation with a problem of nonparametric regression. W e give an algorithm designed for high dimensional data whic h app ears straightforw ard in the light of o ur the- oretical w ork, toge ther with the thresholding estimation theory . W e ﬁnally attempt to give a general treatmen t of the problem that can b e extended to frameworks other than gaussian. AMS 2000 sub ject cla ssiﬁcations: Primary 62C20. Keywords and phra ses: Classiﬁcation, High dimension, Gaussian mea- sure, thresholding estimator, dimension reduction, Linear Discriminant Anal- ysis, Quadratic Discr i minant Analysis. Con te nts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Aﬃne per turbation of aﬃne rules . . . . . . . . . . . . . . . . . . . . . 5 3 Quadratic p erturbatio n of quadra tic rule . . . . . . . . . . . . . . . . . 14 4 Classiﬁcation pro cedur e in high dimension: a wa y to solve Problem 2 . 19 5 Application to medical data and the TIMIT database . . . . . . . . . 24 6 A more geometric alternative measure of erro r: the lear ning erro r . . . 27 7 A geometrical Analysis of LDA to solve Pro blem 1 . . . . . . . . . . . 3 3 8 A general scheme to solve Problem 1 . . . . . . . . . . . . . . . . . . . 44 Ac knowledgemen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 1. Introduction Let X b e a vector spa ce, typically X = R p but X can also b e an inﬁnite di- mensional p olish s pa ce (i.e: s e parable complete metric space). In Section 8 X is a separable Bana ch space. In the binary classiﬁca tion pro blem, the aim is to recov er the unknown cla s s y ∈ { 0 , 1 } as so ciated with a n observ atio n x ∈ X . In other words, we seek a classiﬁc a tion rule (also ca lled clas siﬁer), i.e a mea surable g : X → { 0 , 1 } . This rule giv es an incor rect cla ssiﬁcation for the observ ation x if g ( x ) 6 = y . The underlying pr obabilistic mo del, that makes a perfo rmance measure of g p os s ible, is set by distributions P k ( k = 0 , 1) on X . F or k = 0 , 1 , the distr ibution P k is the distribution of the data having la be l equal to k . In this 1 imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 2 framework, the weight ed sum of the pr obabilities of miscla ssiﬁcation is de ﬁned by C ( π, g ) = π P 1 ( g ( X ) 6 = 1 ) + (1 − π ) P 0 ( g ( X ) 6 = 0 ) . (1) In a bay esian framework, the weigh t π reﬂects the margina l distribution of the lab el Y . In our approa ch, we do no t wan t this ma r ginal distribution to set the imp ortance of the diﬀeren t erro r s. In the man y applications we hav e in mind, such as tumour detection from a n MRI signal, the class that app ear s mos t fr e- quently is not nec e s sarily the one fo r which a class iﬁcation er ror has the most impo rtant medical conse q uences. This is the reaso n why we search a pro cedur e g that minimise C ( π , g ) a nd not its bay esia n counterpart : P ( g ( X ) = Y ). Here, we do not wan t to study the inﬂuence of the weigh t π in the problem. The main r eason is that our results, to b e g iven later, are simpler to formulate and to understand whe n π = 1 / 2, a nd that the problem we are in terested in is the pr o blem tha t rise from the hig h dimensio n o f the spac e X , and not the prob- lem related to the use of π . T he r efore, in the rest o f the pr esent pa p er we will make the ass umption that π = 1 / 2 . In the sequel, we will set C ( g ) = C (1 / 2 , g ). This is a us ual a ssumption (se e for example Bick el and Levina [ 6 ]) In the case where π = 1 / 2 it is known that, if P 0 and P 1 are equiv alent, then the rule that minimises C ( g ) is given by g ∗ ( x ) = 1 V , V = { x ∈ X : L 10 ( x ) ≥ 0 } where L 10 = log  dP 1 dP 0  (2) is the loga rithm of the likekihoo d ra tio b et ween P 1 and P 0 (i.e the Rado n- Nikodym der iv ative). In rea l life problems, L 10 is unknown, a nd the only thing we hav e is a sub- stitute b L 10 of it. Also, it is natural to plug it in ( 2 ) and to use the class iﬁe r g ( x ) = 1 ˆ V ( x ) and ˆ V = n x ∈ X : b L 10 ≥ 0 o . The natur al q uestion tha t we will investigate in this article is the following: Problem 1. Is ther e a simple way to r elate the exc ess risk C ( g ) − C ( g ∗ ) to a me asur e of the lo g-likeliho o d ”p erturb ation”: b L 10 − L 10 . In other words we seek an upper b ound and a lower bo und o f C ( g ) − C ( g ∗ ) by a simple-to - study re a l v alue d function o f b L 10 − L 10 . In this ar ticle we fo cus on the gaussian case, and unless the contrary is explicitly s tated, P 1 and P 0 will be ga us sian equiv alent pro babilities on X . W e inv estiga te Pro blem 1 and the answer we o btain in the g eneral ca se leads to the b ound C ( g ) − C ( g ∗ ) ≤ c ( r ) k b L 10 − L 10 k 1 / 6 L 2 ( γ ) while k L 10 k L 2 ( γ ) ≥ r > 0 for a gaussia n measure γ , where c ( r ) is a co nstant o nly depe nding on r . In so me pa rticular cases (when b L 10 − L 10 and L 10 are aﬃne) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 3 we are able to give a n explicit constant c ( L 10 ) and a n exp onent higher than 1 / 6 (exp o nent 1 ). If we supp ose that P 0 and P 1 hav e equal cov aria nce, then it is known that L 10 is a ﬃne and it is natura l to take an aﬃne b L 10 . The co rresp onding pro cedure is usually calle d Linear Discriminant Analysis (LDA) (even if the underlying pro- cedure is a ﬃne). If we s uppo se that P 0 and P 1 hav e diﬀerent cov ariance, then L 10 is quadratic and it is natural to take a quadra tic b L 10 . The co rresp onding classiﬁcation pro cedure will be calle d Quadr atic Discriminant Analys is (QDA). The corresp o nding pro ce dures are also k nown as plug-in pro cedures: b L 10 is plugged into ( 2 ) in order to o btain g . Plug- in pro cedur e hav e b een studied in a diﬀer e n t context (see for exa mple [ 3 ] and the r eferences therein), but our ap- proach diﬀers fr o m those . The interest o f Pro ble m 1 in the g aussian setting, is understo o d by a ddressing the problem of ﬁnding a go o d substitute b L 10 for L 10 . F or example, in man y applications, we are given a learning set consis ting of n rando m v ariables drawn independently fr om P 1 and n ′ drawn from P 0 . The problem of ﬁnding a g o od substitute b L 10 of L 10 then b eco mes an estimation pro blem whose er ror measure is given in the answer to Problem 1 . Also, our a nswer to P r oblem 1 given b elow gives rise to a natural w ay to es timate L 10 in high dimension, whic h is the answer to wha t we ca ll Pr oblem 2 : Problem 2. Given a le arning set, c onstru ct b L 10 in or der to get a s atisfac- tory classiﬁc ation pr o c e dur e in high dimension: a pr o c e dur e that c an b e just iﬁe d the or etic al ly and with numeric al exp eriment. Classical methods of c la ssiﬁcation break down when the dimensionality is extremely large. F or example. Bick el and Levina [ 6 ] have studied the p o or p er- formances of Fisher discriminant a na lysis. Althoug h, the n um ber o f parameter s to learn in order to build a cla ssiﬁcation rule s e ems to be resp onsible for the p o or per formance. In the sequel we shall give theoretical non- a symptotic r esults that emphasise this po or p erfor mances. T o o vercome the p o or p erforma nce Bick el and Levina [ 6 ] prop ose to use a r ule which r elies on feature indepe ndence , F an and F an [ 12 ] prop os e to select the interesting fea tur es with a m ultiple testing pro cedure. B ickel and Levina give a theo retical study of a particular LDA pro- cedure (i.e a L DA pr o cedure bas ed on a pa rticular es timator b L 10 ), they do not study the Q DA pr o cedure. The selection of interesting features constitutes a reduction o f the dimens io n of the space on which the classiﬁcatio n r ule acts. F eature selec tio n is widely used in high dimensio nal classiﬁca tion, the pro cedures used for sele c tion of interest- ing features are often motiv ated b y theoretical results (see [ 12 ]). Unfortunately , these theor etical results a re ba sed on the following tw o p ostulates . On the o ne hand, features c a n be a priori divided in to tw o parts, an interesting one a nd a non interesting one. On the other hand, selecting the in ter esting features is necessary and suﬃcient to get a go o d classiﬁca tio n rule. If we acce pt that thes e po stulates reﬂect no thing but a r elatively clear intuition, we would like to give imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 4 an ana ly sis of the clas siﬁcation risk in or der to justify a feature selec tio n metho d based on multiple hypo thesis testing. Thresholding techniques ar e widely used in the non-parametric regression framework (see [ 9 ] for an introductio n to the thresholding techniques), and as we sha ll see, the techniques can b e used to g ive a n answer to Pr o blem 2 . Also we b elieve that our ans wer to Pr o blem 1 will shed light on the simple link that exists betw een the nonpara metr ic regress io n and the classiﬁcatio n problem. F unctional data a na lysis is the study of data that lives in an inﬁnite dimen- sional functional space. Hence curve classiﬁcatio n is o ne of the problems it de a ls with. Since [ 17 ], functional data a nalysis has underg one further developments and esp ecia lly in the context of cla ssiﬁcation (see for example [ 5 ] and the ref- erences therein). In the ga ussian s e tting, it is r ather na tural to exp ect re s ults that are dimensio nless and tha t can b e a pplied to a ny abstract polis h space. Hence, our a nswer to pro blem 1 will be given in terms of L 2 ( γ ) norms , with γ a ga ussian measur e, a nd since the constant involv ed in our theor etical result do es not dep end on the dimension, the extension from X = R p to more abstract spaces is straightforward. Let us introduce so me no tation. In the whole article, γ C,µ is a gauss ian mea- sure on X with mea n µ and cov ariance C , γ C is the zero mean gaussian measure with cov ar iance C and γ p is the gaussia n measur e on R p with mean zero and cov aria nce I d R p ; Φ( x ) is the cumulativ e distribution function of a real g aussian random v a riable with mean zero and v a riance one. If γ is a pro ba bilit y measur e on R p , k Π ⊥ x e k L 2 ( γ ) will b e the norm of the o rthogona l pro jection in L 2 ( γ ) of the vector e ∈ L 2 ( γ ) on the hyper -plan orthogona l to x ∈ L 2 ( γ ); if F ∈ R p k F k L 2 ( γ ) will b e the norm of the linear application x ∈ R p → h F, x i R p . W e shall use b oth the fact that if F ∈ R p and γ is a g aussian measure with mean zero and cov ar iance C , then k F k L 2 ( γ ) = k C 1 / 2 F k R p ; and that k F k L 2 ( γ ) is a natura l measure that can be extended in an inﬁnite dimensio na l framework. The s ym- metric diﬀerence betw een tw o subsets of X A and B is denoted by A ∆ B , it is the set of all elemen ts that are in A \ B or in B \ A . If A is a ma trix of R p k A k H S will b e the Hilb ert-Schmidt norm o f the matrix A , trace ( A ) the trace of A , and q A ( x ) will b e given by h Ax, x i R p for all x ∈ R p . This article is org anized as follows. W e give the main theoretical results - leading to a solution to Problem 1 - for the LDA pro cedur e in Section 2 , a nd for the QDA pro cedure in Section 3 . In sectio n 4 w e give our algorithm fo r high dimensional data cla ssiﬁcation and the theoretical result related to it. This leads to our contribution to Pr oblem 2 in the lig ht of our solution to Pr oblem 1 . In Section 5 we apply this algor ithm to curve classiﬁcatio n. In Section 6 we int ro duce a geometric measur e o f error and derive its link with the excess risk . Section 7 is devoted to the pro of of results given in Section 2 and Section 8 , to the pro of of r esults g iven in Section 3 and p os sible g eneralisa tio ns. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 5 2. Aﬃn e p erturbation of aﬃne rules 2.1. An solution to Pr oblem 1 2.1.1. Main r esu lt In this sectio n, X = R p , C is a sy mmetric deﬁnite p ositive matrix and P 1 = γ µ 1 ,C P 0 = γ µ 0 ,C . Under these hypotheses L 10 ( x ) = L A 10 ( x ) is aﬃne on R p : L A 10 ( x ) = h F 10 , x − s 10 i R p where s 10 = µ 1 + µ 0 2 , F 10 = C − 1 m 10 (3) and m 10 = µ 1 − µ 0 . In this se ction, we res trict ourse lves to an aﬃne substitute b L A 10 ( x ), we note ˆ F 10 and ˆ s 10 the c o rresp onding substitutes of F 10 and s 10 . W e then decide that X comes from P 1 if it is in ˆ V = n x ∈ R p st b L A 10 ( x ) ≥ 0 o . (4) One can deﬁne the angle α in L 2 ( γ C ) betw een F 10 and ˆ F 10 by α = arctan k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) h ˆ F 10 , F 10 i L 2 ( γ C ) ! . (5) This angle will play a very imp ortant ro le in the sequel. W e obtained the fol- lowing s olution to Problem 1 . Theorem 2.1. L et ˆ F 10 and ˆ s 10 b e two R p ve ctors and b L A 10 ( x ) deﬁne d by sub- stituting ˆ F 10 and ˆ s 10 for F 10 and s 10 in ( 3 ). L et P 1 and P 0 b e two gaussian me asur es on X = R p with the same c ovarianc e C with me ans r esp e ctively µ 1 and µ 0 . If ˆ V is the R p subset deﬁne d by ( 4 ), we have: C ( 1 ˆ V ) − C (1 V ) ≤ E k F 10 k L 2 ( γ C ) wher e E = 4 k F 10 k L 2 ( γ C ) √ π k ˆ F 10 k L 2 ( γ C ) |h ˆ F 10 , ˆ s 10 − s 10 i R p | + k F 10 − ˆ F 10 k L 2 ( γ C ) ! . (6) If |h ˆ F 10 , ˆ s 10 − s 10 i R p | ≤ 1 4 |h ˆ F 10 , F 10 i L 2 ( γ C ) | and α ≤ π / 4 ( α is deﬁne d by ( 5 )), then C ( 1 ˆ V ) − C (1 V ) ≤ e − k F 10 k 2 L 2 ( γ C ) 32 E k F 10 k L 2 ( γ C ) . (7) The pro of of this theor em is given in Section 7 at Sub-section 7.4 . It is a consequence of Theorem 7.1 obtained by simple geometric metho ds emphasizing imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 6 the fact that P 0 ( X ∈ V \ ˆ V ) is the meas ure of an are a betw een tw o hyper plans obtained b y a rotation of angle α . The pro of also uses the inequality C ( 1 ˆ V ) − C (1 V ) ≤ 1 2  P 1 ( X ∈ V \ ˆ V ) + P 0 ( X ∈ ˆ V \ V )  = R ( 1 ˆ V ) , (8) which deﬁnes R ( 1 ˆ V ). W e ca ll R (1 ˆ V ) the learning error , it is the pr obability of making a a wr o ng cla ssiﬁcation with g ( x ) = 1 ˆ V ( x ) and a go od classiﬁca tion with the o ptimal rule g ∗ = 1 V . W e will use and motiv ate more deeply this measure of er ror in Section 6 . Let us now give comments on T heo rem 2.1 . 2.1.2. Gener al c omments If we no te δ = ˆ F 10 − F 10 and d 0 = h ˆ F 10 , s 10 − ˆ s 10 i R p , (9) we hav e ˆ L 10 ( x ) = L 10 ( x ) + h δ, x − s 10 i R p + d 0 . Also, in the sequel we will talk about aﬃne p er tur bation of the optimal r ule. The preceding theorem r esults fro m the study of aﬃne p erturbations o f aﬃne rules. The case w he r e d 0 = 0 will b e studied later but we can already no te that in this case, Theorem 2.1 yields C ( 1 ˆ V ) − C ( 1 V ) ≤ kL 10 − b L 10 k L 2 ( γ C,s 10 ) kL 10 k L 2 ( γ C,s 10 ) , which is a nice answer to Pr oblem 1 . In the sequel (see Section 7 Theo rem 7.1 ), we shall see that it is o ptimal whenever kL 10 k L 2 ( γ C,s 10 ) do es not b ecome to larg e. The quantit y r = k F 10 k L 2 ( γ C ) measures the theor etical s eparation of the data. Indeed it is the L 1 distance b etw een P 1 and P 0 , deﬁned b y d 1 ( P 1 , P 0 ) = R | dP 1 − dP 0 | that measures this separ ation: it is known that d 1 ( P 1 , P 0 ) = (1 − 2 C ( 1 V )), which implies d 1 ( P 1 , P 0 ) = Φ  − 1 2 r  − Φ  1 2 r  . Also, d 1 ( P 1 , P 0 ) ∼ r when r → 0, and then the data cannot be distinguis hed by any rule. The data tends to b e per fectly separated when d 1 ( P 1 , P 0 ) → 1. In this case, r → ∞ a nd d 1 ( P 1 , P 0 ) ∼ 1 − 2 e − r 2 8 r √ 2 π . Also note that in the inﬁnite dimensional setting tw o gaussian measure s P 0 and P 1 are either o rthogona l (there exis ts a Bo relian set A such that P 1 ( A ) = P 0 ( X \ A ) = 0 ) or equiv a le n t (i.e mutually absolutely contin uo us ) and the latter imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 7 case app ea rs if a nd only if r is ﬁnite. Although, if E measures the es timation error , 1 k F 10 k L 2 ( γ C ) and e − k F 10 k 2 L 2 ( γ C ) 32 (10) in the upp er b ounds ( 6 ) a nd ( 7 ), a re linked with the proximit y of the mea sures P 0 and P 1 . When k F 10 k 2 L 2 ( γ C ) is large, data ar e well sepa rated a nd the terms in ( 10 ) mea sure the impa ct of this separatio n on the excess risk. W e b elieve that when k F 10 k 2 L 2 ( γ C ) tends to 0, 1 k F 10 k L 2 ( γ C ) is linked to the e r ror measur e R ( 1 V ) used in the pro of (deﬁned by ( 8 )). Indeed, it is not correct to think that the classiﬁcation problem is harder (in the sense of the excess risk ) when da ta ar e not well separated: straightforward computation leads to ∀ ˜ V ⊂ R p C ( 1 ˜ V ) − C ( g ∗ ) ≤ 1 2 d 1 ( P 1 , P 0 ) . As we shall se e in the sequel (see Theo rem 6.1 ) R ( 1 V ) behaves almo st like the excess r isk if a nd only if d 1 ( P 0 , P 1 ) do es no t tend to 0 . The learning set has to b e used to elab or ate estimators ˆ F 10 and ˆ s 10 of F 10 and s 10 . T he preceding theorem allows us to quantify what intuition clear ly indicates: a go o d estimatio n of the parameters F 10 and s 10 (or more indirectly µ 1 , µ 0 and C ) lea ds to a go o d class iﬁcation rule. These estimators m ust lead to a s mall ex c ess risk and by the pre c eding theo rem E P ⊗ n [ C ( 1 ˆ V ) − C ( 1 V )] ≤ E P ⊗ n [ E ] k F 10 k L 2 ( γ C ) , (11) where P ⊗ n is the learning set distribution. It seems that little is known o n theo r etical behaviour of the LDA pro cedure (a plug-in pro cedure) with resp ect to the o ptimal r ule (the Bayes rule ). The result that is classica lly used (see for exa mple Anderso n and Bahadur [ 2 ]) to show the consistency of a LDA rule using e stimators ˆ F 10 = d C − 1 ˆ m 10 = d C − 1 ( ˆ µ 1 − ˆ µ 0 ) and ˆ s 10 = ( ˆ µ 1 + ˆ µ 0 ) / 2 is tha t the probability to obser ve X ❀ γ C,µ 0 (in that case X comes from class 0) fa lling into ˆ V (and a ﬀect it to cla ss 1) is P  h ˆ F 10 , C 1 / 2 ξ i R p ≥ h ˆ s 10 − µ 0 , ˆ F 10 i R p |A  = 1 − Φ h ˆ s 10 − µ 0 , ˆ F 10 i R p k ˆ F 10 k L 2 ( R p ,γ C ) ! , (12) where A is the σ -ﬁeld gener ated by the learning set, and ξ is a cen tered g aussian random vector o f R p with cov ariance I d R p . Note that the pro o f of ( 12 ) fo llows from a straightforw ard calcula tion. W e b elieve that a direc t analysis of this err or term misses the g eometrical as pec t of the pro blem. In a ddition, this erro r has to be compa red with the lo west po ssible error C ( g ∗ ). Note that for the LD A pro cedure in a high dimensio na l framework, a n analysis o f the worst case excess imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 8 risk has b een done with ( 12 ) by Bick el and Lev ina [ 6 ] for a particula r choice of ˆ F 10 and ˆ s 10 . Our Theorem, bec a use it is in tr insic to the classiﬁcation pro cedure, is singular ly diﬀerent fr om the type of res ult tha t they obtain. In pa rticular, it will allow us to establish a revealing link b etw een dimensiona lity reduction and thresholding es timation. 2.1.3. The c onstant p art of the p erturb ation The err or due to the constant part of the p e rturbation ( d 0 in equation ( 9 )), is measured by 4 √ π      * ˆ F 10 k ˆ F 10 k L 2 ( γ ) , ˆ s 10 − s 10 + R p      . In order to g ive a ﬁrst simple analysis of this term, we ar e g oing to supp ose tha t ˆ F 10 and ˆ s 10 are indep endent. This indep e ndence can b e o btained by keeping a part of the learning s et for the estimation of F 10 and a part for the estimation of s 10 . In thisat ca se, if n ′ observ ations of the lea rning set were us e d to cons truct ˆ s 10 , a nd if ˆ s 10 = ( ¯ µ 1 + ¯ µ 0 ) / 2 ( ¯ µ i is the empir ical mea n o f the obser v ations of group i ), then, straightforward calculatio n leads to E P ⊗ n " 4 √ π k ˆ F 10 k L 2 ( γ ) |h ˆ F 10 , ˆ s 10 − s 10 i R p | # ≤ 8 √ 2 n ′ π . Ultimately , the diﬃcult y of the problem do es not come from the constant pa rt of the p ertur bation, but from the linear part. The c onditions under which the seco nd inequality ( 7 ) of the theorem is given shall easily b e sa tisﬁed. The seco nd c ondition is that α ≤ π 4 . It is no t diﬃcult to satisfy if ˆ F 10 and F 10 are close enoug h to each o ther. The ﬁrst one is veriﬁed if the second is a nd if w e have:      * ˆ F 10 k ˆ F 10 k L 2 ( γ C ) , s 10 − ˆ s 10 + R p      ≤ √ 2 8 k F 10 k L 2 ( γ C ) . If for example ˆ s 10 = ( ¯ µ 1 + ¯ µ 0 ) / 2 a nd the learning set is comp os ed of n ′ observ a- tions uniquely used fo r the estimation o f s 10 , then, g iven the rest of the lea rning set, h ˆ F 10 k ˆ F 10 k L 2 ( γ C ) , s 10 − ˆ s 10 i R p ❀ γ 1 n ′ and the preceding c ondition is satisﬁed with probability 1 2 Φ √ 2 8 k F 10 k L 2 ( γ C ) n ′ ! . 2.1.4. The line ar p art of the p erturb ation As we s hall explain in the pr o of of Theore m 2.1 , the angle α deﬁned b y ( 5 ) measures quite well the er ror due to the linear par t of the p er turbation. Also, imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 9 the upp er bo und given in the preceding theorem is not sharp everywhere. Indeed, if β ∈ R , and ˆ F 10 = β F 10 , the err or R ( 1 V ) is null and the b ound ( 6 ) can b e arbitrar ily la rge. W e b elieve that the study o f methods designed to estimate direction (parameter on the sphere S p − 1 ) in a high dimensio nal setting are required. W e o nly wan t to give the link b etw e en the pro blem of estimating F 10 as a vector of R p and the problem of es timating F 10 in order to get small C ( 1 ˆ V ). In addition, this inv ar iance o f the e r ror under dila tation only exists in the direction F 10 which is unknown and is see ms to b e q uite tricky to make a direct us e of it. Let us give a simple example to illustrate the interest of the link betw een estimation and learning. Example 2.1. L et σ > 0 , supp ose X ❀ γ 1 n I p ,F 10 , C = I p and that s 10 is known. In the estimation pr oblem of F 10 for classiﬁc ation we wish to r e c over F 10 fr om the observation X and the err or is me asure d by R ( 1 ˆ V ) ≤ k F 10 − ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) = k ˆ F 10 − F 10 k R p k F 10 k R p . In Exa mple 2.1 the pr oblem is exa ctly the o ne we encoun ter in the r egress ion framework, while estimating F 10 from p noisy observ ations of ( F 10 [ i ]) i =1 ,...,p with an error meas ured with a l 2 norm. Supp ose now that w e wan t to let p grow to inﬁnit y . If the co eﬃcients o f F 10 decrease s uﬃcient ly fast, for exa mple if F 10 ∈ l q ( R ) with q < 2 , then (see for example [ 9 ]), it is p ossible to obtain a go o d statistical es timation of F 10 by setting to zero the co eﬃcient that are ar e, in absolute v a lue , under a threshold. It is a thresholding estimation and w e sha ll use this type of pr o cedure in Sectio n 4 . In the ca se where we observe X from the distribution γ C /n,m 10 (or equiv alently X i , i = 0 , 1 , from the distribution γ 2 C /n,µ i ) and if C 6 = I p is known, the pr oblem can b e reduced to the preceding particular case thank s to the transfor mation x → C − 1 / 2 x . When C is unknown, the parallel with the es timation framework is mor e delicate beca use the er r or E depe nds on C . Remark 2. 1 . R eplacing c o eﬃcients by zer o in t he r e gr ession fr amework of Ex- ample 2.1 is e quivalent t o r e ducing the dimension of the s p ac e on which t he chosen classiﬁc ation ru le acts. Sele cting the signiﬁc ant c o eﬃcients of F 10 is e quivalent to ﬁ nding the dir e ction e i ∈ R p for which |h C − 1 / 2 ( µ 1 − µ 0 ) , e i i R p | 2 is lar ge. This is almost e quivalent to ﬁ nding the dir e ct ion in which a t he or etic al version of the r atio b etwe en inter-varianc e and intr a-varianc e is big. This typ e of heuristic with empiric al quantities has b e en us e d by Fisher [ 13 ], whose str ate gy is to maximize the R ayleigh quotient ( s e e for example [ 14 ]). The p oint is that the u se of empiric al quantities in high dimension c an b e c atast r ophic (se e next subse ction). 2.2. Pr o c e dur es to avoid in high dimension W e ar e going to give t wo results that will lead to the following precepts in the problem of es tima ting L 10 . While giving a solution to Pro blem 2 , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 10 1. one should no t try to estima te the full cov a riance matrix C from the data, 2. one should re strict the p oss ible v alues of m 10 to a (suﬃciently small) subset of R p . These precepts have be en known for some time, but we give precis e non-asy mptotic results emphasising them. The ﬁr s t fact is a consequence of P rop osition 2.1 b e- low while the sec ond o ne results from Prop ositio n 2.2 . These tw o pro po sition arise from the use of a more geometr ic e r ror measure, the learning error R , whic h has already b een deﬁned by ( 8 ) and whic h shall be s tudied in mo r e detail in Sectio n 6 . In fact it is a n e asy g eometric exer cise, for o ne who knows a little on gaussian measure, to o btain the following lower bo und R ( 1 ˆ V ) ≥ | α | 2 π e − k F 10 k 2 L 2 ( γ C ) 8 , (13) (whic h is the last p oint of Theorem 7.1 in Section 7 ) where α , the ang le in L 2 ( γ C ) b etw een F 10 and ˆ F 10 , is deﬁned b y ( 5 ). On the o ther hand, Theo rem 6.1 from Sec tion 6 leads to C ( g ) − C ( g ∗ ) ≥ min ( √ 2 π 2 ∗ 16 2 k C − 1 / 2 m 10 k R p e k C − 1 / 2 m 10 k 2 R p 8 R ( g ) 2 , R ( g ) 8 ) , for a ll measur able g : X → { 0 , 1 } . Also, it s uﬃces to get a lower b ound on the Learning error R ( 1 ˆ V ) b y the use of ( 13 ) to get (a go o d) lo wer b ound on the excess Risk when d 1 ( P 0 , P 1 ) cannot b e as c lo sed as desired fro m zer o. This is what we shall do. F or the cas e where the distributions P 1 and P 0 are a lmost undistinguishable ( d 1 ( P 1 , P 0 ) → 0) we refer to the discussion in Section 6 . 2.2.1. One should not try to identify the c orr elation structu r e Let us recall that if A is a deﬁnite p ositive matrix, one ca n deﬁne its genera lised inv erse, also called Mo ore-Penrose pseudo -inv ers e : C − . This gener alised inv ers e C − arises fr o m the decomp osition R p = K er ( C ) ⊕ K e r ( C ) ⊥ . On K e r ( C ), C − is null, and o n K er ( C ) ⊥ , C − equals the inv ers e of ˜ C = C | K e r ( C ) ⊥ ( i.e ˜ C is the restriction of C to K er ( C ) ⊥ ). Prop ositio n 2.1. Supp ose we ar e given X 1 , . . . , X n dr awn indep endently fr om a gaussian Pr ob ability distribution P with me an zer o and c ovarianc e C on R p . L et ˆ C b e the empiric al c ovarianc e and ˆ C − its gener alise d inverse. If ˆ F 10 = ˆ C − m 10 and ˆ s 10 = s 10 , t he classiﬁc ation ru le 1 ˆ V deﬁne d by ( 4 ) le ads to E P ⊗ n [ R ( 1 ˆ V )] ≥ arccos  q n p  2 π e − k F 10 k 2 L 2 ( γ C ) 8 . Before we prov e this prop osition, let us comment it in few words. Comment . As a particula r applica tion of this prop osition, we see that the imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 11 Fisher r ule p er forms ba dly when p >> n , which was alrea dy given in [ 6 ], but in a diﬀerent form (a symptotic and not in a direct c o mparison of the ris k with the Bay es risk). Many alter natives to the es timation of the correla tio n s tr ucture can b e us ed, ba sed for ex ample on appr oximation theor y of cov aria nce op er a- tors, together with mo del selection pr o cedure or more sophisticated agg regatio n pro cedure. Much work has a lready b een do ne in this dir ection, see for ex a mple [ 7 ] and the refer ences therein. The appr oximation pro cedure ha s to be link ed with a s tatistical hypothesis, as it is in the ca s e when s tationarity assumptions are made that lead to a T o eplitz cov a riance matrix C (i.e C ij = c ( i − j ) with c : Z → R a p -p erior ic se quence). These matrices are circular conv olution op er- ators a nd ar e dia gonal in the dis crete F ourier Basis ( g m ) 0 ≤ m

k F 10 k 2 L 2 ( γ C ) ≥ r . F rom the pr eceding prop o s ition, uniformly on a ll the p ossible v alues o f µ 1 and µ 0 , the lea rning err o r a nd the exces s risk can converge to zer o only if n p tends to 0. Recall that if no a priori as sumption is done o n m 10 , ¯ m 10 is the b est estimator (according to the mean square error) of m 10 . Also, as in the estimation of a high dimensional vector problem (such as those describ ed in ([ 9 ])), one should make a more restrictive hypothesis on m 10 . W e will supp ose, in Sectio n 5 , that if ( a k ) k ≥ 0 are the co eﬃcients of C − 1 / 2 m 10 in a well chosen ba sis, then P k ≥ 0 a q k ≤ R q for 0 < q < 2. Pr o of. As in the preceding pro po sition, we will use inequality ( 13 ). Also it is suﬃcient to show the following E [ | α | ] ≥ ar c c os  1 √ p − 3 ( √ n k F 10 k L 2 ( γ C ) + 1)  where α is deﬁned by ( 5 ). Because the function ar c cos is dec r easing and concav e on [0 , 1], it suﬃces to obtain E " |h F 10 , ˆ F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ 1 √ p − 3 ( √ n k F 10 k L 2 ( γ C ) + 1) . (17) On the o ther ha nd, E " |h F 10 , ˆ F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ E " k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # + E " |h F 10 , ˆ F 10 − F 10 i L 2 ( γ C ) | k F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) # ≤ E " k F 10 k 2 L 2 ( γ C ) k ˆ F 10 k 2 L 2 ( γ C ) # 1 / 2   1 + E " h F 10 , ˆ F 10 − F 10 i 2 L 2 ( γ C ) k F 10 k 2 L 2 ( γ C ) # 1 / 2   , where this last inequality results from Cauch y-Sch w artz. Recall that ˆ F 10 = F 10 + C − 1 / 2 √ n ξ , where ξ is a s tandardised gaussian random vector of R p . Also , we easily obtain, E " h F 10 , ˆ F 10 − F 10 i 2 L 2 ( γ C ) k F 10 k 2 L 2 ( γ C ) # 1 / 2 = 1 √ n , and k F 10 k 2 L 2 ( γ C ) k ˆ F 10 k 2 L 2 ( γ C ) = k √ nC 1 / 2 F 10 k 2 R p k √ nC 1 / 2 F 10 + ξ k 2 R p . The rest o f the pro o f follows fr om the following simple fact which is a con- sequence of the Co chran Theorem a nd a classical ca lculation on χ 2 random imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 14 v ariables: Let σ > 0, β ∈ R p , X a gaussia n r andom vector of R p with mean β and cov aria nce I p . Then E  1 k X k 2 R p  ≤ 1 p − 3 . 2.3. Case wher e k F 10 k L 2 ( γ C ) diver ges: wel l sep ar ate d data. W e s hall now rapidly consider the case when the data a r e well sepa rated: the case where k F 10 k L 2 ( γ C ) diverges. In the next theorem, we assume that p tends to inﬁnity . Theorem 2. 2. Su pp ose that 0 < α < π / 2 ( α is deﬁne d by ( 5 )), and that cos( α ) k F 10 k L 2 ( γ C ) → ∞ when p tends to inﬁnity. We then have R →    0 si lim inf p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | < 1 b ≥ 1 8 si lim sup p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | > 1 when p → ∞ . This theorem is prov ed in Sectio n 7 . In the case of w ell separated data it is obvious tha t the o ptima l rule will p er form p erfectly . Theo rem 2.2 s hows that for a g iven estimator ˆ F 10 one should c heck that the proba bilit y to have lim inf p →∞ 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | > 1 is small enough. 3. Q uadratic p erturba tion of quadratic rule 3.1. Main r esults and r emarks ab out the inﬁni te dimensional setting In the cas e where C 1 6 = C 0 , L 10 ( x ) = L Q 10 ( x ) is a p olynomial function of degree t wo on R p : L Q 10 ( x ) = − 1 2 h A 10 ( x − s 10 ) , x − s 10 i R p + h G 10 , x − s 10 i R p − c, (18) where A 10 = C − 1 1 − C − 1 0 , G 10 = S m 10 , (19) S = C − 1 0 + C − 1 1 2 , c = 1 8 h Am 10 , m 10 i R p − 1 2 log | det( C − 1 0 C 1 ) | , m 10 and s 10 are deﬁned by ( 3 ). Remark 3.1. The e qu ation ( 19 ) giving L Q 10 ( x ) c an b e mo diﬁe d using the fact that A 10 = 1 2  C − 1 / 2 1 W 10 C − 1 / 2 1 − C − 1 / 2 0 W 01 C − 1 / 2 0  wher e W ij = I − C 1 / 2 i C − 1 j C 1 / 2 i . (20) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 15 This mo diﬁc ation has two advantages. It involves W ij which play an imp ortant r ole in the inﬁn ite dimensional fr amework (se e r emark 3.2 ). On the other hand, it involves W 10 as much as W 01 which c an le ad in pr actic e (while estimating A 10 ) to a symmet ric pr o c e dur e that do es not give mor e imp ortanc e to any gr oup. In the clas siﬁcation pro blem, a p olynomia l of degree tw o b L Q 10 ( x ) is used as a substitute for L 10 . W e decide tha t X co mes fr o m cla ss o ne if it b elong s to ˆ V = n x ∈ R p tq b L Q 10 ( x ) ≥ 0 o , (21) The following theorem gives o ur solution to Problem 1 . Theorem 3.1. L et γ b e a gaussian me asur e on R p . Supp ose that L Q 10 is a p olynomial of de gr e e two on R p and that we have k L Q 10 k L 2 ( γ ) ≥ r for r > 0 . Then, for al l q ∈ ]0 , 1[ , ther e exists c 1 ( r , q ) > 0 su ch that R ( 1 ˆ V ) ≤ c 1 ( r , q ) kL Q 10 − b L Q 10 k q/ 3 L 2 ( γ ) , (22) wher e ˆ V is given by ( 21 ) and R by ( 8 ). W e emphasise the fact that c 1 ( r , q ) dep ends only r and q . In par ticular it do es not depend on the dimension p of the problem. The pr o of of this Theorem is giv en in Section 8 . It is implicitly inﬁnite dimensional, and the prec e ding theorem co uld have b een stated in an inﬁnite dimensional fr a mework. W e do not w a nt to introduce this complicated framework and we refer to [ 8 ] for an int ro duction to the sub ject. The inﬁnite dimensio nal fra mework highlights a particular asp ect o f the problem that is c o ntained in the following remar k. Remark 3.2. [inﬁnite dimensional fr amework] W hen X is a sep ar able Hilb ert sp ac e (it c an also b e a sep ar able Banach sp ac e in t he c ase of LDA) two gaussian me asur es γ C 1 ,µ 1 and γ C 0 ,µ 0 that ar e not e qu ivalent ar e ortho gonal. If t hese me asu r es ar e ortho gonal then t he observe d data fr om the t wo classes ar e p erfe ct ly sep ar ate d and C ( g ∗ ) = 0 . In this c ase one c an hop e to obtain C ( g ) = 0 for a r e asonable classiﬁc ation rule g (Even if it is not trivial, se e The or em 2.2 in the line ar c ase). A ne c essary and suﬃcient c ondition for these me asur es to b e e quivalent is that m 10 = µ 1 − µ 0 ∈ H ( γ C 1 ,µ 1 ) = H ( γ C 0 ,µ 0 ) , (23) and W 10 = I − C 1 / 2 1 C − 1 0 C 1 / 2 1 ∈ H S ( X ) , (24) wher e H ( γ ) is t he r epr o ducing Kernel Hilb ert S p ac e asso ciate d with a gaussian me asur e γ and H S ( X ) is the sp ac e of Hilb ert Shmidt op er ators with values in X (se e c or ol la ries p293 in [ 8 ]). In p articular, the eigenvalues of W 10 ar e in l 2 . In the c ase wher e t hey ar e e quivalent, one c an deﬁne L 10 as a limit (almost sur ely and L 2 ) of its ﬁnite dimensional c oun terp art. This c an also b e understand as me asur able and squar e d inte gr able (with r esp e ct to γ C 1 ,µ 1 ) p olynomials of de gr e e two in X (se e Chapter 5 . 1 0 in [ 8 ]). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 16 3.2. Comment and Cor ol lary . Supp os e b L Q 10 ( x ) is deﬁned substituting ˆ G 10 , ˆ s 10 ˆ A 10 and ˆ c to G 10 , s 10 A 10 and c in ( 18 ). If we note δ 0 = ˆ c − c + D ˆ G 10 + ( ˆ A ∗ 10 + ˆ A 10 )( ˆ s 10 − s 10 ) , ˆ s 10 − s 10 E R p , (25) ( A ∗ is the transp ose of a matrix A ) δ L = ˆ G 10 − G 10 + ( ˆ A ∗ 10 + ˆ A 10 )( ˆ s 10 − s 10 ) (26) and δ Q = ˆ A 10 − A 10 , (27) we then g et, by straightforward calculation: ∀ x ∈ R p b L Q 10 ( x ) = L Q 10 ( x ) + δ 0 + h δ L , x − s 10 i R p − 1 2 h δ Q ( x − s 10 ) , x − s 10 i R p . (2 8 ) Also, a re result are ab out q uadratic p erturbations of quadratic rules. The following coro llary of Theorem 3.1 is ea sier to use. Corollary 3.1. L et X = R p and C b e a s ymm et ric p ositive deﬁnite matrix on R p . S u pp ose t hat ther e exists r > 0 su ch that kL 10 k 2 L 2 ( γ C,s 10 ) > r . Then, for 1 ˆ V given by ( 21 ) and for al l 0 < q < 1 t her e exists c 1 ( r , q ) > 0 su ch that: R ( 1 ˆ V ) ≤ c 1 ( r , q )  1 2 k C ( A 10 − ˆ A 10 ) k 2 H S ( R p ) + k C 1 / 2 δ L k 2 R p +2 δ 2 0 + 1 2 trace 2 ( C ( A 10 − ˆ A 10 ))  q/ 3 , wher e δ L is given by ( 26 ) and δ 0 by ( 25 ) . Pr o of. Let us recall that δ Q is g iven b y ( 27 ). W e hav e kL 10 − b L 10 k 2 L 2 ( γ C,s 10 ) = k 1 2 ( δ Q ( x ) − E γ C [ q δ Q ( X )]) − h δ L , x i R p − ( δ 0 − 1 2 E γ C [ q δ Q ( X )]) k 2 L 2 ( γ C ) ≤ 1 4 V ar ( q C 1 / 2 δ Q C 1 / 2 ( ξ )) + V ar ( h C 1 / 2 δ L , ξ i R p ) + 2 δ 2 0 + 2 E 2 γ C [ q C 1 / 2 δ Q C 1 / 2 ( ξ )] ( ξ ❀ γ I p , 0 , no te that there is equa lity here) = 1 2 k C 1 / 2 δ Q C 1 / 2 k 2 H S ( R p ) + k C 1 / 2 δ L k 2 R p + 2 δ 2 0 + 1 2 trace 2 ( C 1 / 2 δ Q C 1 / 2 ) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 17 3.3. Comp ar ison of this r esult with those obtaine d for LDA. The prece ding theor em and its co r ollary ar e less powerful than those o btained for the LDA pr o cedure and some co njectures might b e made in a par allel with Theorem 2.1 . In this theorem and in Theore m 2.2 , bo th co ncerning linea r r ules, we explained and quantiﬁed how par ameter estimation errors are less impo rtant when k F 10 k L 2 ( γ C ) is large. This observ ation was based on the pres ence of a ter m exp onentially decreasing with k F 10 k L 2 ( γ C ) in the quantities which determine the upper b ound to the learning error (a nd as a conseq uence the excess risk). In Theorem 3.1 concer ning Q D A pro cedur e, we did no t obtain that type of term. Nevertheless, Remark 3.2 (more precis e ly the relation this leads to equiv alence of the mea s ures) allow us to co njecture that s uch a term e x ists. W e also hav e to clarify the hypothesis under which the nor m of L Q 10 is low er bo unded. Le t us recall tha t this hypothesis guar anties that the co ns tant c 1 in equation ( 22 ) is independent of the parameter s of the problem. In a pa rallel with the results obtained for the pro cedur e LD A the lower b ound tha t is required for the nor m of L Q 10 corres p o nds to the a ssumption that the tw o groups co nsidered can a lwa ys be distinguished. W e b elieve that even if this hypo thesis is natural, it is deeply linked with e rror measure that is used in our proo f: the lear ning error . Hence, it is obvious that the ex cess risk is s ma ll when the data cannot be disting uished (see Section 6 for a fuller discussion) but our res ult do es no t reﬂect this fact. W e do not discuss the estimation of G 10 which le a ds to the sa me analysis as that for F 10 in the cas e o f a linear rule . Let us now discuss the estimation of W 10 (and W 01 ). 3.4. Thr esholding estimation of an op er ator and l i ne arisation of a pr o c e dur e. Recall that W 10 is a symmetr ic matrix . Supp o se we k now an o rthonormal bas e in which it is dia gonal. Let λ 10 = ( λ 10 i ) i =1 ,...,p be the vector o f its eigenv alues. T o build the estimator ˆ W 10 of W 10 , we hav e to es tima te its eigenv alues. It remains to meas ure the lea rning erro r and hence the estimation error of the eigenv alues vector in l 2 norm. Supp ose that p tends to inﬁnity . W e will recall later that if the measure o f class 0 and 1 tend to equiv alent gaussian measur e in a sepa rable Hilb ert spa ce, then W 10 tends to b e Hilb ert-Schmidt. This means that λ 10 stays in l 2 ( N ). Once again, if λ 10 has co eﬃcients decreasing suﬃciently fast, the thresho lding estimatio n s hould b e used. This thres holding es timation is no lo ng er a reduction of the dimension of the space in which the r ules acts, but b ecomes a linea risation of the classiﬁcatio n rules - It can b e in terpreted as a r eduction of the dimension of the space in which the used rule lives- Indeed, let ˆ W 10 = P l i =1 ˆ λ 10 i e i ⊗ e i for l ≤ p and ( e i ) i =1 ,...,p be an orthonorma l bases of R p , we hav e: b L Q 10 = l X i =1 ˆ λ 10 i h e i , x − ˆ s 10 i 2 R p + g ( x ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 18 Figure 1 . Sep ar ation of the data in a dir e ction wher e the v arianc es ar e diﬀer ent . The two gr oups c an b e identiﬁed with their el lip soids of c onc ent r ation: a horizontal el lipsoid and a vertic al el lipso id. the two gr oups have the same me an, but diﬀer e nt c ovarianc e, which makes the data q uite wel l sep ar ate d. One c an t ake advantage of t his sep ar ation only if a quadr atic rule is use d. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 19 where g ( x ) is aﬃne and deﬁned on R p . In this cas e, the plug-in rule is aﬃne in a subspace of dimension p − l a nd q uadratic in the subspace of dimension l spanned by ( e i ) i =1 ,...,l . Let us note that b eca us e W 10 = I − C − 1 / 2 1 C 0 C − 1 / 2 1 , setting the eige n v alues of ˆ W ij to zero in a subspace of R p , is equiv alent to choosing a subspace in which the cov aria nce matrices C 1 and C 0 are ”close enough”. In this subs pa ce, one ca n suppo se that C 1 equals C 0 . The cla ssiﬁcation rule, in this subspace, is linear . Figure 1 illustra tes the case w her e the eigenv a lues of W 10 are big enough and why a qua dratic rule is b etter in tha t c a se. 4. Cl assiﬁcation pro cedure in hi gh di m ension: a wa y to solve Problem 2 4.1. Intr o du ction. In this sec tio n, we give a pra ctical metho d of classiﬁca tion for gauss ia n data in high dimension and hence pr esent our contribution to Problem 2 . Note that if we only treat the binar y cla ssiﬁcation problem, it is easy to extend our pro cedure to the ca se of K classes as we hav e do ne in [ 15 ]. Reca ll that we are giv e n n 1 observ ations from P 1 and n 0 observ ations from P 0 . W e will no te n = n 1 + n 0 . W e supp ose tha t each of the n k vectors of gr oup k is comp ose d o f the p ﬁr st wa velet co eﬃcient (see [ 20 ]) of a random curve fro m X = L 2 [0 , 1] whic h is a realisatio n o f a gaussian r a ndom v aria ble P k = γ C k ,µ k of unknown mean a nd cov aria nce. Recall that a lea rning rule can b e deﬁned by a partition of R p . W e co nstruct this partition ˆ V , R p \ ˆ V of R p with the use of a frontier functions b L 10 : ˆ V = n x ∈ R p : b L 10 ( x ) ≥ 0 o , (29) which s hould b e given in the s equel. W e divide here the presentation into tw o parts. In the ﬁrst part, we g ive a theoretical result in the ca se where the cov ariance matrices are supp osed to b e known. In the second part, we give the metho d that is used when the co v ariances are unknown. W e k eep the notation of the preceding sectio ns. In the ca s e of LDA pro cedure, m 10 = µ 1 − µ 0 F 10 = C − 1 m 10 , s 10 = µ 1 + µ 0 2 , and in the cas e of the QDA pr o cedure, G 10 = 1 2 ( C − 1 1 + C − 1 0 ) m 10 , A 10 = C − 1 1 − C − 1 0 . 4.2. Case of known and e qual c ovarianc e: pr o c e dur e and the or etic al r esult. Notation and assumpti ons. Let ¯ µ k be the empirical mea n o f the lea rning data ( X ik ) i =1 ,...,n k of clas s k . W e supp os e here tha t the cov ariance of group 0 and 1 equal C , and that s 10 is known. The separ ation fr ontier b etw e e n the imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 20 t wo g roups is a ﬃne a nd F 10 is the only unknown parameter . W e supp ose that the lear ning set is made of n 1 = n 0 = n ( p ) / 2 p -dimensional vectors. W e g ive a metho d to c onstruct an estimator of F 10 and g ive theoretical results when n ( p ) tends then to inﬁnity m uch more s lowly than p . F or q > 0, the ball l q p ( R ) is comp osed of the vectors θ ∈ R p such that p X i =1 | θ i | q ≤ R q . W e will note Ω p (Θ( R ) , r ) = { ( x, y , C ) ∈ R p × R p × C p such that (30) C − 1 / 2 ( x − y ) ∈ Θ ( R ) a nd k C − 1 / 2 ( x − y ) k R p ≥ r o where C p is the set of symmetric deﬁnite p ositive matric e s in R p . If ( µ 0 , µ 1 , C ) ∈ Ω p (Θ( R ) , r ), we w ill note D ( ˆ L 10 ) = C ( 1 ˆ V ) − C ( 1 V ) , (31) where ˆ V is given by ( 29 ) and V is given b y ( 2 ). The Pro cedure. The plug -in rule aﬀect the obser v ation X to class 1 if it belo ngs to ˆ V deﬁned by ( 29 ) where b L 10 = h ˆ F 10 , X − s 10 i R p . W e estimate F 10 = C − 1 m 10 by ˆ F 10 = C − 1 ˆ m 10 , where the co eﬃcients o f C − 1 / 2 ˆ m 10 are g iven by  y 10 l 1 | y 10 l | >λ F DR 10  l =1 ,...,p , where y 10 l =  C − 1 / 2 ( ¯ µ 1 − ¯ µ 0 )  l =1 ,...,p , and λ F DR 10 is chosen by the Benja mini a nd Ho cheberg pro cedure [ 4 ] for the control of the false discov er y rate (FDR) of the following m ultiple h yp otheses: ∀ l = 1 , . . . , p H 0 l : E [ y 10 l ] = 0 : V er sus H 0 l : E [ y 10 l ] 6 = 0 (3 2 ) W e r ecall tha t this pro cedur e is the following. The ( | y 10 l | ) l are ordered in de- creasing order: | y 10(1) | ≥ · · · ≥ | y 10( p ) | and λ F DR 10 = | y 10( k F D R 10 ) | where k F DR 10 = max ( k ∈ { 1 , . . . , p } : | y 10( k ) | ≥ s 1 n ( p ) z  b p k 2 p  ) , z ( α ) is the quantile of order α o f a sta ndardized ga ussian random v a riable and b p ∈ [0 , 1 / 2[ is low er bounded b y c 0 log p where c 0 is a p ositive constant (which do es no t dep e nd o n p . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 21 Theoretical result Theorem 4.1. L et R > 0 , and q ∈ ]0 , 2[ . L et ˆ V b e deﬁne d by ( 29 ) and η p = p − 1 q R p n ( p ) . Supp ose that p tends to inﬁnity. If η q p ∈ [ log 5 ( p ) p , p − δ ] for δ > 0 , then, for r > 0 , we have sup ( µ 0 ,µ 1 ,C ) ∈ Ω p ( l q ( R ) ,r ) E P ⊗ n h D p ( ˆ L 10 ) i ≤ 1 + o p (1) r   √ 2 log 1 / 2  p R q n ( p ) q/ 2  Rn 1 / 2 ( p )   2 − q 2 , wher e D p is t he exc ess risk as deﬁne d by ( 31 ), and P ⊗ n is t he law of t he le arning set. Pr o of. The cov a r iance matrix of the vector C − 1 / 2 ( ¯ µ 1 − ¯ µ 0 ) eq ua ls I p 1 n ( p ) . W e then have to use succes sively The o rem 2.1 (of this article), Theo r em 1 . 1 o f Abramovic h et .al [ 1 ], and Theore m 5 po int 3 b. of Donoho and Jo hnstone [ 11 ] to b e a ble to wr ite, ∀ r > 0 : sup ( µ 0 ,µ 1 ,C ) ∈ Ω p ( l q ( R ) ,r ) E P ⊗ n h D 2 p ( ˆ L 10 ) i ≤ 1 + o p (1) r 2   √ 2 log 1 / 2  p R q n ( p ) q/ 2  Rn 1 / 2 ( p )   2 − q . This inequa lity leads to the result by the use of the J ensen inequality: E P ⊗ n h D p ( ˆ L 10 ) i ≤ E P ⊗ n h D 2 p ( ˆ L 10 ) i 1 / 2 . Comments. Let us ma ke a few r emarks o n this r esult. 1. The r ate of conv erge nce is fas ter when q is close to 0, and s low er when it is close to 2. This leads us to consider the spar sity of C − 1 / 2 ( µ 0 − µ 1 ), and makes the use of the w av elet basis attractive. On the one hand, it transforms a wide class of curves in to spar se vectors and on the other hand, it almost diagona lises a wide class of cov a riance op erato rs. 2. W e could obta in the same sp eed with a universal threshold (i.e with the threshold λ U = 1 n ( p ) p 2 log( p ) ). In this ca se, the constan t 1+ o p (1) r 2 would not b e that go o d (cf [ 1 ]). 3. W e are not aw ar e of any r esults concerning the conv ergence of any cla ssiﬁ- cation pr o cedure in this framework (the high dimensional g aussian fra me- work with the s e t of po ssible para meter deter mined by Ω p ). Indeed we do not make any str o ng assumption on C . Bick el and Le v ina [ 6 ] as well as F an and F an [ 12 ] supp ose in their work that the ratio b etw een the highes t and the low es t eig env alue is lo w er a nd upp er- b o unded. Even if o ur Theorem do esnot tr eat the case wher e C is unknown the hypotheses we us e seems more na tural. Let us reca ll tha t if Y is a gaussia n random v ar iable with imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 22 v alues in a Hilber t Space, then the cov ariance op era tor is neces s arily nu- clear. Also, the assumption used by the ab ove mentioned author s does not allow us to consider gaussian measures with supp ort in a Hilbe r t space. 4. Finding the sig niﬁcant comp onent of the nor mal vector F 10 deﬁning the optimal separating hype rplan is equiv a le nt with ﬁnding the signiﬁca n t contrast in a m ultiv a riate ANOV A. Hence, c o ntrolling the exp ected fals e discov er y ra te in this ANO V A is suﬃcient to get a go o d cla ssiﬁcation rule. 4.3. The c ase of diﬀ er ent un know n c ovarianc es F or the r est of this sectio n, if k ∈ { 0 , 1 } , ¯ µ k will b e the empirical mean o f the Learning data of class k . W e are going to use a diagonal es timator ˆ C k of the cov aria nce ma trix C k . The diag o nal elemen ts of ˆ C k will b e ( ˆ σ 2 kq ) q =1 ,. ..,p . F or q ∈ { 1 , . . . , p } , k ∈ { 0 , 1 } , ˆ σ 2 kq will we the unbiased version of the empir ic al v ariance o f feature q of the o bs erv ations ( X ikq ) i =1 ,...,n k of class k . W e will note ˆ s 10 = ( ¯ µ 1 + ¯ µ 0 ) / 2 . The classiﬁca tion rule used choos es that X ∈ R p comes from the class k if X belo ngs to ˆ V k given by ( 29 ) and ˆ L 10 = − 1 2 h ˆ A 10 ( x − ˆ s 10 ) , x − ˆ s 10 i R p + h ˆ G 10 , x − ˆ s 10 i R p − ˆ c 10 , where the quantities of this equation will b e given in w ha t follows. for all (1 , 0) ∈ { 1 , . . . , K } 2 , 1 6 = 0, we now giv e ˆ G 10 (equation ( 33 )), ˆ A 10 (equation 34 ), and ˆ c 10 (equation 35 ). W e estimate G 10 = 1 2 ( C − 1 1 + C − 1 0 ) m 10 by ˆ G 10 =   1 √ 2 1 ˆ σ 2 1 q + 1 ˆ σ 2 0 q ! 1 / 2 y 10 q 1 | y 10 q | >λ F DR 10   q =1 ,. ..,p (33) where y 10 q = 1 √ 2 1 ˆ σ 2 1 q + 1 ˆ σ 2 0 q ! 1 / 2 ( ˆ µ 1 q − ˆ µ 0 q ) , and λ F DR 10 is c ho sen by the Benjamini and Ho cheberg pr o cedure. This pro cedure is the following. Let V ar 0 ( y ij q ) b e the v ar iance of y 10 q calculated under the hypothesis that µ 1 q = µ 0 q . The ter m 1 + ˆ σ 2 1 q / ˆ σ 2 0 q 2 n 1 + 1 + ˆ σ 2 0 q / ˆ σ 2 1 q 2 n 0 is an estimation of this v ariance when σ 2 kq ( k = 0 , 1) are known a nd equal to ˆ σ 2 kq . In pra ctice, we substitute these terms for V ar 0 ( y 10 q ). The real ( | y 10 q | / q V ar 0 ( y 10 q )) q =1 ,. ..,p imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 23 are o r dered by decreasing order: | y 10(1) | / q V ar 0 ( y 10(1) ) ≥ · · · ≥ | y 10( p ) / q V ar 0 ( y 10( p ) ) | and λ F DR 10 = | y 10( k F D R 10 ) | where k F DR 10 = max    k : | y 10( k ) | ≥ s 1 + ˆ σ 2 1( k ) / ˆ σ 2 0( k ) 2 n 1 + 1 + ˆ σ 2 0( k ) / ˆ σ 2 1( k ) 2 n 0 z  b p k 2 p     , z ( α ) is the quantile of order α o f a sta ndardized ga ussian random v a riable and b p ∈ [0 , 1[ is as in the preceding algo rithm. In practice, we choo s e b p = 0 . 01, but one could k eep a part o f the learning set to lea rn the be s t v alue of b p . Note that in the applica tion we hav e in mind, the learning se t is to o small to b e divided. In addition, the choice of b p , in view o f Theorem 4.1 do es not determine the p erfor mances o f the algor ithm. In pra c tice the diﬀere nc e of classiﬁcatio n error b etw een the choices b p = 0 . 01 a nd b p = 0 . 05 for example, is not imp or tant. This ﬁrst part of the metho ds constitute a dimensio n re duction. Indeed, the only co ordina tes of ( ˆ G 10 q ) q =1 ,. ..,p that are k ept non n ull are those for which | y 10 q | ≥ λ F DR ij . The linea r application asso cia ted with ( ˆ G 10 q ) q =1 ,. ..,p only acts in k F DR 10 directions. Let us a lso note that if we extend our pro cedure to a m ul- ticlass proc e dure, for tw o couples of classes ( i, j ) 6 = ( l, m ), the cor resp onding estimations G ij and G lm might b e based on diﬀerent dimension reduction. Remark 4.1. Th e test ing pr o c e du re use d c an b e analyse d as a ”vertic al” ANOV A that r eve als the inter esting dir e ction 1. in which classiﬁc ation should b e done (with thr esholding estimation of G 10 ) 2. in which classiﬁc ation should b e quadr atic (with thr esholding estimation of A 10 ). The matrix A 10 is estimated b y a diagonal matrix with diag onal elements given by ˆ a 10 q = 1 ˆ σ 2 1 q − 1 ˆ σ 2 0 q ! 1 | w 10 q |≥ η F D R 10 , where w 10 q = ˆ σ 2 1 q − ˆ σ 2 0 q , q = 1 , . . . , p, (34) and the threshold η F DR 10 is chosen with the same type o f pr o cedure as the one used to ﬁnd λ F DR 10 . Let V ar 0 ( w 10 q ) be the v ariance of w 10 q under the h yp o thes is that σ 1 q = σ 0 q . The term 2 ˆ σ 4 1 q n 1 − 1 + 2 ˆ σ 4 0 q n 0 − 1 is an estimation of it that we use in practice. The r eal num b er s ( | w 10 q / p V ar 0 ( w 10 q ) | ) q are ordered by decreasing order: | w 10(1) / q V ar 0 ( w 10 p ) | ≥ · · · ≥ | w 10( p ) / q V ar 0 ( w 10 p ) | and η F DR 10 = | w 10( k F D R 10 ) | imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 24 where k F DR 10 = max    k : | w 10( k ) | ≥ s 2 ˆ σ 4 1( k ) n 1 − 1 + 2 ˆ σ 4 0( k ) n 0 − 1 z  b p k 2 p     . This part of the metho d constitutes a linearisatio n o f the rule. Indeed, the directions q ∈ { 1 , . . . , p } in which ˆ a 10 q is 0 are the directions in which the cla s- siﬁcation rule betw een the g roups 1 and 0 is linear. In the other dir e c tions, the rule is qua dratic. The use of this metho ds is still motiv ated by Theo rem 4.1 and the theorems used in its pro of, but it nee ds additional theoretical justiﬁcation. W e will ﬁnally note: ˆ c 10 = p X q =1 1 | w 10 q |≥ η F DR 10  1 8 ˆ a 10 q ( ¯ µ 1 q − ¯ µ 0 q ) 2 + 1 2 log | det( ˆ σ − 1 0 q ˆ σ 1 q ) |  . (35) 5. Appl ication to medi cal data and the TIMIT database W e are g oing to study the p erfor ma nce o f the given pro cedure. With that a im, we compa r e our metho d with the one g iven by Rossi and Villa [ 22 ] on the database TI MIT. W e then use tes t o ur pro cedure on medical da ta. 5.1. Comp ar ison of our metho d with the one of R ossi and Vil la i n the c ase of two class classiﬁc ation Rossi and Villa use a supp or t vector machine (SVM) with diﬀere n t t ypes of kernels. Recall that the SVM pr o cedure is to cons tr uct an aﬃne frontier function f given by f ( x ) = h w, x i R p + b, where w a nd b a re solutio ns of a n optimiza tion pro blem of the fo llowing type: min w, b,ξ k w k 2 R p + C N X i =1 ξ i under y i ( h w, x i i R n + b ) ≥ 1 − ξ i , ξ i ≥ 0 i = 1 , . . . , n where ( x i , y i ) i =1 ,...,n are the couples (observ ations, lab e ls) of the learning set. The TIMIT databa se has notably been studied b y Ha stie et al. [ 18 ]. This database includes phonemes ”a a” and ” ao ” pr onounced by many diﬀeren t per - sons. The co rresp onding re c ords a r e curves observed at a ﬁne enough sampling frequency . More precisely , one curve is a p -dimensional vector with p = 256. The learning set is comp osed of 519 ”aa” and 759 ” ao ” and the test s et is comp osed of 176 ”aa” and 263 ”ao” . Also , the curves ( x i ) i =1 ,..., 519 are those which corres p o nd to the pronunciation of phoneme ” aa” and the label y i = 0 is as s o ciated to them. The lab e l ”1” is asso cia ted to the other curv es which corres p o nd to the pronunciation of phoneme ” a o ”. The metho d o f Ro ssi a nd Villa g ives almost the same results as ours: 2 0% o f classiﬁcation mistakes. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 25 5.2. Applic ation to me dic al data The medical pr oblem is the following. In Mag netic resona nc e imag e ry , one can obtain spectra characterizing tissues loca lized in some area of the br ain. The sp ectra obtained can b e used to characterize tumors . Unfortunately , even for a sp ecia lis t, it is hard to deﬁne a go od r ule to a sso ciate the name of a tumor with a g iven sp ectra. Some spec tra hav e be e n o btained on identiﬁed tumors. W e hav e b een given these sp ectra. In order to hav e enough s p ectr a in o ur lea rning set, we reta ined ﬁve groups o f sp ectr a (some of them reg rouping many tumors). The glio bla stomes of the ﬁr st type 1 , the glioblas tomes o f the second t ype , the Meningiomes, the Metastase s and the healthy tissues. The databa s e provided by the sp ecialis ts contains 21 gliobla s tomes of ﬁrst type, 9 g lioblastomes of second t yp e , 16 M´ eningiomes, 18 m´ etastases and 9 healthy tissues, that is, 7 5 s p ectr a sampled at 1024 p oints. W e giv e the plot of the sp ectra considered in Figure 2 . In order to test our proce dur e, we used a str a tegy o f t ype ”leave on out”. Figure 4 leads us to an exp erimental conﬁr mation tha t in the case of tw o class classiﬁcation, the chosen dimension is a go o d one. W e tested diﬀeren t co nﬁgurations s ummarized in the table Figure 3 . The classiﬁcation erro r rate is s till s igniﬁcant, but the reduction dimensio n pro cedure provides a r eduction of the error ra te (Reca ll that in the case of 4 gro ups having equal a priori probability a r ule that w ould gue s s randomly the type of tumor would have an er ror rate of 75%). Ther e a re t wo reaso ns for this mo derate per formances. Roughly , theor etical physic predicts that a sp ectrum a s so ciated with a g iven tumor, for example a Glioblastome, is a r andom v aria ble y = ( y q ) q =1 ,. ..,p that has a quite s ma ll v ar iability . Also, we shuold b e able to separate easily sp ectra asso ciated with diﬀerent gr oups. Unfor tuna tely , in practice, the instrument ation leads to a measur ement of sp ectra z = ( z q ) q =1 ,. ..,p having complex v alues and for which ther e exists a sequence of angles ( ψ q ) q =1 .. .,p such that: ∀ q ∈ { 1 , . . . , p } y q = ℜ ( e iψ q z q ) . This sequence of a ngles is unknown. The theoretical physics o f instrumentation shows tha t there a re t w o r eal ( a, b ) such that ∀ q ∈ { 1 , . . . , p } ψ q = aq + b. Metho ds to obtain a a nd b are not suﬃciently eﬃcien t, but this repr e sents an active ﬁeld o f research. W e chose to ask the physicians to change the phase manually in order to ha ve a homogeneous real pa r t of the s pectr a in a pa rticular group and we kept the r eal part of the sp ectra . The change of phase made by the physicians is not optimal and the residua l v aria tion of the phase c reates a certain disparity o f observed sp e ctra inside ea ch gro up. This disparity can b e seen Figure 2 . The incorpo ration o f the phase int o a cla ssiﬁcation algorithm, 1 The group of Gli oblastomes has a to o large v ariability , also, we chose to divide it into t w o groups: ﬁrst type and second t ype. These tw o types correspond to the presence of certain c hemical substances. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 26 (a) 21 gli oblastomes A (b) 9 glioblastomes B (c) 16 Meningiomes (d) 18 metastases (e) 9 healthy tissues Figure 2 . Sp e ctr a of the le arning set Groups considered all all except Glioblastomes of ﬁr s t t yp e Metastases and M eningiomes error rate 43 % 30 % 5% Figure 3 . Consider e d gr oups and err or r ate in e ach c ase. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 27 Figure 4 . Classiﬁc ation e rr or r ate (in a two gro up pr oblem: M´ eningiomes v ersus Glioblas- tomes of ﬁrst typ e) as a function of t he sele ct e d dimension. The dimension sele cted by our algorithm i s marke d by a b lack p oint in the Figur e. and the use of the complex nature of the data will be the ob ject of further studies. W e no te, how ever tha t these phase problems in the F ourier domain can be translated interestingly in the temp o ral domain. Finally , the learning s et is still to o small. W e hop e to see the size incr ease in the forthcoming years. 6. A mo re geom etric alternative measure of error: the l earning error 6.1. Deﬁnition and main r esul t W e hav e a lready deﬁned the learning erro r to b e R ( g ) = P ( g ( X ) 6 = Y et g ∗ ( X ) = Y ) , which whe n Y ❀ U ( { 0 , 1 } ) equals R ( g ) = 1 2 ( P 1 ( g ( X ) 6 = 1 et g ∗ ( X ) = 1) + P 0 ( g ( X ) 6 = 0 et g ∗ ( X ) = 0)) . In other words, the lear ning er r or is the pr obability to misclassify X with g and to classify it cor rectly with g ∗ . The p o in t that motiv a tes the use of this err or is that 1. it le ads to a simple g eometric interpretation (mos tly used in the t w o follow- ing Sections) and hence it is used in all the further theore tical developmen t we will give; 2. it is not sensitive to the p ossible indis tinguishability o f the distributions P 0 and P 1 and it lea ds to low er bounds a s in Se c tio n 2 (see remark below). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 28 It follows ea sily from C ( g ) − C ( g ∗ ) = P ( g ( X ) 6 = Y et g ∗ ( X ) = Y ) − P ( g ( X ) = Y et g ∗ ( X ) 6 = Y ) , that a classiﬁcation rule g satis ﬁes : C ( g ) − C ( g ∗ ) ≤ R ( g ) . (36) In the ga ussian case that is studied in this article, we prov ed the following theorem that gives a r everse inequality of ( 36 ). Theorem 6.1. L et g ∗ b e the optimal ru le in the binary classiﬁc ation pr oblem (as pr esente d in Se ction 1 ) . 1. If P 0 and P 1 have the same c ovarianc e C and r esp e ctive me ans µ 1 and µ 0 , then, for al l me asur able fu n ctions g : R p → { 0 , 1 } , we have: C ( g ) − C ( g ∗ ) ≥ min ( √ 2 π 2 ∗ 16 2 k C − 1 / 2 m 10 k R p e k C − 1 / 2 m 10 k 2 R p 8 R ( g ) 2 , R ( g ) 8 ) , wher e m 10 = µ 1 − µ 0 . 2. Le t c 1 > 0 and P ( c 1 ) b e t he set of c ouples ( P, Q ) of gaussian me asure on R p such t hat d 1 ( P, Q ) > c 1 . If ( P 1 , P 0 ) ∈ P ( c 1 ) then t her e ex ists a c onstant c ( c 1 ) > 0 (t hat only dep ends on c 1 ) such that C ( g ) − C ( g ∗ ) ≥ min  c ( c 1 ) R ( g ) 8 , R ( g ) 8  . Before we prov e this result, let us comment it. Comments. Let us no te that C ( g ) − C ( g ∗ ) ≤ 1 2 d 1 ( P 1 , P 0 ) . Also, in the ca se where d 1 ( P 1 , P 0 ) tends to 0, the excess risk do es not measure the diﬀer ence b etw een g and g ∗ but the proximity o f P 1 and P 0 . The lear ning error is not sensitive to this scale phenomeno n, a s witness the following example. Example 6.1. L et µ ≥ 0 , P 1 = N ( µ, 1) and P 0 = N ( − µ, 1) . In this c ase, for al l a ∈ R R ( 1 [ a, ∞ [ ) = 1 2 ( P (0 < ξ + µ < a ) + P ( a < ξ − µ < 0)) , wher e ξ ❀ N (0 , 1) ; and d 1 ( P 1 , P 0 ) → 0 if and only if µ → 0 in which c ase R ( 1 [ a, ∞ , [ ) → 1 2 P ( ξ ∈ [0 , | a | ]) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 29 Under t hese c onditions, t he le arning err or asso ciate d with 1 [ a, ∞ , [ tends to 0 only if a tends t o 0 . In other wor ds, whe n µ → 0 , the le arning err or makes a diﬀer enc e b etwe en the rules 1 [100 , ∞ , [ and g ∗ = 1 [0 , ∞ , [ : inf µ< 50 R (1 [100 , ∞ [ ) ≥ 1 2 P ( ξ ∈ [0 , | 5 0 | ]) ≈ 1 4 while we have C (1 [100 , ∞ [ ) − C ( g ∗ ) ≤ 1 2 d 1 ( P 1 , P 0 ) ≤ µ √ 2 π . Remark 6.1. By deﬁnition, is t he quantity of inter est. The pr oblem with it is that it c an gives cr e dit to every given pr o c e dur e when d 1 ( P 1 , P 0 ) is suﬃciently smal l. Also, one c annot ar gue that a rule is n ever go o d ac c or ding to the exc ess risk. In t he pr e c e ding example, t he pr o c e dur e g ( x ) = 1 [100 , ∞ [ ( x ) is uniformly (on say | µ | ≤ 5 0 ) inc onsistent ac c or ding to the le arning err or but n ot ac c or ding to the ex c ess risk. The ma in c o nsequence o f this Theor em has already b een used in Section 2 . 2 . F rom equation ( 36 ), if ( g n ) n ≥ 0 is a s e quence of cla ssiﬁcation rules such that R ( g n ) tends to z e ro, then C ( g n ) − C ( g ∗ ) tends to zero . Theor e m 6.1 , implies the conv er s e result. 6.2. Pr o of of The or em 6.1 Pr o of. Let us take K 1 = { x ∈ R p : g ( x ) 6 = 1 et g ∗ ( x ) = 1 } and K 0 = { x ∈ R p : g ( x ) 6 = 0 et g ∗ ( x ) = 0 } . Also, R ( g ) = 1 2 ( P 1 ( K 1 ) + P 0 ( K 0 )) and at least one of the following tw o inequa l- ities is satisﬁed (from the pigeonhole principle): P 1 ( K 1 ) ≥ R ( g ) , P 0 ( K 0 ) ≥ R ( g ) . Without lo s s of generality we will supp ose that P 1 ( K 1 ) ≥ R ( g ) which implies P 1 ( K 1 ) + P 0 ( K 1 ) ≥ R ( g ). Note that we hav e C ( g ) − C ( g ∗ ) = P ( g 6 = Y ) − P ( g ∗ 6 = Y ) = 1 2 ( P 1 ( K 1 ) − P 1 ( K 0 )) + 1 2 ( P 0 ( K 0 ) − P 0 ( K 1 )) ( b y conditioning with resp ect to Y ) = 1 2 (( P 1 − P 0 )( K 1 ) + ( P 0 − P 1 )( K 0 )) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 30 and, b ecause g ∗ ( X ) = 1 if a nd only if dP 1 ≥ dP 0 (b y deﬁnition of g ∗ and from the fact that Y ❀ U ( { 0 , 1 } )), we get C ( g ) − C ( g ∗ ) = 1 2 Z 1 K 1 ∪ K 0 | dP 1 − dP 0 | ≥ 1 2 Z 1 K 1 | dP 1 − dP 0 | . (37) A straightforward c alculation (see for example [ 15 ] Pr op osition 1.4.2 Chapter 1 Part I) leads to Z X m ( x )( dP 1 − dP 0 ) = 2 E P  m ( X ) e f 10 ( P, X ) | sinh  1 2 L 10 ( X )  |  , for all measurable m , where P is any pr o bability mea s ure that dominates P 1 and P 0 , f 10 ( P, X ) = 1 2 log( dP 1 dP dP 0 dP ) and L 10 ( x ) = log( dP 1 dP 0 ( x )). In par ticular d 1 ( P 1 , P 0 ) = 2 E P  e f 10 ( P, X ) | sinh  1 2 L 10 ( X )  |  , Also note that whenever K ⊂ { x ∈ R p : L 10 ( x ) ≥ 0 } we hav e P 1 ( K ) − P 0 ( K ) = 2 E P [1 K e f 10 ( P, X ) sinh( L 10 ( X ) / 2)] , and a s a cons equence, ( 37 ) can b e r ewritten C ( g ) − C ( g ∗ ) ≥ E [1 K 1 ( X ) e f 10 ( P, X ) sinh( L 10 ( X ) / 2)] . (38) It can a lso b e shown that P 1 ( K ) + P 0 ( K ) = 2 E P [1 K e f 10 ( P, X ) cosh( L 10 ( X ) / 2)] , and co nsequently , P 1 ( K 1 ) + P 0 ( K 1 ) ≥ R ( g ) is rewr itten 2 E P [1 K 1 ( X ) e f 10 ( P, X ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) . (39) On the o ther ha nd, d 1 ( P 1 , P 0 ) ≥ c 1 leads to: 2 E P [ e f 10 ( P, X ) | sinh( L 10 ( X ) / 2) | ] ≥ c 1 . (40) In the rest of the pro of, w e shall com bine ( 39 ) and ( 40 ) in order to low er bo und the r ight member of ( 38 ). W e r e ma rk that the left member in ( 39 ) and the r ight member of ( 38 ) only diﬀer by a factor tw o and replacing a s inh by a cosh. F or our purp os e, these tw o functions only diﬀer fundamentally nea r zero . W e are g oing to deco mpo se K 1 int o tw o disjoint sets. Also, we will deﬁne K + 1 = { x ∈ K 1 : L 10 ( x ) ≥ 2 } et K − 1 = { x ∈ K 1 : L 10 ( x ) ≤ 2 } . Let us a lso deﬁne A and B by: Z K 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) = Z K + 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) | {z } A + Z K − 1 e f 10 ( P, x ) sinh( L 10 ( x ) / 2) P ( dx ) | {z } B . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 31 F rom ( 39 ), (a nd the pigeo nhole principle) tw o cases can o ccur. In the ﬁrs t cas e E P [1 K + 1 ( X ) e f 10 ( P, x ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) / 4 , and in the s econd E P [1 K − 1 ( X ) e f 10 ( P, x ) cosh( L 10 ( X ) / 2)] ≥ R ( g ) / 4 . (41) In the ﬁrs t case, be c ause X ∈ K + 1 implies sinh( L 10 ( X ) / 2) ≥ 1 2 cosh( L 10 ( X ) / 2) (ln(6) ≤ 2) , we hav e A ≥ R ( g ) / 8 and hence the des ired result ( it s uﬃces to r emark that L 10 ( x ) ≥ 0 if x ∈ K 1 which implies B ≥ 0). W e shall now consider the ca se wher e ( 41 ) is satisﬁed. In this case, b eca use cosh( x ) ≤ 2 for all | x | ≤ 1, we hav e Z K − 1 e f 10 ( P, x ) P ( dx ) ≥ R ( g ) / 8 . Also, the deﬁnitio n dν = e f 10 ( P, x ) dP R e f 10 ( P, x ) dP , makes ν a proba bilit y mea s ure on R p and ν ( K − 1 ) ≥ R ( g ) / 8 . (42) On the o ther ha nd, (see the deﬁnitio n of f 10 ) Z e f 10 ( P, x ) dP = Z p dP 1 dP 0 = A 2 ( P 1 , P 0 ) ( A 2 ( P 1 , P 0 ) is the Hellinge r a ﬃnit y b e t w een P 1 and P 0 ) which leads to B = A 2 ( P 1 , P 0 ) Z ∞ 0 ν  X ∈ K − 1 and | sinh( L 10 ( X ) / 2) | ≥ t  dt. (43) W e hav e ν ( X ∈ K − 1 ) = ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  + ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≥ t  . Let g b e the application which asso cia tes to t > 0 the real g ( t ) = sup ( P 1 ,P 0 ) ∈P ( c 1 ) ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) . (44) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 32 F or every t > 0, we hav e: ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≥ t  = ν ( X ∈ K − 1 ) − ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  W e then deduce from this inequality and from ( 43 ) that for all ǫ ≥ 0, B ≥ A 2 ( P 1 , P 0 ) Z ǫ 0 ν  X ∈ K − 1 and | sinh( L 10 ( X ) / 2) | ≥ t  dt ≥ ǫν ( X ∈ K − 1 ) − A 2 ( P 1 , P 0 ) Z ǫ 0 ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  dt ) ≥ ǫ R ( g ) / 8 − Z ǫ 0 ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  dtA 2 ( P 1 , P 0 ) where this la st inequality results from ( 42 ). The rest of the pro of relies on the following le mma. Lemma 6. 1. 1. The applic ation g deﬁne d by ( 44 ) le ads t o g ( t ) ≤ c ( c 1 ) A 2 ( P 1 , P 0 ) t 1 / 7 ( c ( c 1 ) is a p ositive c onst ant that only dep ends on c 1 ). 2. In the c ase wher e C 1 = C 0 = C , we have ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  ≤ 4 t √ 2 π k C − 1 / 2 m 10 k R p . W e prov e this r esult at the end of the current pro o f. Let us note that it is equation ( 40 ) that plays a crucial r ole in the pro of. In the case where C 1 6 = C 2 , Z ǫ 0 ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  dtA 2 ( P 1 , P 0 ) ≤ ˜ c ( c 1 ) ǫ 1+1 / 7 , and the choice ǫ =  R ( g ) 16 ˜ c ( c 1 )  7 leads to the des ired result. In the case wher e C 1 = C 2 , Z ǫ 0 ν  X ∈ K − 1 and | sinh( L 10 / 2) | ≤ t  dt ≤ 2 ǫ 2 √ 2 π k C − 1 / 2 m 10 k R p , and the choice ǫ = √ 2 π k C − 1 / 2 m 10 k R p R ( g ) 32 A 2 ( P 1 ,P 0 ) leads to the desired result. Indeed, in the ca se wher e C 1 = C 0 , classical calculation leads to A 2 ( P 1 , P 0 ) = Z e f 10 ( P, X ) dP = e − k C − 1 ( µ 1 − µ 0 ) k 2 R p 8 . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 33 Let us now prov e Lemma ( 6.1 ) Pr o of. Let us b egin by point 2. It is suﬃcient to notice that if P 1 | 0 is a gaussia n measure with c ov aria nce C and mean s 10 , and if X is a random v ar ia ble drawn from P 1 | 0 , then e f 10 ( P 1 | 0 ,X ) = e − k C − 1 ( µ 1 − µ 0 ) k 2 R p 8 in distribution L 10 ( X ) ❀ N (0 , σ 2 ) , where σ 2 = k C − 1 ( µ 1 − µ 0 ) k 2 R p . Also, we g et ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) = P  |N (0 , σ 2 ) | ≤ 2 Arg sinh ( t )  ≤ 4 Arg sinh ( t ) √ 2 π σ ≤ 4 t √ 2 π σ . Let us now prov e p oint 1 of the Lemma. ν ( | sinh( L 10 ( X ) / 2) | ≤ t ) ≤ Z 1 | sinh( L 10 ( x ) / 2) |≤ t  dP 1 dP 0  1 / 2 dP 0 / A 2 ( P 1 , P 0 ) . ≤ P 1 / 2 0 ( |L 10 ( X ) / 2 | ≤ t ) A 2 ( P 1 , P 0 ) (from Ca uch y-Sch wartz and Arg sh ( y ) ≥ y ) . Finally , we conclude from point 2 of Theor em 8.4 , given in Section 8 , which hypothesis is s atisﬁed since: c 1 ≤ d 1 ( P 1 , P 0 ) ≤ 2 p K ( P 0 , P 1 ) (from Pinsker inequality (see [ 24 ])) , ≤ 2 kL 10 k 1 / 2 L 2 ( P 0 ) (from Cauch y- Schartz inequality) . 7. A geo m etrical Analysis of LDA to sol v e Problem 1 7.1. Intr o du ction and ﬁrst r esul t Let X be a sepa r able Ba nach spa ce X = R p , e ndowed with its Bo rel σ - ﬁeld a nd a gaussian measure γ . Throughout the next section, we will asso cia te to any measurable f the set V f = { x ∈ X : f ( x ) ≥ 0 } . (45) In this section X = R p . Rec a ll that α (deﬁned by ( 5 )) is the angle, a ccording to the geometr y of L 2 ( γ C ) b etw e e n F 10 et ˆ F 10 . This quantit y will play a very imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 34 impo rtant role in the whole section. In or der to shorten the notation, we will replace R ( 1 ˆ V ) b y R in this s e ction a nd those tha t follow. Recall that F 10 = C − 1 m 10 , m 10 = µ 1 − µ 0 , s 10 = µ 1 + µ 0 2 , where µ 1 , (resp. µ 0 ) a nd C are the mean and (common) cov aria nce o f the dis- tribution P 1 = γ C,µ 1 (resp. P 0 = γ C,µ 0 ) of data fro m g roup 1 (resp. 0). With the ab ov e deﬁned nota tion ( 45 ), the o ptimal rule and the plug-in r ule can b e rewritten with V = V h F 10 ,x − s 10 i R p and ˆ V = V h ˆ F 10 ,x − ˆ s 10 i R p F or the purp os e of this section, let us note tha t the learning e r ror studied in the preceding section and in tro duced by equation ( 8 ) is (in the case of LDA) R = 1 2  γ C,µ 0  X ∈ ˆ V \ V  + γ C,µ 1  X ∈ V \ ˆ V  . which implies R = 1 2  γ C,s 10  X ∈  ˆ V \ V − m 10 2  + γ C,s 10  X ∈  V \ ˆ V + m 10 2  . (46) The Pro blem now b ecomes to that o f measuring t wo ar eas of R p with γ C,s 10 . Standard prop erties of gaussia n measure now leads to R = 1 2 γ p  ( V h .,G p i R p \ V h .,G p + e p i R p + d 0 ) − G p 2  (47) + 1 2 γ p  ( V h .,G p + e p i R p + d 0 \ V h .,G p i R p ) + G p 2  , where d 0 = h ˆ F 10 ; ˆ s 10 − s 10 i R p , G p = C 1 / 2 F 10 = C − 1 / 2 m 10 , ˆ G p = C 1 / 2 ˆ F 10 and e p = C 1 / 2 ( ˆ F 10 − F 10 ) . (48) One may no te that the change of g eometry implies k G p k R p = k F 10 k L 2 ( γ ) , k ˆ G p k R p = k ˆ F 10 k L 2 ( γ ) , k e p k p = k F 10 − ˆ F 10 k L 2 ( γ C ) , (4 9 ) and α (deﬁned by equa tion ( 5 )) is the angle, in the g eometry of R p betw een G p and ˆ G p . The following theorem gives low er b ounds and upp er b ounds on the learning error R as functions of (amo ng others) α . Its pro o f r elies on the fact that R is the measure by γ 2 of t wo ”s imple” areas of R p (see Figure 5 ) and the use of four elementary prop erties of gauss ian measure to b e given later (see Figure 6 ). imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 35 Theorem 7 .1. L et d 0 = h ˆ F 10 , ˆ s 10 − s 10 i R p . The L e arning err or R as a function of α satisﬁes: ∀ α ∈ [ − π , π ] R ( α ) = R ( − α ) . The L e arning err or also satisﬁes the fol lowing ine quality If α ≥ π 2 , then R ≥ 1 2 . If 0 ≤ α < π 2 , t hen we have R ≤ 1 2 and we distinguish b et we en four c ases. 1. If | d 0 | ≤ 1 4 |h F 10 , ˆ F 10 i L 2 ( γ C ) | , we have: e − k F 10 k 2 L 2 ( γ C ) 8 1 4 α 2 π + 1 2 γ 1 " 0; | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! ≤ R , (50) and R ≤ e − k F 10 k 2 L 2 ( γ C ) cos( α ) 2 32 α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! . (51) 2. If 1 4 |h F 10 , ˆ F 10 i L 2 ( γ C ) | < | d 0 | ≤ 1 2 |h F 10 , ˆ F 10 i L 2 ( γ C ) | , we have: e − k F 10 k 2 L 2 ( γ C ) 2 1 4  1 2 γ 1  0; k F 10 k L 2 ( γ C ) 4  + α 2 π  ≤ R (52) R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . (53) 3. If 1 2 |h F 10 , ˆ F 10 i L 2 ( γ C ) | < | d 0 | , we have: α 4 π + 1 4 γ 1  0; k F 10 k L 2 ( γ C ) 2  ≤ R , (54) R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . 4. If | d 0 | = 0 , t hen we have e − k F 10 k 2 L 2 ( γ C ) 8 α 2 π ≤ R . (55) Pr o of. Step 1: The pr oblem is two dimensional W e shall prove this equality: R = 1 2 γ 2  Q a − − y +  + 1 2 γ 2  Q b − − y −  , (56) where Q a − , Q b − , y + and y − will be deﬁned below. Q a − and Q b − are t wo a reas of R 2 , y + and y − are tw o vectors of R 2 and all these quantities are illustrated Figure imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 36 5 . In the following we shall us e the notatio n ˜ e p = Π G ⊥ p e p for the orthogona l pro jection of e p on the o rthogona l to G p in R p . W e will s upp os e that k ˜ e p k R p 6 = 0, since the par t o f the result concer ning k ˜ e p k R p = 0 is straight forward. The calculation of R is in tr ins ically a calculus in the tw o dimensio nal spa ce M p , spanned b y G p and ˜ e p . In o rder to make this fact clear, note that for a ll z 1 ∈ M p z 2 ∈ M ⊥ p we hav e: V h .,G p + e p i R p + d 0 \ V h .,G p i R p + z 1 + z 2 = V h .,G p + e p i R p + d 0 \ V h .,G p i R p + z 1 and V h .,G p i R p \ V h .,G p + e p i R p + d 0 + z 1 + z 2 = V h .,G p i R p \ V h .,G p + e p i R p + d 0 + z 1 (here M ⊥ p was the or thogonal of M p in R p ). By the tensorial pro p erty of γ p and equation ( 47 ), we ﬁnally get R = 1 2 γ 2  M p ∩ ( V h . ,G p + e p i R p + d 0 \ V h . ,G p i R p − G p 2 )  (57) + 1 2 γ 2  M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p + d 0 + G p 2 )  . (58) Also, in the sequel we w ill identify M p with R 2 , D and ˆ D will be the stra ight lines of M p with eq ua tion h ., G p i R p = 0 and h ., G p + e p i R p + d 0 = 0. It can ea sily be shown that these lines intersect in a p given by a p = − d 0 ˜ e p k ˜ e p k 2 R p . (59) Also, V h . ,G p i R p = V h . − a p ,G p i R p et V h . ,G p + e p i R p + d 0 = V h . − a p ,G p + e p i R p , and with the sa me calculus that was used to o bta in ( 47 ), equa tion ( 57 ) beco mes: R = 1 2 γ 2  M p ∩ ( V h . ,G p + e p i R p \ V h . ,G p i R p ) − G p 2 + a p  (60) + 1 2 γ 2  M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p ) + G p 2 + a p  . (6 1) Notice that for rea sons of s ymmetry we can assume that d 0 ≥ 0 without lo ss of generality . In the sequel, we shall use the notation y + = G p 2 − a p et y − = − G p 2 − a p , (62) the co ordinates of y + in the orthonor ma l co ordina te sys tem obtained from the orthogo nal c o ordinate sys tem (0 , ˜ e p , G p ) will b e noted ( y h , y v ) and are equal ( d 0 k ˜ e p k R p , k G p k R p 2 ). W e shall also no te Q a − = M p ∩ ( V h . ,G p + e p i R p \ V h . ,G p i R p ) et Q b − = M p ∩ ( V h . ,G p i R p \ V h . ,G p + e p i R p ) . (63) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 37 Figure 5 . Figur e giving the deﬁnition of Q a − , Q b − , Q + , and Q ǫ for L emma 7.1 W e ﬁnally derive equa tion ( 56 ). F ro m Figure 5 , we notice tha t replacing α by − α , R do es not change; that if 0 < α ≤ π / 2 then R ≤ 1 2 and if π ≥ α ≥ π / 2 then R p ≥ 1 / 2. Also, we will now s uppo s e that α ∈ [0 , π / 2]. Step 2 . The rest of the pr o of r elies on the following lemma. Lemma 7. 1. L et, Q + and Q ǫ b e deﬁne d by Figur e 5 forming, with Q a − et Q b − , a p artition of R 2 . L et u = tan( α ) y h . We t hen have • If y − ∈ Q − , t hen 1 2 γ 1 ([0; | y v | ]) + α 2 π + γ 1 ([0 , y v 2 ]) γ 1  0;     y v / 2 cos( α ) sin( α )      ≤ γ 2 ( Q b − − y − ) γ 2 ( Q b − − y − ) ≤ α 2 π + γ 1 ([0; | u | (1 + tan( α ))]) , (64) • If y − ∈ Q + , then e − y 2 v 2 1 2  1 2 γ 1 ([0; | u | ]) + α 2 π  ≤ γ 2 ( Q b − − y − ) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 38 γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2  γ 1 ([0; ((1 + tan( α )) | u | ]) + α 2 π  , (65) • If y − ∈ Q ǫ , t hen e − (1+ ǫ ) 2 | u | 2 2 1 2  1 2 γ 1 ([0; | u | ]) + α 2 π  ≤ γ 2 ( Q b − − y − ) γ 2 ( Q b − − y − ) ≤  γ 1 ([0; (1 + tan( α )) | u | ]) + α 2 π  . (66) • We have c onc erning γ 2 ( Q a − − y + ) : γ 2 ( Q a − − y + ) ≤ γ 2 ( Q b − − y − ) . (67) • Final ly, if y h = 0 , we have e − y 2 v 2 α 2 π ≤ γ 2 ( Q a − − y + ) = γ 2 ( Q b − − y − ) . (6 8 ) This Lemma will b e prov e n in Subsection 7.3 , let us s ee how it implies The- orem 7.1 . Fix ǫ = 1 for the rest of the pro of (Other v alues of ǫ will help us in the pro of of Theo rem 2 .2 ). E quation ( 67 ) of the lemma implies that 1 2 γ 2 ( Q b − − y − ) ≤ R ≤ γ 2 ( Q b − − y − ) . Recall that ( y h , y v ) has be en deﬁned following equa tion ( 62 ) as the co ordina tes of y + and that u = tan( α ) y h . A simple ca lculation leads to u = | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) et y 2 v = k F 10 k 2 L 2 ( γ C ) 4 . If 1 2 |h G p , ˆ G p i R p | < | d 0 | , we hav e in the pr eceding Lemma y − ∈ Q − and: 1 4 γ 1  0; tan( α ) k F 10 k L 2 ( γ C ) 2  + α 4 π ≤ R R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . The case wher e | d 0 | < 1 4 |h G p , ˆ G p i R p | (which means that 2 | u | < | y v | ) is the case where y − ∈ Q + , and we then have: e − k F 10 k 2 L 2 ( γ C ) 8 1 4 α 2 π + 1 2 γ 1 " 0; | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! ≤ R , and R ≤ e − k F 10 k 2 L 2 ( γ C ) cos( α ) 2 32 α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #!! . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 39 If 1 4 |h G p , ˆ G p i R p | < | d 0 | < 1 2 |h G p , ˆ G p i R p | , (which means tha t 2 | u | > | y v | > | u | ) we hav e in the preceding lemma y − ∈ Q ǫ ( ǫ = 1), a nd since in this case | y v | > | u | > | y v | / 2, we get: e − k F 10 k 2 L 2 ( γ C ) 2 1 4  1 2 γ 1  0; k F 10 k L 2 ( γ C ) 4  + α 2 π  ≤ R and R ≤ α 2 π + γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! . This ends the pro of of Theo rem 7.1 . 7.2. Pr o of of The or em 2.2 Theorem ( 2.2 ) is a lso a cons equence o f the prec eding Lemma. W e will use the preceding lemma while tuning the v alue of ǫ . W e use without restating them the deﬁnitions g iven befo r e the preceding lemma. Let us a ssume that 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | has an infer io r limit a < 1. Then, there ex- ists ǫ > 0 such that y + and y − (deﬁned by ( 62 )) b elong to Q + (for k F 10 k L 2 cos( α ) large enough), then eq uation ( 65 ) implies that R ≤ e − ǫ 2 k F 10 k 2 L 2 cos 2 ( α ) 2(1+ ǫ ) 2  1 + | α | 2 π  , and R tends to 0 when k F 10 k 2 L 2 cos 2 ( α ) tends to inﬁnity . If now 2 | d 0 | |h F 10 , ˆ F 10 i L 2 ( γ C ) | tends to a > 1, then y + or y − (given by ( 62 )) b elo ngs to Q − (for k F 10 k L 2 cos( α ) large eno ugh). And sinc e in this ca s e equatio n ( 64 ) leads to R ≥ 1 4  1 2 γ 1 ([0; k F 10 k L 2 / 2]) (69) + γ 1  0; k F 10 k L 2 cos( α ) 4 sin( α )  γ 1 ([0; k F 10 k L 2 / 4]) + α 2 π  , we obtain the desired result b y letting k F 10 k L 2 tend to inﬁnit y . One has to observe that α depends on k F 10 k L 2 and that the limit v alues α = π / 2 and α = 0 require the use of diﬀerent ter ms in inequality ( 69 ). This ends the pro of of Theo rem 2 .2 . 7.3. Pr o of of L emma 7.1 This pro of is the central pa rt of this section. It is mostly geometrica l, and r equire only is the following four pr op erties (given by Figure 6 ): imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 40 Figure 6 . The four pr op ert ies use d in the pr o of • Prop erty 1 . If A ⊂ R 2 betw een the tw o half str aight lines (0 , u ) and (0 , v ) such that Angle( u, v ) = α , then γ 2 ( A ) = α 2 π . This result follows directly from rotatio na l inv ar iance of the gaus s ian mea s ure. Such a n area will b e called an angular p ortion of size α and centre 0. • Prop erties 2 and 3 . Let y ∈ R 2 , D a s traight line of R 2 , b the or thogonal pro jection of y on D and h the dista nce from y to D . If A ⊂ R 2 and A is included in the half plan delimited by D that do es not contain y , then γ 2 ( A − y ) ≤ e − h 2 / 2 γ 2 ( A − b ). This is prop er t y 2. If A ⊂ R p is included in the half plan delimited by D that contains y then γ 2 ( A − y ) ≥ e − h 2 / 2 γ 2 ( A − b ).This is pro p e rty 3. • Prop erty 4 . If A = [0; d ] × [0; ∞ [ (see Figure 6 ) then γ 2 ( A ) = 1 2 γ 1 ([0; d ]). Such a re ctangle will b e ca lled an inﬁnite re c tangle of origin 0 and heig ht d . W e will note q and ˆ q the o r thogonal pr o jections o f y o n D and ˆ D . The prop erties 2 and 3 a r e well known but for the sake o ﬀ completeness we r ecall their pro of. It suﬃces to note tha t γ 2 ( A − y ) = Z x ∈ A 1 2 π e − k x − y k 2 R 2 2 dx = e − h 2 2 Z x ∈ A 1 2 π e − k x − b k 2 R 2 2 e h x − b,y − b i R 2 dx, imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 41 Figure 7 . Figur e to visualize de pr o of and that x ∈ A implies h x − b, y − b i R 2 ≤ 0 for prop erty 2 and h x − b, y − b i R 2 ≥ 0 for prop erty 3. W e are now g oing to distinguish b etw een a num b er o f c ases and, in each of them, use the announced pro pe r ties. First note that the ineq uality concerning y + is trivial. Figure 7 a nd 5 will b e useful in the following. Case y − ∈ Q b − . In this case | y v | ≤ | u | . One can include in Q b − the disjoint union of an inﬁnite rectangle of or igin y − , and height | y v | ; an a ngular p ortion of s ize α a nd ce n tre y − ; a nd a recta ngle with vertex y − height | y v | / 2 and length | y v / 2 cos( α ) sin( α ) | . Using prop erties 4 and 1 , we then get: 1 2 γ 1 ([0; | y v | ]) + α 2 π + γ 1 ([0 , y v 2 ]) γ 1  0;     y v / 2 cos( α ) sin( α )      ≤ γ 2 ( Q b − − y − ) . (70) On the other ha nd, Q b − can b e included in the disjoin t unio n o f an angular po rtioin with centre y − , of tw o inﬁnite r ectangles with height less than or equal to | u | tan( α ) and of tw o inﬁnite recta ngle o f height low er or eq ual to | u | . Also, prop erties 1 and 4 imply: γ 2 ( Q b − − y − ) ≤ α 2 π + γ 1 ([0; | u | (1 + tan( α ))]) . (71) Case y − ∈ Q + . In this case | y v | > (1 + ǫ ) | u | , y − is at a distanc e | y v | from D and at a distance ( | y v | − | u | ) co s( α ) ≥ ǫ 1+ ǫ | y v | cos( α ) fro m ˆ D . Pr op erties 2 a nd 3 imply: e − y 2 v 2 γ 2 ( Q b − − q ) ≤ γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2 γ 2 ( Q b − − ˆ q ) . (72) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 42 One can include in Q b − an angular p ortion o f s ize α and with ce n tre q o r a n inﬁnite r ectangle of orig in y and height | u | . Also, prop erties 1 and 4 imply , with ( 72 ) and the fact tha t max( a, b ) ≥ a + b 2 the equation: 1 2  1 2 γ 1 ([0; | u | ]) + α 2 π  ≤ γ 2 ( Q b − − q ) . The set Q b − can b e included in the unio n of an a ngular p ortion o f size α ce ntred in ˆ q and of tw o inﬁnite r ectangles o f o rigin ˆ q and height | u | (1 + tan( α )). Also , prop erties 1 and 4 together with ( 72 ) and max( a, b ) ≥ a + b 2 imply the following equation: e − y 2 v 2 1 2  1 2 γ 1 ([0; | u | ]) + α 2 π  ≤ γ 2 ( Q b − − y − ) , (7 3 ) γ 2 ( Q b − − y − ) ≤ e − ǫ 2 y 2 v cos 2 ( α ) 2(1+ ǫ ) 2  γ 1 ([0; | u | (1 + tan( α ))]) + α 2 π  . Case y − ∈ Q ǫ . In this case (1 + ǫ ) | u | > | y v | > | u | , y − is at a dis tance | y v | ≤ (1 + ǫ ) | u | from D and at a dista nce ( | y v | − | u | ) cos( α ) ≥ 0 from ˆ D . Prop erties 2 and 3 imply e − (1+ ǫ ) 2 | u | 2 2 γ 2 ( Q b − − q ) ≤ γ 2 ( Q b − − y − ) ≤ γ 2 ( Q b − − ˆ q ) . (74) from which we deduce the following inequa lity in the s a me wa y as in the pre- ceding paragr aph: e − (1+ ǫ ) 2 | u | 2 2 1 2  1 2 γ 1 ([0; | u | ]) + α 2 π  ≤ γ 2 ( Q b − − y − ) , (75) γ 2 ( Q b − − y − ) ≤  γ 1 ([0; | u | (1 + tan( α ))]) + α 2 π  . This ends the pro of of the Lemma . Remark 7.1 (On log-co ncav e mea s ures) . It is natur al to ask which typ e of pr ob ability me asur e satisﬁes the four pr op erties use d. Conc ern ing pr op erty 2 , it is p ossible to c onsider me asu r es t hat ar e not gaussian. Su pp ose that µ is a pr ob- ability me asu re on R p with p ositive density, ae − φ with r esp e ct t o the L eb esgue me asur e, wher e φ is strictly c onvex in the sense that their exists c > 0 such t hat for al l x, y ∈ R p φ ( x ) + φ ( y ) − 2 φ  x + y 2  ≥ c 2 k x − y k 2 R p , (76) φ (0) = 0 = Ar g inf φ , a is a p ositive c onstant and φ is r adial: ther e exists a function ψ fr om R to R such that φ ( x ) = ψ ( k x k ) . L et y ∈ R p , D b e a hyp erplane of R p , b the ortho gonal pr oje ction of y on D , h the distanc e fr om y to D and A ⊂ R p include d in the half sp ac e delimite d by D which do es not c ontain y . On e c an show (se e pr op osition 3 . 3 . 1 p126 in [ 15 ]) t hat µ ( A − y ) ≤ e − c h 2 2 µ ( A − b ) . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 43 7.4. Pr o of of The or em 2.1 Pr o of. The seco nd equation o f the Theorem results directly fro m equatio n ( 51 ) in Theorem 7.1 . T o s how the ﬁrst equa tion of the Theo rem, we will four cases. Case num b er 4 is the imp ortant one that relies o n the use of Theo rem 7.1 . The other case s rely on verifying that the right member of the ﬁrst equation o f the Theorem is no t to o small. 1. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) < 0. Let us note that b ecause R is a probability , we hav e R ≤ 1. In addition, E ≥ k F 10 − ˆ F 10 k L 2 ( γ C ) ≥ k F 10 k L 2 ( γ C ) . which implies that R p ≤ E k F 10 k L 2 ( γ C ) . 2. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0 and k ˆ F 10 k L 2 ( γ C ) ≤ 1 2 k F 10 k L 2 ( γ C ) . Recall that R is upp er b ounded by 1 2 when h ˆ F 10 , F 10 i L 2 ( γ C ) > 0 (se e Theorem 7.1 , it is the ca se where α de ﬁned by ( 5 ) s atisﬁes − π / 2 ≤ α ≤ π / 2). In addition, the inequality k ˆ F 10 k L 2 ( γ C ) ≤ 1 2 k F 10 k L 2 ( γ C ) implies E ≥ 1 2 k F 10 k L 2 ( γ C ) , and as a consequence R p ≤ 1 2 implies that R p ≤ E k F 10 k L 2 ( γ C ) . 3. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0, k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) et π 2 > α > π 4 (recall that α has b een deﬁned by 5 ). Since π 2 > α > π 4 , we hav e c o s( α ) ≤ 1 2 and as a co nsequence and with the help of ( 5 ): h ˆ F 10 , F 10 i L 2 ( γ C ) ≤ √ 2 2 k ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) . Under this last constraint, we hav e min ˆ F 10 k F 10 − ˆ F 10 k 2 L 2 ( γ C ) = min α  (1 − α ) 2 + α 2  k F 10 k 2 L 2 ( γ C ) = k F 10 k 2 L 2 ( γ C ) , which again implies R p ≤ E k F 10 k L 2 ( γ C ) . 4. Case where h ˆ F 10 , F 10 i L 2 ( γ C ) > 0, k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) and α < π 4 . Since α ∈ [0 , π 4 ], the concavit y of the sin function gives α π ≤ sin( α ) 2 √ 2 . In addition, the relatio n k ˆ F 10 k L 2 ( γ C ) ≥ 1 2 k F 10 k L 2 ( γ C ) implies that sin( α ) = k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) k ˆ F 10 k L 2 ( γ C ) ≤ 2 k F 10 − ˆ F 10 k L 2 ( γ C ) k F 10 k L 2 ( γ C ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 44 (the ﬁrst inequality is a trigo nometric formula). Finally , we o btain: α π ≤ k F 10 − ˆ F 10 k L 2 ( γ C ) √ 2 k F 10 k L 2 ( γ C ) . (77) Recall that d 0 = h ˆ F 10 , ˆ s 10 − s 10 i R p . The equality deﬁning α ( 5 ) and the fact that cos( α ) ≥ √ 2 2 now imply: | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) ≤ √ 2 | d 0 | sin( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) (since cos( α ) ≥ √ 2 2 ) = √ 2 | d 0 | k ˆ F 10 k L 2 ( γ C ) (from a trigonometr ic formula) . Also, noticing that γ 1 ([0; u ]) ≤ u √ 2 π , and tha t tan( α ) ≤ 1, we get: γ 1 " 0; (1 + tan( α )) | d 0 | tan( α ) k Π F ⊥ 10 ˆ F 10 k L 2 ( γ C ) #! ≤ γ 1 " 0; 2 √ 2 | d 0 | k ˆ F 10 k L 2 ( γ C ) #! (78) ≤ 2 | d 0 | √ π k ˆ F 10 k L 2 ( γ C ) . In the cases 1, 2 and 3 of Theo rem 7.1 , b ecause tan( α ) ≤ 1 ( α ≤ π 4 ), the equations ( 77 ), ( 78 ), ( 51 ),( 54 ) imply: R ≤ E k F 10 k L 2 ( γ C ) . This ends the pro of of Theorem 2.1 . 8. A gene ral sc heme to so lv e Problem 1 8.1. Intr o du ction and main r esul t Presen tatio n of the main ide as. I n this section, we will pr ov e r esults con- cerning the QDA pro cedure. Reca ll that the learning er ror R (The probability to misclass ify data with a given rule when the o ptima l rule gives a correct clas- siication) satisﬁes: R ≤ 1 2  P 1 ( X ∈ V ˆ L Q 10 △ V L Q 10 ) + P 0 ( X ∈ V ˆ L Q 10 △ V L Q 10 )  (79) (If f : X → R , V f is deﬁned by ( 45 ) at the b eginning of the preceding section). Indeed, the even t X ∈ V ˆ L Q 10 △ V L Q 10 corres p o nds to the case where decisio ns (go o d or er roneous) ta ken b y the optimal rule a nd the plug-in rule are diﬀerent. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 45 Remark 8.1. In the c ase of pr o c e dur e LDA, we had R = 1 2  γ C,s 10  X ∈ ˆ V \ V − m 10 2  + γ C,s 10  X ∈ V \ ˆ V + m 10 2  . F r om this e quation, one c an e asily de duc e that 2 R = 1 2  γ C,s 10  X ∈ ˆ V △ V − m 10 2  + γ C,s 10  X ∈ V △ ˆ V + m 10 2  , and as a c onse qu en c e: 2 R = 1 2  P 1 ( X ∈ V ˆ L A 10 △ V L A 10 ) + P 0 ( X ∈ V ˆ L A 10 △ V L A 10 )  . (80) It is less obvious that this typ e of r elation is true in the ” quadr atic c ase. It’s se ems less obvious. In subsection 8.2 we will present a technique to put an uppe r b ound on the probabilities like P ( V f △ V f + δ ). In this type of quantit y , we shall call p ertur- bation function the meas urable function δ (which can be though t as a small function) a nd optimal fro ntier function the measurable function f fr o m X to R . In the case of the QDA, the r esults o btained are co nsequences of Theor em 8.1 given in the next paragr aph, with frontier function f = L Q 10 and pe r turbation function δ = ˆ L Q 10 − L Q 10 . A general result concerning quadratic p erturbation of a quadratic rule. In the sequel we need to int ro duce so me quantities r elated to g aussian measure in separ a ble Banach spaces, and X is a separa ble Bana ch Space. W e refer to [ 8 ] and its se ction on measurable po lynomials for a rig ourous tre a tment of the sub ject. The Hilb ert Spa c e of measur a ble aﬃne function from X to R with ﬁnite L 2 ( γ C,m ) norm and null integral with r esp ect to γ C,m will b e denoted by X ∗ γ C,m . The Hilb ert space o f mea surable quadratic fo r m in L 2 ( γ C,m ) with null int egral with r esp ect to γ C,m will b e denoted E 2 ( γ C,m ). The spa ce of measur able quadratic for ms in L 2 ( γ C,m ) will b e denoted by X ∗ 2 γ and we hav e the c la ssical gaussian chaos decomp osition in L 2 ( γ C,m ): X ∗ 2 γ = { C te } ⊕ X ∗ γ C,m ⊕ E 2 ( γ C,m ) . In inﬁnite dimension H ( γ C,m ) is the repro ducing kernel Hilb ert space asso ci- ated to γ C,m , in ﬁnite dimensio n ( X = R p ), we hav e (if C is of full ra nk) H ( γ C,m ) = R p . Recall that to eac h Hilbe r t-Schmidt op erato r A on H ( γ C,m ), one can as so ciate the meas ur able element of E 2 ( γ C,m ) and that each element of E 2 ( γ C,m ) is a sso ciated to a unique Hilb ert-Schmidt o p er ator on H ( γ C,m ). In imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 46 ﬁnite dimension, if C is of full r ank: q γ C,m A ( x ) = q C − 1 / 2 AC − 1 / 2 ( x − m ) − Z X q C − 1 / 2 AC − 1 / 2 ( x − m ) γ C,m ( dx ) ( recall that q A ( x ) = h Ax, x i R p ) = h AC − 1 / 2 ( x − m ) , C − 1 / 2 ( x − m ) i R p − p X i =1 λ i , where ( λ i ) i =1 ,...,p is the vector o f the eig env alues o f A . Theorem 8.1. L et X b e a sep ar able Banach sp ac e, γ C,m b e a gaussian me asur e on X with me an m and c ovarianc e C . L et A and D b e 2 symmetric Hilb ert- Schmidt op er ators on H ( γ C,m ) , F , d ∈ X ∗ γ C,m , and c, d 0 ∈ R . L et f ( x ) = c + F ( x ) + q γ C,m A ( x ) and δ ( x ) = d 0 + d ( x ) + q γ C,m D ( x ) b e the function deﬁning V f and V f + δ (If g : X → R , V g is deﬁne d by e qu ation ( 45 )). Final ly, let r, R ∈ R b e such that R > r > 0 . 1. A ssume t hat r ≤ k f k L 2 ( γ C,m ) . Then, for al l q ∈ ]0 , 1 [ , ther e exists c 1 ( r , q ) > 0 (that only dep en ds on r and R ) such that γ C,m ( V f △ V f + δ ) ≤ c 1 ( r , q ) k δ k q/ 3 L 2 ( γ C,m ) . (81) 2. If | E L 2 ( γ C,m ) [ f ] | > r and k f k L 2 ( γ C,m ) , then, for al l q ∈ ]0 , 1[ , ther e exists c 2 ( r , q ) > 0 (that only dep ends on r and R ) such that γ C,m ( V f △ V f + δ ) ≤ c 2 ( r , q ) k δ k 2 q/ 7 L 2 ( γ C,m ) . (82) The tw o following subsections a re devoted to the pro of of this theorem. Sub- section 8.2 pres ent s a genera l metho dology to o btain this type of result, and in Section 8 .4 , we a pply this metho dology to obtain Theorem 8.1 . 8.2. De c omp osition of the domain W e will give an upper bo und to the pr obability that X ∈ V f ∆ V f + δ . In the ca ses we hav e in mind, this set is es sentially comp os ed of elements for which δ takes large v a lues or f is near zer o. Also, w e shall bound the mea sure of areas on which 1. the pe r turbation is larg e (with large deviatio n inequa lity), 2. | f | is small (with an inequality such as P ( | f ( X ) | ≤ ǫ ) ≤ g ( ǫ )). Lemma 8.1 that follows is based on the t w o fo llowing assumptions. 1. Assumption A 1 . It exis ts c 0 , c 1 > 0, h δ : R + → R + non decreasing such that h δ (0) = 0 , lim s →∞ h δ ( s ) = ∞ and ∀ s > 0 , P ( | δ ( X ) − E [ δ ( X )] | ≥ c 0 h δ ( s )) ≤ c 1 e − s 2 2 . (83) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 47 2. Assumption A 2 . It exis ts β > 0 a nd c 2 > 0 such that ∀ ǫ > 0 , P ( | f ( X ) | ≤ ǫ ) ≤ c 2 ǫ β . (84) Remark 8. 2 . The function h δ of As s u mption A 1 wil l help u s in me asuring t he eﬀe ct of a p ertu rb ation δ . Lemma 8.1. Un der Assum ption A 1 ( 83 ) and A 2 ( 84 ), for al l q ∈ ]0 ; 1[ we have: P ( X ∈ V f ∆ V f + δ ) ≤ c 1 − q 1 c 2 | E P [ δ ( X )] | qβ + r 2 π 1 − q c 2 c 1 − q 1 2 E "  c 0 h δ  | ξ | √ 1 − q + 1  + | E P [ δ ( X )] |  qβ # , wher e ξ is a c ent re d r e al gaussian r andom variable with varianc e 1 . Pr o of. Recall that V f = { x : f ( x ) ≥ 0 } . P ( X ∈ V f ∆ V f + δ ) = P ( − ( δ ( X ) − E [ δ ( X )]) − E [ δ ( X )] ≤ f ( X ) ≤ 0 or 0 ≤ f ( X ) ≤ ( δ ( X ) − E [ δ ( X )]) + E [ δ ( X )]) , also, P ( X ∈ V f ∆ V f + δ ) ≤ P ( U ) , where U = {| f ( X ) | ≤ | δ ( X ) − E [ δ ( X )] | + | E [ δ ( X )] |} . Deﬁne B j = { c 0 h δ ( j ) ≤ | δ ( X ) − E [ δ ( X )] | < c 0 h δ ( j + 1) } for j ∈ N . This family of event s p er mits us to r ecov er a ll p os s ible even ts. W e observe that P ( U ) = X j ≥ 0 P ( U ∩ B j ) , and then using the Holder inequality , ( p + q = 1) we get: P ( U ) ≤ X j ≥ 0 P ( U ∩ B j ) q P ( B j ) p . It follows tha t P ( X ∈ V f ∆ V f + δ ) ≤ X j P ( | f ( X ) | ≤ | E [ δ ( X )] | + c 0 h δ ( j + 1)) q P ( | δ ( X ) − E [ δ ( X )] | ≥ c 0 h δ ( j )) 1 − q ≤ c 2 c 1 − q 1 X j ≥ 0 ( | E [ δ ( X )] | + c 0 h δ ( j + 1)) qβ e − (1 − q ) j 2 2 , ( from assumption A1 and A2 ) imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 48 ≤ c 2 c 1 − q 1  | E [ δ ( X )] | qβ 0 + r 2 π 1 − q Z ∞ 0 ( h δ ( x + 1) + | E [ δ ( X )] | ) qβ r 1 − q 2 π e − (1 − q ) x 2 2 dx ! which implies the desired result. Lemma 8 . 2. L et δ 1 , . . . , δ k b e k p ert urb ations satisfying assumption A 1 deﬁne d by e quation ( 83 ) with the err or functions h δ 1 , . . . , h δ k . Then, if h δ = P k i =1 h δ i , ther e exists c 0 ( k ) , c 1 ( k ) > 0 such that ∀ s > 0 P ( | δ − E ( δ ) | ≥ c 0 h δ ( s )) ≤ c 1 e − s 2 2 . (85) Pr o of. Recall that for all i , h δ i ≥ 0 . Let us ﬁx s > 0. T he pr o of relies on the pig eonhole principle. Indeed, if P k i =1 | δ i − E [ δ i ] | ≥ k P k i =1 c 0 i h δ i ( s ) then there exists i 0 ∈ { 1 , . . . , k } suc h that | δ i 0 − E [ δ i 0 ] | ≥ P k i =1 c 0 i h δ i ( s ). If we ﬁx c 0 = k max c 0 i , we then hav e P      k X i =1 δ i − E [ δ i ]      ≥ c 0 k X i =1 h δ i ( s ) ! ≤ P k X i =1 | δ i − E [ δ i ] | ≥ k k X i =1 c 0 i h δ i ( s ) ! ( from the tr iangle inequality and the fact that c 0 k X i =1 h δ i ( s ) ≥ k k X i =1 c 0 i h δ i ( s ) ) ≤ P ∃ i 0 ∈ { 1 , . . . , k } : | δ i 0 − E [ δ i 0 ] | ≥ k X i =1 c 0 i h δ i ( s ) ! (pigeon hole pr inciple) ≤ k X i =1 P ( | δ i − E [ δ i ] | ≥ c 0 i h δ i ( s )) (subadditivity of pro ba bilit y) ≤ k X i =1 c 1 i e − s 2 2 ( h δ i satisﬁes a ssumption A 1 ) , which e nds the pro of. The r e sults that a llow us to verify a s sumption A2 ar e pres ent ed in Section 8.5 . W e now reca ll some s tandard large deviation results that allow us to verify assumption A1. 8.3. L ar ge deviation In the c a se where δ is linea r or Lipschits, the following class ical r esult (see for example [ 8 ] (p1 7 4)) a llows us to chec k a ssumption A 1 . imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 49 Theorem 8.2 . L et γ = γ C b e a gaussian me asure of c ovarianc e C on X a sep ar able Banach Sp ac e, H = H ( γ ) b e the asso ciate d r epr o ducing kernel H ilb ert Sp ac e, δ : X → R a fu n ction s uch that ther e exists N ( δ ) > 0 with | δ ( x + h ) − δ ( x ) | ≤ N ( δ ) | h | H ( γ ) ∀ h ∈ H ( γ ) γ − ps. (86) Then ∀ s > 0 γ  x ∈ X : | δ ( x ) − Z δ ( x ) dγ | > s  ≤ 2 e − s 2 2 N ( δ ) 2 (87) In the cas e where δ is quadra tic, the following result from Ma ssart and Lau- rent [ 19 ] (Lemma 1 p1325 ) will help us to check assumption A 1 . Theorem 8. 3. If D = D ia g ( d 1 , . . . , d p ) and q D ( x ) = h Dx, x i R p , t hen γ p  x ∈ R p : q D ( x ) − Z R p q D ( x ) γ p ( dx ) ≥ s 2 k q D k L 2 ( γ p ) + sup i | d i | s 2  ≤ e − s 2 2 (88) γ p  x ∈ R p : q D ( x ) − Z R p q D ( x ) γ p ( dx ) ≤ − s 2 k q D k L 2 ( γ p )  ≤ e − s 2 2 (89) As a co nsequence, assumption A 1 is satisﬁed with h δ ( s ) = s 2 k q D k L 2 ( γ p ) + s 2 sup i | d i | ) ≤ k q D k L 2 ( γ p ) ( s 2 + s 2 ). The use w e will make of these results is e n tirely contained in the following corolla r y . Corollary 8 . 1. L et X b e a sep ar able Banach sp ac e, γ a gaussian me asur e on X and δ ∈ E 2 ( γ ) . Then δ satisﬁes assumption A 1 with h δ ( s ) = k δ − E γ [ δ ] k L 2 ( γ ) ( s + s 2 ) . Pr o of. It suﬃces to chec k the result for X = R p and to use a standa rd approx- imation ar gument. Recall that in L 2 ( γ ), we have X ∗ 2 ,γ = { cte } ⊕ X ∗ γ ⊕ E 2 ( γ ). Also, there exists a unique triplet δ 0 = E γ [ δ ] ∈ { cte } , δ 1 ∈ X ∗ γ and δ 2 ∈ E 2 ( γ ) such that δ = δ 0 + δ 1 + δ 2 . F rom the preceding corollar y , assumption A 1 is satis - ﬁed for p ertur ba tion δ 2 , measur e P = γ and h δ 2 ( s ) = k δ 2 k L 2 ( γ ) ( s + s 2 ). Because δ 1 ∈ X ∗ γ , δ 1 is aﬃne. Also , by Theor em 8.2 , the a ssumption A 1 is satisﬁed for per turbation δ 1 with h δ 1 ( s ) = s k δ 1 k L 2 ( γ ) . W e can then conclude us ing Lemma 8.2 and the fa c t that k δ 2 k L 2 ( γ ) ( s + s 2 ) + s k δ 1 k L 2 ( γ ) ≤ ( k δ 1 k L 2 ( γ ) + k δ 2 k L 2 ( γ ) )( s + s 2 ) ≤ √ 2( s + s 2 ) k δ − δ 0 k L 2 ( γ ) . W e now hav e all elements to demonstrate Theor e m 8.1 . 8.4. Pr o of of The or em 8.1 As announced, we sha ll apply Theo rem 8.1 . F rom Theor em 8.4 Assumption A 2 is satisﬁed with β = 1 / 3 in the case 1 of our Theor em and for β = 2 / 7 in the imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 50 case 2 of our The o rem. In b oth cases the c o nstant c 2 depe nds on r only . In b oth cases, from the preceding corollar y , assumption A 2 is satis ﬁed with the function h δ ( s ) = ( s + s 2 ) k δ − δ 0 k L 2 ( γ ) . Also, if we apply Lemma 8.1 , for all q ∈ ]0 , 1 [, there e x ists a constant C ( r , q ) > 0 s uch that γ ( V f ∆ V f + δ ) ≤ C ( r , q )  | E γ ( δ ) | + k δ − E [ δ ] k L 2 ( γ )  qβ , and a co ns tant C ′ ( r , q ) > 0 such that γ ( V f ∆ V f + δ ) ≤ C ′ ( r , q ) k δ k qβ L 2 ( γ ) , This ends the pro of of the Theo rem. 8.5. Smal l cr own pr ob abili t y In this subsection X ∗ 2 is the set of rea l random v ar iables that can b e written c + P i ≥ 1 β i ( ξ 2 i − 1) + α i ξ i with c ∈ R , β = ( β i ) i ∈ l 2 ( N ), α = ( α i ) i ∈ l 2 ( N ) ( ξ i ) i ∈ N is a seq uence of indep endent identically distributed ga ussian random v ariables with mean 0 and v a riance 1. Let q ∈ X ∗ 2 given by q = c + X i ≥ 0 α i ξ i + X i β i ( ξ 2 i − 1) . we will note n 1 ( q ) = max i | α i | n 2 ( q ) = max i | β i | , σ ( q ) =   X i ≥ 0 2 β 2 i + α 2 i   1 / 2 . (90) Theorem 8. 4. 1. Ther e exists C ( c 0 ) > 0 such that sup { P ( | q | ≤ ǫ ) : q ∈ X ∗ 2 : | E [ q ] | ≥ c 0 } ≤ C ( c 0 ) ǫ 2 / 7 . 2. Ther e exists C ′ ( c 0 ) > 0 such t hat sup  P ( | q | ≤ ǫ ) : q ∈ X ∗ 2 : E [ q 2 ] ≥ c 0  ≤ C ′ ( c 0 ) ǫ 1 / 3 . 3. Le t q ∈ X ∗ 2 , for al l ǫ ≥ 0 , P ( | q | ≤ ǫ ) ≤ s 1 π ǫ n 2 ( q ) . Remark 8.3. This r esult may se em s urprising, and we did not show it is opti- mal. If n 2 ( q ) = max i | β i | > c 0 , the b ound of p oint 3 is optimal in t he sense t hat if β = (1 , 0 , . . . ) , c = 1 and α = 0 we get P ( | q | ≤ ǫ ) = P ( | ξ 2 | ≤ ǫ ) ∼ C ǫ 1 / 2 (for a c onst ant C which c an b e c alculate d ex plicitly). In addi tion, when k β k l 2 → 0 the b ehaviour of P ( | q | ≤ ǫ ) tends to b e the same as P ( |k α k l 2 N (0 , 1) − c | ≤ ǫ ) ∼ C ′ ( c 0 ) ǫ . Also, it may b e c onje ct ur e d that p oints 1 and 2 of the The or em c an b e impr ove d (in or der to obtai n an ex p onent 1 / 2 inste ad of 2 / 7 and 1 / 3 ) but we b elieve this is unlikely. The diﬃcult c ases to study (and p oint 3 of the fol lowing pr o of demonstr ate this) ar e t hose with k β k ∞ → 0 but k β k l 2 do es not tend to zer o. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 51 Pr o of. W e shall pro ceed in four steps. Step 1 . W e claim tha t if | E [ q ] | > ǫ then P ( | q | ≤ ǫ ) ≤ σ 2 ( q ) ( | E [ q ] | − ǫ ) 2 . (91) Notice that | q − E [ q ] | ≥ || q | − | E [ q ] || and if | q | < ǫ < | E [ q ] | then || q | − | E [ q ] || = | E [ q ] | − | q | and | q | ≥ | E [ q ] | − | q − E [ q ] | . Also P ( | q | ≤ ǫ ) ≤ P ( | E [ q ] | − | q − E [ q ] | ≤ ǫ ) = P (1 ≤ | q − E [ q ] | | E [ q ] | − ǫ ) which implies ( 91 ) b y the Markov inequality . Step 2 . W e will ass ume witho ut loss o f g enerality that for a ll i ∈ N α i ≥ 0 . This is wha t we will do. In the following, α i 0 = max i α i , j 0 ∈ arg max | β j | and sign( x ) is the function that returns the sign of the real x . W e claim that P ( | q | ≤ ǫ ) ≤ s 1 π ǫ n 2 ( q ) . (92) Let Z = X i 6 = j 0 α i ξ i + β i ( ξ 2 i − 1) . T o o btain the des ired ine q uality , note that for all α j 0 ≥ 0, β j 0 6 = 0 P  | Z + α j 0 ξ + β j 0 ( ξ 2 − 1) | ≤ ǫ  = P  | sign( β j 0 ) Z + α j 0 ξ + | β j 0 | ( ξ 2 − 1) | ≤ ǫ  = P | sign( β j 0 ) Z | β j 0 | + ( ξ + α j 0 2 | β j 0 | ) 2 − 1 − α 2 j 0 4 β 2 j 0 ) | ≤ ǫ | β j 0 | ! = P  ξ ∈  f α j 0 ,β j 0 ( − ǫ ) − α j 0 2 | β j 0 | ; f α j 0 ,β j 0 ( ǫ ) − α j 0 2 | β j 0 |  . where f α,β ( ǫ ) = s (1 + α 2 4 β 2 − sign( β ) Z − ǫ | β | ) + , and ( x ) + = x 1 x ≥ 0 . The inequalit y ( 92 ) results from the c ho ice α = α j 0 and β = β j 0 and from the fact that if u ∈ R , q ( u + ǫ | β j 0 | ) + − q ( u − ǫ | β j 0 | ) + ≤ q 2 ǫ n 2 ( q ) . Step 3 W e cla im that P ( | q | ≤ ǫ ) ≤ 208 n 2 ( q ) σ ( q ) + 2 ǫ σ ( q ) e − ( | E [ q ] |− ǫ ) 2 σ 2 ( q ) . (93) W e prove the following lemma (which is a central limit theor em) at the end o f the pro of. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 52 Lemma 8 .3. L et X i = β i ( ξ 2 i − 1) + α i ξ i , ξ b e a gaussian c enter e d r andom variable with varianc e 1 and σ ( q ) given by ( 90 ). W e obtain: sup ǫ ≥ 0       P   | E γ [ q ] + X i ≥ 0 X i | ≤ ǫ   − P  | ξ + E γ [ q ] σ ( q ) | ≤ ǫ σ ( q )        ≤ 104 max( | β i | ) σ ( q ) . Also, b ecause | E [ q ] | > ǫ P  | ξ + E [ q ] σ ( q ) | ≤ ǫ σ ( q )  ≤ 2 ǫ σ ( q ) e − ( | E [ q ] |− ǫ ) 2 σ 2 ( q ) , we hav e inequa lit y ( 93 ). Step 4 . As anno unced w e will distinguish several disjo int cases to demo nstrate po int s 1 and 2 of the theo rem. W e b eg in with p o int 1 . 1. In the case where σ ( q ) < ǫ 1 / 7 , it is the inequa lit y from step 1 ( 91 ) that leads to the desired conclusion. 2. In the case where n 2 ( q ) ≥ ǫ 3 / 7 , it is the inequality fro m step 2 ( 9 2 ) that leads to the desired conclusion. 3. In the case wher e n 2 ( q ) < ǫ 3 / 7 and σ ( q ) > ǫ 1 / 7 , it is the ineq uality fro m step 3 ( 93 ) that leads to the desired conclusio n. W e conclude with p oint 2. 1. In the case where n 2 ( q ) ≥ ǫ 1 / 3 , it is the inequality fro m step 2 ( 92 ) that leads to the desired conclusion. 2. In the ca se where n 2 ( q ) < ǫ 1 / 3 it is the inequality from step 3 ( 93 ) that leads to the desired conclusion. W e now give the pro of of theorem 8.3 . Pr o of. This pro of is decomp osed into tw o steps. In the ﬁrst step, we calcula te ∀ α, β ∈ R , φ α,β ( t ) = E h e it ( ξα + β ( ξ 2 − 1)) i , (94) and in the s econd o ne we deduce that fo r a ll | t | < σ 6 max j | β j | = a | Y j ≥ 0 φ α j ,β j ( t/σ ) − e − t 2 / 2 | ≤ 4 max j | β j | σ | t | 3 2 e − t 2 / 6 , (95) which implies the desired result from the Essen inequality (see for ex ample [ 23 ] imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 53 p358) sup u ∈ R       P   1 σ X j ≥ 0 α j ξ j + β j ( ξ 2 j − 1) ≥ u   − Φ( u )       ≤ Z a − a      Q i ≥ 0 φ α,β ( t/σ ) − e − t 2 / 2 t      dt + 24 a √ 2 π ≤ 4 max j | β j | σ Z R t 2 2 e − t 2 6 dt + max j | β j | 72 √ 2 σ √ π = max j | β j | σ 72 r 2 π + 32 ! ≤ 104 max j | β j | σ , where Φ is the cum ulative distr ibutio n function of a standardised gaussia n rea l random v ar iable. Step 1. Let Ω β = { z ∈ C 2 ℑ ( z ) β > − 1 } and ψ α,β ( z ) b e g iven by ∀ α, β ∈ R , z ∈ ω β ψ α,β ( z ) = e − β iz (1 − 2 β iz ) 1 / 2 e − 1 / 2 α 2 z 2 (1 − 2 βiz ) . The function ψ α,β is analytic on Ω β . The function φ α,β ( t ) deﬁned by ( 94 ) ca n be contin ued int o a n analytic function on the domain Ω β and b eca use x 2 2 + y ( αx + β ( x 2 − 1)) = 1 2 (1 + 2 β y )( x + αy 1 + 2 β y ) 2 − α 2 y 2 2(1 + 2 β y ) we observe that ∀ y > − 1 2 β ψ α,β ( iy ) = φ α,β ( iy ) . Also, we can deduce that φ α,β ( z ) and ψ α,β ( z ) are equal on Ω β and in particular on R which g ives ∀ α, β ∈ R , t ∈ R φ α,β ( t ) = e − β it (1 − 2 β it ) 1 / 2 e − 1 / 2 α 2 t 2 (1 − 2 βit ) . Step 2. P ro of of ( 95 ). The preceding equation gives | Y i ≥ 0 φ α,β ( t/σ ) − e − t 2 / 2 | = e − t 2 2 | e z − 1 | ≤ e − t 2 2 | z | e z , where u = t σ et z = t 2 2 + X j ≥ 0 ( − 1 / 2 α 2 j u 2 (1 − 2 β j iu ) + 1 2 ( − 2 β j ui − lo g(1 − 2 β j ui )) ) , imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 54 and hence z = X j ≥ 0 ( u 2 α 2 j 2 − 1 2 α 2 j u 2 (1 − 2 β j iu ) ! + u 2 2 β 2 j 2 − 1 2 (2 β j ui + log (1 − 2 β j ui )) !) . (96) In addition, if | t | < σ 6 m ax i | β i | , then for all j ∈ N | 2 uβ j | < 1 3 and we hav e (cf T aylor expansion (1) p352 in [ 23 ] ) | log(1 − 2 β j ui ) + 2 β j ui − 4 β 2 j u 2 2 | ≤ 8 | uβ j | 3 3     1 1 − | 2 u β j |     ≤ 4 | uβ j | 2 max j | β j | . W e also have | u 2 α 2 j 2 − 1 2 α 2 j u 2 (1 − 2 β j iu ) | ≤ 1 2 α 2 j | u | 3 2 | β j | 1 + 4 β 2 j u 2 ≤ α 2 j | u | 3 max j | β j | . As a consequence, if | t | < σ 6 max i | β i | , then ( 96 ) implies: | z | ≤ 2 σ 2 | u | 3 max j | β j | = 2 max j | β j | σ | t | 3 , and e −  t 2 2 −| z |  ≤ e − t 2 2 (1 − 2 3 ) = e − t 2 6 . Ac knowledgemen ts This work has b e en done with s uppo rt fro m La Region Rhones-Alp es. References [1] F. Abramovic h, Y. Benjamini, D. Do no ho, a nd I. Johnstone. Adapting to unknown sparsity b y controlling the false discov er y rate. Annals of statistics , 34, 2006 . [2] T. Anderso n and R. Bahadur. Classiﬁca tion into tw o mult iv ariate nor mal distributions with diﬀerent covriance matrices. Annals of Mathema tilc al Statistics , 33(2):4 20–43 1, 1962 . [3] J.Y. Audiber t and A. Tsybakov. F ast learning rates for plug-in clas s iﬁers under the margin condition. Annals of St atistics , 20 06. [4] Y. Benjamini and Y. Ho ch ber g. Controlling the false discovery rate :a prac- tical and p oweful approach to multiple testing. J ou r n al of R oyal Statistic al So ciety B , 57:2 89–30 0, 1995 . [5] A Berlinet, G Biau, and L Rouvi` ere. F unctional clas siﬁcation with w av elet. 2005. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018 R. Gir ar d/High dimensional gaussian classiﬁc ation 55 [6] P . Bick e l and E . Lev ina. Some theory for ﬁsher’s linea r dis criminant func- tion, ’naive bayes’, and some a lter natives w hen there a re ma ny more v ari- ables than observ atio ns. Bernoul li , 10 (6):989–1 010, 2 0 04. [7] P . Bick el a nd E. Levina. Regular ized estimation of lar ge cov ar iance matri- ces. Annals of St atistics , 20 0 7. [8] V. I. Bogachev. Gaussian Me asur es . AMS, 1998 . [9] E. Candes. Mo dern statistical e stimation via o r acle inequalities. Acta Numeric a , pages 1– 6 9, 20 0 6. [10] D. Donoho. High-dimensional data analysis: the curses and blessings of dimensionalit y . Av a ilable at h ttp:// www- stat.stanford.edu/do noho/Lecture s , 2 000. [11] D. L. Dono ho a nd I. Johnstone. Minimax risk ov er lp-balls for lq-error. Pr ob ability The ory and R elate d Fields , (9 9 ):277–3 03, 1994. [12] J . F an and F an Y. High dimensio nal cla ssiﬁcation us ing features annealed independenc e rules. T echn ical rep ort, Princeton University , 2007 . [13] R. Fisher . The use of multiple measurements in taxo nomic proble ms . An- nals of Eugenics , 7:179– 188, 1936. [14] J . F riedman, T. Hastie, a nd R. Tibshirani. The Elements of St atistic al L e arning . Spr ing er, 2001. [15] R. Girard. R e duct ion de dimension en statistique et applic ation ` a la se g- mentation d’image s hyp ersp e ctr ales . PhD thesis, Universit ´ e J o seph F ourier, 2008. [16] V. Gir a rdin and R. Seno ussi. Semigroup statio nary pro cess es and sp ectral representation. Bernoul li , 9(5):857– 876, 2003 . [17] Ulf Grenander. Sto chastic pro cesse s and sta tistical inference. Arki v for Matematik , 1:195 – 277, 1950 . [18] T. Hastie, A. Buja, a nd R. Tibshirani. Penalised discr imina nt ana lysis. Annals of S tatistics , 23:73 –102, 1995 . [19] B. Laurent a nd P . Massart. Adaptive estimatio n of a quadratic functional by mo del selection. The ann als of Statistics , 28 (5):1302 – 1338 , 2 000. [20] S. Ma llat. A Wavelet T our of Signal Pr o c essing . Academic P ress, 1 999. [21] S. Ma lla t, G. Papanicola ou, and Z. Zhang . Adaptive cov ariance estimatio ni of lo cally stationar y pro cess es. The annals of St atistics , 26 (1):1–47, 19 98. [22] F. Ro ssi a nd N. Villa. Supp or t vector ma chine for functional data class iﬁ- cation. N eur o c omputing , 69:7 3 0–74 2, 2006 . [23] Sho r ack. Pr ob ability for Statistitian . Springer, 200 0. [24] A. Tsyba ko v. Intr o duction a l’estimation n on-p ar ametrique . Spring er, 2 004. [25] Y az ici. Sto chastic deconv olution o ver g r oups. IEEE T r ans. on In formation The ory , 50 (3), 2 0 04. imsart-g eneric ver. 2007/12/10 file: article-fina l1.tex date: Octobe r 24, 2018

High dimensional gaussian classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment