Hyper-sparse optimal aggregation

Hyp er-sparse optimal aggregation St ´ ephane Ga ¨ ıﬀas 1 , 3 and Guillaume Lecu ´ e 2 , 3 Marc h 28, 2022 Abstract In this pap er, we consider the problem of hyp er-sp arse aggr egation . Namely , giv en a dictionary F = { f 1 , . . . , f M } of functions, w e look for an optimal aggre- gation algorithm that writes ˜ f = P M j =1 θ j f j with as many zero co eﬃcien ts θ j as p ossible. This problem is of particular in terest when F contains man y irrelev ant functions that should not app ear in ˜ f . W e provide an exact oracle inequalit y for ˜ f , where only tw o coeﬃcients are non-zero, that entails ˜ f to b e an optimal aggregation algorithm. Since selectors are suboptimal aggregation pro cedures, this pro v es that 2 is the minimal num b er of elements of F required for the con- struction of an optimal aggregation procedures in ev ery situations. A simulated example of this algorithm is prop osed on a dictionary obtained using LARS, for the problem of selection of the regularization parameter of the LASSO. W e also give an example of use of aggregation to achiev e minimax adaptation ov er anisotropic Beso v spaces, which was not previously known in minimax theory (in regression on a random design). Keywor ds. Aggregation ; Exact oracle inequalit y ; Empirical risk minimization ; Emprical pro cess theory ; Sparsity ; Minimax adaptation 1 In tro duction 1.1 Motiv ations In this pap er, w e consider the problem of sp arse aggr e gation . Namely , given a dic- tionary F = { f 1 , . . . , f M } of functions, we lo ok for an optimal aggregation algorithm that writes b f = P M j =1 θ j f j with as man y zero co eﬃcien ts θ j as possible. This question app ears when one wan ts to use aggregation pro cedures to construct adaptiv e proce- dures. Indeed, in practice many elements of the dictionary appear to b e irrelev an t. W e would lik e to remov e completely these irrelev ant elements of the dictionary from the ﬁnal aggregate, while k eeping the optimality of the procedure (“optimalit y” is used in reference to the deﬁnition of “optimal aggregation procedure” provided in Tsybak ov (2003b) and Lecu´ e and Mendelson (2009a)). Moreov er, one could imagine large dictionaries containing many diﬀerent types of estimators (kernel estimators, 1 Universit ´ e Pierre et Marie Curie - Paris 6, Lab oratoire de Statistique Th´ eorique et Appliqu´ ee. email : stephane.gaiffas@upmc.fr 2 CNRS, Laboratoire d’Analyse et Math´ ematiques appliqu´ ees, Universit ´ e Paris-Est - Marne-la- v all´ ee email : guillaume.lecue@univ-mlv.fr 3 This work is supported by F rench Agence Nationale de la Recherce (ANR) ANR Grant “Prog- nostic” ANR-09-JCJC-0101-01. ( http://www.lsta.upmc.fr/prognostic/index.php ) 1 pro jection estimators, etc.) with many diﬀerent parameters (smo othing parameters, groups of v ariables, etc.). Some of the estimators are likely to b e more adapted than the others, dep ending on the kind of mo dels that ﬁts w ell to the data. So, w e would lik e to construct pro cedures that can adapt to diﬀeren t models combining only the estimators, contained in the dictionary , that are the more adapted for this mo del. Up to no w, optimal procedures are based on exp onen tial w eights (cf. Juditsky et al. (2008), Dalalyan and Tsybako v (2007)) pro viding aggregation pro cedures with no zero co eﬃcien ts, even for the w orse elements in the dictionary . An improv ement going in the direction of sparse aggregation has b een made using a preselection step in Lecu´ e and Mendelson (2009a). This preselection step allo ws to remov e all the estimators in F whic h p erforms badly on a learning subsample. In the present work, we prov e that optimal aggregation algorithms with only tw o non-zero co eﬃcients exists, see Section 2, Theorem 1. This means that the aggregate writes as a conv ex combination of only t wo elemen ts of F . Then, we propose an original pro of of an already known result, in v olving an explicit geometrical setup, of the fact that selecting a single element of F using empirical risk minimization is a sub optimal aggregation pro cedure, see Theorem 2. Finally , we use our “h yp er- sparse” aggregate on a dictionary “consisting” of p enalized empirical risk minimizers (PERM). The aim is to construct, as an application of the previous analysis, an adaptiv e estimator o v er anisotropic Beso v balls, namely an estimator that adapts to the unknown anisotropic smo othness of the regression function, in the sense that it ac hieves the optimal minimax rate without an a priori knowledge of the anisotropic smo othness parameters. This result was not, as far as we kno w, previously proposed in minimax theory . T o do so, we use recen t results b y Mendelson and Neeman (2009) on regularized learning together with ou r oracle inequality for the h yper-spare aggregate. 1.2 The mo del Let Ω b e a measurable space endow ed with a probability measure µ and ν b e a probabilit y measure on Ω × R such that µ is its marginal on Ω. Assume ( X, Y ) and D n := ( X i , Y i ) n i =1 to b e n + 1 indep endent random v ariables distributed according to ν . W e work under the following assumption. Assumption 1. We c an write Y = f 0 ( X ) + ε, (1) wher e ε is such th at E ( ε | X ) = 0 and E ( ε 2 | X ) ≤ σ 2 ε a.s. for some c onstant σ ε > 0 . W e will assume further that either Y is b ounded: k Y k ∞ < + ∞ , or that ε is subgaussian: k ε k ψ 2 := inf { c > 0 : E [exp(( ε/c ) 2 )] ≤ 2 } < + ∞ , see b elo w. W e w an t to estimate the regression function f 0 using the observ ations D n . If f is a function, its error of prediction is given b y the risk R ( f ) = E ( f ( X ) − Y ) 2 , and if b f is a random function dep ending on the data D n , the error of prediction is the conditional exp ectation R ( b f ) = E [( b f ( X ) − Y ) 2 | D n ] . 2 Giv en a set of functions F , a natural wa y to appro ximate f 0 is to consider the em- pirical risk minimizer (ERM), that minimizes the functional R n ( f ) := 1 n n X i =1 ( Y i − f ( X i )) 2 o ver F . This very basic principle is at the core of the pro cedures proposed in this pap er. W e will also commonly use the follo wing notations. If f F ∈ argmin f ∈ F R ( f ), w e will consider the excess loss L f = L F ( f )( X , Y ) := ( Y − f ( X )) 2 − ( Y − f F ( X )) 2 , and use the notations P L f := E L f ( X, Y ) , P n L f := 1 n n X i =1 L f ( X i , Y i ) . 2 Hyp er-sparse aggregation 2.1 The aggregation problem Assume that we are giv en a ﬁnite set F = { f 1 , . . . , f M } of functions (usually called a dictionary), the aggregation problem is to construct pro cedures ˜ f (usually called an aggregate) satisfying inequalities of the form R ( ˜ f ) ≤ c min f ∈ F R ( f ) + r ( F , n ) , (2) where the result holds with high probabilit y or in expectation. Inequalities of the form (2) are called oracle inequalities and r ( F , n ) is called the residue. W e wan t the residue to b e as small as p ossible. A classical result (cf. Juditsky et al. (2008)) says that aggregates with v alues in F cannot mimic exactly (that is for c = 1) the oracle faster than r ( F , n ) ∼ ((log M ) /n ) 1 / 2 . Nevertheless, it is p ossible to mimic the oracle up to the residue (log M ) /n (see Juditsky et al. (2008) and Lecu´ e and Mendelson (2009a), among others). An aggregate t ypically write as a conv ex com bination of the elements of F , namely b f := M X j =1 θ j f j , where θ := ( θ j ( D n , F )) M j =1 is a map { 1 , . . . , M } → Θ, where Θ := n λ ∈ ( R + ) M : M X i =1 λ j = 1 o . P opular examples of aggregation algorithms are the aggregate with cumulated ex- p onen tial weigh ts (A CEW), see Catoni (2001); Leung and Barron (2006); Juditsky et al. (2008, 2005); Audib ert (2009), where the weigh ts are given by θ (ACEW) j := 1 n n X k =1 exp( − P k i =1 ( Y i − f j ( X i )) 2 /T ) P M l =1 exp( − P k i =1 ( Y i − f l ( X i )) 2 /T ) , 3 where T is the so-called temp erature parameter, and the aggregate with exp onential w eights (AEW), see Dalalyan and Tsybak o v (2007) among others, where θ (AEW) j := exp( − P n i =1 ( Y i − f j ( X i )) 2 /T ) P M l =1 exp( − P n i =1 ( Y i − f l ( X i )) 2 /T ) . The A CEW satisﬁes (2) with c = 1 and r ( F, n ) ∼ (log M ) /n , see references abov e, hence it is optimal in the sense of Tsybak o v (2003b). In these aggregates, no co eﬃ- cien t equals zero, although they can b e very small, dep ending on the v alue of R n ( f j ) and T [this makes in particular the choice of T of imp ortance]. In this pap er, we look for an aggregation algorithm that shares the same prop ert y of optimality , but with as few non-zero coeﬃcients θ j as p ossible, hence the name hyp er-sp arse aggr e gate . W e ask for the follo wing question: Question 1. What is the minimal numb er of non-zer o c o eﬃcients θ j such that an aggr e gation pr o c e dur e P M j =1 θ j f j is optimal? It turns out that the answer to this question is tw o. Indeed, if ev ery co eﬃcien t is zero, excepted for one, the aggregate coincides with an element of F , and w e know that suc h a pro cedure can only achiev e the rate ((log M ) /n ) 1 / 2 (see Juditsky et al. (2008) and Theorem 2 b elo w where, in the particular case of the ERM, the sub optimalit y of this kind of pro cedure can b e understo o d from a geometrical point of view (this diﬀers from the statistical p oin t of view from Juditsky et al. (2008) whic h inv olv es “min-max” type theorem)). In Deﬁnition 1, we construct three pro cedures, where t wo of them (see (7) and (8)), only ha v e t w o non-zero co eﬃcien ts θ j , and we prov e in Theorem 1 b elo w that these pro cedures are optimal. W e shall assume one of the follo wing. Assumption 2. One of the fol lowing holds. • Ther e is a c onstant b > 0 such that: max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b. (3) • Ther e is a c onstant b > 0 such that: max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b. (4) Note that Assumption (4) allows an unbounded dictionary F . The results given b elo w diﬀer a bit dep ending on the considered assumption (there is an extra log n term in the subgaussian case given by (4)). T o simplify the notations, w e assume from no w that we hav e 2 n observ ations from a sample D 2 n = ( X i , Y i ) 2 n i =1 . Let us deﬁne our aggregation pro cedures. Deﬁnition 1 (Aggregation pro cedures) . F ol low the fol lowing steps: (0. Initialization) Cho ose a c onﬁdenc e level x > 0 . If (3) holds, deﬁne φ = φ n,M ( x ) = b r log M + x n . 4 If (4) holds, deﬁne φ = φ n,M ( x ) = ( σ ε + b ) r (log M + x ) log n n . (1. Splitting) Split the sample D 2 n into D n, 1 = ( X i , Y i ) n i =1 and D n, 2 = ( X i , Y i ) 2 n i = n +1 . (2. Preselection) Use D n, 1 to deﬁne a r andom subset of F : b F 1 = n f ∈ F : R n, 1 ( f ) ≤ R n, 1 ( b f n, 1 ) + c max  φ k b f n, 1 − f k n, 1 , φ 2  o , (5) wher e k f k 2 n, 1 = n − 1 P n i =1 f ( X i ) 2 , R n, 1 ( f ) = n − 1 P n i =1 ( f ( X i ) − Y i ) 2 , b f n, 1 ∈ argmin f ∈ F R n, 1 ( f ) . (3. Aggregation) Cho ose b F as one of the fol lowing sets: b F = conv( b F 1 ) = the c onvex hul l of b F 1 (6) b F = seg ( b F 1 ) = the se gments b etwe en the functions in b F 1 (7) b F = star( b f n, 1 , b F 1 ) = the se gments b etwe en b f n, 1 with the elements of b F 1 , (8) and r eturn the ERM r elative to D n, 2 : ˜ f ∈ argmin g ∈ b F R n, 2 ( g ) , wher e R n, 2 ( f ) = n − 1 P 2 n i = n +1 ( f ( X i ) − Y i ) 2 . These algorithms are illustrated in Figures 1 and 2. In Figure 1 we summarize the aggregation steps in the three cases. In Figure 2 w e give a sim ulated illustration of the preselection step, and we sho w the v alue of the weigh ts of the AEW for a comparison. As mentioned ab o ve, the Step 3 of the algorithm returns, when b F is giv en by (7) or (8), a function which is a conv ex combination of only tw o functions in F , among the ones remaining after the preselection step. The preselection step w as in tro duced in Lecu´ e and Mendelson (2009a), with the use of (6) in the aggregation step. Eac h of the three pro cedures prop osed in Deﬁnition 1 are optimal in view of Theorem 1 b elo w. F rom the computational p oin t of view, pro cedure (8) is the most app ealing: an ERM in star( b f n, 1 , b F ) can b e computed in a fast and explicit wa y , see Algorithm 1 b elo w. The next Theorem prov es that each of these aggregation pro cedures are optimal. Theorem 1. L et x > 0 b e a c onﬁdenc e level, F b e a dictionary with c ar dinality M and ˜ f b e one of the aggr e gation pr o c e dur e given in Deﬁnition 1. If max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b, we have, with ν 2 n -pr ob ability at le ast 1 − 2 e − x : R ( ˜ f ) ≤ min f ∈ F R ( f ) + c b (1 + x ) log M n , 5 Figure 1: Aggregation algorithms: ERM o v er conv( b F 1 ) , seg ( b F 1 ) , or star( b f n, 1 , b F 1 ). wher e c b is a c onstant dep ending on b, and wher e we r e c al l that R ( ˜ f ) = E [( Y − ˜ f ( X )) 2 | ( X i , Y i ) 2 n i =1 ] . If max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b, we have, with ν 2 n -pr ob ability at le ast 1 − 4 e − x : R ( ˜ f ) ≤ min f ∈ F R ( f ) + c σ ε ,b (1 + x ) log M log n n . R emark 1 . Note that the deﬁnition of the set b F 1 , and thus ˜ f , dep ends on the conﬁ- dence x through the factor φ n,M ( x ). R emark 2 . T o simplify the pro ofs, we don’t give the explicit v alues of the constants. Ho wev er, when (3) holds, one can choose c = 4(1 + 9 b ) in (5) and c = c 1 (1 + b ) when (4) holds (where c 1 is the absolute constant app earing in Theorem 5). Of course, this is not likely to b e the optimal choice. 2.2 The star-shap ed aggregate In this section we give details for the computation of the star-shaped aggregate, namely the aggregate ˜ f given by Deﬁnition 1 when b F is (8). Indeed, if λ ∈ [0 , 1], we ha ve R n, 2 ( λf + (1 − λ ) g ) = λR n, 2 ( f ) + (1 − λ ) R n, 2 ( g ) − λ (1 − λ ) k f − g k 2 n, 2 , so the minimum of λ 7→ R n, 2 ( λf + (1 − λ ) g ) is ac hiev ed at λ n, 2 ( f , g ) = 0 ∨ 1 2  R n, 2 ( g ) − R n, 2 ( f ) k f − g k 2 n, 2 + 1  ∧ 1 , where a ∨ b = max( a, b ), a ∧ b = min( a, b ) and min λ ∈ [0 , 1] R n, 2 ( λf + (1 − λ ) g ) is thus equal to R n, 2 ( λ n, 2 ( f , g ) f + (1 − λ n, 2 ( f , g )) g ) given b y      R n, 2 ( f ) if R n, 2 ( f ) − R n, 2 ( g ) ≥ k f − g k 2 n, 2 R n, 2 ( f )+ R n, 2 ( g ) 2 − ( R n, 2 ( f ) − R n, 2 ( g )) 2 4 k f − g k 2 n, 2 − k f − g k 2 n, 2 4 if | R n, 2 ( f ) − R n, 2 ( g ) | ≤ k f − g k 2 n, 2 R n, 2 ( g ) otherwise. 6 0 20 40 60 80 40 60 80 100 Dictionary Sum of squares Threshold Expo. weights 0 20 40 60 80 20 40 60 80 100 Dictionary Sum of squares Threshold Expo. weights 0 20 40 60 80 40 60 80 100 120 Dictionary Sum of squares Threshold Expo. weights Figure 2: Empirical risk R n, 1 ( f ), v alue of the threshold R n, 1 ( b f n, 1 ) + 2 max( φ k b f n, 1 − f k n, 1 , φ 2 ) and weigh ts of the AEW (that we rescaled for illustration purp ose) for f ∈ F , where F is a dictionary obtained using LARS, see Section 4 b elo w. Only the elements of F with an empirical risk smaller than the threshold are kept from the dictionary , see Deﬁnition (1). The ﬁrst and third examples corresp ond to a case where an aggregate with preselection step improv es up on AEW, while in the second example, b oth pro cedures b eha ves similarly . This leads to the following algorithm for the computation of ˜ f . Algorithm 1 : Computation of the star-shap ed aggregate. Input : dictionary F , data ( X i , Y i ) 2 n i =1 , and a conﬁdence level x > 0 Output : star-shap ed aggregate ˜ f Split D 2 n in to tw o samples D n, 1 and D n, 2 foreac h j ∈ { 1 , . . . , M } do Compute R n, 1 ( f j ) and R n, 2 ( f j ), and use this lo op to ﬁnd b f n, 1 ∈ argmin f ∈ F R n, 1 ( f ) end foreac h j ∈ { 1 , . . . , M } do Compute k f j − b f n, 1 k n, 1 and k f j − b f n, 1 k n, 2 end Construct the set of preselected elemen ts b F 1 = n f ∈ F : R n, 1 ( f ) ≤ R n, 1 ( b f n, 1 ) + c max  φ k b f n, 1 − f k n, 1 , φ 2  o , where φ is giv en in Deﬁnition 1. foreac h f ∈ b F 1 do compute R n, 2 ( λ n, 2 ( b f n, 1 , f ) b f n, 1 + (1 − λ n, 2 ( b f n, 1 , f )) f ) and keep the elemen t f b  ∈ b F 1 that minimizes this quan tity end return ˜ f = λ n, 2 ( b f n, 1 , f b  ) b f n, 1 + (1 − λ n, 2 ( b f n, 1 , f b  )) f b j , 7 2.3 Sub optimalit y of Penalized ERM In this section, w e prov e that minimizing the empirical risk R n ( · ) (or a p enalized v ersion, called PERM from now on) on F (Λ) is a sub optimal aggregation pro cedure b oth in exp ectation and deviation. According to Tsybako v (2003b), the optimal rate of aggregation in the gaussian regression mo del is (log M ) /n . This means that it is the minimum price one has to pay in order to mimic the b est function among a class of M functions with n observ ations. This rate is ac hiev ed by the aggregate with cumulativ e exp onen tial weigh ts, see Catoni (2001), Y ang (2000) and Juditsky et al. (2008). In Theorem 2 b elo w, we prov e that the usual PERM pro cedure cannot ac hieve this rate and th us, that it is suboptimal compared to the aggregation metho ds with exp onen tial weigh ts. The low er bounds for aggregation metho ds app earing in the literature (see Tsybako v (2003b); Juditsky et al. (2008); Lecu ´ e (2006)) are usually based on minimax theory arguments. In particular, in Tsybako v (2003b), it is prov ed that a selector (that is an aggregation pro cedure taking its v alues in the dictionnary itself ) cannot mimic the oracle faster than p (log M ) /n . This result implies the one that we hav e here, but, it do esn’t provide an explicit setup for which a given selector p erforms p o orly . The result in Juditsky et al. (2008) sa ys that whatever the selector is, there exists a probability measure and a dictionnary for which it cannot mimic the oracle faster than p (log M ) /n . The proof of this result does not tell explicitely which probabilistic setup is bad for this selector. In the presen t result, we are in terested in a particular type of selector: the PERM for some p enalt y . W e can provide an explicit framew ork (dictionnary+probabilistic setup) b ecause the argumen t considered here is based on some geometric considerations (in the same spirit as the low er b ound obtained in Lee et al. (1996) and Mendelson (2008)). The explicit example that mak es the PERM fail is the follo wing Gaussian regression mo del with uniform design: Assumption 3 (G) . Assume that ε is standar d Gaussian and that X is univariate and uniformly distribute d on [0 , 1] . The dictionary is constructed as follo w: Figure 3: Example of a setup in which ERM p erforms badly . The set F (Λ) = { f 1 , . . . , f M } is the dictionary from which we wan t to mimic the b est element and f 0 is the regression function. F or the regression function we take f 0 ( x ) = ( 2 h if x ( M ) = 1 h if x ( M ) = 0 , (9) 8 where x has the dyadic decomp osition x = P k ≥ 1 x ( k ) 2 − k where x ( k ) ∈ { 0 , 1 } and h = C 4 r log M n . W e consider the dictionary of functions F M = { f 1 , . . . , f M } f j ( x ) = 2 x ( j ) − 1 , ∀ j ∈ { 1 , . . . , M } , (10) where again ( x ( j ) : j ≥ 1) is the dy adic decomp osition of x ∈ [0 , 1]. Theorem 2. Ther e exists an absolute c onstant c 0 > 0 such that the fol lowing holds. L et M ≥ 2 b e an inte ger and assume that (G) holds. We c an ﬁnd a r e gr ession function f 0 and a family F (Λ) of c ar dinality M such that, if one c onsiders a p enalization satisfying | p en( f ) | ≤ C p (log M ) /n, ∀ f ∈ F (Λ) with 0 ≤ C < σ (24 √ 2 c ∗ ) − 1 ( c ∗ is an absolute c onstant fr om the Sudakov minorization, se e The or em 7 in App endix A.2 ) , the PERM pr o c e dur e deﬁne d by ˜ f n ∈ argmin f ∈ F (Λ) ( R n ( f ) + p en( f )) satisﬁes, with pr ob ability gr e ater than c 0 , k ˜ f n − f 0 k 2 ≥ min f ∈ F (Λ) k f − f 0 k 2 + C 3 r log M n for any inte ger n ≥ 1 and M ≥ M 0 ( σ ) such that n − 1 log[( M − 1)( M − 2)] ≤ 1 / 4 wher e C 3 is an absolute c onstant. This result tells that, in some particular cases, the PERM cannot mimic the b est element in a class of cardinality M faster than ((log M ) /n ) 1 / 2 . This rate is v ery far from the optimal one (log M ) /n . Of course, one can say that the PERM fails to achiev e the optimal rate only in the very particular framew ork that we hav e constructed here. Nevertheless, this approach can b e generalized (w e refer the reader to Lecu´ e and Mendelson (2009b) for instance). Finally , remark that classical p enalt y functions are of the order [Complexity of the class] divided by n , whic h is in our aggregation setup of the order of (log M ) /n . Thus, the restriction that we hav e on the p enalt y function co vers the classical cases that one can meet in the litterature on p enalization metho ds. Let F (Λ) be the set that we consider in the proof of Theorem 2 (see Section 5 b elo w), and tak e p en( f ) = 0. Using Monte-Carlo (w e do 5000 lo ops), we compute the excess risk E k ˜ f n − f 0 k 2 − min f ∈ F (Λ) k f − f 0 k 2 of the ERM. In Figure 4 b elow, w e compare the excess risk and the b ound ((log M ) /n ) 1 / 2 for several v alues of M and n . It turns out that, for this set F (Λ), the lo wer b ound ((log M ) /n ) 1 / 2 is indeed accurate for the excess risk. Actually , by using the classical symmetrization argument and the Dudley’s entrop y integral (or Pisier’s inequality), it is easy to obtain an upp er b ound for the excess risk of the ERM of the order of ((log M ) /n ) 1 / 2 for any class F (Λ) of cardinalit y M . As an application of the aggregation algorithm 1, we consider the problem of adaptation to the regularization parameter of a p enalized empirical risk minimization pro cedure, denoted for short PERM in what follo ws. 9 50 100 150 200 0.15 0.20 0.25 M = 50 Excess risk ((log M) / n)^(1/2) 50 100 150 200 0.15 0.20 0.25 0.30 M = 100 Excess risk ((log M) / n)^(1/2) 50 100 150 200 0.15 0.20 0.25 0.30 0.35 M = 200 Excess risk ((log M) / n)^(1/2) Figure 4: The excess risk of the ERM compared to ((log M ) /n ) 1 / 2 for several v alues of M and n ( x -axis) 3 An example of dictionary: P enalized ERM 3.1 Deﬁnition and to ols Let us ﬁx a function space F , endow ed with a seminorm | · | F . The set F is a space of functions, suc h as a Sobolev, Beso v or Reproducing Kernel Hilbert Space (RHKS), the latter b eing a common example in regularized learning, see Cuck er and Smale (2002). A simple example (in the one dimensional case) is the Sob olev space W s 2 of functions suc h that | f | 2 F = R f ( s ) ( t ) 2 dt < + ∞ , which corresp onds to the so-called smo othing splines estimator, see W ahba (1990)]. A PERM (whic h stands for p enalized empirical risk minimization) minimizes the functional 1 n n X i =1 ( Y i − f ( X i )) 2 + p en( f ) (11) o ver F , where p en( f ) is a quantit y measuring the smoothness (or “roughness”) of f ∈ F . Typically , the p enalization term writes p en( f ) = h 2 | f | 2 F (see v an de Geer (2000) and Gy¨ orﬁ et al. (2002) among others), where h > 0 is a regularization parameter. In Mendelson and Neeman (2009), sharp error b ounds for the PERM are es- tablished, in the general context of a so-called or der e d and p ar ametrize d hier achy {F r : r > 0 } . An example of such an ordered and parametrized hierac hy is F r = r F 1 , where F 1 = { f ∈ F : | f | F ≤ 1 } . In the latter pap er, a v ery sharp analysis is conducted when F is a RKHS, allowing for p enalizations less than quadratic in the RKHS norm. In this section, w e use the to ols prop osed in Mendelson and Neeman (2009) to derive an error b ound for the PERM using the standard p enalt y pen( f ) = h 2 | f | 2 F , but when F is a Besov space. In nonparametric estimation literature, Besov spaces are of particular interest since they include functions with inhomo gene ous smo othness , for instance functions with rapid oscillations or bumps. Moreov er, since the design random v ariable X is even tually m ultiv ariate, the question of anisotropic smo othness naturally arises. Anisotrop y means that the smo othness of the regression function f 0 diﬀers in each direction. As far as we know, adaptive estimation of a multiv ariate curve with anisotropic smo othness was previously considered only in Gaussian white noise or densit y mo dels, 10 see Kerky ac harian et al. (2001), Hoﬀmann and Lepski (2002), Kerky ac harian et al. (2007), Neumann (2000). There is no result concerning the adaptiv e estimation of the regression with anisotropic smo othness on a general random design X . In order to simplify the deﬁnition of the anisotropic Besov space, we shall assume from no w that Ω = R d . Let us consider the following compactness assumption on the unit ball F 1 . It uses metric en tropy , which is a standard measure of the compactness in learning theory , see Cuc k er and Smale (2002) for instance. Recall that k f k ∞ = sup x ∈ R d | f ( x ) | , and denote by C ( R d ) the set of contin uous functions on R d , endow ed with the L ∞ - norm. If F 1 ⊂ C ( R d ), we introduce H ∞ ( F 1 , δ ) = log N ∞ ( F 1 , δ ), where N ∞ ( F 1 , δ ) is the minimal num b er of L ∞ -balls with radius δ needed to cov er F 1 . Assumption 4 ( C β ) . Assume that F emb e ds c ontinuously in C ( R d ) , and that ther e is a numb er β ∈ (0 , 2) such that for any δ > 0 , the unit b al l of F satisﬁes: H ∞ ( δ, F 1 ) ≤ cδ − β , (12) wher e c > 0 is indep endent of δ . This assumption entails H ∞ ( δ, F r ) ≤ c ( r /δ ) β for any r > 0. Moreo v er, the con tinuous embedding giv es that k f k ∞ ≤ c | f | F for any f ∈ F . Assumption 4 is satisﬁed by barely all the smo othness spaces considered in nonparametric literature (at least when the smoothness of the space is large enough compared to the dimension, see b elo w). Let us giv e an example. Let B s p,q b e the anisotropic Beso v space with smo othness s = ( s 1 , . . . , s d ). This space is precisely deﬁned in Appendix B. Eac h s i corresp onds to the smo othness in the i -th co ordinate. The computation of the en tropy of F 1 when F = B s p,q is done in Theorem 5.30 from T rieb el (2006). Namely , if ¯ s is the harmonic mean of s , given b y 1 ¯ s := 1 d d X i =1 1 s i , (13) then the unit ball F 1 of F = B s p,q satisﬁes Assumption 4 with β = d/ ¯ s , given that ¯ s > d/p , whic h is the usual condition to hav e the embedding in C ( R d ). 3.2 Lo cal complexit y using en trop y No w, w e ha v e in mind to use Theorem 2.5 from Mendelson and Neeman (2009), in order to derive a risk b ound for the PERM. F or this, we need a control on the local complexit y of F r , for any r > 0. The complexity is measured in this pap er b y the exp ectation E k P − P n k V r,λ , where for r , λ > 0, V r,λ is the class of excess losses V r,λ := { α L r,f : 0 ≤ α ≤ 1 , f ∈ F r , E ( α L r,f ) ≤ λ } , where L r,f := ( Y − f ( X )) 2 − ( Y − f ∗ r ( X )) 2 and f ∗ r ∈ argmin f ∈F r E ( Y − f ( X )) 2 . The next Lemma (a pro of is given in App endix A.3) gives a b ound on this measure of the complexity under Assumption 4. Lemma 1. Assume that k Y k ∞ < + ∞ and gr ant Assumption 4. One has, for any r , λ > 0 : E k P − P n k V r,λ ≤ c max h r 2 n − 1 / (1+ β / 2) , r 1+ β / 2 λ (1 − β / 2) / 2 √ n i , 11 wher e c = c β , k Y k ∞ . This Lemma, although probably not optimal, is suﬃcien t to pro vide a satisfactory risk b ound for the PERM with a p enalization of the form p en( f ) = h 2 | f | 2 F . It is close in spirit to a b ound prop osed in Loustau (2009) (see Theorem 1) for the problem of classiﬁcation framework using a Besov p enalization, with an extra assumption on the inputs X i , since the pro of inv olv es a decomp osition on a wa velet basis. Here we use only the en tropy condition, together with some basic to ols from empirical process theory , see the pro of in App endix A.3. 3.3 A risk b ound for the PERM using entrop y No w, w e can derive a risk b ound for the PERM using Lemma 1 and the results from Mendelson and Neeman (2009). First, note that λ/ 8 ≥ c max h r 2 n − 1 / (1+ β / 2) , r 1+ β / 2 λ (1 − β / 2) / 2 √ n i if and only if λ ≥ cr 2 n − 1 / (1+ β / 2) . So, an y λ ≥ cr 2 n − 1 / (1+ β / 2) satisﬁes, using Lemma 1, that λ/ 8 ≥ E k P − P n k V r,λ , and consequently , using the “isomorphic co or- dinate pro jection” [see Theorem 2.2 in Mendelson and Neeman (2009)], we ha v e that for any f ∈ F r , the following holds w.p. larger than 1 − 2 e − x : 1 2 P n L r,f − ρ n ( r , x ) ≤ P L r,f ≤ 2 P n L r,f + ρ n ( r , x ) , where ρ n ( r , x ) := c  r 2 n − 1 / (1+ β / 2) + (1 + r 2 ) x n  . (14) This explains the shap e of the usual quadratic p enalization p en( f ) = h 2 | f | 2 F , where h = cn − 1 / (2+ β ) (up to the other term, which is of smaller order 1 /n ), and this entails the following. Theorem 3. Assume that k Y k ∞ ≤ b , and gr ant Assumption 4. L et ρ n ( r , x ) b e given by (14) and deﬁne for r, y > 0 : θ ( r , y ) = y + log( π 2 / 6) + 2 log (1 + cn + log r ) , wher e c = c β ,b . Then, for any x > 0 , with pr ob ability at le ast 1 − 2 exp( − x ) , any ¯ f ∈ F that minimizes the functional P n ` f + c 1 ρ n (2 | f | F , θ ( | f | F , x )) over F also satisﬁes P ` ¯ f ≤ inf f ∈F  P ` f + c 2 ρ n (2 | f | F , θ ( | f | F , x ))  . Pr o of. The conditions of Theorem 2.5 in Mendelson and Neeman (2009) are satisﬁed with ρ n ( r , x ) given by (14). The statement of the Theorem easily follows from it, using the same argumen ts as in the pro of of Theorem 3.7 herein. 12 Let us rewrite the result of Theorem 3. F or any x > 0, if ¯ f is the PERM at level x , one has, with ν n -probabilit y larger than 1 − 2 e − x : P ` ¯ f ≤ inf r> 0 n P ` f r + c 1 r 2 n − 1 1+ β / 2 + c 2 (1 + r 2 ) n  x + log( π 2 6 ) + log(1 + c 3 n + log r )  o , where we recall that P ` ¯ f = E [( Y − ¯ f ( X )) 2 | X 1 , . . . , X n ] and f r ∈ argmin f ∈F r R ( f ). This inequality pro ves that ¯ f adapts to the radius | f | F of f in F . The leading term in the righ t hand side of this inequality is r 2 n − 2 / (2+ β ) . If F = B s p, ∞ , it b ecomes r 2 n − 2 ¯ s / (2 ¯ s + d ) , whic h is the minimax optimal rate of conv ergence ov er anisotropic Beso v space, see Kerkyac harian et al. (2007) for instance. 3.4 Adaptiv e estimation o v er anisotropic Besov space What we hav e in mind now is the application of Theorems 1 and 3 to the problem of adaptive estimation o ver a collection of anisotropic Beso v space. Consider tw o v ectors s min and s max in R d + with p ositiv e co ordinates and harmonic means ¯ s min and ¯ s max resp ectiv ely , satisfying s min ≤ s max ( s min i ≤ s max i for any i ∈ { 1 , . . . , d } ) and ¯ s min > d/ min( p, 2). Consider the collection of anisotropic Besov space ( B s p, ∞ : s ∈ S ) , where S := d Y i =1 [ s min i , s max i ] . (15) The strategy is to aggregate a dictionary of PERM, corresp onding to a discretization of S , in order to adapt to the anisotropic smo othness of f 0 . The steps are the follo wing. W e shall assume to simplify that w e hav e 2 n observ ations. Deﬁnition 2 (Adaptive estimator) . 1. Split (at r andom) the whole sample ( X i , Y i ) 2 n i =1 into a tr aining sample ( X i , Y i ) n i =1 and a le arning sample ( X i , Y i ) 2 n i = n +1 . Fix a c onﬁdenc e level x > 0 . 2. Compute the uniform discr etization of S with step (log n ) − 1 : S n := d Y i =1  s min i + k (log n ) − 1 : 1 ≤ k ≤ [( s max i − s min i ) log n ]  . (16) Then, for e ach s ∈ S n , take ¯ f s as a minimizer of the functional 1 n n X i =1 ( Y i − f ( X i )) 2 + p en s ( f , x ) , wher e p en s ( f , x ) = c 1 n − 2 ¯ s / (2 ¯ s + d ) | f | 2 B s p, ∞ + c 2 (1 + | f | 2 B s p, ∞ ) n  x + log( π 2 6 ) + log(1 + c 3 n + log | f | B s p, ∞ )  . If b is such that k Y k ∞ ≤ b , c onsider the dictionary of trunc ate d PERM F PERM = {− b ∨ ¯ f s ∧ b : s ∈ S n } . 13 3. Using the le arning sample ( X i , Y i ) 2 n i = n +1 , c ompute one of the aggr e gates ˜ f given in Deﬁnition 1 using the dictionary F PERM . The next Theorem, which is an immediate consequence of Theorems 1 and 3, pro ves that the aggregate ˜ f is minimax adaptativ e ov er the collection of anisotropic Beso v spaces (15). Theorem 4. L et ˜ f b e the aggr e gate d estimator given in Deﬁnition 4. Assume that max( k Y k ∞ , k f 0 k ) ∞ ≤ b and that f 0 ∈ B s 0 p, ∞ for some s 0 ∈ S , wher e ( B s p, ∞ : s ∈ S ) is the c ol le ction given by (15) that satisﬁes ¯ s min > d/p . Then, with ν 2 n -pr ob ability lar ger than 1 − 4 e − x , we have: k ˜ f − f 0 k 2 L 2 ( µ ) ≤ c 1 r 2 0 n − 2 ¯ s 0 2 ¯ s 0 + d + c 2 1 + r 2 0 + log log n n  x + log( π 2 / 6) + c log (1 + c 3 n + log r 0 )  o , wher e r 0 = | f 0 | B s p, ∞ and 1 ¯ s 0 = 1 d d X i =1 1 s 0 ,i . The dominating term in the right hand side is of order n − 2 ¯ s 0 2 ¯ s 0 + d , which is the minimax optimal rate of con vergence ov er anisotropic Besov space (a minimax low er b ound ov er B s p,q can b e easily obtained using standard arguments, suc h as the ones from Tsybako v (2003a), together with Bernstein estimates o v er B s p, ∞ (that can be found in T rieb el (2006) for instance). Note that there is no regular or sparse zone here, since the error of estimation is measured with L 2 ( µ ) norm. The result obtained here is stronger than the ones usually obtained in minimax theory , where one only giv es an upp er b ound for E k ˜ f − f 0 k 2 L 2 ( µ ) , while here is given a concen tration inequality for k ˜ f − f 0 k 2 L 2 ( µ ) . 4 Sim ulation study In this section, we prop ose a sim ulation study for the problem of selection of the smo othing parameter of the LASSO, see Tibshirani (1996); Efron et al. (2004). W e sim ulate i.i.d. data Y i = β > 0 X i + ε i , where β 0 is a vector of size p = 91 giv en by β 0 = (3 , 1 . 5 , 0 30 , 2 , − 6 , 4 , 0 25 , − 4 , 0 15 , 2 . 5 , 3 , 0 10 , 3 , 1 , − 2) where 0 n is the vector in R n with each co ordinate set to zero. The noise ε i is cen tered Gaussian with v ariance σ 2 . The v ector X = ( X 1 , . . . , X d ) is a centered Gaussian vector suc h that the correlation betw een X i and X j is 2 −| i − j | (follo wing the exam ples from Tibshirani (1996)). Using the lars routine from R 1 , we construct a dictionary F made of the entire sequence of LASSO type estimators for v arious regularization parameters coming out of the LARS algorithm. Then w e compare the prediction error | X ( b β − β 0 ) | 2 and the estimation error | b β − β 0 | 2 where b β is: 1 www.r-project.org 14 • b β ( C p ) = the LASSO with regularization parameter selected using Mallows- C p selection rule, see Efron et al. (2004) • b β (AEW) = The aggregate with exp onen tial weigh ts computed on F with tem- p erature parameter 4 σ 2 , see for instance Dalalyan and Tsybako v (2007) • b β (star) = the star-shap ed aggregate, see Algorithm 1, with constant c = 2. W e compute the errors | X ( b β − β 0 ) | 2 and | b β − β 0 | 2 using 100 simulations for several v alues of n and σ 2 . The splits taken are chosen at random with size n/ 2 for training and n/ 2 for learning for b oth the AEW and star-shap ed aggregate (we don’t split the learning sample). F or b oth aggregates we do some jac kknife: instead of using a single aggregate, w e compute a mean of 10 aggregates obtained with sev eral splits chosen at random. This makes the ﬁnal aggregates less dependent on the split. In order to mak e the oracle and the Mallows- C p errors comparable to the error of the aggregates (that need to split the data, while Mallows- C p do esn’t), we compute the weigh ts of aggregation using splitting, then we compute the aggregate using a dictionary F computed using the whole sample. The conclusion is that, for this example, the star-shap ed do es a better job than b oth the AEW and the C p in most cases. When the noise level is not too high ( σ = 2, whic h corresp onds to a RSNR of 5), see the errors given in Figure 5, the star-shap ed is alw a ys the best. When the noise level is high ( σ = 5, RSNR=2) and n is small, see Figure 6, the story is diﬀeren t: the AEW is b etter than the star-shap ed. In suc h an extreme situation the AEW takes adv an tage of the av eraging (recall that no coeﬃcient is zero in the AEW). How ever, when n b ecomes larger than p , the star-shap ed improv es again up on AEW. 5 Pro ofs of the main results 5.1 Pro of of Theorem 1 Pr o of of The or em 1. Let us prov e the result in the ψ 2 case, the other case is similar. Fix x > 0 and let b F b e either (6), (7) or (8). Set d := diam( b F 1 , L 2 ( µ )). Consider the second half of the sample D n, 2 = ( X i , Y i ) 2 n i = n +1 . By Corollary 1 (see App endix A b elo w), with probability at least 1 − 4 exp( − x ) (relative to D n, 2 ), w e ha ve for ev ery f ∈ b F    1 n 2 n X i =1+ n L b F ( f )( X i , Y i ) − E  L b F ( f )( X , Y ) | D n, 1     ≤ c ( σ ε + b ) max( dφ, bφ 2 ) , where L b F ( f )( X , Y ) := ( f ( X ) − Y ) 2 − ( f b F ( X ) − Y ) 2 is the excess loss function relative to b F , f b F ∈ Arg min f ∈ b F R ( f ) and where φ = p ((log M + x ) log n ) /n . By deﬁnition 15 Oracle Star-shaped AEW Cp Mallows 0.0 0.1 0.2 0.3 0.4 n= 70 sigma= 2 Oracle Star-shaped AEW Cp Mallows 0.02 0.04 0.06 0.08 0.10 0.12 0.14 n= 100 sigma= 2 Oracle Star-shaped AEW Cp Mallows 0.01 0.02 0.03 0.04 0.05 n= 150 sigma= 2 Oracle Star-shaped AEW Cp Mallows 1 2 3 4 5 6 n= 70 sigma= 2 Oracle Star-shaped AEW Cp Mallows 0.5 1.0 1.5 2.0 2.5 3.0 3.5 n= 100 sigma= 2 Oracle Star-shaped AEW Cp Mallows 0.5 1.0 1.5 2.0 n= 150 sigma= 2 Figure 5: Errors | b β − β 0 | 2 (ﬁrst row) and | X ( b β − β 0 ) | 2 (second row) for σ = 2 and n = 70 , 100 , 150. 16 Oracle Star-shaped AEW Cp Mallows 0.5 1.0 1.5 n= 70 sigma= 5 Oracle Star-shaped AEW Cp Mallows 0.2 0.4 0.6 0.8 1.0 n= 100 sigma= 5 Oracle Star-shaped AEW Cp Mallows 0.1 0.2 0.3 0.4 n= 150 sigma= 5 Oracle Star-shaped AEW Cp Mallows 5 10 15 20 25 30 35 n= 70 sigma= 5 Oracle Star-shaped AEW Cp Mallows 5 10 15 20 n= 100 sigma= 5 Oracle Star-shaped AEW Cp Mallows 4 6 8 10 12 n= 150 sigma= 5 Figure 6: Errors | b β − β 0 | 2 (ﬁrst row) and | X ( b β − β 0 ) | 2 (second row) for σ = 5 and n = 70 , 100 , 150. 17 of ˜ f , we hav e 1 n P 2 n i = n +1 L b F ( ˜ f )( X i , Y i ) ≤ 0, so, on this ev en t (relative to D n, 2 ) R ( ˜ f ) ≤ R ( f b F ) + E  L b F ( ˜ f ) | D n, 1  − 1 n 2 n X i = n +1 L b F ( ˜ f )( X i , Y i ) (17) ≤ R ( f b F ) + c ( σ ε + b ) max( dφ, bφ 2 ) = R ( f F ) +  c ( σ ε + b ) max( dφ, bφ 2 ) −  R ( f F ) − R ( f b F )   =: R ( f F ) + β , and it remains to show that β ≤ c b,σ ε (1 + x ) log M log n n . When b F is given by (6) or (7), the geometrical conﬁguration is the same as in Lecu´ e and Mendelson (2009a), so we skip the pro of. Let us turn out to the situation where b F is given by (8). Recall that b f n, 1 is the ERM on b F 1 using D n, 1 . Consider f 1 suc h that k b f n, 1 − f 1 k L 2 ( µ ) = max f ∈ b F 1 k b f n, 1 − f k L 2 ( µ ) , and note that k b f n, 1 − f 1 k L 2 ( µ ) ≤ d ≤ 2 k b f n, 1 − f 1 k L 2 ( µ ) . The mid-p oin t f 2 := ( b f n, 1 + f 1 ) / 2 belongs to star( b f n, 1 , b F 1 ). Using the parallelogram iden tity , w e hav e for any u, v ∈ L 2 ( ν ): E ν  u + v 2  2 ≤ E ν ( u 2 ) + E ν ( v 2 ) 2 − k u − v k 2 L 2 ( ν ) 4 , where for ev ery h ∈ L 2 ( ν ), E ν ( h ) = E h ( X , Y ). In particular, for u ( X , Y ) = b f n, 1 − Y and v ( X, Y ) = f 1 ( X ) − Y , the mid-p oin t is ( u ( X, Y ) + v ( X, Y )) / 2 = f 2 ( X ) − Y . Hence, R ( f 2 ) = E ( f 2 ( X ) − Y ) 2 = E  b f n, 1 ( X ) + f 1 ( X ) 2 − Y  2 ≤ 1 2 E ( b f n, 1 ( X ) − Y ) 2 + 1 2 E ( f 1 ( X ) − Y ) 2 − 1 4 k f n, 1 − f 1 k 2 L 2 ( µ ) ≤ 1 2 R ( b f n, 1 ) + 1 2 R ( f 1 ) − d 2 16 , where the expectations are taken conditioned on D n, 1 . By Lemma 4 (see Appendix A b elo w), since b f n, 1 , f 1 ∈ b F 1 , we hav e 1 2 R ( b f n, 1 ) + 1 2 R ( f 1 ) ≤ R ( f F ) + c ( σ ε + b ) max( φd, bφ 2 ) , and thus, since f 2 ∈ b F R ( f b F ) ≤ R ( f 2 ) ≤ R ( f F ) + c ( σ ε + b ) max( φd, bφ 2 ) − cd 2 . Therefore, β = c ( σ ε + b ) max( dφ, bφ 2 ) −  R ( f F ) − R ( f b F )  ≤ c ( σ ε + b ) max( φd, bφ 2 ) − cd 2 . Finally , if d ≥ c σ ε ,b φ then β ≤ 0, otherwise β ≤ c σ ε ,b φ 2 . 18 Pr o of of The or em 2. The dictionary F M is chosen so that we ha ve, for any j ∈ { 1 , . . . , M − 1 } k f j − f 0 k 2 L 2 ([0 , 1]) = 5 h 2 2 + 1 and k f M − f 0 k 2 L 2 ([0 , 1]) = 5 h 2 2 − h + 1 . Th us, we ha ve min j =1 ,...,M k f j − f 0 k 2 L 2 ([0 , 1]) = k f M − f 0 k 2 L 2 ([0 , 1]) = 5 h 2 2 − h + 1 . This geometrical setup for F (Λ), whic h is a unfa vourable setup for the ERM, is represen ted in Figure 3. F or b f n := ˜ f PERM n ∈ argmin f ∈ F M  R n ( f ) + p en( f )  , where we take R n ( f ) = 1 n P n i =1 ( Y i − f ( X i )) 2 = k Y − f k 2 n , we hav e E k b f n − f 0 k 2 L 2 ([0 , 1]) = min j =1 ,...,M k f j − f 0 k 2 L 2 ([0 , 1]) + h P [ b f n 6 = f M ] . (18) No w, w e upp er b ound P [ b f n = f M ]. W e consider the dy adic decomp osition of the design v ariable X : X = + ∞ X k =1 X ( k ) 2 − k , (19) where ( X ( k ) : k ≥ 1) is a sequence of i.i.d. random v ariables follo wing a Bernoulli B (1 / 2 , 1) with parameter 1 / 2 (because X is uniformly distributed on [0 , 1]). If w e deﬁne N j := 1 √ n n X i =1 ζ ( j ) i ε i and ζ ( j ) i := 2 X ( j ) i − 1 , w e hav e by the deﬁnition of h and since ζ ( j ) i ∈ {− 1 , 1 } : √ n 2 σ ( k Y − f M k 2 n − k Y − f j k 2 n ) = N j − N M + h 2 σ √ n n X i =1 ( ζ ( j ) i ζ ( M ) i + 3( ζ ( j ) i − ζ ( M ) i ) − 1) ≥ N j − N M − 4 C σ p log M . This entails, for ¯ N M − 1 := max 1 ≤ j ≤ N − 1 N j , that P [ b f n = f M ] = P h M − 1 \ j =1 n k Y − f M k 2 n − k Y − f j k 2 n ≤ pen( f j ) − p en( f M ) oi ≤ P h N M ≥ ¯ N M − 1 − 6 C σ p log M i . It is easy to chec k that N 1 , . . . , N M are M normalized standard gaussian random v ariables uncorrelated (but dep enden t). W e denote by ζ the family of Rademacher 19 v ariables ( ζ ( j ) i : i = 1 , . . . , n ; j = 1 , . . . , M ). W e hav e for any 6 C /σ < γ < (2 √ 2 c ∗ ) − 1 ( c ∗ is the “Sudako v constant”, see Theorem 7), P [ b f n = f M ] ≤ E h P  N M ≥ ¯ N M − 1 − 6 C σ p log M    ζ i ≤ P  N M ≥ − γ p log M + E ( ¯ N M − 1 | ζ )  (20) + E h P n E ( ¯ N M − 1 | ζ ) − ¯ N M − 1 ≥ ( γ − 6 C σ ) p log M    ζ oi . Conditionally to ζ , the vector ( N 1 , . . . , N M − 1 ) is a linear transform of the Gaussian v ector ( ε 1 , . . . , ε n ). Hence, conditionally to ζ , ( N 1 , . . . , N M − 1 ) is a gaussian v ector. Th us, we can use a standard deviation result for the suprem um of Gaussian random v ectors (see for instance Massart (2007), Chapter 3.2.4), which leads to the following inequalit y for the second term of the RHS in (20): P n E ( ¯ N M − 1 | ζ ) − ¯ N M − 1 ≥ ( γ − 6 C σ ) p log M    ζ o ≤ exp( − (3 C /σ − γ / 2) 2 log M ) . Remark that w e used E [ N 2 j | ζ ] = 1 for any j = 1 , . . . , M − 1. F or the ﬁrst term in the RHS of (20), w e hav e P h N M ≥ − γ p log M + E ( ¯ N M − 1 | ζ ) i ≤ P h N M ≥ − 2 γ p log M + E ( ¯ N M − 1 ) i (21) + P h − γ p log M + E ( ¯ N M − 1 ) ≥ E ( ¯ N M − 1 | ζ ) i . Next, we use Sudako v’s Theorem (cf. Theorem 7 in App endix A.2) to low er bound E ( ¯ N M − 1 ). Since ( N 1 , . . . , N M − 1 ) is, conditionally to ζ , a Gaussian vector and since for any 1 ≤ j 6 = k ≤ M w e hav e E [( N k − N j ) 2 | ζ ] = 1 n n X i =1 ( ζ ( k ) i − ζ ( j ) i ) 2 then, according to Sudako v’s minoration (cf. Theorem 7 in the Appendix), there exits an absolute constan t c ∗ > 0 such that c ∗ E [ ¯ N M − 1 | ζ ] ≥ min 1 ≤ j 6 = k ≤ M − 1  1 n n X i =1 ( ζ ( k ) i − ζ ( j ) i ) 2  1 / 2 p log M . Th us, we ha ve c ∗ E [ ¯ N M − 1 ] ≥ E h min j 6 = k  1 n n X i =1 ( ζ ( k ) i − ζ ( j ) i ) 2  1 / 2 i p log M ≥ √ 2  1 − E h max j 6 = k 1 n n X i =1 ζ ( k ) i ζ ( j ) i i p log M , where w e used the fact that √ x ≥ x/ √ 2 , ∀ x ∈ [0 , 2]. Besides, using Ho eﬀding’s inequalit y we hav e E [exp( sξ ( j,k ) )] ≤ exp( s 2 / (2 n )) for any s > 0, where ξ ( j,k ) := 20 n − 1 P n i =1 ζ ( k ) i ζ ( j ) i . Then, using a maximal inequalit y (cf. Theorem 8 in App endix A.2) and since n − 1 log[( M − 1)( M − 2)] ≤ 1 / 4, w e hav e E h max j 6 = k 1 n n X i =1 ζ ( k ) i ζ ( j ) i i ≤  1 n log[( M − 1)( M − 2)]  1 / 2 ≤ 1 2 . (22) This entails c ∗ E [ ¯ N M − 1 ] ≥  log M 2  1 / 2 . Th us, using this inequality in the ﬁrst RHS of (21) and the usual inequality on the tail of a Gaussian random v ariable ( N M is standard Gaussian), w e obtain: P h N M ≥ − 2 γ p log M + E ( ¯ N M − 1 ) i ≤ P h N M ≥ (( c ∗ √ 2) − 1 − 2 γ ) p log M i ≤ P h N M ≥ (( c ∗ √ 2) − 1 − 2 γ ) p log M i (23) ≤ exp  − (( c ∗ √ 2) − 1 − 2 γ ) 2 (log M ) / 2  . Remark that we used 2 √ 2 c ∗ γ < 1. F or the second term in (21), we apply the concen- tration inequality of Theorem 6 to the non-negative random v ariable E [ ¯ N M − 1 | ζ ]. W e ﬁrst hav e to control the second moment of this v ariable. W e know that, conditionally to ζ , N j | ζ ∼ N (0 , 1) thus, N j | ζ ∈ L ψ 2 (for more details on Orlicz norm, we refer the reader to v an der V aart and W ellner (1996)). Th us, k max 1 ≤ j ≤ M − 1 N j | ζ k ψ 2 ≤ K ψ − 1 2 ( M ) max 1 ≤ j ≤ M − 1 k N j | ζ k ψ 2 (cf. Lemma 2.2.2 in v an der V aart and W ellner (1996)). Since k N j | ζ k 2 ψ 2 = 1, we ha ve k max 1 ≤ j ≤ M − 1 N j | ζ k ψ 2 ≤ K √ log M . In particular, we ha ve E  max 1 ≤ j ≤ M − 1 N 2 j | ζ  ≤ K log M and so E  E [ ¯ N M − 1 | ζ ]  2 ≤ K log M . Then, Theorem 6 pro vides P h − γ p log M + E [ ¯ N M − 1 ] ≥ E [ ¯ N M − 1 | ζ ] i ≤ exp( − γ 2 /c 0 ) , (24) where c 0 is an absolute constan t. Finally , combining (20), (23), (21), (24) in the initial inequality (20), we obtain P [ b f n = f M ] ≤ exp( − (3 C /σ − γ ) 2 log M ) + exp  − (( c ∗ √ 2) − 1 − 2 γ ) 2 (log M ) / 2  + exp( − γ 2 /c 0 ) . T ake γ = (12 √ 2 c ∗ ) − 1 . It is easy to ﬁnd an integer M 0 ( σ ) dep ending only on σ suc h that for an y M ≥ M 0 , w e ha v e P [ b f n = f M ] ≤ c 1 < 1, where c 1 is an absolute constan t. W e complete the pro of by using this last result in (18). Pr o of of The or em 4. Recall that w e use the sample D n, 1 to compute the family F = {− b ∨ ¯ f s ∧ b : s ∈ S n } of PERM, which has cardinality c (log n ) d , and the sample D n, 2 to compute the weigh ts of the aggregate ˜ f , see Deﬁnition 2. Recall also that there is s 0 = ( s 0 , 1 , . . . , s 0 ,d ) ∈ S such that f 0 ∈ B s 0 p, ∞ , and denote r 0 = | f 0 | B s 0 p, ∞ . T ake s ∗ = ( s ∗ , 1 , . . . , s ∗ ,d ) ∈ S n suc h that s ∗ ,j ≤ s 0 ,j ≤ s ∗ ,j + (log n ) − 1 for all j = 1 , . . . , d . Remark that for this choice, one has B s 0 p, ∞ ⊂ B s ∗ p, ∞ and n − 2 ¯ s ∗ / (2 ¯ s ∗ + d ) ≤ 21 e d/ 2 n − 2 ¯ s 0 / (2 ¯ s 0 + d ) . By a substraction of P ` f 0 on b oth sides of the oracle inequality stated in Theorem 1, and since k f 0 k ∞ ≤ b , w e can ﬁnd an even t A x,n, 2 satisfying ν 2 n ( A x,n, 2 ) ≥ 1 − 2 e − x and on which: k ˜ f − f 0 k 2 L 2 ( µ ) ≤ E ( ` ¯ f s ∗ − ` f 0 | D n, 1 ) + c (1 + x ) log log n n . No w, using Theorem 3, w e can ﬁnd an even t A x,n, 1 satisfying ν 2 n ( A x,n, 1 ) ≥ 1 − 2 e − x on which: E ( ` ¯ f s ∗ − ` f 0 | D n, 1 ) ≤ inf f : | f | B s ∗ p, ∞ ≤ r 0 E ( ` f − ` f 0 ) + c 1 r 2 0 n − − 2 s ∗ 2 s ∗ + d + c 2 (1 + r 2 0 ) n  x + log( π 2 6 ) + log(1 + c 3 n + log r 0 )  o ≤ c 0 1 r 2 0 n − − 2 s 0 2 s 0 + d + c 2 (1 + r 2 0 ) n  x + log( π 2 6 ) + log(1 + c 3 n + log r 0 )  o , where w e used the fact that | f 0 | B s ∗ p, ∞ ≤ | f 0 | B s 0 p, ∞ ≤ r 0 . This concludes the proof of Theorem 4, since ν 2 n ( A x,n, 1 ∩ A x,n, 2 ) ≥ 1 − 4 e − x . A T o ols from empirical pro cess theory A.1 Useful results from literature The following Theorem is a T alagrand’s type concentration inequalit y (see T alagrand (1996)) for a class of un b ounded functions. Theorem 5 (Theorem 4, Adamczak (2008)) . Assume that X, X 1 , . . . , X n ar e inde- p endent r andom variables and F is a c ountable set of functions such that E f ( X ) = 0 , ∀ f ∈ F and, for some α ∈ (0 , 1] , k sup f ∈ F f ( X ) k ψ α < + ∞ . Deﬁne Z := sup f ∈ F    1 n n X i =1 f ( X i )    and σ 2 = sup f ∈ F E f ( X ) 2 and b := k max i =1 ,...,n sup f ∈ F | f ( X i ) |k ψ α n 1 − 1 /α . Then, for any η ∈ (0 , 1) and δ > 0 , ther e is c = c α,η ,δ such that for any x > 0 : P h Z ≥ (1 + η ) E Z + σ r 2(1 + δ ) x n + cb  x n  1 /α i ≤ 4 e − x P h Z ≤ (1 − η ) E Z − σ r 2(1 + δ ) x n − cb  x n  1 /α i ≤ 4 e − x . A.2 Some probabilistic to ols F or the ﬁrst Theorem, we refer to Einmahl and Mason (1996). The tw o following Theorems can be found, for instance, in Massart (2007); v an der V aart and W ellner (1996); Ledoux and T alagrand (1991). 22 Theorem 6 (Einmahl and Masson (1996)) . L et Z 1 , . . . , Z n b e n indep endent non- ne gative r andom variables such that E [ Z 2 i ] ≤ σ 2 , ∀ i = 1 , . . . , n . Then, we have, for any δ > 0 , P h n X i =1 Z i − E [ Z i ] ≤ − nδ i ≤ exp  − nδ 2 2 σ 2  . Theorem 7 (Sudak o v) . Ther e exists an absolute c onstant c ∗ > 0 such that for any inte ger M , any c enter e d gaussian ve ctor X = ( X 1 , . . . , X M ) in R M , we have, c ∗ E [ max 1 ≤ j ≤ M X j ] ≥ ε p log M , wher e ε := min n p E [( X i − X j ) 2 ] : i 6 = j ∈ { 1 , . . . , M } o . Theorem 8 (Maximal inequality) . L et Y 1 , . . . , Y M b e M r andom variables satisfying E [exp( sY j )] ≤ exp(( s 2 σ 2 ) / 2) for any inte ger j and any s > 0 . Then, we have E [ max 1 ≤ j ≤ M Y j ] ≤ σ p log M . A.3 Some technical lemmas In this section w e state some technical Lemmas, used in the pro of of Theorem 1. Notations Giv en a sample ( Z i ) n i =1 , w e set the random empirical measure P n := n − 1 P n i =1 δ Z i . F or any function f deﬁne ( P − P n )( f ) := n − 1 P n i =1 f ( Z i ) − E f ( Z ) and for a class of functions F , deﬁne k P − P n k F := sup f ∈ F | ( P − P n )( f ) | . In all what follows, w e denote by c an absolute p ositiv e constant, that can v ary from place to place. Its dep endence on the parameters of the setting is sp eciﬁed in place. Pr o of of L emma 1. W e ﬁrst start with lemma 4.6 of Mendelson and Neeman (2009) to obtain E k P − P n k V r,λ ≤ 2 ∞ X i =0 2 − i E k P − P n k L r, 2 i +1 λ (25) where L r, 2 i +1 λ := {L r,f : f ∈ F r , E L r,f ≤ 2 i +1 λ } . Let i ∈ N . Using the Gin ´ e-Zinn symmetrization argument, see Gin ´ e and Zinn (1984), we hav e E k P − P n k L r, 2 i +1 λ ≤ 2 n E ( X,Y ) E  h sup L∈L r, 2 i +1 λ    n X i =1  i L ( X i , Y i )    i , where (  i ) is a sequence of i.i.d. Rademacher v ariables. Recall that there is (see Ledoux and T alagrand (1991)) an absolute constant c g suc h that for an y T ⊂ R n , we ha ve E  h sup t ∈ T    n X i =1  i t i    i ≤ c g E g h sup t ∈ T    n X i =1 g i t i    i , (26) where ( g i ) are i.i.d. standard normal. So, we hav e E k P − P n k L r, 2 i +1 λ ≤ 2 c g n E ( X,Y ) E g h sup L∈L r, 2 i +1 λ    n X i =1 g i L ( X i , Y i )    i . 23 Consider the Gaussian pro cess f → Z f := P n i =1 g i L f ( X i , Y i ) indexed by F r, 2 i +1 λ := { f ∈ F r : E L r,f ≤ 2 i +1 λ } . F or every f , f 0 ∈ F r, 2 i +1 λ , we hav e (conditionally to the observ ations) E g | Z f − Z f 0 | 2 ≤ 4( k Y k ∞ + r ) 2 E g | Z 0 f − Z 0 f 0 | 2 , where Z 0 f := P n i =1 g i ( f ( X i ) − f ∗ r ( X i )) and f ∗ r ∈ argmin f ∈F r R ( f ). Using the conv ex- it y of F r , it is easy to get E [ L r,f ] ≥ k f − f ∗ r k 2 , ∀ f ∈ F r . Deﬁne B r, 2 i +1 λ := { f − f ∗ r : f ∈ F r , k f − f ∗ r k ≤ √ 2 i +1 λ } . Using Slepian’s Lemma (see, e.g. Ledoux and T alagrand (1991); Dudley (1999)), we ha ve: E k P − P n k L r, 2 i +1 λ ≤ c Y ( r + 1) E , where we put E := 1 n E X E g h sup f ∈ B r, 2 i +1 λ    n X i =1 g i f ( X i )    i . Moreo ver, using Dudley’s en tropy in tegral argumen t (again, see Ledoux and T ala- grand (1991); Dudley (1999); Massart (2007)), and Assumption 4, we hav e E ≤ 12 √ n E X Z ∆ 0 q N ( B r, 2 i +1 λ , k · k n , t ) dt ≤ 12 √ c √ n E X Z ∆ 0  r t  β / 2 dt = c β √ n r β / 2 E X [∆ 1 − β / 2 ] ≤ c β √ n r β / 2 ( E X [∆ 2 ]) (1 − β / 2) / 2 , where c β = 12 √ c/ (1 − β / 2) and ∆ := diam( B r, 2 i +1 λ , k· k n ). But, one has, if B 2 r, 2 i +1 λ := { f 2 : f ∈ B r, 2 i +1 λ } , using a contraction argumen t (see Ledoux and T alagrand (1991), Chapter 4), with again a Gin ´ e-Zinn symmetrization, E X [∆ 2 ] ≤ E X k P − P n k B 2 r, 2 i +1 λ + 2 i +1 λ ≤ 4 c g ( r + 1) E + 2 i +1 λ. Hence, E satisﬁes E ≤ c √ n r β / 2 (( r + 1) E + 2 i +1 λ ) (1 − β / 2) / 2 , th us E k P − P n k L r, 2 i +1 λ ≤ max  r 2 n 2 / (2+ β ) , r 1+ β / 2 (2 i +1 λ ) 1 / 2 − β / 4 √ n  . Plugging the last result in the sum of Equation (25) entails the result. Lemma 2. Deﬁne d ( F ) := diam( F , L 2 ( µ )) , σ 2 ( F ) = sup f ∈ F E [ f ( X ) 2 ] , C = conv( F ) , 24 and L C ( C ) = { ( Y − f ( X )) 2 − ( Y − f C ( X )) 2 : f ∈ C } , wher e f C ∈ argmin g ∈C R ( g ) . If max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b , we have E h sup f ∈ F 1 n n X i =1 f 2 ( X i ) i ≤ c max  σ 2 ( F ) , b 2 log M n  , and E k P n − P k L C ( C ) ≤ cb r log M n max  b r log M n , d ( F )  . If max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b , we have E h sup f ∈ F 1 n n X i =1 f 2 ( X i ) i ≤ c max  σ 2 ( F ) , b 2 log M log n n  , and E k P n − P k L C ( C ) ≤ cb r log M log n n max  b r log M log n n , d ( F )  . Pr o of. First, consider the case when k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ≤ b . Deﬁne r 2 = sup f ∈ F 1 n n X i =1 f ( X i ) 2 , and note that E X ( r 2 ) ≤ E X k P − P n k F 2 + σ ( F ) 2 , where F := { f 2 : f ∈ F } . Using the same argument as in the b eginning of the pro of of Lemma 1, see abov e, we hav e E X k P − P n k F 2 ≤ c n E X E g h sup f ∈ F    n X i =1 g i f 2 ( X i )    i . The pro cess f 7→ Z 2 ,f = P n i =1 g i f 2 ( X i ) is Gaussian, with intrinsic distance E g | Z 2 ,f − Z 2 ,f 0 | 2 = n X i =1 ( f ( X i ) 2 − f 0 ( X i ) 2 ) 2 ≤ d n, ∞ ( f , f 0 ) 2 × 4 nr 2 , where d n, ∞ ( f , f 0 ) = max i =1 ,...,n | f ( X i ) − f 0 ( X i ) | . So, using Dudley’s en tropy in tegral, w e hav e E g k P − P n k F 2 ≤ c √ n Z ∆ n, ∞ ( F ) 0 q log N ( F , d n, ∞ , t ) dt ≤ cr ∆ n, ∞ ( F ) r log M n , where ∆ n, ∞ is the d n, ∞ -diameter of F . So, w e get E X k P − P n k F 2 ≤ c r log M n E X [∆ n, ∞ ( F ) r ] ≤ c r log M n q E X [∆ 2 n, ∞ ( F )] p E X [ r 2 ] , whic h entails that E X ( r 2 ) ≤ c log M n E X [∆ 2 n, ∞ ( F )] + 4 σ ( F ) 2 . Since E [ Z 2 ] ≤ 2 k Z k 2 ψ 2 for a subgaussian v ariable Z , we ha ve, by using Pisier’s in- equalit y , E X [∆ 2 n, ∞ ( F )] ≤ 4 k max i =1 ,...,n sup f ∈ F | f ( X i ) − f 0 ( X i ) |k 2 ψ 2 ≤ 4 log( n + 1) k sup f ∈ F | f ( X ) − f 0 ( X ) |k 2 ψ 2 ≤ 4 b 2 log( n + 1) , 25 so we hav e prov ed that E X ( r 2 ) ≤ c max  b 2 log M log n n , σ ( F ) 2  . When k f k ∞ ≤ b , the pro of is easier, since we can use the contraction principle for Rademac her pro cess after the symmetrization argument: E X k P − P n k F 2 ≤ 2 n E X E  h sup f ∈ F    n X i =1  i f 2 ( X i )    i ≤ 8 b n E X E  h sup f ∈ F    n X i =1  i f ( X i )    i , and one obtains from this, as previously , that E X ( r 2 ) ≤ c max  b 2 log M n , σ ( F ) 2  . Let us turn to the part of the Lemma concerning E k P − P n k L C ( C ) . Recall that C = con v( F ) and write for short L f ( X, Y ) = L C ( f )( X , Y ) = ( Y − f ( X )) 2 − ( Y − f C ( X )) 2 for eac h f ∈ C , where we recall that f C ∈ argmin g ∈C R ( g ). Using the same argumen t as b efore we hav e E k P − P n k L C ( C ) ≤ c n E ( X,Y ) E g h sup f ∈C    n X i =1 g i L f ( X i , Y i )    i . Consider the Gaussian process f → Z f := P n i =1 g i L f ( X i , Y i ) indexed b y C . F or ev ery f , f 0 ∈ C , the intrinsic distance of Z f satisﬁes E g | Z f − Z f 0 | 2 = n X i =1 ( L f ( X i , Y i ) − L f 0 ( X i , Y i )) 2 ≤ max i =1 ,...,n | 2 Y i − f ( X i ) − f 0 ( X i ) | 2 × n X i =1 ( f ( X i ) − f 0 ( X i )) 2 = max i =1 ,...,n | 2 Y i − f ( X i ) − f 0 ( X i ) | 2 × E g | Z 0 f − Z 0 f 0 | 2 , where Z 0 f := P n i =1 g i ( f ( X i ) − f C ( X i )). Therefore, by Slepian’s Lemma, we hav e for ev ery ( X i , Y i ) n i =1 : E g h sup f ∈C Z f i ≤ max i =1 ,...,n sup f ,f 0 ∈C | 2 Y i − f ( X i ) − f 0 ( X i ) | × E g h sup f ∈C Z 0 f i , and since for ev ery f = P M j =1 α j f j ∈ C , where α j ≥ 0 , ∀ j = 1 , . . . , M and P α j = 1, Z 0 f = P M j =1 α j Z f j , we hav e E g h sup f ∈C Z 0 f i ≤ E g h sup f ∈ F Z 0 f i . Moreo ver, we ha ve using Dudley’s entrop y in tegral argument, 1 n E g h sup f ∈ F Z 0 f i ≤ c √ n Z ∆ n ( F 0 ) 0 p N ( F , k · k n , t ) dt ≤ c r log M n r 0 , 26 where F 0 := { f − f C : f ∈ F } and ∆ n ( F 0 ) := diam( F 0 , k · k n ) and r 0 2 := sup f ∈ F 0 1 n n X i =1 f ( X i ) 2 . On the other hand, w e can prov e, using Pisier’s inequalit y for ψ 1 random v ariables and the fact that k U 2 k ψ 1 = k U k 2 ψ 2 for every random v ariable U , that s E h max i =1 ,...,n sup f ,f 0 ∈C | 2 Y i − f ( X i ) − f 0 ( X i ) | 2 i (27) ≤ 2 p 2 log ( n + 1)( k ε k ψ 2 + k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) . So, we ﬁnally obtain E k P − P n k L C ( C ) ≤ c r log n log M n p E ( r 0 2 ) , and the conclusion follo ws from the ﬁrst part of the Lemma, since σ ( F 0 ) ≤ d ( F ). The case max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b is easier and follo ws from the fact that the left hand side of (27) is smaller than 4 b . Lemma 2 combined with Theorem 5 leads to the following corollary . Corollary 1. L et d ( F ) = diam( F , L 2 ( P X )) , C := conv( F ) and L f ( X, Y ) = ( Y − f ( X )) 2 − ( Y − f C ( X )) 2 . If max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b we have, with pr ob ability lar ger than 1 − 4 e − x , that for every f ∈ C :    1 n n X i =1 L f ( X i , Y i ) − E L f ( X, Y )    ≤ c ( σ ε + b ) r (log M + x ) log n n max  b r (log M + x ) log n n , d ( F )  . If max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b , we have, with pr ob ability lar ger than 1 − 4 e − x , that for every f ∈ C :    1 n n X i =1 L f ( X i , Y i ) − E L f ( X, Y )    ≤ cb r log M + x n max  b r log M + x n , d ( F )  . Pr o of. W e apply Theorem 5 to the pro cess Z := sup f ∈C    1 n n X i =1 L f ( X i , Y i ) − E L f ( X, Y )    , to obtain that with a probabilit y larger than 1 − 4 e − x : Z ≤ c  E Z + σ ( C ) r x n + b n ( C ) x n  , 27 where σ ( C ) 2 = sup f ∈C E [ L f ( X, Y ) 2 ] , and b n ( C ) =   max i =1 ,...,n sup f ∈C |L f ( X i , Y i ) − E [ L f ( X, Y )] |   ψ 1 . Since L f ( X, Y ) = 2 ε ( f C ( X ) − f ( X )) + ( f C ( X ) − f ( X ))(2 f 0 ( X ) − f ( X ) − f C ( X )), we ha ve using Assumption 1: E [ L f ( X, Y ) 2 ] ≤ 4 σ 2 ε k f − f C k 2 L 2 ( P X ) + 4 q E [( f C ( X ) − f ( X )) 4 ] q E [(2 f 0 ( X ) − f ( X ) − f C ( X )) 4 ] . If U f := ( f C ( X ) − f ( X )) 2 w e hav e k U f k ψ 1 = k f C − f k 2 ψ 2 ≤ (2 b ) 2 for any f ∈ C , so using the ψ 1 v ersion of Bernstein’s inequality (see v an der V aart and W ellner (1996)), w e hav e that P ( | U f − E ( U f ) | ≥ m k U f k ψ 1 ) ≤ 2 exp( − c min( m, m 2 )) = 2 exp( − cm ) for an y m ∈ N − { 0 } . But for such a random v ariable, one has E ( U p f ) 1 /p ≤ c p E ( U f ) for an y p > 1 (cf. Mendelson (2004)). So, in particular for p = 2, we derive q E [( f C ( X ) − f ( X )) 4 ] ≤ c k f − f C k 2 L 2 ( P X ) . Moreo ver, since E ( Z 4 ) ≤ 16 k Z k 4 ψ 2 , we hav e q E [(2 f 0 ( X ) − f ( X ) − f C ( X )) 4 ] ≤ 8 b 2 . So, we can conclude that σ ( C ) 2 ≤ (4 σ 2 ε + 8 cb 2 ) d ( F ) . Since E ( Z ) ≤ k Z k ψ 1 , we hav e b n ( C ) ≤ 2 log( n + 1) k sup f ∈C |L f ( X, Y ) |k ψ 1 . Moreov er, a straigh tforw ard calculation giv es L f ( X, Y ) ≤ ε 2 + ( f C ( X ) − f 0 ( X )) 2 + 3( f ( X ) − f 0 ( X )) 2 , so b n ( C ) ≤ 10 log( n + 1) b 2 . Putting all this together, and using Lemma 2, w e arrive at Z ≤ c ( σ ε + b ) r (log M + x ) log n n max  b r (log M + x ) log n n , d ( F )  , with probabilit y larger than 1 − 4 e − x for an y x > 0. In the b ounded case where max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b , the pro of is easier, and one can use the original T alagrand’s concentration inequality . Lemma 3. L et L f ( X, Y ) = ( Y − f ( X )) 2 − ( Y − f F ( X )) 2 . If we have max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b we have, with pr ob ability lar ger than 1 − 4 e − x , that for every f ∈ F :    1 n n X i =1 L f ( X i , Y i ) − E L f ( X, Y )    ≤ c ( σ ε + b ) r (log M + x ) log n n max  b r (log M + x ) log n n , k f − f F k  . 28 A lso, with pr ob ability at le ast 1 − 4 e − x , we have for every f , g ∈ F :   k f − g k 2 n − k f − g k 2   ≤ cb r (log M + x ) log n n max  b r (log M + x ) log n n , k f − g k  . When max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b , we have, with pr ob ability lar ger than 1 − 2 e − x , that for every f ∈ F :    1 n n X i =1 L f ( X i , Y i ) − E L f ( X, Y )    ≤ cb r log M + x n max  b r log M + x n , k f − f F k  , and with pr ob ability at le ast 1 − 2 e − x , that for every f , g ∈ F :   k f − g k 2 n − k f − g k 2   ≤ cb r log M + x n max  b r log M + x n , k f − g k  . Pr o of of L emma 3. The pro of uses exactly the same arguments as that of Lemma 2 and Corollary 1, and thus is omitted. Lemma 4. L et b F 1 b e given by (5) and r e c al l that f F ∈ argmin f ∈ F R ( f ) and let d ( b F 1 ) = diam( b F 1 , L 2 ( P X )) . Assume that max( k ε k ψ 2 , k sup f ∈ F | f ( X ) − f 0 ( X ) |k ψ 2 ) ≤ b. Then, with pr ob ability at le ast 1 − 4 exp( − x ) , we have f F ∈ b F 1 , and any function f ∈ b F 1 satisﬁes R ( f ) ≤ R ( f F ) + c ( σ ε + b ) r (log M + x ) log n n max  b r (log M + x ) log n n , d ( b F 1 )  . If max( k Y k ∞ , sup f ∈ F k f k ∞ ) ≤ b , we have with pr ob ability at le ast 1 − 2 exp( − x ) that f F ∈ b F 1 , and any function f ∈ b F 1 satisﬁes R ( f ) ≤ R ( f F ) + cb r log M + x n max  b r log M + x n , d ( b F 1 )  . Pr o of. The pro of follows the lines of the pro of of Lemma 4.4 in Lecu´ e and Mendelson (2009a), together with Lemma 3, so we don’t reproduce it here. B F unction spaces In this section w e give precise deﬁnitions of the spaces of functions considered in the pap er, and give useful related results. The deﬁnitions and results presented here can b e found in T rieb el (2006), in particular in Chapter 5 whic h is ab out anisotropic spaces, anisotropic multiresolutions, and en tropy num b ers of the embeddings of such spaces (see Section 5.3.3) that we use in particular to derive condition ( C β ), for the anisotropic Besov space, see Section 3. 29 Let { e 1 , . . . , e d } be the canonical basis of R d and s = ( s 1 , . . . , s d ) with s i > 0 b e a vector of directional smo othness, where s i corresp onds to the smo othness in direction e i . Let us ﬁx 1 ≤ p, q ≤ ∞ . If f : R d → R , we deﬁne ∆ k h f as the diﬀer enc e of order k ≥ 1 and step h ∈ R d , given b y ∆ 1 h f ( x ) = f ( x + h ) − f ( x ) and ∆ k h f ( x ) = ∆ 1 h (∆ k − 1 h f )( x ) for an y x ∈ R d . Deﬁnition 3. We say that f ∈ L p ( R d ) b elongs to the anisotr opic Besov sp ac e B s p,q ( R d ) if the semi-norm | f | B s p,q ( R d ) := d X i =1  Z 1 0 ( t − s i k ∆ k i te i f k p ) q dt t  1 /q is ﬁnite (with the usual mo diﬁc ations when p = ∞ or q = ∞ ). W e know that the norms k f k B s p,q := k f k p + | f | B s p,q are equiv alent for any choice of k i > s i . An equiv alent deﬁnition of the seminorm can b e given using the directional diﬀerences and the anisotropic distance, see The- orem 5.8 in T rieb el (2006). Sev eral explicit particular cases for the space B s p,q are of interest. If s = ( s, . . . , s ) for some s > 0, then B s p,q is the standard isotropic Besov space. When p = q = 2 and s = ( s 1 , . . . , s d ) has integer co ordinates, B s 2 , 2 is the anisotropic Sob olev space B s 2 , 2 = W s 2 = n f ∈ L 2 : d X i =1    ∂ s i f ∂ x s i i    2 < ∞ o . If s has non-in teger co ordinates, then B s 2 , 2 is the anisotropic Bessel-p oten tial space H s = n f ∈ L 2 : d X i =1    (1 + | ξ i | 2 ) s i / 2 b f ( ξ )    2 < ∞ o . As w e men tioned b elo w, Assumption 4 is satisﬁed for barely all smo othness spaces considered in nonparametric literature. In particular, if F = B s p,q is the anisotropic Beso v space deﬁned ab o ve, ( C β ) is satisﬁed: it is a consequence of a more general Theorem (see Theorem 5.30 in T rieb el (2006)) concerning the entrop y num bers of em b eddings (see Deﬁnition 1.87 in T rieb el (2006)). Here, we only give a simpliﬁed v ersion of this Theorem, whic h is suﬃcien t to deriv e ( C β ) for B s p,q . Indeed, if one tak es s 0 = s , p 0 = p , q 0 = q and s 1 = 0, p 0 = ∞ , q 0 = ∞ in Theorem 5.30 from T rieb el (2006), we obtain the following Theorem 9. L et 1 ≤ p, q ≤ ∞ and s = ( s 1 , . . . , s d ) wher e s i > 0, and let ¯ s b e the harmonic me an of s ( se e (13)) . Whenever ¯ s > d/p , we have B s p,q ⊂ C ( R d ) , wher e C ( R d ) is the set of c ontinuous functions on R d , and for any δ > 0, the sup- norm entr opy of the unit b al l of the anisotr opic Besov sp ac e , namely the set U s p,q := { f ∈ B s p,q : | f | B s p,q ≤ 1 } 30 satisﬁes H ∞ ( δ, U s p,q ) ≤ Dδ − ¯ s /d , (28) wher e D > 0 is a c onstant indep endent of δ . F or the isotropic Sob olev space, Theorem 9 w as obtained in the k ey pap er Birman and Solomjak (1967) (see Theorem 5.2 herein), and for the isotropic Beso v space, it can be found, among others, in Birg ´ e and Massart (2000) and Kerky acharian and Picard (2003). R emark 3 . A more constructive computation of the entrop y of anisotropic Besov spaces can b e done using the replicant co ding approach, which is done for Besov b odies in Kerky ac harian and Picard (2003). Using this approach together with an anisotropic m ultiresolution analysis based on compactly supp orted w av elets or atoms, see Section 5.2 in T rieb el (2006), we c an obtain a direct computation of the entrop y . The idea is to do a quantization of the wa v elet coeﬃcients, and then to co de them using a replication of their binary representation, and to use 01 as a separator (so that the co ding is injectiv e). A low er b ound for the en trop y can b e obtained as an elegan t consequence of Ho eﬀding’s deviation inequality for sums of i.i.d. v ariables and a combinatorial lemma. References Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical pro- cesses with applications to Marko v chains. Ele ctr on. J. Pr ob ab. , 13 no. 34, 1000– 1034. A udiber t, J.-Y. (2009). F ast learning rates in statistical inference through aggre- gation. Ann. Statist. , 37 1591. URL doi:10.1214/08- AOS623 . Birg ´ e, L. and Massar t, P. (2000). An adaptiv e compression algorithm in Besov spaces. Constr. Appr ox. , 16 1–36. Birman, M. ˇ S. and Solomjak, M. Z. (1967). Piecewise polynomial appro ximations of functions of classes W α p . Mat. Sb. (N.S.) , 73 (115) 331–355. Ca toni, O. (2001). Statistic al L e arning The ory and Sto chastic Optimization . Ecole d’ ´ et ´ e de Probabilit´ es de Saint-Flour 2001, Lecture Notes in Mathematics, Springer, N.Y. Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bul l. Amer. Math. So c. (N.S.) , 39 1–49 (electronic). D alal y an, A. S. and Tsybako v, A. B. (2007). Aggregation b y exp onen tial w eigh t- ing and sharp oracle inequalities. In COL T . 97–111. Dudley, R. M. (1999). Uniform c en tr al limit the or ems , vol. 63 of Cambridge Studies in A dvanc e d Mathematics . Cam bridge Universit y Press, Cambridge. Efron, B. , Hastie, T. , Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. , 32 407–499. With discussion, and a rejoinder b y the authors. 31 Einmahl, U. and Mason, D. M. (1996). Some univ ersal results on the b eha vior of incremen ts of partial sums. Ann. Pr ob ab. , 24 1388–1407. Gin ´ e, E. and Zinn, J. (1984). Some limit theorems for empirical pro cesses. A nn. Pr ob ab. , 12 929–998. Gy ¨ orfi, L. , Kohler, M. , Krzy ˙ zak, A. and W alk, H. (2002). A distribution-fr e e the ory of nonp ar ametric r e gr ession . Springer Series in Statistics, Springer-V erlag, New Y ork. Hoffmann, M. and Lepski, O. V. (2002). Random rates in anisotropic regression. The Annals of Statistics , 30 325–396. Juditsky, A. , Rigollet, P. and Tsybako v, A. B. (2008). Learning b y mirror a veraging. A nn. Statist. , 36 2183–2206. URL https://acces- distant.upmc.fr: 443/http/dx.doi.org/10.1214/07- AOS546 . Juditsky, A. B. , Nazin, A. V. , Tsybak ov, A. B. and V a y a tis, N. (2005). Re- cursiv e aggregation of estimators by the mirror descent metho d with av eraging. Pr oblemy Per e dachi Informatsii , 41 78–96. Kerky acharian, G. , Lepski, O. and Picard, D. (2001). Nonlinear estimation in anisotropic multi-index denoising. Pr ob ab. The ory R elate d Fields , 121 137–170. Kerky acharian, G. , Lepski, O. and Picard, D. (2007). Nonlinear estimation in anisotropic m ultiindex denoising. Sparse case. T e or. V er oyatn. Primen. , 52 150– 171. Kerky acharian, G. and Picard, D. (2003). Replicant compression coding in Beso v spaces. ESAIM Pr ob ab. Stat. , 7 239–250 (electronic). Lecu ´ e, G. (2006). Low er b ounds and aggregation in densit y estimation. J. Mach. L e arn. R es. , 7 971–981. Lecu ´ e, G. and Mendelson, S. (2009a). Aggregation via empirical risk minimiza- tion. Pr ob ab. The ory R elate d Fields , 145 591–613. URL https://acces- distant. upmc.fr:443/http/dx.doi.org/10.1007/s00440- 008- 0180- 8 . Lecu ´ e, G. and Mendelson, S. (2009b). Sharp er low er b ounds on the p erformance of the empirical risk minimization algorithm. T o app e ar in Bernoul li . Ledoux, M. and T alagrand, M. (1991). Pr ob ability in Banach sp ac es , vol. 23 of Er gebnisse der Mathematik und ihr er Gr enzgebiete (3) [R esults in Mathematics and R elate d Ar e as (3)] . Springer-V erlag, Berlin. Isop erimetry and pro cesses. Lee, W. S. , Bar tlett, P. L. and Williamson, R. C. (1996). The imp ortance of conv exit y in learning with squared loss. In Pr o c e e dings of the Ninth A nnual Confer enc e on Computational L e arning The ory . A CM Press, 140–146. Leung, G. and Barr on, A. R. (2006). Information theory and mixing least-squares regressions. IEEE T r ans. Inform. The ory , 52 3396–3410. Loust au, S. (2009). Penalized empirical risk minimization ov er b eso v spaces. Ele c- tr onic Journal of Statistics , 3 824–850. 32 Massar t, P. (2007). Conc entr ation ine qualities and mo del sele ction , vol. 1896 of L e ctur e Notes in Mathematics . Springer, Berlin. Lectures from the 33rd Summer Sc ho ol on Probabilit y Theory held in Sain t-Flour, July 6–23, 2003, With a forew ord b y Jean Picard. Mendelson, S. (2004). On the p erformance of k ernel classes. J. Mach. L e arn. R es. , 4 759–771. URL http://dx.doi.org/10.1162/1532443041424337 . Mendelson, S. (2008). Low er b ounds for the empirical minimization algorithm. IEEE T r ans. Inform. The ory , 54 3797–3803. Mendelson, S. and Neeman, J. (2009). Regularization in kernel learning. T ech. rep. T o app ear in Annals of Statistics, a v ailble at http://www.imstat.org/aos/. Neumann, M. H. (2000). Multiv ariate wa v elet thresholding in anisotropic function spaces. Statist. Sinic a , 10 399–431. T alagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. , 126 505–563. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. J. R oy. Statist. So c. Ser. B , 58 267–288. Triebel, H. (2006). The ory of function sp ac es. III , vol. 100 of Mono gr aphs in Math- ematics . Birkh¨ auser V erlag, Basel. Tsybako v, A. (2003a). Intr o duction l’estimation non-p ar amtrique . Springer. Tsybako v, A. B. (2003b). Optimal rates of aggregation. Computational L e arning The ory and Kernel Machines. B.Sch¨ olkopf and M.Warmuth, e ds. L e ctur e Notes in Ar tiﬁcial Intel ligenc e , 2777 303–313. Springer, Heidelb erg. v an de Geer, S. A. (2000). Applic ations of empiric al pr o c ess the ory , vol. 6 of Cam- bridge Series in Statistic al and Pr ob abilistic Mathematics . Cam bridge Univ ersity Press, Cambridge. v an der V aar t, A. W. and Wellner, J. A. (1996). We ak c onver genc e and empiric al pr o c esses . Springer Series in Statistics, Springer-V erlag, New Y ork. With applications to statistics. W ahba, G. (1990). Spline mo dels for observational dat a , vol. 59 of CBMS-NSF R e gional Confer enc e Series in Applie d Mathematics . So ciet y for Industrial and Applied Mathematics (SIAM), Philadelphia, P A. Y ang, Y. (2000). Mixing strategies for densit y estimation. A nn. Statist. , 28 75–87. URL http://dx.doi.org/10.1214/aos/1016120365 . 33

Hyper-sparse optimal aggregation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment