A survey of cross-validation procedures for model selection

A surv ey of ross-v alidation pro edures for mo del seletion Otob er 22, 2018 Sylv ain Arlot, CNRS ; Willo w Pro jet-T eam, Lab oratoire d'Informatique de l'Eole Normale Sup erieure (CNRS/ENS/INRIA UMR 8548) 45, rue d'Ulm, 75 230 P aris, F rane Sylvain.Arlotens .f r Alain Celisse, Lab oratoire P aul P ainlev é, UMR CNRS 8524, Univ ersité des Sienes et T e hnologies de Lille 1 F-59 655 Villeneuv e dAsq Cedex, F rane Alain.Celissema th. un iv -li ll e1 .fr Abstrat Used to estimate the risk of an estimator or to p erform mo del sele- tion, ross-v alidation is a widespread strategy b eause of its simpliit y and its apparen t univ ersalit y . Man y results exist on the mo del seletion p erformanes of ross-v alidation pro edures. This surv ey in tends to relate these results to the most reen t adv anes of mo del seletion theory , with a partiular emphasis on distinguishing empirial statemen ts from rigorous theoretial results. As a onlusion, guidelines are pro vided for  ho osing the b est ross-v alidation pro edure aording to the partiular features of the problem in hand. Con ten ts 1 In tro dution 3 1.1 Statistial framew ork . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Statistial algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Mo del seletion 6 2.1 The mo del seletion paradigm . . . . . . . . . . . . . . . . . . . . 6 2.2 Mo del seletion for estimation . . . . . . . . . . . . . . . . . . . . 7 2.3 Mo del seletion for iden tiation . . . . . . . . . . . . . . . . . . 8 2.4 Estimation vs. iden tiation . . . . . . . . . . . . . . . . . . . . . 8 1 3 Ov erview of some mo del seletion pro edures 8 3.1 The un biased risk estimation priniple . . . . . . . . . . . . . . . 9 3.2 Biased estimation of the risk . . . . . . . . . . . . . . . . . . . . 10 3.3 Pro edures built for iden tiation . . . . . . . . . . . . . . . . . . 11 3.4 Strutural risk minimization . . . . . . . . . . . . . . . . . . . . . 11 3.5 A d ho  p enalization . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Where are ross-v alidation pro edures in this piture? . . . . . . 12 4 Cross-v alidation pro edures 12 4.1 Cross-v alidation philosoph y . . . . . . . . . . . . . . . . . . . . . 13 4.2 F rom v alidation to ross-v alidation . . . . . . . . . . . . . . . . . 13 4.2.1 Hold-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.2 General denition of ross-v alidation . . . . . . . . . . . . 14 4.3 Classial examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3.1 Exhaustiv e data splitting . . . . . . . . . . . . . . . . . . 14 4.3.2 P artial data splitting . . . . . . . . . . . . . . . . . . . . . 15 4.3.3 Other ross-v alidation-lik e risk estimators . . . . . . . . . 16 4.4 Historial remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Statistial prop erties of ross-v alidation estimators of the risk 17 5.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 Theoretial assessmen t of the bias . . . . . . . . . . . . . 17 5.1.2 Corretion of the bias . . . . . . . . . . . . . . . . . . . . 19 5.2 V ariane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.1 V ariabilit y fators . . . . . . . . . . . . . . . . . . . . . . 19 5.2.2 Theoretial assessmen t of the v ariane . . . . . . . . . . . 20 5.2.3 Estimation of the v ariane . . . . . . . . . . . . . . . . . . 21 6 Cross-v alidation for eien t mo del seletion 21 6.1 Relationship b et w een risk estimation and mo del seletion . . . . 22 6.2 The global piture . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.3 Results in v arious framew orks . . . . . . . . . . . . . . . . . . . . 23 7 Cross-v alidation for iden tiation 24 7.1 General onditions to w ards mo del onsisteny . . . . . . . . . . . 24 7.2 Rened analysis for the algorithm seletion problem . . . . . . . 25 8 Sp eiities of some framew orks 26 8.1 Densit y estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.2 Robustness to outliers . . . . . . . . . . . . . . . . . . . . . . . . 27 8.3 Time series and dep enden t observ ations . . . . . . . . . . . . . . 27 8.4 Large n um b er of mo dels . . . . . . . . . . . . . . . . . . . . . . . 28 9 Closed-form form ulas and fast omputation 29 10 Conlusion: whi h ross-v alidation metho d for whi h problem? 30 10.1 The general piture . . . . . . . . . . . . . . . . . . . . . . . . . . 30 10.2 Ho w the splits should b e  hosen? . . . . . . . . . . . . . . . . . . 31 10.3 V-fold ross-v alidation . . . . . . . . . . . . . . . . . . . . . . . . 31 10.4 F uture resear h . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 1 In tro dution Man y statistial algorithms, su h as lik eliho o d maximization, least squares and empirial on trast minimization, rely on the preliminary  hoie of a mo del, that is of a set of parameters from whi h an estimate will b e returned. When sev eral andidate mo dels (th us algorithms) are a v ailable,  ho osing one of them is alled the mo del seletion problem. Cross-v alidation (CV) is a p opular strategy for mo del seletion, and more generally algorithm seletion. The main idea b ehind CV is to split the data (one or sev eral times) for estimating the risk of ea h algorithm: P art of the data (the training sample) is used for training ea h algorithm, and the remaining part (the v alidation sample) is used for estimating the risk of the algorithm. Then, CV selets the algorithm with the smallest estimated risk. Compared to the resubstitution error, CV a v oids o v ertting b eause the training sample is indep enden t from the v alidation sample (at least when data are i.i.d. ). The p opularit y of CV mostly omes from the generalit y of the data splitting heuristis, whi h only assumes that data are i.i.d. . Nev ertheless, the- oretial and empirial studies of CV pro edures do not en tirely onrm this univ ersalit y. Some CV pro edures ha v e b een pro v ed to fail for some mo del seletion problems, dep ending on the goal of mo del seletion: estimation or iden tiation (see Setion 2 ). F urthermore, man y theoretial questions ab out CV remain widely op en. The aim of the presen t surv ey is to pro vide a lear piture of what is kno wn ab out CV, from b oth theoretial and empirial p oin ts of view. More preisely , the aim is to answ er the follo wing questions: What is CV doing? When do es CV w ork for mo del seletion, k eeping in mind that mo del seletion an target dieren t goals? Whi h CV pro edure should b e used for ea h mo del seletion problem? The pap er is organized as follo ws. First, the rest of Setion 1 presen ts the statistial framew ork. Although non exhaustiv e, the presen t setting has b een  hosen general enough for sk et hing the omplexit y of CV for mo del seletion. The mo del seletion problem is in tro dued in Setion 2. A brief o v erview of some mo del seletion pro edures that are imp ortan t to k eep in mind for un- derstanding CV is giv en in Setion 3. The most lassial CV pro edures are dened in Setion 4 . Sine they are the k eystone of the b eha viour of CV for mo del seletion, the main prop erties of CV estimators of the risk for a xed mo del are detailed in Setion 5 . Then, the general p erformanes of CV for mo del seletion are desrib ed, when the goal is either estimation (Setion 6) or iden tiation (Setion 7). Sp ei prop erties of CV in some partiular frame- w orks are disussed in Setion 8 . Finally , Setion 9 fo uses on the algorithmi omplexit y of CV pro edures, and Setion 10 onludes the surv ey b y ta kling sev eral pratial questions ab out CV. 1.1 Statistial framew ork Assume that some data ξ 1 , . . . , ξ n ∈ Ξ with ommon distribution P are ob- serv ed. Throughout the pap erexept in Setion 8.3 the ξ i are assumed to b e indep enden t. The purp ose of statistial inferene is to estimate from the data ( ξ i ) 1 ≤ i ≤ n some target feature s of the unkno wn distribution P , su h as the mean or the v ariane of P . Let S denote the set of p ossible v alues for s . 3 The qualit y of t ∈ S , as an appro ximation of s , is measured b y its loss L ( t ) , where L : S 7→ R is alled the loss funtion , and is assumed to b e minimal for t = s . Man y loss funtions an b e  hosen for a giv en statistial problem. Sev eral lassial loss funtions are dened b y L ( t ) = L P ( t ) := E ξ ∼ P [ γ ( t ; ξ ) ] , (1) where γ : S × Ξ 7→ [0 , ∞ ) is alled a  ontr ast funtion . Basially , for t ∈ S and ξ ∈ Ξ , γ ( t ; ξ ) measures ho w w ell t is in aordane with observ ation of ξ , so that the loss of t , dened b y ( 1), measures the a v erage aordane b et w een t and new observ ations ξ with distribution P . Therefore, sev eral framew orks su h as transdutiv e learning do not t denition ( 1 ) . Nev ertheless, as detailed in Setion 1.2 , denition (1) inludes most lassial statistial framew orks. Another useful quan tit y is the ex ess loss ℓ ( s, t ) := L P ( t ) − L P ( s ) ≥ 0 , whi h is related to the risk of an estimator b s of the target s b y R ( b s ) = E ξ 1 ,...,ξ n ∼ P [ ℓ ( s, b s ) ] . 1.2 Examples The purp ose of this subsetion is to sho w that the framew ork of Setion 1.1 inludes sev eral imp ortan t statistial framew orks. This list of examples do es not pretend to b e exhaustiv e. Densit y estimation aims at estimating the densit y s of P with resp et to some giv en measure µ on Ξ . Then, S is the set of densities on Ξ with resp et to µ . F or instane, taking γ ( t ; x ) = − ln( t ( x )) in (1), the loss is minimal when t = s and the exess loss ℓ ( s, t ) = L P ( t ) − L P ( s ) = E ξ ∼ P  ln  s ( ξ ) t ( ξ )   = Z s ln  s t  d µ is the Kullba k-Leibler div ergene b et w een distributions tµ and sµ . Predition aims at prediting a quan tit y of in terest Y ∈ Y giv en an explana- tory v ariable X ∈ X and a sample of observ ations ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . In other w ords, Ξ = X × Y , S is the set of measurable mappings X 7→ Y and the on trast γ ( t ; ( x, y )) measures the disrepany b et w een the observ ed y and its predited v alue t ( x ) . T w o lassial predition framew orks are regression and lassiation, whi h are detailed b elo w. Regression orresp onds to on tin uous Y , that is Y ⊂ R (or R k for m ultiv ari- ate regression), the feature spae X b eing t ypially a subset of R ℓ . Let s denote the regression funtion, that is s ( x ) = E ( X,Y ) ∼ P [ Y | X = x ] , so that ∀ i, Y i = s ( X i ) + ǫ i with E [ ǫ i | X i ] = 0 . A p opular on trast in regression is the le ast-squar es  ontr ast γ ( t ; ( x, y ) ) = ( t ( x ) − y ) 2 , whi h is minimal o v er S for t = s , and the exess loss is ℓ ( s, t ) = E ( X,Y ) ∼ P h ( s ( X ) − t ( X ) ) 2 i . 4 Note that the exess loss of t is the square of the L 2 distane b et w een t and s , so that predition and estimation are equiv alen t goals. Classiation orresp onds to nite Y (at least disrete). In partiular, when Y = { 0 , 1 } , the predition problem is alled binary (sup ervise d) lassi ation . With the 0-1 on trast funtion γ ( t ; ( x, y )) = 1 l t ( x ) 6 = y , the minimizer of the loss is the so-alled Ba y es lassier s dened b y s ( x ) = 1 l η ( x ) ≥ 1 / 2 , where η denotes the regression funtion η ( x ) = P ( X,Y ) ∼ P ( Y = 1 | X = x ) . Remark that a sligh tly dieren t framew ork is often onsidered in binary las- siation. Instead of lo oking only for a lassier, the goal is to estimate also the ondene in the lassiation made at ea h p oin t: S is the set of measurable mappings X 7→ R , the lassier x 7→ 1 l t ( x ) ≥ 0 b eing asso iated to an y t ∈ S . Basially , the larger | t ( x ) | , the more onden t w e are in the lassiation made from t ( x ) . A lassial family of losses asso iated with this problem is dened b y (1) with the on trast γ φ ( t ; ( x, y ) ) = φ ( − (2 y − 1) t ( x ) ) where φ : R 7→ [0 , ∞ ) is some funtion. The 0-1 on trast orresp onds to φ ( u ) = 1 l u ≥ 0 . The on v ex loss funtions orresp ond to the ase where φ is on v ex, nondereasing with lim −∞ φ = 0 and φ (0) = 1 . Classial examples are φ ( u ) = max { 1 + u, 0 } (hinge), φ ( u ) = exp( u ) , and φ ( u ) = log 2 ( 1 + exp( u ) ) (logit). The orresp ond- ing losses are used as ob jetiv e funtions b y sev eral lassial learning algorithms su h as supp ort v etor ma hines (hinge) and b o osting (exp onen tial and logit). Man y referenes on lassiation theory , inluding mo del seletion, an b e found in the surv ey b y Bou heron et al. (2005 ). 1.3 Statistial algorithms In this surv ey , a statisti al algorithm A is an y (measurable) mapping A : S n ∈ N Ξ n 7→ S . The idea is that data D n = ( ξ i ) 1 ≤ i ≤ n ∈ Ξ n will b e used as an input of A , and that the output of A , A ( D n ) = b s A ( D n ) ∈ S , is an estimator of s . The qualit y of A is then measured b y L P  b s A ( D n )  , whi h should b e as small as p ossible. In the sequel, the algorithm A and the estimator b s A ( D n ) are often iden tied when no onfusion is p ossible. Minimum  ontr ast estimators form a lassial family of statistial algorithms, dened as follo ws. Giv en some subset S of S that w e all a mo del , a minim um on trast estimator of s is an y minimizer of the empirial on trast t 7→ L P n ( t ) = 1 n n X i =1 γ ( t ; ξ i ) , where P n = 1 n n X i =1 δ ξ i , o v er S . The idea is that the empirial on trast L P n ( t ) has an exp etation L P ( t ) whi h is minimal o v er S at s . Hene, minimizing L P n ( t ) o v er a set S of andidate v alues for s hop efully leads to a go o d estimator of s . Let us no w giv e three p opular examples of empirial on trast minimizers: • Maximum likeliho o d estimators : tak e γ ( t ; x ) = − ln( t ( x )) in the densit y estimation setting. A lassial  hoie for S is the set of pieewise onstan t funtions on a regular partition of Ξ with K piees. 5 • L e ast-squar es estimators : tak e γ ( t ; ( x, y )) = ( t ( x ) − y ) 2 the least-squares on trast in the regression setting. F or instane, S an b e the set of piee- wise onstan t funtions on some xed partition of X (leading to regresso- grams), or a v etor spae spanned b y the rst v etors of w a v elets or F ourier basis, among man y others. Note that regularized least-squares algorithms su h as the Lasso, ridge regression and spline smo othing also are least- squares estimators, the mo del S b eing some ball of a (data-dep enden t) radius for the L 1 (resp. L 2 ) norm in some high-dimensional spae. Hene, tuning the regularization parameter for the LASSO or SVM, for instane, amoun ts to p erform mo del seletion from a olletion of mo dels. • Empiri al risk minimizers , follo wing the terminology of V apnik (1982 ): tak e an y on trast funtion γ in the predition setting. When γ is the 0-1 on trast, p opular  hoies for S lead to linear lassiers, partitioning rules, and neural net w orks. Bo osting and Supp ort V etor Ma hines lassiers also are empirial on trast minimizers o v er some data-dep enden t mo del S , with on trast γ = γ φ for some on v ex funtions φ . Let us nally men tion that man y other lassial statistial algorithms an b e onsidered with CV, for instane lo al a v erage estimators in the predition framew ork su h as k -Nearest Neigh b ours and Nadara y a-W atson k ernel estima- tors. The fo us will b e mainly k ept on minim um on trast estimators to k eep the length of the surv ey reasonable. 2 Mo del seletion Usually , sev eral statistial algorithms an b e used for solving a giv en statistial problem. Let ( b s λ ) λ ∈ Λ denote su h a family of andidate statistial algorithms. The algorithm sele tion pr oblem aims at  ho osing from data one of these algo- rithms, that is,  ho osing some b λ ( D n ) ∈ Λ . Then, the nal estimator of s is giv en b y b s b λ ( D n ) ( D n ) . The main diult y is that the same data are used for training the algorithms, that is, for omputing ( b s λ ( D n ) ) λ ∈ Λ , and for  ho osing b λ ( D n ) . 2.1 The mo del seletion paradigm F ollo wing Setion 1.3 , let us fo us on the mo del sele tion pr oblem , where an- didate algorithms are minim um on trast estimators and the goal is to  ho ose a mo del S . Let ( S m ) m ∈M n b e a family of mo dels, that is, S m ⊂ S . Let γ b e a xed on trast funtion, and for ev ery m ∈ M n , let b s m b e a minim um on trast estimator o v er mo del S m with on trast γ . The goal is to  ho ose b m ( D n ) ∈ M n from data only . The  hoie of a mo del S m has to b e done arefully . Indeed, when S m is a small mo del, b s m is a p o or statistial algorithm exept when s is v ery lose to S m , sine ℓ ( s, b s m ) ≥ inf t ∈ S m { ℓ ( s, t ) } := ℓ ( s, S m ) . The lo w er b ound ℓ ( s, S m ) is alled the bias of mo del S m , or appr oximation err or . The bias is a noninreasing funtion of S m . 6 On the on trary , when S m is h uge, its bias ℓ ( s, S m ) is small for most targets s , but b s m learly o v erts. Think for instane of S m as the set of all on tin uous funtions on [0 , 1] in the regression framew ork. More generally , if S m is a v etor spae of dimension D m , in sev eral lassial framew orks, E [ ℓ ( s, b s m ( D n ) ) ] ≈ ℓ ( s, S m ) + λD m (2) where λ > 0 do es not dep end on m . F or instane, λ = 1 / (2 n ) in densit y estimation using the lik eliho o d on trast, and λ = σ 2 /n in regression using the least-squares on trast and assuming v ar ( Y | X ) = σ 2 do es not dep end on X . The meaning of (2) is that a go o d mo del  hoie should balane the bias term ℓ ( s, S m ) and the varian e term λD m , that is solv e the so-alled bias-varian e tr ade-o . By extension, the v ariane term, also alled estimation err or , an b e dened b y E [ ℓ ( s, b s m ( D n ) ) ] − ℓ ( s , S m ) = E [ L P ( b s m ) ] − inf t ∈ S m L P ( t ) , ev en when (2) do es not hold. The in terested reader an nd a m u h deep er insigh t in to mo del seletion in the Sain t-Flour leture notes b y Massart (2007 ). Before giving examples of lassial mo del seletion pro edures, let us men tion the t w o main dieren t goals that mo del seletion an target: estimation and iden tiation. 2.2 Mo del seletion for estimation On the one hand, the goal of mo del seletion is estimation when b s b m ( D n ) ( D n ) is used as an appro ximation of the target s , and the goal is to minimize its loss. F or instane, AIC and Mallo ws' C p mo del seletion pro edures are built for estimation (see Setion 3.1 ). The qualit y of a mo del seletion pro edure D n 7→ b m ( D n ) , designed for esti- mation, is measured b y the exess loss of b s b m ( D n ) ( D n ) . Hene, the b est p ossible mo del  hoie for estimation is the so-alled or ale mo del S m ⋆ , dened b y m ⋆ = m ⋆ ( D n ) ∈ arg min m ∈M n { ℓ ( s, b s m ( D n ) ) } . (3) Sine m ⋆ ( D n ) dep ends on the unkno wn distribution P of data, one annot exp et to selet b m ( D n ) = m ⋆ ( D n ) almost surely . Nev ertheless, w e an hop e to selet b m ( D n ) su h that b s b m ( D n ) is almost as lose to s as b s m ⋆ ( D n ) . Note that there is no requiremen t for s to b elong to S m ∈M n S m . Dep ending on the framew ork, the optimalit y of a mo del seletion pro edure for estimation is assessed in at least t w o dieren t w a ys. First, in the asymptoti framew ork, a mo del seletion pro edure b m is alled eient (or asymptotially optimal) when it leads to b m su h that ℓ  s, b s b m ( D n ) ( D n )  inf m ∈M n { ℓ ( s, b s m ( D n ) ) } a.s. − − − − → n →∞ 1 . Sometimes, a w eak er result is pro v ed, the on v ergene holding only in probabil- it y . 7 Seond, in the non-asymptoti framew ork, a mo del seletion pro edure sat- ises an or ale ine quality with onstan t C n ≥ 1 and remainder term R n ≥ 0 when ℓ  s, b s b m ( D n ) ( D n )  ≤ C n inf m ∈M n { ℓ ( s, b s m ( D n ) ) } + R n (4) holds either in exp etation or with large probabilit y (that is, a probabilit y larger than 1 − C ′ /n 2 , for some p ositiv e onstan t C ′ ). Note that if ( 4) holds on a large probabilit y ev en t with C n tending to 1 when n tends to innit y and R n ≪ ℓ ( s, b s m ⋆ ( D n ) ) , then the mo del seletion pro edure b m is eien t. In the estimation setting, mo del seletion is often used for building adaptive estimators , assuming that s b elongs to some funtion spae T α (Barron et al. , 1999 ).Then, a mo del seletion pro edure b m is optimal when it leads to an estima- tor b s b m ( D n ) ( D n ) (appro ximately) minimax with resp et to T α without kno wing α , pro vided the family ( S m ) m ∈M n has b een w ell- hosen. 2.3 Mo del seletion for iden tiation On the other hand, mo del seletion an aim at iden tifying the true mo del S m 0 , dened as the smallest mo del among ( S m ) m ∈M n to whi h s b elongs. In partiular, s ∈ S m ∈M n S m is assumed in this setting. A t ypial example of mo del seletion pro edure built for iden tiation is BIC (see Setion 3.3 ). The qualit y of a mo del seletion pro edure designed for iden tiation is measured b y its probabilit y of reo v ering the true mo del m 0 . Then, a mo del seletion pro edure is alled (mo del)  onsistent when P ( b m ( D n ) = m 0 ) − − − − → n →∞ 1 . Note that iden tiation an naturally b e extended to the general algorithm seletion problem, the true mo del b eing replaed b y the statistial algorithm whose risk on v erges at the fastest rate (see for instane Y ang , 2007 ). 2.4 Estimation vs. iden tiation When a true mo del exists, mo del onsisteny is learly a stronger prop ert y than eieny dened in Setion 2.2 . Ho w ev er, in man y framew orks, no true mo del do es exist so that eieny is the only w ell-dened prop ert y . Could a mo del seletion pro edure b e mo del onsisten t in the former ase (lik e BIC) and eien t in the latter ase (lik e AIC)? The general answ er to this question, often alled the AIC-BIC dilemma, is negativ e: Y ang (2005 ) pro v ed in the regression framew ork that no mo del seletion pro edure an b e sim ultane- ously mo del onsisten t and minimax rate optimal. Nev ertheless, the strengths of AIC and BIC an sometimes b e shared; see for instane the in tro dution of a pap er b y Y ang (2005 ) and a reen t pap er b y v an Erv en et al. (2008 ). 3 Ov erview of some mo del seletion pro edures Sev eral approa hes an b e used for mo del seletion. Let us briey sk et h here some of them, whi h are partiularly helpful for understanding ho w CV w orks. 8 Lik e CV, all the pro edures onsidered in this setion selet b m ( D n ) ∈ arg m in m ∈M n { crit( m ; D n ) } , (5) where ∀ m ∈ M n , crit( m ; D n ) = crit( m ) ∈ R is some data-dep enden t riterion. A partiular ase of (5) is p enalization , whi h onsists in  ho osing the mo del minimizing the sum of empirial on trast and some measure of omplexit y of the mo del (alled p enalt y) whi h an dep end on the data, that is, b m ( D n ) ∈ arg min m ∈M n { L P n ( b s m ) + p en( m ; D n ) } . (6) This setion do es not pretend to b e exhaustiv e. Completely dieren t approa hes exist for mo del seletion, su h as the Minim um Desription Length (MDL) (Rissanen, 1983 ), and the Ba y esian approa hes. The in terested reader will nd more details and referenes on mo del seletion pro edures in the b o oks b y Burnham and Anderson (2002 ) or Massart (2007 ) for instane. Let us fo us here on v e main ategories of mo del seletion pro edures, the rst three ones oming from a lassiation made b y Shao (1997 ) in the linear regression framew ork. 3.1 The un biased risk estimation priniple When the goal of mo del seletion is estimation, man y mo del seletion pro- edures are of the form (5 ) where crit( m ; D n ) un biasedly estimates (at least, asymptotially) the loss L P ( b s m ) . This general idea is often alled un biased risk estimation priniple, or Mallo ws' or Ak aik e's heuristis. In order to explain wh y this strategy an p erform w ell, let us write the starting p oin t of most theoretial analysis of pro edures dened b y (5): By denition (5), for ev ery m ∈ M n , ℓ ( s , b s b m ) + crit( b m ) − L P ( b s b m ) ≤ ℓ ( s, b s m ) + crit( m ) − L P ( b s m ) . (7) If E [ crit( m ) − L P ( b s m ) ] = 0 for ev ery m ∈ M n , then onen tration inequalities are lik ely to pro v e that ε − n , ε + n > 0 exist su h that ∀ m ∈ M n , ε + n ≥ crit( m ) − L P ( b s m ) ℓ ( s, b s m ) ≥ − ε − n > − 1 with high probabilit y , at least when Card( M n ) ≤ C n α for some C, α ≥ 0 . Then, (7) diretly implies an orale inequalit y lik e (4) with C n = (1 + ε + n ) / (1 − ε − n ) . If ε + n , ε − n → 0 when n → ∞ , this pro v es the pro edure dened b y (5 ) is eien t. Examples of mo del seletion pro edures follo wing the un biased risk estima- tion priniple are FPE (Final Predition Error, Ak aik e, 1970 ), sev eral ross- v alidation pro edures inluding the Lea v e-one-out (see Setion 4), and GCV (Generalized Cross-V alidation, Cra v en and W ah ba , 1979 , see Setion 4.3.3 ). With the p enalization approa h (6), the un biased risk estimation priniple is that E [ pen ( m ) ] should b e lose to the ideal p enalt y pen id ( m ) := L P ( b s m ) − L P n ( b s m ) . Sev eral lassial p enalization pro edures follo w this priniple, for instane: 9 • With the log-lik eliho o d on trast, AIC (Ak aik e Information Criterion, Ak aik e , 1973 ) and its orreted v ersions (Sugiura , 1978 ; Hurvi h and T sai , 1989 ). • With the least-squares on trast, Mallo ws' C p (Mallo ws , 1973 ) and sev eral rened v ersions of C p (see for instane Baraud , 2002 ). • With a general on trast, o v ariane p enalties (Efron, 2004 ). AIC, Mallo ws' C p and related pro edures ha v e b een pro v ed to b e optimal for estimation in sev eral framew orks, pro vided Card( M n ) ≤ C n α for some onstan ts C, α ≥ 0 (see the pap er b y Birgé and Massart , 2007 , and referenes therein). The main dra wba k of p enalties su h as AIC or Mallo ws' C p is their dep en- dene on some assumptions on the distribution of data. F or instane, Mallo ws' C p assumes the v ariane of Y do es not dep end on X . Otherwise, it has a sub optimal p erformane (Arlot, 2008b ). Sev eral resampling-based p enalties ha v e b een prop osed to o v erome this problem, at the prie of a larger omputational omplexit y , and p ossibly sligh tly w orse p erformane in simpler framew orks; see a pap er b y Efron (1983 ) for b o ot- strap, and a pap er b y Arlot (2008a ) and referenes therein for generalization to ex hangeable w eigh ts. Finally , note that all these p enalties dep end on m ultiplying fators whi h are not alw a ys kno wn (for instane, the noise-lev el, for Mallo ws' C p ). Birgé and Massart (2007 ) prop osed a general data-driv en pro edure for estimat- ing su h m ultiplying fators, whi h satises an orale inequalit y with C n → 1 in regression (see also Arlot and Massart , 2009 ). 3.2 Biased estimation of the risk Sev eral mo del seletion pro edures are of the form (5) where crit( m ) do es not un biasedly estimate the loss L P ( b s m ) : The w eigh t of the v ariane term om- pared to the bias in E [ crit( m ) ] is sligh tly larger than in the deomp osition (2) of L P ( b s m ) . F rom the p enalization p oin t of view, su h pro edures are overp e- nalizing . Examples of su h pro edures are FPE α (Bhansali and Do wnham , 1977 ) and GIC λ (Generalized Information Criterion, Nishii , 1984 ; Shao , 1997 ) with α, λ > 2 , whi h are losely related. Some ross-v alidation pro edures, su h as Lea v e- p -out with p/n ∈ (0 , 1) xed, also b elong to this ategory (see Setion 4.3.1 ). Note that FPE α with α = 2 is FPE, and GIC λ with λ = 2 is lose to FPE and Mallo ws' C p . When the goal is estimation, there are t w o main reasons for using biased mo del seletion pro edures. First, exp erimen tal evidene sho w that o v erp enal- izing often yields b etter p erformane when the signal-to-noise ratio is small (see for instane Arlot, 2007 , Chapter 11). Seond, when the n um b er of mo dels Card( M n ) gro ws faster than an y p o w er of n , as in the omplete v ariable seletion problem with n v ariables, then the un biased risk estimation priniple fails. F rom the p enalization p oin t of view, Birgé and Massart (2007 ) pro v ed that when Card( M n ) = e κn for some κ > 0 , 10 the minimal amoun t of p enalt y required so that an orale inequalit y holds with C n = O (1) is m u h larger than pen id ( m ) . In addition to the FPE α and GIC λ with suitably  hosen α, λ , sev eral p enalization pro edures ha v e b een prop osed for taking in to aoun t the size of M n (Barron et al., 1999 ; Baraud , 2002 ; Birgé and Massart , 2001 ; Sauv é , 2009 ). In the same pap ers, these pro edures are pro v ed to satisfy orale inequalities with C n as small as p ossible, t ypially of order ln( n ) when Card( M n ) = e κn . 3.3 Pro edures built for iden tiation Some sp ei mo del seletion pro edures are used for iden tiation. A t ypial example is BIC (Ba y esian Information Criterion, S h w arz , 1978 ). More generally , Shao (1997 ) sho w ed that sev eral pro edures iden tify on- sisten tly the orret mo del in the linear regression framew ork as so on as they o v erp enalize within a fator tending to innit y with n , for instane, GIC λ n with λ n → + ∞ , FPE α n with α n → + ∞ (Shibata, 1984 ), and sev eral CV pro edures su h as Lea v e- p -out with p = p n ∼ n . BIC is also part of this piture, sine it oinides with GIC ln( n ) . In another pap er, Shao (1996 ) sho w ed that m n -out-of- n b o otstrap p enaliza- tion is also mo del onsisten t as so on as m n ∼ n . Compared to Efron's b o otstrap p enalties, the idea is to estimate pen id with the m n -out-of- n b o otstrap instead of the usual b o otstrap, whi h results in o v erp enalization within a fator tending to innit y with n (Arlot, 2008a ). Most MDL-based pro edures an also b e put in to this ategory of mo del seletion pro edures (see Grün w ald , 2007 ). Let us nally men tion the Lasso (Tibshirani , 1996 ) and other ℓ 1 p enalization pro edures, whi h ha v e reen tly attrated m u h atten tion (see for instane Hesterb erg et al. , 2008 ). They are a omputationally eien t w a y of iden tifying the true mo del in the on text of v ariable seletion with man y v ariables. 3.4 Strutural risk minimization In the on text of statistial learning, V apnik and Cherv onenkis (1974 ) pro- p osed the strutural risk minimization approa h (see also V apnik, 1982 , 1998 ). Roughly , the idea is to p enalize the empirial on trast with a p enalt y (o v er)- estimating pen id ,g ( m ) := sup t ∈ S m { L P ( t ) − L P n ( t ) } ≥ p en id ( m ) . Su h p enalties ha v e b een built using the V apnik-Cherv onenkis dimension, the om binatorial en trop y , (global) Radema her omplexities ( K olt hinskii , 2001 ; Bartlett et al., 2002 ), (global) b o otstrap p enalties (F romon t , 2007 ), Gaus- sian omplexities or the maximal disrepany (Bartlett and Mendelson, 2002 ). These p enalties are often alled glob al b eause pen id ,g ( m ) is a suprem um o v er S m . The lo alization approa h (see Bou heron et al., 2005 ) has b een in tro dued in order to obtain p enalties loser to pen id (su h as lo al Radema her om- plexities), hene smaller predition errors when p ossible (Bartlett et al. , 2005 ; K olt hinskii , 2006 ). Nev ertheless, these p enalties are still larger than pen id ( m ) 11 and an b e diult to ompute in pratie b eause of sev eral unkno wn on- stan ts. A non-asymptoti analysis of sev eral global and lo al p enalties an b e found in the b o ok b y Massart (2007 ) for instane; see also K olt hinskii (2006 ) for reen t results on lo al p enalties. 3.5 A d ho  p enalization Let us nally men tion that p enalties an also b e built aording to partiular features of the problem. F or instane, p enalties an b e prop ortional to the ℓ p norm of b s m (similarly to ℓ p -regularized learning algorithms) when ha ving an estimator with a on trolled ℓ p norm seems b etter. The p enalt y an also b e prop ortional to the squared norm of b s m in some repro duing k ernel Hilb ert spae (similarly to k ernel ridge regression or spline smo othing), with a k ernel adapted to the sp ei framew ork. More generally , an y p enalt y an b e used, as so on as pen( m ) is larger than the estimation error (to a v oid o v ertting) and the b est mo del for the nal user is not the orale m ⋆ , but more lik e arg min m ∈M n { ℓ ( s, S m ) + κ p en( m ) } for some κ > 0 . 3.6 Where are ross-v alidation pro edures in this piture? The family of CV pro edures, whi h will b e desrib ed and deeply in v estigated in the next setions, on tains pro edures in the rst three ategories. CV pro e- dures are all of the form (5), where crit( m ) either estimates (almost) un biasedly the loss L P ( b s m ) , or o v erestimates the v ariane term (see Setion 2.1 ). In the latter ase, CV pro edures either b elong to the seond or the third ategory , dep ending on the o v erestimation lev el. This fat has t w o ma jor impliations. First, CV itself do es not tak e in to aoun t prior information for seleting a mo del. T o do so, one an either add to the CV estimate of the risk a p enalt y term (su h as k b s m k p ), or use prior information to pre-selet a subset of mo dels f M ( D n ) ⊂ M n b efore letting CV selet a mo del among ( S m ) m ∈ f M ( D n ) . Seond, in statistial learning, CV and resampling-based pro edures are the most widely used mo del seletion pro edures. Strutural risk minimization is often to o p essimisti, and other alternativ es rely on unrealisti assumptions. But if CV and resampling-based pro edures are the most lik ely to yield go o d predition p erformanes, their theoretial grounds are not that rm, and to o few CV users are areful enough when  ho osing a CV pro edure to p erform mo del seletion. Among the aims of this surv ey is to p oin t out b oth p ositiv e and negativ e results ab out the mo del seletion p erformane of CV. 4 Cross-v alidation pro edures The purp ose of this setion is to desrib e the rationale b ehind CV and to dene the dieren t CV pro edures. Sine all CV pro edures are of the form ( 5), dening a CV pro edure amoun ts to dene the orresp onding CV estimator of the risk of an algorithm A , whi h will b e crit( · ) in (5). 12 4.1 Cross-v alidation philosoph y As notied in the early 30s b y Larson (1931 ), training an algorithm and ev aluat- ing its statistial p erformane on the same data yields an o v eroptimisti result. CV w as raised to x this issue (Mosteller and T uk ey, 1968 ; Stone , 1974 ; Geisser , 1975 ), starting from the remark that testing the output of the algorithm on new data w ould yield a go o d estimate of its p erformane (Breiman , 1998 ). In most real appliations, only a limited amoun t of data is a v ailable, whi h led to the idea of splitting the data : P art of the data (the training sample) is used for training the algorithm, and the remaining data (the v alidation sample) is used for ev aluating its p erformane. The v alidation sample an pla y the role of new data as so on as data are i.i.d. . Data splitting yields the validation estimate of the risk, and a v eraging o v er sev eral splits yields a r oss-validation estimate of the risk. As will b e sho wn in Setions 4.2 and 4.3 , v arious splitting strategies lead to v arious CV estimates of the risk. The ma jor in terest of CV lies in the univ ersalit y of the data splitting heuris- tis, whi h only assumes that data are iden tially distributed and the train- ing and v alidation samples are indep enden t, t w o assumptions whi h an ev en b e relaxed (see Setion 8.3 ). Therefore, CV an b e applied to (almost) an y algorithm in (almost) an y framew ork, for instane regression (Stone, 1974 ; Geisser , 1975 ), densit y estimation (Rudemo, 1982 ; Stone, 1984 ) and lassi- ation (Devro y e and W agner , 1979 ; Bartlett et al., 2002 ), among man y others. On the on trary , most other mo del seletion pro edures (see Setion 3 ) are sp e- i to a framew ork: F or instane, C p (Mallo ws , 1973 ) is sp ei to least-squares regression. 4.2 F rom v alidation to ross-v alidation In this setion, the hold-out (or v alidation) estimator of the risk is dened, leading to a general denition of CV. 4.2.1 Hold-out The hold-out (Devro y e and W agner , 1979 ) or (simple) validation relies on a sin- gle split of data. F ormally , let I ( t ) b e a non-empt y prop er subset of { 1 , . . . , n } , that is, su h that b oth I ( t ) and its omplemen t I ( v ) =  I ( t )  c = { 1 , . . . , n } \ I ( t ) are non-empt y . The hold-out estimator of the risk of A ( D n ) with training set I ( t ) is dened b y b L H − O  A ; D n ; I ( t )  := 1 n v X i ∈ D ( v ) n γ  A ( D ( t ) n ); ( X i , Y i )  , (8) where D ( t ) n := ( ξ i ) i ∈ I ( t ) is the tr aining sample , of size n t = Car d( I ( t ) ) , and D ( v ) n := ( ξ i ) i ∈ I ( v ) is the validation sample , of size n v = n − n t ; I ( v ) is alled the v alidation set. The question of  ho osing n t , and I ( t ) giv en its ardinalit y n t , is disussed in the rest of this surv ey . 13 4.2.2 General denition of ross-v alidation A general desription of the CV strategy has b een giv en b y Geisser (1975 ): In brief, CV onsists in a v eraging sev eral hold-out estimators of the risk orre- sp onding to dieren t splits of the data. F ormally , let B ≥ 1 b e an in teger and I ( t ) 1 , . . . , I ( t ) B b e a sequene of non-empt y prop er subsets of { 1 , . . . , n } . The CV estimator of the risk of A ( D n ) with training sets  I ( t ) j  1 ≤ j ≤ B is dened b y b L CV  A ; D n ;  I ( t ) j  1 ≤ j ≤ B  := 1 B B X j =1 b L H − O  A ; D n ; I ( t ) j  . (9) All existing CV estimators of the risk are of the form (9), ea h one b eing uniquely determined b y the w a y the sequene  I ( t ) j  1 ≤ j ≤ B is  hosen, that is, the  hoie of the splitting s heme. Note that when CV is used in mo del seletion for iden tiation, an alterna- tiv e denition of CV w as prop osed b y Y ang (2006 , 2007 ) and alled CV with voting (CV-v). When t w o algorithms A 1 and A 2 are ompared, A 1 is seleted b y CV-v if and only if b L H − O ( A 1 ; D n ; I ( t ) j ) < b L H − O ( A 2 ; D n ; I ( t ) j ) for a ma jorit y of the splits j = 1 , . . . , B . By on trast, CV pro edures of the form ( 9) an b e alled CV with a v eraging (CV-a), sine the estimates of the risk of the algorithms are a v eraged b efore their omparison. 4.3 Classial examples Most lassial CV estimators split the data with a xed size n t of the training set, that is, Card( I ( t ) j ) ≈ n t for ev ery j . The question of  ho osing n t is disussed extensiv ely in the rest of this surv ey . In this subsetion, sev eral CV estimators are dened. T w o main ategories of splitting s hemes an b e distinguished, giv en n t : exhaustiv e data splitting, that is onsidering all training sets I ( t ) of size n t , and partial data splitting. 4.3.1 Exhaustiv e data splitting Lea v e-one-out (LOO, Stone, 1974 ; Allen, 1974 ; Geisser , 1975 ) is the most lassial exhaustiv e CV pro edure, orresp onding to the  hoie n t = n − 1 : Ea h data p oin t is suessiv ely left out from the sample and used for v alidation. F ormally , LOO is dened b y (9) with B = n and I ( t ) j = { j } c for j = 1 , . . . , n : b L LOO ( A ; D n ) = 1 n n X j =1 γ  A  D ( − j ) n  ; ξ j  (10) where D ( − j ) n = ( ξ i ) i 6 = j . The name LOO an b e traed ba k to pap ers b y Piard and Co ok (1984 ) and b y Breiman and Sp etor (1992 ), but LOO has sev- eral other names in the literature, su h as delete-one CV (see Li, 1987 ), or dinary CV (Stone, 1974 ; Burman, 1989 ), or ev en only CV (Efron, 1983 ; Li, 1987 ). 14 Lea v e- p -out (LPO, Shao , 1993 ) with p ∈ { 1 , . . . , n } is the exhaustiv e CV with n t = n − p : ev ery p ossible set of p data p oin ts are suessiv ely left out from the sample and used for v alidation. Therefore, LPO is dened b y ( 9) with B =  n p  and ( I ( t ) j ) 1 ≤ j ≤ B are all the subsets of { 1 , . . . , n } of size p . LPO is also alled delete- p CV or delete- p multifold CV (Zhang , 1993 ). Note that LPO with p = 1 is LOO. 4.3.2 P artial data splitting Considering  n p  training sets an b e omputationally in tratable, ev en for small p , so that partial data splitting metho ds ha v e b een prop osed. V -fold CV (VF CV) with V ∈ { 1 , . . . , n } w as in tro dued b y Geisser (1975 ) as an alternativ e to the omputationally exp ensiv e LOO (see also Breiman et al. , 1984 , for instane). VF CV relies on a preliminary partitioning of the data in to V subsamples of appro ximately equal ardinalit y n/V ; ea h of these subsamples suessiv ely pla ys the role of v alidation sample. F ormally , let A 1 , . . . , A V b e some partition of { 1 , . . . , n } with Card ( A j ) ≈ n/V . Then, the VF CV estimator of the risk of A is dened b y (9) with B = V and I ( t ) j = A c j for j = 1 , . . . , B , that is, b L VF  b s ; D n ; ( A j ) 1 ≤ j ≤ V  = 1 V V X j =1   1 Card( A j ) X i ∈ A j γ  b s  D ( − A j ) n  ; ξ i    (11) where D ( − A j ) n = ( ξ i ) i ∈ A c j . By onstrution, the algorithmi omplexit y of VF CV is only V times that of training A with n − n/V data p oin ts, whi h is m u h less than LOO or LPO if V ≪ n . Note that VF CV with V = n is LOO. Balaned Inomplete CV (BICV, Shao, 1993 ) an b e seen as an alternativ e to VF CV w ell-suited for small training sample sizes n t . Indeed, BICV is dened b y (9) with training sets ( A c ) A ∈T , where T is a balaned inomplete blo  k designs (BIBD, John, 1971 ), that is, a olletion of B > 0 subsets of { 1 , . . . , n } of size n v = n − n t su h that: 1. Card { A ∈ T s.t. k ∈ A } do es not dep end on k ∈ { 1 , . . . , n } . 2. Card { A ∈ T s.t. k , ℓ ∈ A } do es not dep end on k 6 = ℓ ∈ { 1 , . . . , n } . The idea of BICV is to giv e to ea h data p oin t (and ea h pair of data p oin ts) the same role in the training and v alidation tasks. Note that VF CV relies on a similar idea, sine the set of training sample indies used b y VF CV satisfy the rst prop ert y and almost the seond one: P airs ( k , ℓ ) b elonging to the same A j app ear in one v alidation set more than other pairs. Rep eated learning-testing (RL T) w as in tro dued b y Breiman et al. (1984 ) and further studied b y Burman (1989 ) and b y Zhang (1993 ) for instane. The RL T estimator of the risk of A is dened b y ( 9) with an y B > 0 and ( I ( t ) j ) 1 ≤ j ≤ B are B dieren t subsets of { 1 , . . . , n } ,  hosen randomly and indep enden tly from the data. RL T an b e seen as an appro ximation to LPO with p = n − n t , with whi h it oinides when B =  n p  . 15 Mon te-Carlo CV (MCCV, Piard and Co ok , 1984 ) is v ery lose to RL T: B indep enden t subsets of { 1 , . . . , n } are randomly dra wn, with uniform distribu- tion among subsets of size n t . The only dierene with RL T is that MCCV allo ws the same split to b e  hosen sev eral times. 4.3.3 Other ross-v alidation-lik e risk estimators Sev eral pro edures ha v e b een in tro dued whi h are lose to, or based on CV. Most of them aim at xing an observ ed dra wba k of CV. Bias-orreted v ersions of VF CV and RL T risk estimators ha v e b een pro- p osed b y Burman (1989 , 1990 ), and a losely related p enalization pro edure alled V -fold p enalization has b een dened b y Arlot (2008 ), see Setion 5.1.2 for details. Generalized CV (GCV, Cra v en and W ah ba , 1979 ) w as in tro dued as a rotation-in v arian t v ersion of LOO in least-squares regression, for estimating the risk of a linear estimator b s = M Y where Y = ( Y i ) 1 ≤ i ≤ n ∈ R n and M is an n × n matrix indep enden t from Y : crit GCV ( M , Y ) := n − 1 k Y − M Y k 2 ( 1 − n − 1 tr( M ) ) 2 where ∀ t ∈ R n , k t k 2 = n X i =1 t 2 i . GCV is atually loser to C L (Mallo ws , 1973 ) than to CV, sine GCV an b e seen as an appro ximation to C L with a partiular estimator of the v ariane (Efron, 1986 ). The eieny of GCV has b een pro v ed in v arious framew orks, in partiular b y Li (1985 , 1987 ) and b y Cao and Golub ev (2006 ). Analyti Appro ximation When CV is used for seleting among linear mo d- els, Shao (1993 ) prop osed an analyti appro ximation to LPO with p ∼ n , whi h is alled APCV. LOO b o otstrap and .632 b o otstrap The b o otstrap is often used for stabi- lizing an estimator or an algorithm, replaing A ( D n ) b y the a v erage of A ( D ⋆ n ) o v er sev eral b o otstrap resamples D ⋆ n . This idea w as applied b y Efron (1983 ) to the LOO estimator of the risk, leading to the LOO b o otstr ap . Noting that the LOO b o otstrap w as biased, Efron (1983 ) ga v e a heuristi argumen t leading to the . 632 b o otstr ap estimator of the risk, later mo died in to the . 632+ b o ot- str ap b y Efron and Tibshirani (1997 ). The main dra wba k of these pro edures is the w eakness of their theoretial justiations. Only empirial studies ha v e supp orted the go o d b eha viour of . 632+ b o otstrap ( Efron and Tibshirani , 1997 ; Molinaro et al. , 2005 ). 4.4 Historial remarks Simple v alidation or hold-out w as the rst CV-lik e pro edure. It w as in tro dued in the psy hology area (Larson , 1931 ) from the need for a reliable alternativ e to the r esubstitution err or , as illustrated b y Anderson et al. (1972 ). The hold- out w as used b y Herzb erg (1969 ) for assessing the qualit y of preditors. The problem of  ho osing the training set w as rst onsidered b y Stone (1974 ), where 16 on trollable and unon trollable data splits w ere distinguished; an instane of unon trollable division an b e found in the b o ok b y Simon (1971 ). A primitiv e LOO pro edure w as used b y Hills (1966 ) and b y La hen bru h and Mi k ey (1968 ) for ev aluating the error rate of a predi- tion rule, and a primitiv e form ulation of LOO an b e found in a pap er b y Mosteller and T uk ey (1968 ). Nev ertheless, LOO w as atually in tro dued inde- p enden tly b y Stone (1974 ), b y Allen (1974 ) and b y Geisser (1975 ). The rela- tionship b et w een LOO and the ja kknife (Quenouille , 1949 ), whi h b oth rely on the idea of remo ving one observ ation from the sample, has b een disussed b y Stone (1974 ) for instane. The hold-out and CV w ere originally used only for estimating the risk of an algorithm. The idea of using CV for mo del seletion arose in the disussion of a pap er b y Efron and Morris (1973 ) and in a pap er b y Geisser (1974 ). The rst author to study LOO as a mo del seletion pro edure w as Stone (1974 ), who prop osed to use LOO again for estimating the risk of the seleted mo del. 5 Statistial prop erties of ross-v alidation esti- mators of the risk Understanding the b eha viour of CV for mo del seletion, whi h is the purp ose of this surv ey , requires rst to analyze the p erformanes of CV as an estimator of the risk of a single algorithm. T w o main prop erties of CV estimators of the risk are of partiular in terest: their bias, and their v ariane. 5.1 Bias Dealing with the bias inurred b y CV estimates an b e made b y t w o strategies: ev aluating the amoun t of bias in order to  ho ose the least biased CV pro edure, or orreting for this bias. 5.1.1 Theoretial assessmen t of the bias The indep endene of the training and the v alidation samples imply that for ev ery algorithm A and an y I ( t ) ⊂ { 1 , . . . , n } with ardinalit y n t , E h b L H − O  A ; D n ; I ( t )  i = E h γ  A  D ( t ) n  ; ξ  i = E [ L P ( A ( D n t ) ) ] . Therefore, assuming that Card( I ( t ) j ) = n t for j = 1 , . . . , B , the exp etation of the CV estimator of the risk only dep ends on n t : E  b L CV  A ; D n ;  I ( t ) j  1 ≤ j ≤ B   = E [ L P ( A ( D n t ) ) ] . (12) In partiular (12 ) sho ws that the bias of the CV estimator of the risk of A is the dierene b et w een the risks of A , omputed resp etiv ely with n t and n data p oin ts. Sine n t < n , the bias of CV is usually nonnegativ e, whi h an b e pro v ed rigorously when the risk of A is a dereasing funtion of n , that is, when A is a smart rule; note ho w ev er that a lassial algorithm su h as 1-nearest-neigh b our in lassiation is not smart (Devro y e et al., 1996 , Setion 6.8). Similarly , the bias of CV tends to derease with n t , whi h is rigorously true if A is smart. 17 More preisely , (12 ) has led to sev eral results on the bias of CV, whi h an b e split in to three main ategories: asymptoti results ( A is xed and the sample size n tends to innit y), non-asymptoti results (where A is allo w ed to mak e use of a n um b er of parameters gro wing with n , sa y n 1 / 2 , as often in mo del seletion), and empirial results. They are listed b elo w b y statistial framew ork. Regression The general b eha viour of the bias of CV (p ositiv e, dereasing with n t ) is onrmed b y sev eral pap ers and for sev eral CV estimators. F or LPO, non-asymptoti expressions of its bias w ere pro v ed b y Celisse (2008b ) for pro jetion estimators, and b y Arlot and Celisse (2009 ) for regressograms and k ernels estimators when the design is xed. F or VF CV and RL T, an asymptoti expansion of their bias w as yielded b y Burman (1989 ) for least-squares esti- mators in linear regression, and extended to spline smo othing ( Burman , 1990 ). Note nally that Efron (1986 ) pro v ed non-asymptoti analyti expressions of the exp etations of the LOO and GCV estimators of the risk in regression with binary data (see also Efron, 1983 , for some expliit alulations). Densit y estimation sho ws a similar piture. Non-asymptoti expressions for the bias of LPO estimators for k ernel and pro jetion estimators with the quadrati risk w ere pro v ed b y Celisse and Robin (2008 ) and b y Celisse (2008a ). Asymptoti expansions of the bias of the LOO estimator for histograms and k er- nel estimators w ere previously pro v ed b y Rudemo (1982 ); see Bo wman (1984 ) for sim ulations. Hall (1987 ) deriv ed similar results with the log-lik eliho o d on trast for k ernel estimators, and related the p erformane of LOO to the in teration b et w een the k ernel and the tails of the target densit y s . Classiation F or the simple problem of disriminating b et w een t w o p opula- tions with shifted distributions, Da vison and Hall (1992 ) ompared the asymp- totial bias of LOO and b o otstrap, sho wing the sup eriorit y of the LOO when the shift size is n − 1 / 2 : As n tends to innit y , the bias of LOO sta ys of or- der n − 1 , whereas that of b o otstrap w orsens to the order n − 1 / 2 . On realisti syn theti and real biologial data, Molinaro et al. (2005 ) ompared the bias of LOO, VF CV and .632+ b o otstrap: The bias dereases with n t , and is generally minimal for LOO. Nev ertheless, the 10 -fold CV bias is nearly minimal uniformly o v er their exp erimen ts. In the same exp erimen ts, .632+ b o otstrap exhibits the smallest bias for mo derate sample sizes and small signal-to-noise ratios, but a m u h larger bias otherwise. CV-alibrated algorithms When a family of algorithm ( A λ ) λ ∈ Λ is giv en, and b λ is  hosen b y minimizing b L CV ( A λ ; D n ) o v er λ , b L CV ( A b λ ; D n ) is biased for estimating the risk of A b λ ( D n ) , as rep orted from sim ulation exp erimen ts b y Stone (1974 ) for the LOO, and b y Jonathan et al. (2000 ) for VF CV in the v ariable seletion setting. This bias is of dieren t nature ompared to the pre- vious framew orks. Indeed, b L CV ( A b λ , D n ) is biased simply b eause b λ w as  hosen using the same data as b L CV ( A λ , D n ) . This phenomenon is similar to the op- timism of L P n ( b s ( D n ) ) as an estimator of the loss of b s ( D n ) . The orret w a y of estimating the risk of A b λ ( D n ) with CV is to onsider the full algorithm 18 A ′ : D n 7→ A b λ ( D n ) ( D n ) , and then to ompute b L CV ( A ′ ; D n ) . The resulting pro edure is alled double ross b y Stone (1974 ). 5.1.2 Corretion of the bias An alternativ e to  ho osing the CV estimator with the smallest bias is to orret for the bias of the CV estimator of the risk. Burman (1989 , 1990 ) prop osed a orreted VF CV estimator, dened b y b L corrVF ( A ; D n ) = b L VF ( b s ; D n ) + L P n ( A ( D n ) ) − 1 V V X j =1 L P n  A ( D ( − A j ) n )  , and a orreted RL T estimator w as dened similarly . Both estimators ha v e b een pro v ed to b e asymptotially un biased for least-squares estimators in linear regression. When the A j s ha v e exatly the same size n/V , the orreted VF CV riterion is equal to the sum of the empirial risk and the V -fold p enalt y ( Arlot, 2008 ), dened b y pen VF ( A ; D n ) = V − 1 V V X j =1 h L P n  A ( D ( − A j ) n )  − L P ( − A j ) n  A ( D ( − A j ) n )  i . The V -fold p enalized riterion w as pro v ed to b e (almost) un biased in the non- asymptoti framew ork for regressogram estimators. 5.2 V ariane CV estimators of the risk using training sets of the same size n t ha v e the same bias, but they still b eha v e quite dieren tly; their v ariane v ar( b L CV ( A ; D n ; ( I ( t ) j ) 1 ≤ j ≤ B )) aptures most of the information to explain these dierenes. 5.2.1 V ariabilit y fators Assume that Card( I ( t ) j ) = n t for ev ery j . The v ariane of CV results from the om bination of sev eral fators, in partiular ( n t , n v ) and B . Inuene of ( n t , n v ) Let us onsider the hold-out estimator of the risk. F ol- lo wing in partiular Nadeau and Bengio (2003 ), v ar h b L H − O  A ; D n ; I ( t )  i = E h v ar  L P ( v ) n  A ( D ( t ) n )     D ( t ) n  i + v ar [ L P ( A ( D n t ) ) ] = 1 n v E h v ar  γ ( b s , ξ ) | b s = A ( D ( t ) n )  i + v ar [ L P ( A ( D n t ) ) ] . (13) The rst term, prop ortional to 1 /n v , sho ws that more data for v alidation dereases the v ariane of b L H − O , b eause it yields a b etter estimator of L P  A ( D ( t ) n )  . The seond term sho ws that the v ariane of b L H − O also dep ends on the distribution of L P  A ( D ( t ) n )  around its exp etation; in partiular, it strongly dep ends on the stability of A . 19 Stabilit y and v ariane When A is unstable, b L LOO ( A ) has often b een p oin ted out as a v ariable estimator (Setion 7.10, Hastie et al., 2001 ; Breiman , 1996 ). Con v ersely , this trend disapp ears when A is stable, as notied b y Molinaro et al. (2005 ) from a sim ulation exp erimen t. The relation b et w een the stabilit y of A and the v ariane of b L CV ( A ) w as p oin ted out b y Devro y e and W agner (1979 ) in lassiation, through upp er b ounds on the v ariane of b L LOO ( A ) . Bousquet and Elisse (2002 ) extended these results to the regression setting, and pro v ed upp er b ounds on the maxi- mal up w ard deviation of b L LOO ( A ) . Note nally that sev eral approa hes based on the b o otstrap ha v e b een pro- p osed for reduing the v ariane of b L LOO ( A ) , su h as LOO b o otstrap, .632 b o otstrap and .632+ b o otstrap (Efron, 1983 ); see also Setion 4.3.3 . P artial splitting and v ariane When ( n t , n v ) is xed, the v ariabilit y of CV tends to b e larger for partial data splitting metho ds than for LPO. Indeed, ha ving to  ho ose B <  n n t  subsets ( I ( t ) j ) 1 ≤ j ≤ B of { 1 , . . . , n } , usually randomly , indues an additional v ariabilit y ompared to b L LPO with p = n − n t . In the ase of MCCV, this v ariabilit y dereases lik e B − 1 sine the I ( t ) j are  hosen indep enden tly . The dep endene on B is sligh tly dieren t for other CV estimators su h as RL T or VF CV, b eause the I ( t ) j are not indep enden t. In partiular, it is maximal for the hold-out, and minimal (n ull) for LOO (if n t = n − 1 ) and LPO (with p = n − n t ). Note that the dep endene on V for VF CV is more omplex to ev aluate, sine B , n t , and n v sim ultaneously v ary with V . Nev ertheless, a non-asymptoti the- oretial quan tiation of this additional v ariabilit y of VF CV has b een obtained b y Celisse and Robin (2008 ) in the densit y estimation framew ork (see also em- pirial onsiderations b y Jonathan et al., 2000 ). 5.2.2 Theoretial assessmen t of the v ariane Understanding preisely ho w v ar( b L CV ( A )) dep ends on the splitting s heme is omplex in general, sine n t and n v ha v e a xed sum n , and the n um b er of splits B is generally link ed with n t (for instane, for LPO and VF CV). F urthermore, the v ariane of CV b eha v es quite dieren tly in dieren t framew orks, dep ending in partiular on the stabilit y of A . The onsequene is that on traditory results ha v e b een obtained in dieren t framew orks, in partiular on the v alue of V for whi h the VF CV estimator of the risk has a minimal v ariane ( Burman , 1989 ; Hastie et al. , 2001 , Setion 7.10). Despite the diult y of the problem, the v ariane of sev eral CV estimators of the risk has b een assessed in sev eral framew orks, as detailed b elo w. Regression In the linear regression setting, Burman (1989 ) yielded asymp- toti expansions of the v ariane of the VF CV and RL T estimators of the risk with homosedasti data. The v ariane of RL T dereases with B , and in the ase of VF CV, in a partiular setting, v ar  b L VF ( A )  = 2 σ 2 n + 4 σ 4 n 2  4 + 4 V − 1 + 2 ( V − 1) 2 + 1 ( V − 1) 3  + o  n − 2  . 20 The asymptotial v ariane of the VF CV estimator of the risk dereases with V , implying that LOO asymptotially has the minimal v ariane. Non-asymptoti losed-form form ulas of the v ariane of the LPO estimator of the risk ha v e b een pro v ed b y Celisse (2008b ) in regression, for pro jetion and k ernel estimators for instane. On the v ariane of RL T in the regression setting, see the asymptoti results of Girard (1998 ) for Nadara y a-W atson k ernel estima- tors, as w ell as the non-asymptoti omputations and sim ulation exp erimen ts b y Nadeau and Bengio (2003 ) with sev eral learning algorithms. Densit y estimation Non-asymptoti losed-form form ulas of the v ariane of the LPO estimator of the risk ha v e b een pro v ed b y Celisse and Robin (2008 ) and b y Celisse (2008a ) for pro jetion and k ernel estimators. In partiular, the dep endene of the v ariane of b L LPO on p has b een quan tied expliitly for histogram and k ernel estimators b y Celisse and Robin (2008 ). Classiation F or the simple problem of disriminating b et w een t w o p opu- lations with shifted distributions, Da vison and Hall (1992 ) sho w ed that the gap b et w een asymptoti v arianes of LOO and b o otstrap b eomes larger when data are noisier. Nadeau and Bengio (2003 ) made non-asymptoti omputations and sim ulation exp erimen ts with sev eral learning algorithms. Hastie et al. (2001 ) empirially sho w ed that VF CV has a minimal v ariane for some 2 < V < n , whereas LOO usually has a large v ariane; this fat ertainly dep ends on the stabilit y of the algorithm onsidered, as sho w ed b y sim ulation exp erimen ts b y Molinaro et al. (2005 ). 5.2.3 Estimation of the v ariane There is no univ ersalv alid under all distributionsun biased estimator of the v ariane of RL T (Nadeau and Bengio , 2003 ) and VF CV estimators (Bengio and Grandv alet , 2004 ). In partiular, Bengio and Grandv alet (2004 ) reommend the use of v ariane estimators taking in to aoun t the orrelation struture b et w een test errors; otherwise, the v ariane of CV an b e strongly underestimated. Despite these negativ e results, (biased) estimators of the v ariane of b L CV ha v e b een prop osed b y Nadeau and Bengio (2003 ), b y Bengio and Grandv alet (2004 ) and b y Mark atou et al. (2005 ), and tested in sim ulation exp erimen ts in regression and lassiation. F urthermore, in the framew ork of densit y estima- tion with histograms, Celisse and Robin (2008 ) prop osed an estimator of the v ariane of the LPO risk estimator. Its auray is assessed b y a onen tration inequalit y . These results ha v e reen tly b een extended to pro jetion estimators b y Celisse (2008a ). 6 Cross-v alidation for eien t mo del seletion This setion ta kles the prop erties of CV pro edures for mo del seletion when the goal is estimation (see Setion 2.2 ). 21 6.1 Relationship b et w een risk estimation and mo del se- letion As sho wn in Setion 3.1 , minimizing an un biased estimator of the risk leads to an eien t mo del seletion pro edure. One ould onlude here that the b est CV pro edure for estimation is the one with the smallest bias and v ariane (at least asymptotially), for instane, LOO in the least-squares regression framew ork (Burman , 1989 ). Nev ertheless, the b est CV estimator of the risk is not neessarily the b est mo del seletion pro edure. F or instane, Breiman and Sp etor (1992 ) observ ed that uniformly o v er the mo dels, the b est risk estimator is LOO, whereas 10- fold CV is more aurate for mo del seletion. Three main reasons for su h a dierene an b e in v ok ed. First, the asymptoti framew ork ( A xed, n → ∞ ) ma y not apply to mo dels lose to the orale, whi h t ypially has a dimension gro wing with n when s do es not b elong to an y mo del. Seond, as explained in Setion 3.2 , estimating the risk of ea h mo del with some bias an b e b eneial and omp ensate the eet of a large v ariane, in partiular when the signal-to- noise ratio is small. Third, for mo del seletion, what matters is not that ev ery estimate of the risk has small bias and v ariane, but more that sign ( cr it( m 1 ) − crit( m 2 ) ) = sign ( L P ( b s m 1 ) − L P ( b s m 2 ) ) with the largest probabilit y for mo dels m 1 , m 2 near the orale. Therefore, sp ei studies are required to ev aluate the p erformanes of the v arious CV pro edures in terms of mo del seletion eieny . In most frame- w orks, the mo del seletion p erformane diretly follo ws from the prop erties of CV as an estimator of the risk, but not alw a ys. 6.2 The global piture Let us start with the lassiation of mo del seletion pro edures made b y Shao (1997 ) in the linear regression framew ork, sine it giv es a go o d idea of the p erformane of CV pro edures for mo del seletion in general. T ypially , the eieny of CV only dep ends on the asymptotis of n t /n : • When n t ∼ n , CV is asymptotially equiv alen t to Mallo ws' C p , hene asymptotially optimal. • When n t ∼ λn with λ ∈ (0 , 1) , CV is asymptotially equiv alen t to GIC κ with κ = 1 + λ − 1 , whi h is dened as AIC with a p enalt y m ultiplied b y κ/ 2 . Hene, su h CV pro edures are o v erp enalizing b y a fator (1 + λ ) / (2 λ ) > 1 . The ab o v e results ha v e b een pro v ed b y Shao (1997 ) for LPO (see also Li, 1987 , for the LOO); they also hold for RL T when B ≫ n 2 sine RL T is then equiv alen t to LPO (Zhang , 1993 ). In a general statistial framew ork, the mo del seletion p erformane of MCCV, VF CV, LOO, LOO Bo otstrap, and .632 b o otstrap for se- letion among minim um on trast estimators w as studied in a series of pap ers (v an der Laan and Dudoit, 2003 ; v an der Laan et al. , 2004 , 2006 ; v an der V aart et al., 2006 ); these results apply in partiular to least-squares regression and densit y estimation. It turns out that under mild onditions, an orale-t yp e inequalit y is pro v ed, sho wing that up to a m ultiplying fator C n → 1 , 22 the risk of CV is smaller than the minim um of the risks of the mo dels with a sample size n t . In partiular, in most framew orks, this implies the asymptoti optimalit y of CV as so on as n t ∼ n . When n t ∼ λn with λ ∈ (0 , 1) , this naturally generalizes Shao's results. 6.3 Results in v arious framew orks This setion gathers results ab out mo del seletion p erformanes of CV when the goal is estimation, in v arious framew orks. Note that mo del seletion is on- sidered here with a general meaning, inluding in partiular bandwidth  hoie for k ernel estimators. Regression First, the results of Setion 6.2 suggest that CV is sub optimal when n t is not asymptotially equiv alen t to n . This fat has b een pro v ed rigor- ously for VF CV when V = O (1) with regressograms (Arlot, 2008 ): with large probabilit y , the risk of the mo del seleted b y VF CV is larger than 1 + κ ( V ) times the risk of the orale, with κ ( V ) > 0 for ev ery xed V . Note ho w ev er that the b est V for VF CV is not the largest one in ev ery regression frame- w ork, as sho wn empirially in linear regression ( Breiman and Sp etor , 1992 ; Herzb erg and T suk ano v , 1986 ); Breiman (1996 ) prop osed to explain this phe- nomenon b y relating the stabilit y of the andidate algorithms and the mo del seletion p erformane of LOO in v arious regression framew orks. Seond, the univ ersalit y of CV has b een onrmed b y sho wing that it natu- rally adapts to heterosedastiit y of data when seleting among regressograms. Despite its sub optimalit y , VF CV with V = O (1) satises a non-asymptoti orale inequalit y with onstan t C > 1 ( Arlot, 2008 ). F urthermore, V -fold p e- nalization (whi h often oinides with orreted VF CV, see Setion 5.1.2 ) sat- ises a non-asymptoti orale inequalit y with C n → 1 as n → + ∞ , b oth when V = O (1) (Arlot, 2008 ) and when V = n (Arlot , 2008a ). Note that n -fold p e- nalization is v ery lose to LOO, suggesting that it is also asymptotially optimal with heterosedasti data. Sim ulation exp erimen ts in the on text of  hange- p oin t detetion onrmed that CV adapts w ell to heterosedastiit y , on trary to usual mo del seletion pro edures in the same framew ork ( Arlot and Celisse , 2009 ). The p erformanes of CV ha v e also b een assessed for other kinds of estimators in regression. F or  ho osing the n um b er of knots in spline smo othing, Burman (1990 ) pro v ed that orreted v ersions of VF CV and RL T are asymptotially optimal pro vided n/ ( B n v ) = O (1) . F urthermore, in k ernel regression, sev eral CV metho ds ha v e b een ompared to GCV in k ernel regression b y Härdle et al. (1988 ) and b y Girard (1998 ); the onlusion is that GCV and related riteria are omputationally more eien t than MCCV or RL T, for a similar statistial p erformane. Finally , note that asymptoti results ab out CV in regression ha v e b een pro v ed b y Györ et al. (2002 ), and an orale inequalit y with onstan t C > 1 has b een pro v ed b y W egk amp (2003 ) for the hold-out, with least-squares estimators. Densit y estimation CV p erforms similarly than in regression for seleting among least-squares estimators (v an der Laan et al., 2004 ): It yields a risk smaller than the minim um of the risk with a sample size n t . In partiular, 23 non-asymptoti orale inequalities with onstan t C > 1 ha v e b een pro v ed b y Celisse (2008b ) for the LPO when p/n ∈ [ a, b ] , for some 0 < a < b < 1 . The p erformane of CV for seleting the bandwidth of k ernel densit y esti- mators has b een studied in sev eral pap ers. With the least-squares on trast, the eieny of LOO w as pro v ed b y Hall (1983 ) and generalized to the m ultiv ari- ate framew ork b y Stone (1984 ); an orale inequalit y asymptotially leading to eieny w as reen tly pro v ed b y Dalelane (2005 ). With the Kullba k-Leibler div ergene, CV an suer from troubles in p erforming mo del seletion (see also S h uster and Gregory , 1981 ; Cho w et al. , 1987 ). The inuene of the tails of the target s w as studied b y Hall (1987 ), who ga v e onditions under whi h CV is eien t and the  hosen bandwidth is optimal at rst-order. Classiation In the framew ork of binary lassiation b y in terv als (that is, with X = [0 , 1] and pieewise onstan t lassiers), Kearns et al. (1997 ) pro v ed an orale inequalit y for the hold-out. F urthermore, empirial exp erimen ts sho w that CV yields (almost) alw a ys the b est p erformane, ompared to deterministi p enalties (Kearns et al. , 1997 ). On the on trary , sim ulation exp erimen ts b y Bartlett et al. (2002 ) in the same setting sho w ed that random p enalties su h as Radema her omplexit y and maximal disrepany usually p erform m u h b etter than hold-out, whi h is sho wn to b e more v ariable. Nev ertheless, the hold-out still enjo ys quite go o d theoretial prop erties: It w as pro v ed to adapt to the margin ondition b y Blan hard and Massart (2006 ), a prop ert y nearly una hiev able with usual mo del seletion pro edures (see also Massart , 2007 , Setion 8.5). This suggests that CV pro edures are naturally adaptiv e to sev eral unkno wn prop erties of data in the statistial learning frame- w ork. The p erformane of the LOO in binary lassiation w as related to the stabilit y of the andidate algorithms b y Kearns and Ron (1999 ); they pro v ed orale-t yp e inequalities alled sanit y- he k b ounds, desribing the w orst-ase p erformane of LOO (see also Bousquet and Elisse, 2002 ). An exp erimen tal omparison of sev eral CV metho ds and b o otstrap-based CV (in partiular .632+ b o otstrap) in lassiation an also b e found in pap ers b y Efron (1986 ) and Efron and Tibshirani (1997 ). 7 Cross-v alidation for iden tiation Let us no w fo us on mo del seletion when the goal is to iden tify the true mo del S m 0 , as desrib ed in Setion 2.3 . In this framew ork, asymptoti optimalit y is replaed b y (mo del) onsisteny , that is, P ( b m ( D n ) = m 0 ) − − − − → n →∞ 1 . Classial mo del seletion pro edures built for iden tiation, su h as BIC, are desrib ed in Setion 3.3 . 7.1 General onditions to w ards mo del onsisteny A t rst sigh t, it ma y seem strange to use CV for iden tiation: LOO, whi h is the pioneering CV pro edure, is atually losely related to the un biased risk 24 estimation priniple, whi h is only eien t when the goal is estimation. F ur- thermore, estimation and iden tiation are someho w on traditory goals, as explained in Setion 2.4 . This in tuition ab out inonsisteny of some CV pro edures is onrmed b y sev eral theoretial results. Shao (1993 ) pro v ed that sev eral CV metho ds are inonsisten t for v ariable seletion in linear regression: LOO, LPO, and BICV when lim inf n →∞ ( n t /n ) > 0 . Ev en if these CV metho ds asymptotially selet all the true v ariables with probabilit y 1, the probabilit y that they selet to o m u h v ariables do es not tend to zero. More generally , Shao (1997 ) pro v ed that CV pro edures b eha v e asymptotially lik e GIC λ n with λ n = 1 + n/n t , whi h leads to inonsisteny as so on as n/n t = O (1) . In the on text of ordered v ariable seletion in linear regression, Zhang (1993 ) omputed the asymptoti v alue of the probabilit y of seleting the true mo del for sev eral CV pro edures. He also n umerially ompared the v alues of this probabilit y for the same CV pro edures in a sp ei example. F or LPO with p/n → λ ∈ (0 , 1) as n tends to + ∞ , P ( b m = m 0 ) inreases with λ . The result is sligh tly dieren t for VF CV: P ( b m = m 0 ) inreases with V (hene, it is maximal for the LOO, whi h is the w orst ase of LPO). The v ariabilit y indued b y the n um b er V of splits seems to b e more imp ortan t here than the bias of VF CV. Nev ertheless, P ( b m = m 0 ) is almost onstan t b et w een V = 10 and V = n , so that taking V > 10 is not advised for omputational reasons. These results suggest that if the training sample size n t is negligible in fron t of n , then mo del onsisteny ould b e obtained. This has b een onrmed theo- retially b y Shao (1993 , 1997 ) for the v ariable seletion problem in linear regres- sion: CV is onsisten t when n ≫ n t → ∞ , in partiular RL T, BICV (dened in Setion 4.3.2 ) and LPO with p = p n ∼ n and n − p n → ∞ . Therefore, when the goal is to iden tify the true mo del, a larger prop ortion of the data should b e put in the v alidation set in order to impro v e the p erformane. This phenomenon is somewhat related to the r oss-validation p ar adox ( Y ang, 2006 ). 7.2 Rened analysis for the algorithm seletion problem The b eha viour of CV for iden tiation is b etter understo o d b y onsidering a more general framew ork, where the goal is to selet among statistial algorithms the one with the fastest on v ergene rate. Y ang (2006 , 2007 ) onsidered this problem for t w o andidate algorithms (or more generally an y nite n um b er of algorithms). Let us men tion here that Stone (1977 ) onsidered a few sp ei examples of this problem, and sho w ed that LOO an b e inonsisten t for  ho osing the b est among t w o go o d estimators. The onlusion of Y ang's pap ers is that the suien t ondition on n t for the onsisteny in seletion of CV strongly dep ends on the on v ergene rates ( r n,i ) i =1 , 2 of the andidate algorithms. Let us assume that r n, 1 and r n, 2 dier at least b y a m ultipliativ e onstan t C > 1 . Then, in the regression framew ork, if the risk of b s i is measured b y E k b s i − s k 2 , Y ang (2007 ) pro v ed that the hold- out, VF CV, RL T and LPO with v oting (CV-v, see Setion 4.2.2 ) are onsisten t in seletion if n v , n t → ∞ and √ n v max i r n t ,i → ∞ , (14) 25 under some onditions on k b s i − s k p for p = 2 , 4 , ∞ . In the lassiation frame- w ork, if the risk of b s i is measured b y P ( b s i 6 = s ) , Y ang (2006 ) pro v ed the same onsisteny result for CV-v under the ondition n v , n t → ∞ and n v max i r 2 n t ,i s n t → ∞ , (15) where s n is the on v ergene rate of P ( b s 1 ( D n ) 6 = b s 2 ( D n ) ) . In tuitiv ely , onsisteny holds as so on as the unertain t y of ea h estimate of the risk (roughly prop ortional to n − 1 / 2 v ) is negligible in fron t of the risk gap | r n t , 1 − r n t , 2 | (whi h is of the same order as max i r n t ,i ). This ondition holds either when at least one of the algorithms on v erges at a non-parametri rate, or when n t ≪ n , whi h artiially widens the risk gap. Empirial results in the same diretion w ere pro v ed b y Dietteri h (1998 ) and b y Alpa ydin (1999 ), leading to the advie that V = 2 is the b est  hoie when VF CV is used for omparing t w o learning pro edures. See also the re- sults b y Nadeau and Bengio (2003 ) ab out CV onsidered as a testing pro edure omparing t w o andidate algorithms. The suien t onditions (14 ) and (15) an b e simplied dep ending on max i r n,i , so that the abilit y of CV to distinguish b et w een t w o algorithms de- p ends on their on v ergene rates. On the one hand, if max i r n,i ∝ n − 1 / 2 , then (14 ) or (15 ) only hold when n v ≫ n t → ∞ (under some onditions on s n in lassiation). Therefore, the ross-v alidation parado x holds for omparing al- gorithms on v erging at the parametri rate (mo del seletion when a true mo del exists b eing only a partiular ase). Note that p ossibly stronger onditions an b e required in lassiation where algorithms an on v erge at fast rates, b et w een n − 1 and n − 1 / 2 . On the other hand, (14 ) and (15 ) are milder onditions when max i r n,i ≫ n − 1 / 2 : They are implied b y n t /n v = O (1) , and they ev en allo w n t ∼ n (under some onditions on s n in lassiation). Therefore, non-parametri algorithms an b e ompared b y more usual CV pro edures ( n t > n / 2 ), ev en if LOO is still exluded b y onditions (14 ) and (15). Note that aording to a sim ulation exp erimen ts, CV with a v eraging (that is, CV as usual) and CV with v oting are equiv alen t at rst but not at seond order, so that they an dier when n is small (Y ang, 2007 ). 8 Sp eiities of some framew orks Originally , the CV priniple has b een prop osed for i.i.d. observ ations and usual on trasts su h as least-squares and log-lik eliho o d. Therefore, CV pro edures ma y ha v e to b e mo died in other sp ei framew orks, su h as estimation in presene of outliers or with dep enden t data. 8.1 Densit y estimation In the densit y estimation framew ork, some sp ei mo diations of CV ha v e b een prop osed. First, Hall et al. (1992 ) dened the smo othed CV, whi h onsists in pre- smo othing the data b efore using CV, an idea related to the smo othed b o otstrap. 26 They pro v ed that smo othed CV yields an exellen t asymptotial mo del seletion p erformane under v arious smo othness onditions on the densit y . Seond, when the goal is to estimate the densit y at one p oin t (and not globally), Hall and S h uan y (1989 ) prop osed a lo al v ersion of CV and pro v ed its asymptoti optimalit y . 8.2 Robustness to outliers In presene of outliers in regression, Leung (2005 ) studied ho w CV m ust b e mo died to get b oth asymptoti eieny and a onsisten t bandwidth estimator (see also Leung et al., 1993 ). T w o  hanges are p ossible to a hiev e robustness: Cho osing a robust regres- sor, or  ho osing a robust loss-funtion. In presene of outliers, lassial CV with a non-robust loss funtion has b een sho wn to fail b y Härdle (1984 ). Leung (2005 ) desrib ed a CV pro edure based on robust losses lik e L 1 and Hub er's (Hub er, 1964 ) ones. The same strategy remains appliable to other setups lik e linear mo dels in Ron hetti et al. (1997 ). 8.3 Time series and dep enden t observ ations As explained in Setion 4.1 , CV is built up on the heuristis that part of the sample (the v alidation set) an pla y the role of new data with resp et to the rest of the sample (the training set). New means that the v alidation set is indep enden t from the training set with the same distribution. Therefore, when data ξ 1 , . . . , ξ n are not indep enden t, CV m ust b e mo di- ed, lik e other mo del seletion pro edures (in non-parametri regression with dep enden t data, see the review b y Opsomer et al., 2001 ). Let us rst onsider the statistial framew ork of Setion 1 with ξ 1 , . . . , ξ n iden tially distributed but not indep enden t. Then, when for instane data are p ositiv ely orrelated, Hart and W ehrly (1986 ) pro v ed that CV o v erts for  ho os- ing the bandwidth of a k ernel estimator in regression (see also Ch u and Marron , 1991 ; Opsomer et al. , 2001 ). The main approa h used in the literature for solving this issue is to  ho ose I ( t ) and I ( v ) su h that min i ∈ I ( t ) , j ∈ I ( v ) | i − j | > h > 0 , where h on trols the dis- tane from whi h observ ations i and j are indep enden t. F or instane, the LOO an b e  hanged in to: I ( v ) = { J } where J is uniformly  hosen in { 1 , . . . , n } , and I ( t ) = { 1 , . . . , J − h − 1 , J + h + 1 , . . . , n } , a metho d alled mo died CV b y Ch u and Marron (1991 ) in the on text of bandwidth seletion. Then, for short range dep endenes, ξ i is almost indep enden t from ξ j when | i − j | > h is large enough, so that ( ξ j ) j ∈ I ( t ) is almost indep enden t from ( ξ j ) j ∈ I ( v ) . Sev eral asymptoti optimalit y results ha v e b een pro v ed on mo died CV, for instane b y Hart and Vieu (1990 ) for bandwidth  hoie in k ernel densit y estimation, when data are α -mixing (hene, with a short range dep endene struture) and h = h n → ∞ not to o fast. Note that mo died CV also enjo ys some asymptoti optimalit y results with long-range dep endenes, as pro v ed b y Hall et al. (1995 ), ev en if an alternativ e blo  k b o otstrap metho d seems more appropriate in su h a framew ork. Sev eral alternativ es to mo died CV ha v e also b een prop osed. The  h -blo  k CV (Burman et al., 1994 ) is mo died CV plus a orretiv e term, similarly to 27 the bias-orreted CV b y Burman (1989 ) (see Setion 5.1 ). Sim ulation exp eri- men ts in sev eral (short range) dep enden t framew orks sho w that this orretiv e term matters when h/n is not small, in partiular when n is small. The partitioned CV has b een prop osed b y Ch u and Marron (1991 ) for bandwidth seletion: An in teger g > 0 is  hosen, a bandwidth b λ k is  hosen b y CV based up on the subsample ( ξ k + g j ) j ≥ 0 for ea h k = 1 , . . . , g , and the seleted bandwidth is a om bination of ( b λ k ) . When a parametri mo del is a v ailable for the dep endeny struture, Hart (1994 ) prop osed the time series CV. An imp ortan t framew ork where data often are dep enden t is time-series anal- ysis, in partiular when the goal is to predit the next observ ation ξ n +1 from the past ξ 1 , . . . , ξ n . When data are stationary , h -blo  k CV and similar ap- proa hes an b e used to deal with (short range) dep endenes. Nev ertheless, Burman and Nolan (1992 ) pro v ed in some sp ei framew ork that unaltered CV is asymptoti optimal when ξ 1 , . . . , ξ n is a stationary Mark o v pro ess. On the on trary , using CV for non-stationary time-series is a quite diult problem. The only reasonable approa h in general is the hold-out, that is, I ( t ) = { 1 , . . . , m } and I ( v ) = { m + 1 , . . . , n } for some deterministi m . Ea h mo del is rst trained with ( ξ j ) j ∈ I ( t ) . Then, it is used for prediting suessiv ely ξ m +1 from ( ξ j ) j ≤ m , ξ m +2 from ( ξ j ) j ≤ m +1 , and so on. The mo del with the smallest a v erage error for prediting ( ξ j ) j ∈ I ( v ) from the past is  hosen. 8.4 Large n um b er of mo dels As men tioned in Setion 3 , mo del seletion pro edures estimating un biasedly the risk of ea h mo del fail when, in partiular, the n um b er of mo dels gro ws exp onen tially with n (Birgé and Massart , 2007 ). Therefore, CV annot b e used diretly , exept ma yb e with n t ≪ n , pro vided n t is w ell  hosen (see Setion 6 and Celisse, 2008b , Chapter 6). F or least-squares regression with homosedasti data, W egk amp (2003 ) pro- p osed to add to the hold-out estimator of the risk a p enalt y term dep ending on the n um b er of mo dels. This metho d is pro v ed to satisfy a non-asymptoti orale inequalit y with leading onstan t C > 1 . Another general approa h w as prop osed b y Arlot and Celisse (2009 ) in the on text of m ultiple  hange-p oin t detetion. The idea is to p erform mo del se- letion in t w o steps: First, gather the mo dels ( S m ) m ∈M n in to meta-mo dels ( e S D ) D ∈D n , where D n denotes a set of indies su h that Card( D n ) gro ws at most p olynomially with n . Inside ea h meta-mo del e S D = S m ∈M n ( D ) S m , b s D is  hosen from data b y optimizing a giv en riterion, for instane the empirial on- trast L P n ( t ) , but other riteria an b e used. Seond, CV is used for  ho osing among ( b s D ) D ∈D n . Sim ulation exp erimen ts sho w this simple tri k automatially tak es in to aoun t the ardinalit y of M n , ev en when data are heterosedasti, on trary to other mo del seletion pro edures built for exp onen tial olletion of mo dels whi h all assume homosedastiit y of data. 28 9 Closed-form form ulas and fast omputation Resampling strategies, lik e CV, are kno wn to b e time onsuming. The naiv e im- plemen tation of CV has a omputational omplexit y of B times the omplexit y of training ea h algorithm A , whi h is usually in tratable for LPO, ev en with p = 1 . The omputational ost of VF CV or RL T an still b e quite ostly when B > 10 in man y pratial problems. Nev ertheless, losed-form form ulas for CV estimators of the risk an b e obtained in sev eral framew orks, whi h greatly dereases the omputational ost of CV. In densit y estimation, losed-form form ulas ha v e b een originally deriv ed b y Rudemo (1982 ) and b y Bo wman (1984 ) for the LOO risk estimator of his- tograms and k ernel estimators. These results ha v e b een reen tly extended b y Celisse and Robin (2008 ) to the LPO risk estimator with the quadrati loss. Similar results are more generally a v ailable for pro jetion estimators as settled b y Celisse (2008a ). In tuitiv ely , su h form ulas an b e obtained pro vided the n um b er N of v alues tak en b y the B =  n n v  hold-out estimators of the risk, orresp onding to dieren t data splittings, is at most p olynomial in the sample size. F or least-squares estimators in linear regression, Zhang (1993 ) pro v ed a losed-form form ula for the LOO estimator of the risk. Similar results ha v e b een obtained b y W ah ba (1975 , 1977 ), and b y Cra v en and W ah ba (1979 ) in the spline smo othing on text as w ell. These pap ers led in partiular to the denition of GCV (see Setion 4.3.3 ) and related pro edures, whi h are often used instead of CV (with a naiv e implemen tation) b eause of their small omputational ost, as emphasized b y Girard (1998 ). Closed-form form ulas for the LPO estimator of the risk w ere also obtained b y Celisse (2008b ) in regression for k ernel and pro jetion estimators, in partiular for regressograms. An imp ortan t prop ert y of these losed-form form ulas is their additivit y: F or a regressogram asso iated to a partition ( I λ ) λ ∈ Λ m of X , the LPO estimator of the risk an b e written as a sum o v er λ ∈ Λ m of terms whi h only dep end on observ ations ( X i , Y i ) su h that X i ∈ I λ . Therefore, dynami programming (Bellman and Dreyfus, 1962 ) an b e used for minimizing the LPO estimator of the risk o v er the set of partitions of X in D piees. As an illustration, Arlot and Celisse (2009 ) suessfully applied this strategy in the  hange-p oin t detetion framew ork. Note that the same idea an b e used with VF CV or RL T, but for a larger omputational ost sine no losed-form form ulas are a v ailable for these CV metho ds. Finally , in framew orks where no losed-form form ula an b e pro v ed, some eien t algorithms exist for a v oiding to reompute b L H − O ( A ; D n ; I ( t ) j ) from srat h for ea h data splitting I ( t ) j . These algorithms rely on up dating form ulas su h as the ones b y Ripley (1996 ) for LOO in linear and quadrati disriminan t analysis; this approa h mak es LOO as exp ensiv e to ompute as the empirial risk. V ery similar form ulas are also a v ailable for LOO and the k -nearest neigh- b ours estimator in lassiation (Daudin and Mary-Huard , 2008 ). 29 10 Conlusion: whi h ross-v alidation metho d for whi h problem? This onlusion ollets a few guidelines aiming at helping CV users, rst in- terpreting the results of CV, seond appropriately using CV in ea h sp ei problem. 10.1 The general piture Dra wing a general onlusion on CV metho ds is an imp ossible task b eause of the v ariet y of framew orks where CV an b e used, whi h indues a v ariet y of b eha viors of CV. Nev ertheless, w e an still p oin t out the three main riteria to tak e in to aoun t for  ho osing a CV metho d for a partiular mo del seletion problem: • Bias : CV roughly estimates the risk of a mo del with a sample size n t < n (see Setion 5.1 ). Usually , this implies that CV o v erestimates the v ariane term ompared to the bias term in the bias-v ariane deomp osition (2) with sample size n . When the goal is estimation and the signal-to-noise ratio (SNR) is large, the smaller bias usually is the b etter, whi h is obtained b y taking n t ∼ n . Otherwise, CV an b e asymptotially sub optimal. Nev ertheless, when the goal is estimation and the SNR is small, k eeping a small up w ard bias for the v ariane term often impro v es the p erformane, whi h is obtained b y taking n t ∼ κ n with κ ∈ (0 , 1) . See Setion 6. When the goal is iden tiation, a large bias is often needed, whi h is obtained b y taking n t ≪ n ; dep ending on the framew ork, larger v alues of n t an also lead to mo del onsisteny , see Setion 7 . • V ariability : The v ariane of the CV estimator of the risk is usually a dereasing funtion of the n um b er B of splits, for a xed training size. When the n um b er of splits is xed, the v ariabilit y of CV also dep ends on the training sample size n t . Usually , CV is more v ariable when n t is loser to n . Ho w ev er, when B is link ed with n t (as for VF CV or LPO), the v ariabilit y of CV m ust b e quan tied preisely , whi h has b een done in few framew orks. The only general onlusion on this p oin t is that the CV metho d with minimal v ariabilit y seems strongly framew ork-dep enden t, see Setion 5.2 for details. • Computational  omplexity : Unless losed-form form ulas or analyti ap- pro ximations are a v ailable (see Setion 9), the omplexit y of CV is roughly prop ortional to the n um b er of data splits: 1 for the hold-out, V for VF CV, B for RL T or MCCV, n for LOO, and  n p  for LPO. The optimal trade-o b et w een these three fators an b e dieren t for ea h prob- lem, dep ending for instane on the omputational omplexit y of ea h estimator, on sp eiities of the framew ork onsidered, and on the nal user's trade-o b et w een statistial p erformane and omputational ost. Therefore, no opti- mal CV metho d an b e p oin ted out b efore ha ving tak en in to aoun t the nal user's preferenes. 30 Nev ertheless, in densit y estimation, losed-form expressions of the LPO es- timator ha v e b een deriv ed b y Celisse and Robin (2008 ) with histograms and k ernel estimators, and b y Celisse (2008a ) for pro jetion estimators. These ex- pressions allo w to p erform LPO without additional omputational ost, whi h redues the aforemen tioned trade-o to the easier bias-v ariabilit y trade-o. In partiular, Celisse and Robin (2008 ) prop osed to  ho ose p for LPO b y minimiz- ing a riterion dened as the sum of a squared bias and a v ariane terms (see also P olitis et al. , 1999 , Chapter 9). 10.2 Ho w the splits should b e  hosen? F or hold-out, VF CV, and RL T, an imp ortan t question is to  ho ose a partiular sequene of data splits. First, should this step b e random and indep enden t from D n , or tak e in to aoun t some features of the problem or of the data? It is often reommended to tak e in to aoun t the struture of data when  ho osing the splits. If data are stratied, the prop ortions of the dieren t strata should (appro ximately) b e the same in the sample and in ea h training and v alidation sample. Be- sides, the training samples should b e  hosen so that b s m ( D ( t ) n ) is w ell dened for ev ery training set; in the regressogram ase, this led Arlot (2008 ) and Arlot and Celisse (2009 ) to  ho ose arefully the splitting s heme. In sup ervised lassiation, pratitioners usually  ho ose the splits so that the prop ortion of ea h lass is the same in ev ery v alidation sample as in the sample. Nev erthe- less, Breiman and Sp etor (1992 ) made sim ulation exp erimen ts in regression for omparing sev eral splitting strategies. No signian t impro v emen t w as rep orted from taking in to aoun t the stratiation of data for  ho osing the splits. Another question related to the  hoie of ( I ( t ) j ) 1 ≤ j ≤ B is whether the I ( t ) j should b e indep enden t (lik e MCCV), slighly dep enden t (lik e RL T), or strongly dep enden t (lik e VF CV). It seems in tuitiv e that giving similar roles to all data p oin ts in the B training and v alidation tasks should yield more reliable results as other metho ds. This in tuition ma y explain wh y VF CV is m u h more used than RL T or MCCV. Similarly , Shao (1993 ) prop osed a CV metho d alled BICV, where ev ery p oin t and pair of p oin ts app ear in the same n um b er of splits, see Setion 4.3.2 . Nev ertheless, most reen t theoretial results on the v arious CV pro edures are not aurate enough to distinguish whi h one ma y b e the b est splitting strategy: This remains a widely op en theoretial question. Note nally that the additional v ariabilit y due to the  hoie of a sequene of data splits w as quan tied empirially b y Jonathan et al. (2000 ) and theoretially b y Celisse and Robin (2008 ) for VF CV. 10.3 V-fold ross-v alidation VF CV is ertainly the most p opular CV pro edure, in partiular b eause of its mild omputational ost. Nev ertheless, the question of  ho osing V remains widely op en, ev en if indiations an b e giv en to w ards an appropriate  hoie. A sp ei feature of VF CVas w ell as exhaustiv e strategiesis that  ho os- ing V uniquely determines the size of the training set n t = n ( V − 1) / V and the n um b er of splits B = V , hene the omputational ost. Con traditory phenomena then o ur. 31 On the one hand, the bias of VF CV dereases with V sine n t = n (1 − 1 / V ) observ ations are used in the training set. On the other hand, the v ariane of VF CV dereases with V for small v alues of V , whereas the LOO ( V = n ) is kno wn to suer from a high v ariane in sev eral framew orks su h as lassiation or densit y estimation. Note ho w ev er that the v ariane of VF CV is minimal for V = n in some framew orks lik e linear regression (see Setion 5.2). F urthermore, estimating the v ariane of VF CV from data is a diult problem in general, see Setion 5.2.3 . When the goal of mo del seletion is estimation, it is often rep orted in the literature that the optimal V is b et w een 5 and 10 , b eause the statistial p erfor- mane do es not inrease m u h for larger v alues of V , and a v eraging o v er 5 or 10 splits remains omputationally feasible (Hastie et al., 2001 , Setion 7.10). Ev en if this laim is learly true for man y problems, the onlusion of this surv ey is that b etter statistial p erformane an sometimes b e obtained with other v alues of V , for instane dep ending on the SNR v alue. When the SNR is large, the asymptoti omparison of CV pro edures re- alled in Setion 6.2 an b e trusted: LOO p erforms (nearly) un biased risk es- timation hene is asymptotially optimal, whereas VF CV with V = O (1) is sub optimal. On the on trary , when the SNR is small, o v erp enalization an impro v e the p erformane. Therefore, VF CV with V < n an yield a smaller risk than LOO thanks to its bias and despite its v ariane when V is small (see sim ulation exp erimen ts b y Arlot, 2008 ). F urthermore, other CV pro edures lik e RL T an b e in teresting alternativ es to VF CV, sine they allo w to  ho ose the bias (through n t ) indep enden tly from B , whi h mainly go v erns the v ariane. Another p ossible alternativ e is V -fold p enalization, whi h is related to orreted VF CV (see Setion 4.3.3 ). When the goal of mo del seletion is iden tiation, the main dra wba k of VF CV is that n t ≪ n is often required for  ho osing onsisten tly the true mo del (see Setion 7), whereas VF CV do es not allo w n t < n/ 2 . Dep ending on the framew orks, dieren t (empirial) reommandations for  ho osing V an b e found in the literature. In ordered v ariable seletion, the largest V seems to b e the b etter, V = 10 pro viding results lose to the optimal ones ( Zhang, 1993 ). On the on trary , Dietteri h (1998 ) and Alpa ydin (1999 ) reommend V = 2 for  ho osing the b est learning pro edures among t w o andidates. 10.4 F uture resear h P erhaps the most imp ortan t diretion for future resear h w ould b e to pro vide, in ea h sp ei framew ork, preise quan titativ e measures of the v ariane of CV estimators of the risk, dep ending on n t , the n um b er of splits, and ho w the splits are  hosen. Up to no w, only a few preise results ha v e b een obtained in this diretion, for some sp ei CV metho ds in linear regression or densit y estimation (see Setion 5.2 ). Pro ving similar results in other framew orks and for more general CV metho ds w ould greatly help to  ho ose a CV metho d for an y giv en mo del seletion problem. More generally , most theoretial results are not preise enough to mak e an y distintion b et w een the hold-out and CV metho ds ha ving the same training sample size n t , b eause they are equiv alen t at rst order. Seond order terms do matter for realisti v alues of n , whi h sho ws the dramati need for theory 32 that tak es in to aoun t the v ariane of CV when omparing CV metho ds su h as VF CV and RL T with n t = n ( V − 1) /V but B 6 = V . Referenes Ak aik e, H. (1970). Statistial preditor iden tiation. A nn. Inst. Statist. Math. , 22:203217. Ak aik e, H. (1973). Information theory and an extension of the maxim um lik e- liho o d priniple. In Se  ond International Symp osium on Information The ory (Tsahkadsor, 1971) , pages 267281. Ak adémiai Kiadó, Budap est. Allen, D. M. (1974). The relationship b et w een v ariable seletion and data aug- men tation and a metho d for predition. T e hnometris , 16:125127. Alpa ydin, E. (1999). Com bined 5 x 2 v F test for omparing sup ervised las- siation learning algorithms. Neur. Comp. , 11(8):18851892. Anderson, R. L., Allen, D. M., and B., C. F. (1972). Seletio on of preditor v ariables in linear m ultiple regression. In banroft, T. A., editor, In Statisti al p ap ers in Honor of Ge or ge W. Sne de  or . Io w a: io w a State Univ ersit y Press. Arlot, S. (2007). R esampling and Mo del Sele tion . PhD thesis, Univ ersit y P aris- Sud 11. oai:tel.ar hiv es-ouv ertes.fr:tel-0019 88 03 _v1 . Arlot, S. (2008a). Mo del seletion b y resampling p enalization. Ele tr oni Jour- nal of Statistis . T o app ear. oai:hal.ar hiv es-ouv ertes.fr:hal-00 26 24 78 _v1 . Arlot, S. (2008b). Sub optimalit y of p enalties prop ortional to the dimension for mo del seletion in heterosedasti regression. Arlot, S. (2008). V -fold ross-v alidation impro v ed: V -fold p enalization. Arlot, S. and Celisse, A. (2009). Segmen tation in the mean of heterosedasti data via ross-v alidation. Arlot, S. and Massart, P . (2009). Data-driv en alibration of p enalties for least- squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni). Baraud, Y. (2002). Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni). Barron, A., Birgé, L., and Massart, P . (1999). Risk b ounds for mo del seletion via p enalization. Pr ob ab. The ory R elate d Fields , 113(3):301413. Bartlett, P . L., Bou heron, S., and Lugosi, G. (2002). Mo del seletion and error estimation. Mahine L e arning , 48:85113. Bartlett, P . L., Bousquet, O., and Mendelson, S. (2005). Lo al Radema her omplexities. A nn. Statist. , 33(4):14971537. Bartlett, P . L. and Mendelson, S. (2002). Radema her and Gaussian omplexi- ties: risk b ounds and strutural results. J. Mah. L e arn. R es. , 3(Sp e. Issue Comput. Learn. Theory):463482. 33 Bellman, R. E. and Dreyfus, S. E. (1962). Applie d Dynami Pr o gr amming . Prineton. Bengio, Y. and Grandv alet, Y. (2004). No un biased estimator of the v ariane of K -fold ross-v alidation. J. Mah. L e arn. R es. , 5:10891105 (eletroni). Bhansali, R. J. and Do wnham, D. Y. (1977). Some prop erties of the order of an autoregressiv e mo del seleted b y a generalization of Ak aik e's FPE riterion. Biometrika , 64(3):547551. Birgé, L. and Massart, P . (2001). Gaussian mo del seletion. J. Eur. Math. So . (JEMS) , 3(3):203268. Birgé, L. and Massart, P . (2007). Minimal p enalties for Gaussian mo del sele- tion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373. Blan hard, G. and Massart, P . (2006). Disussion:  Lo al Radema her omplex- ities and orale inequalities in risk minimization [Ann. Statist. 34 (2006), no. 6, 25932656℄ b y V. K olt hinskii. A nn. Statist. , 34(6):26642671. Bou heron, S., Bousquet, O., and Lugosi, G. (2005). Theory of lassiation: a surv ey of some reen t adv anes. ESAIM Pr ob ab. Stat. , 9:323375 (eletroni). Bousquet, O. and Elisse, A. (2002). Stabilit y and Generalization. J. Mahine L e arning R ese ar h , 2:499526. Bo wman, A. W. (1984). An alternativ e metho d of ross-v alidation for the smo othing of densit y estimates. Biometrika , 71(2):353360. Breiman, L. (1996). Heuristis of instabilit y and stabilization in mo del seletion. A nn. Statist. , 24(6):23502383. Breiman, L. (1998). Aring lassiers. A nn. Statist. , 26(3):801849. With disussion and a rejoinder b y the author. Breiman, L., F riedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Clas- si ation and r e gr ession tr e es . W adsw orth Statistis/Probabilit y Series. W adsw orth A dv aned Bo oks and Soft w are, Belmon t, CA. Breiman, L. and Sp etor, P . (1992). Submo del seletion and ev aluation in re- gression. the x-random ase. International Statisti al R eview , 60(3):291319. Burman, P . (1989). A omparativ e study of ordinary ross-v alidation, v - fold ross-v alidation and the rep eated learning-testing metho ds. Biometrika , 76(3):503514. Burman, P . (1990). Estimation of optimal transformations using v -fold ross v alidation and rep eated learning-testing metho ds. Sankhy a Ser. A , 52(3):314 345. Burman, P ., Cho w, E., and Nolan, D. (1994). A ross-v alidatory metho d for dep enden t data. Biometrika , 81(2):351358. Burman, P . and Nolan, D. (1992). Data-dep enden t estimation of predition funtions. J. Time Ser. A nal. , 13(3):189207. 34 Burnham, K. P . and Anderson, D. R. (2002). Mo del sele tion and multimo del in- fer en e . Springer-V erlag, New Y ork, seond edition. A pratial information- theoreti approa h. Cao, Y. and Golub ev, Y. (2006). On orale inequalities related to smo othing splines. Math. Metho ds Statist. , 15(4):398414. Celisse, A. (2008a). Densit y estimation via ross-v alidation: Mo del seletion p oin t of view. T e hnial rep ort, arXiv: 0811.0802. Celisse, A. (2008b). Mo del Sele tion Via Cr oss-V alidation in Density Estima- tion, R e gr ession and Change-Points Dete tion . PhD thesis, Univ ersit y P aris- Sud 11, http://tel.arh ive s- ou ver te s. fr/ te l- 003 46 320 /e n/ . Celisse, A. and Robin, S. (2008). Nonparametri densit y estimation b y ex- at lea v e-p-out ross-v alidation. Computational Statistis and Data A nalysis , 52(5):23502368. Cho w, Y. S., Geman, S., and W u, L. D. (1987). Consisten t ross-v alidated densit y estimation. A nn. Statist. , 11:2538. Ch u, C.-K. and Marron, J. S. (1991). Comparison of t w o bandwidth seletors with dep enden t errors. A nn. Statist. , 19(4):19061918. Cra v en, P . and W ah ba, G. (1979). Smo othing noisy data with spline funtions. Estimating the orret degree of smo othing b y the metho d of generalized ross-v alidation. Numer. Math. , 31(4):377403. Dalelane, C. (2005). Exat orale inequalit y for sharp adaptiv e k ernel densit y estimator. T e hnial rep ort, arXiv. Daudin, J.-J. and Mary-Huard, T. (2008). Estimation of the onditional risk in lassiation: The sw apping metho d. Comput. Stat. Data A nal. , 52(6):3220 3232. Da vison, A. C. and Hall, P . (1992). On the bias and v ariabilit y of b o ot- strap and ross-v alidation estimates of error rate in disrimination problems. Biometrika , 79(2):279284. Devro y e, L., Györ, L., and Lugosi, G. (1996). A pr ob abilisti the ory of p attern r e  o gnition , v olume 31 of Appli ations of Mathematis (New Y ork) . Springer- V erlag, New Y ork. Devro y e, L. and W agner, T. J. (1979). Distribution-Free p erformane Bounds for P oten tial F untion Rules. IEEE T r ansation in Information The ory , 25(5):601604. Dietteri h, T. G. (1998). Appro ximate statistial tests for omparing sup ervised lassiation learning algorithms. Neur. Comp. , 10(7):18951924. Efron, B. (1983). Estimating the error rate of a predition rule: impro v emen t on ross-v alidation. J. A mer. Statist. Asso . , 78(382):316331. Efron, B. (1986). Ho w biased is the apparen t error rate of a predition rule? J. A mer. Statist. Asso . , 81(394):461470. 35 Efron, B. (2004). The estimation of predition error: o v ariane p enalties and ross-v alidation. J. A mer. Statist. Asso . , 99(467):619642. With ommen ts and a rejoinder b y the author. Efron, B. and Morris, C. (1973). Com bining p ossibly related estimation prob- lems (with disussion). J. R. Statist. So . B , 35:379. Efron, B. and Tibshirani, R. (1997). Impro v emen ts on ross-v alidation: the .632+ b o otstrap metho d. J. A mer. Statist. Asso . , 92(438):548560. F romon t, M. (2007). Mo del seletion b y b o otstrap p enalization for lassiation. Mah. L e arn. , 66(23):165207. Geisser, S. (1974). A preditiv e approa h to the random eet mo del. Biometrika , 61(1):101107. Geisser, S. (1975). The preditiv e sample reuse metho d with appliations. J. A mer. Statist. Asso . , 70:320328. Girard, D. A. (1998). Asymptoti omparison of (partial) ross-v alidation, GCV and randomized GCV in nonparametri regression. A nn. Statist. , 26(1):315 334. Grün w ald, P . D. (2007). The Minimum Desription Length Priniple . MIT Press, Cam bridge, MA, USA. Györ, L., K ohler, M., Krzy»ak, A., and W alk, H. (2002). A distribution-fr e e the ory of nonp ar ametri r e gr ession . Springer Series in Statistis. Springer- V erlag, New Y ork. Hall, P . (1983). Large sample optimalit y of least squares ross-v alidation in densit y estimation. A nn. Statist. , 11(4):11561174. Hall, P . (1987). On Kullba k-Leibler loss and densit y estimation. The A nnals of Statistis , 15(4):14911519. Hall, P ., Lahiri, S. N., and P olzehl, J. (1995). On bandwidth  hoie in nonpara- metri regression with b oth short- and long-range dep enden t errors. A nn. Statist. , 23(6):19211936. Hall, P ., Marron, J. S., and P ark, B. U. (1992). Smo othed ross-v alidation. Pr ob ab. The ory R elate d Fields , 92(1):120. Hall, P . and S h uan y , W. R. (1989). A lo al ross-v alidation algorithm. Statist. Pr ob ab. L ett. , 8(2):109117. Härdle, W. (1984). Ho w to determine the bandwidth of some nonlinear smo others in pratie. In R obust and nonline ar time series analysis (Heidel- b er g, 1983) , v olume 26 of L e tur e Notes in Statist. , pages 163184. Springer, New Y ork. Härdle, W., Hall, P ., and Marron, J. S. (1988). Ho w far are automatially  hosen regression smo othing parameters from their optim um? J. A mer. Statist. As- so . , 83(401):86101. With ommen ts b y Da vid W. Sott and Iain Johnstone and a reply b y the authors. 36 Hart, J. D. (1994). Automated k ernel smo othing of dep enden t data b y using time series ross-v alidation. J. R oy. Statist. So . Ser. B , 56(3):529542. Hart, J. D. and Vieu, P . (1990). Data-driv en bandwidth  hoie for densit y estimation based on dep enden t data. A nn. Statist. , 18(2):873890. Hart, J. D. and W ehrly , T. E. (1986). Kernel regression estimation using re- p eated measuremen ts data. J. A mer. Statist. Asso . , 81(396):10801088. Hastie, T., Tibshirani, R., and F riedman, J. (2001). The elements of statisti-  al le arning . Springer Series in Statistis. Springer-V erlag, New Y ork. Data mining, inferene, and predition. Herzb erg, A. M. and T suk ano v, A. V. (1986). A note on mo diations of ja k- knife riterion for mo del seletion. Utilitas Math. , 29:209216. Herzb erg, P . A. (1969). The parameters of ross-v alidation. Psyhometrika , 34:Monograph Supplemen t. Hesterb erg, T. C., Choi, N. H., Meier, L., and F raley , C. (2008). Least angle and l1 p enalized regression: A review. Statistis Surveys , 2:6193 (eletroni). Hills, M. (1966). Allo ation Rules and their Error Rates. J. R oyal Statist. So . Series B , 28(1):131. Hub er, P . (1964). Robust estimation of a lo al parameter. A nn. Math. Statist. , 35:73101. Hurvi h, C. M. and T sai, C.-L. (1989). Regression and time series mo del sele- tion in small samples. Biometrika , 76(2):297307. John, P . W. M. (1971). Statisti al design and analysis of exp eriments . The Mamillan Co., New Y ork. Jonathan, P ., Krzano wki, W. J., and MCarth y , W. V. (2000). On the use of ross-v alidation to assess p erformane in m ultiv ariate predition. Stat. and Comput. , 10:209229. Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Exp erimen tal and Theoretial Comparison of Mo del Seletion Metho ds. Mahine L e arning , 27:750. Kearns, M. and Ron, D. (1999). Algorithmi Stabilit y and Sanit y-Che k Bounds for Lea v e-One-Out Cross-Validation. Neur al Computation , 11:14271453. K olt hinskii, V. (2001). Radema her p enalties and strutural risk minimization. IEEE T r ans. Inform. The ory , 47(5):19021914. K olt hinskii, V. (2006). Lo al Radema her omplexities and orale inequalities in risk minimization. A nn. Statist. , 34(6):25932656. La hen bru h, P . A. and Mi k ey , M. R. (1968). Estimation of Error Rates in Disriminan t Analysis. T e hnometris , 10(1):111. Larson, S. C. (1931). The shrink age of the o eien t of m ultiple orrelation. J. Edi. Psyhol. , 22:4555. 37 Leung, D., Marriott, F., and W u, E. (1993). Bandwidth seletion in robust smo othing. J. Nonp ar ametr. Statist. , 2:333339. Leung, D. H.-Y. (2005). Cross-v alidation in nonparametri regression with out- liers. A nn. Statist. , 33(5):22912310. Li, K.-C. (1985). F rom stein's un biased risk estimates to the metho d of gener- alized ross v alidation. A nn. Statist. , 13(4):13521377. Li, K.-C. (1987). Asymptoti optimalit y for C p , C L , ross-v alidation and gen- eralized ross-v alidation: disrete index set. A nn. Statist. , 15(3):958975. Mallo ws, C. L. (1973). Some ommen ts on C p . T e hnometris , 15:661675. Mark atou, M., Tian, H., Bisw as, S., and Hrip sak, G. (2005). Analysis of v ariane of ross-v alidation estimators of the generalization error. J. Mah. L e arn. R es. , 6:11271168 (eletroni). Massart, P . (2007). Con entr ation ine qualities and mo del sele tion , v olume 1896 of L e tur e Notes in Mathematis . Springer, Berlin. Letures from the 33rd Summer S ho ol on Probabilit y Theory held in Sain t-Flour, July 623, 2003, With a forew ord b y Jean Piard. Molinaro, A. M., Simon, R., and Pfeier, R. M. (2005). Predition error estima- tion: a omparison of resampling metho ds. Bioinformatis , 21(15):33013307. Mosteller, F. and T uk ey , J. W. (1968). Data analysis, inluding statistis. In Lindzey , G. and Aronson, E., editors, Handb o ok of So ial Psyholo gy, Vol. 2 . A ddison-W esley . Nadeau, C. and Bengio, Y. (2003). Inferene for the generalization error. Ma- hine L e arning , 52:239281. Nishii, R. (1984). Asymptoti prop erties of riteria for seletion of v ariables in m ultiple regression. A nn. Statist. , 12(2):758765. Opsomer, J., W ang, Y., and Y ang, Y. (2001). Nonparametri regression with orrelated errors. Statist. Si. , 16(2):134153. Piard, R. R. and Co ok, R. D. (1984). Cross-v alidation of regression mo dels. J. A mer. Statist. Asso . , 79(387):575583. P olitis, D. N., Romano, J. P ., and W olf, M. (1999). Subsampling . Springer Series in Statistis. Springer-V erlag, New Y ork. Quenouille, M. H. (1949). Appro ximate tests of orrelation in time-series. J. R oy. Statist. So . Ser. B. , 11:6884. Ripley , B. D. (1996). Pattern R e  o gnition and Neur al Networks . Cam bridge Univ. Press. Rissanen, J. (1983). Univ ersal Prior for In tegers and Estimation b y Minim um Desription Length. The A nnals of Statistis , 11(2):416431. Ron hetti, E., Field, C., and Blan hard, W. (1997). Robust linear mo del sele- tion b y ross-v alidation. J. A mer. Statist. Asso . , 92:10171023. 38 Rudemo, M. (1982). Empirial Choie of Histograms and Kernel Densit y Esti- mators. S andinavian Journal of Statistis , 9:6578. Sauv é, M. (2009). Histogram seletion in non gaussian regression. ESAIM: Pr ob ability and Statistis , 13:7086. S h uster, E. F. and Gregory , G. G. (1981). On the onsisteny of maxim um lik e- liho o d nonparametri densit y estimators. In Eddy , W. F., editor, Computer Sien e and Statistis: Pr o  e e dings of the 13th Symp osium on the Interfa e , pages 295298. Springer-V erlag, New Y ork. S h w arz, G. (1978). Estimating the dimension of a mo del. A nn. Statist. , 6(2):461464. Shao, J. (1993). Linear mo del seletion b y ross-v alidation. J. A mer. Statist. Asso . , 88(422):486494. Shao, J. (1996). Bo otstrap mo del seletion. J. A mer. Statist. Asso . , 91(434):655665. Shao, J. (1997). An asymptoti theory for linear mo del seletion. Statist. Sini a , 7(2):221264. With ommen ts and a rejoinder b y the author. Shibata, R. (1984). Appro ximate eieny of a seletion pro edure for the n um b er of regression v ariables. Biometrika , 71(1):4349. Simon, F. (1971). Predition metho ds in riminology . v olume 7. Stone, C. (1984). An asymptotially optimal windo w seletion rule for k ernel densit y estimates. The A nnals of Statistis , 12(4):12851297. Stone, M. (1974). Cross-v alidatory  hoie and assessmen t of statistial predi- tions. J. R oy. Statist. So . Ser. B , 36:111147. With disussion b y G. A. Barnard, A. C. A tkinson, L. K. Chan, A. P . Da wid, F. Do wn ton, J. Di k ey , A. G. Bak er, O. Barndor-Nielsen, D. R. Co x, S. Giesser, D. Hinkley , R. R. Ho  king, and A. S. Y oung, and with a reply b y the authors. Stone, M. (1977). Asymptotis for and against ross-v alidation. Biometrika , 64(1):2935. Sugiura, N. (1978). F urther analysis of the data b y ak aik e's information riterion and the nite orretions. Comm. Statist. AThe ory Metho ds , 7(1):1326. Tibshirani, R. (1996). Regression Shrink age and Seletion via the Lasso. J. R oyal Statist. So . Series B , 58(1):267288. v an der Laan, M. J. and Dudoit, S. (2003). Unied ross-v alidation metho dology for seletion among estimators and a general ross-v alidated adaptiv e epsilon- net estimator: Finite sample orale inequalities and examples. W orking P ap er Series W orking P ap er 130, U.C. Berk eley Division of Biostatistis. a v ailable at h ttp://www.b epress.om/ubbiostat/pap er130. v an der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptoti optimalit y of lik eliho o d-based ross-v alidation. Stat. Appl. Genet. Mol. Biol. , 3:Art. 4, 27 pp. (eletroni). 39 v an der Laan, M. J., Dudoit, S., and v an der V aart, A. W. (2006). The ross- v alidated adaptiv e epsilon-net estimator. Statist. De isions , 24(3):373395. v an der V aart, A. W., Dudoit, S., and v an der Laan, M. J. (2006). Orale inequalities for m ulti-fold ross v alidation. Statist. De isions , 24(3):351371. v an Erv en, T., Grün w ald, P . D., and de Ro oij, S. (2008). Cat hing up faster b y swit hing so oner: A prequen tial solution to the ai-bi dilemma. V apnik, V. (1982). Estimation of dep enden es b ase d on empiri al data . Springer Series in Statistis. Springer-V erlag, New Y ork. T ranslated from the Russian b y Sam uel K otz. V apnik, V. N. (1998). Statisti al le arning the ory . A daptiv e and Learning Sys- tems for Signal Pro essing, Comm uniations, and Con trol. John Wiley & Sons In., New Y ork. A Wiley-In tersiene Publiation. V apnik, V. N. and Cherv onenkis, A. Y. (1974). T e oriya r asp oznavaniya obr a- zov. Statistiheskie pr oblemy obuheniya . Izdat. Nauk a, Moso w. Theory of P attern Reognition (In Russian). W ah ba, G. (1975). Perio di splines for sp etral densit y estimation: The use of ross v alidation for determining the degree of smo othing. Communi ations in Statistis , 4:125142. W ah ba, G. (1977). Pratial Appro ximate Solutions to Linear Op erator Equa- tions When the Data are Noisy . SIAM Journal on Numeri al A nalysis , 14(4):651667. W egk amp, M. (2003). Mo del seletion in nonparametri regression. The A nnals of Statistis , 31(1):252273. Y ang, Y. (2005). Can the strengths of AIC and BIC b e shared? A onit b e- t w een mo del inden tiation and regression estimation. Biometrika , 92(4):937 950. Y ang, Y. (2006). Comparing learning metho ds for lassiation. Statist. Sini a , 16(2):635657. Y ang, Y. (2007). Consisteny of ross v alidation for omparing regression pro- edures. A nn. Statist. , 35(6):24502473. Zhang, P . (1993). Mo del seletion via m ultifold ross v alidation. A nn. Statist. , 21(1):299313. 40

A survey of cross-validation procedures for model selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment