A survey of cross-validation procedures for model selection
Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its apparent universality. Many results exist on the model selection performances of cross-validation pro…
Authors: Sylvain Arlot (LIENS), Alain Celisse (MIA)
A surv ey of ross-v alidation pro edures for mo del seletion Otob er 22, 2018 Sylv ain Arlot, CNRS ; Willo w Pro jet-T eam, Lab oratoire d'Informatique de l'Eole Normale Sup erieure (CNRS/ENS/INRIA UMR 8548) 45, rue d'Ulm, 75 230 P aris, F rane Sylvain.Arlotens .f r Alain Celisse, Lab oratoire P aul P ainlev é, UMR CNRS 8524, Univ ersité des Sienes et T e hnologies de Lille 1 F-59 655 Villeneuv e dAsq Cedex, F rane Alain.Celissema th. un iv -li ll e1 .fr Abstrat Used to estimate the risk of an estimator or to p erform mo del sele- tion, ross-v alidation is a widespread strategy b eause of its simpliit y and its apparen t univ ersalit y . Man y results exist on the mo del seletion p erformanes of ross-v alidation pro edures. This surv ey in tends to relate these results to the most reen t adv anes of mo del seletion theory , with a partiular emphasis on distinguishing empirial statemen ts from rigorous theoretial results. As a onlusion, guidelines are pro vided for ho osing the b est ross-v alidation pro edure aording to the partiular features of the problem in hand. Con ten ts 1 In tro dution 3 1.1 Statistial framew ork . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Statistial algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Mo del seletion 6 2.1 The mo del seletion paradigm . . . . . . . . . . . . . . . . . . . . 6 2.2 Mo del seletion for estimation . . . . . . . . . . . . . . . . . . . . 7 2.3 Mo del seletion for iden tiation . . . . . . . . . . . . . . . . . . 8 2.4 Estimation vs. iden tiation . . . . . . . . . . . . . . . . . . . . . 8 1 3 Ov erview of some mo del seletion pro edures 8 3.1 The un biased risk estimation priniple . . . . . . . . . . . . . . . 9 3.2 Biased estimation of the risk . . . . . . . . . . . . . . . . . . . . 10 3.3 Pro edures built for iden tiation . . . . . . . . . . . . . . . . . . 11 3.4 Strutural risk minimization . . . . . . . . . . . . . . . . . . . . . 11 3.5 A d ho p enalization . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Where are ross-v alidation pro edures in this piture? . . . . . . 12 4 Cross-v alidation pro edures 12 4.1 Cross-v alidation philosoph y . . . . . . . . . . . . . . . . . . . . . 13 4.2 F rom v alidation to ross-v alidation . . . . . . . . . . . . . . . . . 13 4.2.1 Hold-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.2 General denition of ross-v alidation . . . . . . . . . . . . 14 4.3 Classial examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3.1 Exhaustiv e data splitting . . . . . . . . . . . . . . . . . . 14 4.3.2 P artial data splitting . . . . . . . . . . . . . . . . . . . . . 15 4.3.3 Other ross-v alidation-lik e risk estimators . . . . . . . . . 16 4.4 Historial remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Statistial prop erties of ross-v alidation estimators of the risk 17 5.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 Theoretial assessmen t of the bias . . . . . . . . . . . . . 17 5.1.2 Corretion of the bias . . . . . . . . . . . . . . . . . . . . 19 5.2 V ariane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.1 V ariabilit y fators . . . . . . . . . . . . . . . . . . . . . . 19 5.2.2 Theoretial assessmen t of the v ariane . . . . . . . . . . . 20 5.2.3 Estimation of the v ariane . . . . . . . . . . . . . . . . . . 21 6 Cross-v alidation for eien t mo del seletion 21 6.1 Relationship b et w een risk estimation and mo del seletion . . . . 22 6.2 The global piture . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.3 Results in v arious framew orks . . . . . . . . . . . . . . . . . . . . 23 7 Cross-v alidation for iden tiation 24 7.1 General onditions to w ards mo del onsisteny . . . . . . . . . . . 24 7.2 Rened analysis for the algorithm seletion problem . . . . . . . 25 8 Sp eiities of some framew orks 26 8.1 Densit y estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.2 Robustness to outliers . . . . . . . . . . . . . . . . . . . . . . . . 27 8.3 Time series and dep enden t observ ations . . . . . . . . . . . . . . 27 8.4 Large n um b er of mo dels . . . . . . . . . . . . . . . . . . . . . . . 28 9 Closed-form form ulas and fast omputation 29 10 Conlusion: whi h ross-v alidation metho d for whi h problem? 30 10.1 The general piture . . . . . . . . . . . . . . . . . . . . . . . . . . 30 10.2 Ho w the splits should b e hosen? . . . . . . . . . . . . . . . . . . 31 10.3 V-fold ross-v alidation . . . . . . . . . . . . . . . . . . . . . . . . 31 10.4 F uture resear h . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 1 In tro dution Man y statistial algorithms, su h as lik eliho o d maximization, least squares and empirial on trast minimization, rely on the preliminary hoie of a mo del, that is of a set of parameters from whi h an estimate will b e returned. When sev eral andidate mo dels (th us algorithms) are a v ailable, ho osing one of them is alled the mo del seletion problem. Cross-v alidation (CV) is a p opular strategy for mo del seletion, and more generally algorithm seletion. The main idea b ehind CV is to split the data (one or sev eral times) for estimating the risk of ea h algorithm: P art of the data (the training sample) is used for training ea h algorithm, and the remaining part (the v alidation sample) is used for estimating the risk of the algorithm. Then, CV selets the algorithm with the smallest estimated risk. Compared to the resubstitution error, CV a v oids o v ertting b eause the training sample is indep enden t from the v alidation sample (at least when data are i.i.d. ). The p opularit y of CV mostly omes from the generalit y of the data splitting heuristis, whi h only assumes that data are i.i.d. . Nev ertheless, the- oretial and empirial studies of CV pro edures do not en tirely onrm this univ ersalit y. Some CV pro edures ha v e b een pro v ed to fail for some mo del seletion problems, dep ending on the goal of mo del seletion: estimation or iden tiation (see Setion 2 ). F urthermore, man y theoretial questions ab out CV remain widely op en. The aim of the presen t surv ey is to pro vide a lear piture of what is kno wn ab out CV, from b oth theoretial and empirial p oin ts of view. More preisely , the aim is to answ er the follo wing questions: What is CV doing? When do es CV w ork for mo del seletion, k eeping in mind that mo del seletion an target dieren t goals? Whi h CV pro edure should b e used for ea h mo del seletion problem? The pap er is organized as follo ws. First, the rest of Setion 1 presen ts the statistial framew ork. Although non exhaustiv e, the presen t setting has b een hosen general enough for sk et hing the omplexit y of CV for mo del seletion. The mo del seletion problem is in tro dued in Setion 2. A brief o v erview of some mo del seletion pro edures that are imp ortan t to k eep in mind for un- derstanding CV is giv en in Setion 3. The most lassial CV pro edures are dened in Setion 4 . Sine they are the k eystone of the b eha viour of CV for mo del seletion, the main prop erties of CV estimators of the risk for a xed mo del are detailed in Setion 5 . Then, the general p erformanes of CV for mo del seletion are desrib ed, when the goal is either estimation (Setion 6) or iden tiation (Setion 7). Sp ei prop erties of CV in some partiular frame- w orks are disussed in Setion 8 . Finally , Setion 9 fo uses on the algorithmi omplexit y of CV pro edures, and Setion 10 onludes the surv ey b y ta kling sev eral pratial questions ab out CV. 1.1 Statistial framew ork Assume that some data ξ 1 , . . . , ξ n ∈ Ξ with ommon distribution P are ob- serv ed. Throughout the pap erexept in Setion 8.3 the ξ i are assumed to b e indep enden t. The purp ose of statistial inferene is to estimate from the data ( ξ i ) 1 ≤ i ≤ n some target feature s of the unkno wn distribution P , su h as the mean or the v ariane of P . Let S denote the set of p ossible v alues for s . 3 The qualit y of t ∈ S , as an appro ximation of s , is measured b y its loss L ( t ) , where L : S 7→ R is alled the loss funtion , and is assumed to b e minimal for t = s . Man y loss funtions an b e hosen for a giv en statistial problem. Sev eral lassial loss funtions are dened b y L ( t ) = L P ( t ) := E ξ ∼ P [ γ ( t ; ξ ) ] , (1) where γ : S × Ξ 7→ [0 , ∞ ) is alled a ontr ast funtion . Basially , for t ∈ S and ξ ∈ Ξ , γ ( t ; ξ ) measures ho w w ell t is in aordane with observ ation of ξ , so that the loss of t , dened b y ( 1), measures the a v erage aordane b et w een t and new observ ations ξ with distribution P . Therefore, sev eral framew orks su h as transdutiv e learning do not t denition ( 1 ) . Nev ertheless, as detailed in Setion 1.2 , denition (1) inludes most lassial statistial framew orks. Another useful quan tit y is the ex ess loss ℓ ( s, t ) := L P ( t ) − L P ( s ) ≥ 0 , whi h is related to the risk of an estimator b s of the target s b y R ( b s ) = E ξ 1 ,...,ξ n ∼ P [ ℓ ( s, b s ) ] . 1.2 Examples The purp ose of this subsetion is to sho w that the framew ork of Setion 1.1 inludes sev eral imp ortan t statistial framew orks. This list of examples do es not pretend to b e exhaustiv e. Densit y estimation aims at estimating the densit y s of P with resp et to some giv en measure µ on Ξ . Then, S is the set of densities on Ξ with resp et to µ . F or instane, taking γ ( t ; x ) = − ln( t ( x )) in (1), the loss is minimal when t = s and the exess loss ℓ ( s, t ) = L P ( t ) − L P ( s ) = E ξ ∼ P ln s ( ξ ) t ( ξ ) = Z s ln s t d µ is the Kullba k-Leibler div ergene b et w een distributions tµ and sµ . Predition aims at prediting a quan tit y of in terest Y ∈ Y giv en an explana- tory v ariable X ∈ X and a sample of observ ations ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . In other w ords, Ξ = X × Y , S is the set of measurable mappings X 7→ Y and the on trast γ ( t ; ( x, y )) measures the disrepany b et w een the observ ed y and its predited v alue t ( x ) . T w o lassial predition framew orks are regression and lassiation, whi h are detailed b elo w. Regression orresp onds to on tin uous Y , that is Y ⊂ R (or R k for m ultiv ari- ate regression), the feature spae X b eing t ypially a subset of R ℓ . Let s denote the regression funtion, that is s ( x ) = E ( X,Y ) ∼ P [ Y | X = x ] , so that ∀ i, Y i = s ( X i ) + ǫ i with E [ ǫ i | X i ] = 0 . A p opular on trast in regression is the le ast-squar es ontr ast γ ( t ; ( x, y ) ) = ( t ( x ) − y ) 2 , whi h is minimal o v er S for t = s , and the exess loss is ℓ ( s, t ) = E ( X,Y ) ∼ P h ( s ( X ) − t ( X ) ) 2 i . 4 Note that the exess loss of t is the square of the L 2 distane b et w een t and s , so that predition and estimation are equiv alen t goals. Classiation orresp onds to nite Y (at least disrete). In partiular, when Y = { 0 , 1 } , the predition problem is alled binary (sup ervise d) lassi ation . With the 0-1 on trast funtion γ ( t ; ( x, y )) = 1 l t ( x ) 6 = y , the minimizer of the loss is the so-alled Ba y es lassier s dened b y s ( x ) = 1 l η ( x ) ≥ 1 / 2 , where η denotes the regression funtion η ( x ) = P ( X,Y ) ∼ P ( Y = 1 | X = x ) . Remark that a sligh tly dieren t framew ork is often onsidered in binary las- siation. Instead of lo oking only for a lassier, the goal is to estimate also the ondene in the lassiation made at ea h p oin t: S is the set of measurable mappings X 7→ R , the lassier x 7→ 1 l t ( x ) ≥ 0 b eing asso iated to an y t ∈ S . Basially , the larger | t ( x ) | , the more onden t w e are in the lassiation made from t ( x ) . A lassial family of losses asso iated with this problem is dened b y (1) with the on trast γ φ ( t ; ( x, y ) ) = φ ( − (2 y − 1) t ( x ) ) where φ : R 7→ [0 , ∞ ) is some funtion. The 0-1 on trast orresp onds to φ ( u ) = 1 l u ≥ 0 . The on v ex loss funtions orresp ond to the ase where φ is on v ex, nondereasing with lim −∞ φ = 0 and φ (0) = 1 . Classial examples are φ ( u ) = max { 1 + u, 0 } (hinge), φ ( u ) = exp( u ) , and φ ( u ) = log 2 ( 1 + exp( u ) ) (logit). The orresp ond- ing losses are used as ob jetiv e funtions b y sev eral lassial learning algorithms su h as supp ort v etor ma hines (hinge) and b o osting (exp onen tial and logit). Man y referenes on lassiation theory , inluding mo del seletion, an b e found in the surv ey b y Bou heron et al. (2005 ). 1.3 Statistial algorithms In this surv ey , a statisti al algorithm A is an y (measurable) mapping A : S n ∈ N Ξ n 7→ S . The idea is that data D n = ( ξ i ) 1 ≤ i ≤ n ∈ Ξ n will b e used as an input of A , and that the output of A , A ( D n ) = b s A ( D n ) ∈ S , is an estimator of s . The qualit y of A is then measured b y L P b s A ( D n ) , whi h should b e as small as p ossible. In the sequel, the algorithm A and the estimator b s A ( D n ) are often iden tied when no onfusion is p ossible. Minimum ontr ast estimators form a lassial family of statistial algorithms, dened as follo ws. Giv en some subset S of S that w e all a mo del , a minim um on trast estimator of s is an y minimizer of the empirial on trast t 7→ L P n ( t ) = 1 n n X i =1 γ ( t ; ξ i ) , where P n = 1 n n X i =1 δ ξ i , o v er S . The idea is that the empirial on trast L P n ( t ) has an exp etation L P ( t ) whi h is minimal o v er S at s . Hene, minimizing L P n ( t ) o v er a set S of andidate v alues for s hop efully leads to a go o d estimator of s . Let us no w giv e three p opular examples of empirial on trast minimizers: • Maximum likeliho o d estimators : tak e γ ( t ; x ) = − ln( t ( x )) in the densit y estimation setting. A lassial hoie for S is the set of pieewise onstan t funtions on a regular partition of Ξ with K piees. 5 • L e ast-squar es estimators : tak e γ ( t ; ( x, y )) = ( t ( x ) − y ) 2 the least-squares on trast in the regression setting. F or instane, S an b e the set of piee- wise onstan t funtions on some xed partition of X (leading to regresso- grams), or a v etor spae spanned b y the rst v etors of w a v elets or F ourier basis, among man y others. Note that regularized least-squares algorithms su h as the Lasso, ridge regression and spline smo othing also are least- squares estimators, the mo del S b eing some ball of a (data-dep enden t) radius for the L 1 (resp. L 2 ) norm in some high-dimensional spae. Hene, tuning the regularization parameter for the LASSO or SVM, for instane, amoun ts to p erform mo del seletion from a olletion of mo dels. • Empiri al risk minimizers , follo wing the terminology of V apnik (1982 ): tak e an y on trast funtion γ in the predition setting. When γ is the 0-1 on trast, p opular hoies for S lead to linear lassiers, partitioning rules, and neural net w orks. Bo osting and Supp ort V etor Ma hines lassiers also are empirial on trast minimizers o v er some data-dep enden t mo del S , with on trast γ = γ φ for some on v ex funtions φ . Let us nally men tion that man y other lassial statistial algorithms an b e onsidered with CV, for instane lo al a v erage estimators in the predition framew ork su h as k -Nearest Neigh b ours and Nadara y a-W atson k ernel estima- tors. The fo us will b e mainly k ept on minim um on trast estimators to k eep the length of the surv ey reasonable. 2 Mo del seletion Usually , sev eral statistial algorithms an b e used for solving a giv en statistial problem. Let ( b s λ ) λ ∈ Λ denote su h a family of andidate statistial algorithms. The algorithm sele tion pr oblem aims at ho osing from data one of these algo- rithms, that is, ho osing some b λ ( D n ) ∈ Λ . Then, the nal estimator of s is giv en b y b s b λ ( D n ) ( D n ) . The main diult y is that the same data are used for training the algorithms, that is, for omputing ( b s λ ( D n ) ) λ ∈ Λ , and for ho osing b λ ( D n ) . 2.1 The mo del seletion paradigm F ollo wing Setion 1.3 , let us fo us on the mo del sele tion pr oblem , where an- didate algorithms are minim um on trast estimators and the goal is to ho ose a mo del S . Let ( S m ) m ∈M n b e a family of mo dels, that is, S m ⊂ S . Let γ b e a xed on trast funtion, and for ev ery m ∈ M n , let b s m b e a minim um on trast estimator o v er mo del S m with on trast γ . The goal is to ho ose b m ( D n ) ∈ M n from data only . The hoie of a mo del S m has to b e done arefully . Indeed, when S m is a small mo del, b s m is a p o or statistial algorithm exept when s is v ery lose to S m , sine ℓ ( s, b s m ) ≥ inf t ∈ S m { ℓ ( s, t ) } := ℓ ( s, S m ) . The lo w er b ound ℓ ( s, S m ) is alled the bias of mo del S m , or appr oximation err or . The bias is a noninreasing funtion of S m . 6 On the on trary , when S m is h uge, its bias ℓ ( s, S m ) is small for most targets s , but b s m learly o v erts. Think for instane of S m as the set of all on tin uous funtions on [0 , 1] in the regression framew ork. More generally , if S m is a v etor spae of dimension D m , in sev eral lassial framew orks, E [ ℓ ( s, b s m ( D n ) ) ] ≈ ℓ ( s, S m ) + λD m (2) where λ > 0 do es not dep end on m . F or instane, λ = 1 / (2 n ) in densit y estimation using the lik eliho o d on trast, and λ = σ 2 /n in regression using the least-squares on trast and assuming v ar ( Y | X ) = σ 2 do es not dep end on X . The meaning of (2) is that a go o d mo del hoie should balane the bias term ℓ ( s, S m ) and the varian e term λD m , that is solv e the so-alled bias-varian e tr ade-o . By extension, the v ariane term, also alled estimation err or , an b e dened b y E [ ℓ ( s, b s m ( D n ) ) ] − ℓ ( s , S m ) = E [ L P ( b s m ) ] − inf t ∈ S m L P ( t ) , ev en when (2) do es not hold. The in terested reader an nd a m u h deep er insigh t in to mo del seletion in the Sain t-Flour leture notes b y Massart (2007 ). Before giving examples of lassial mo del seletion pro edures, let us men tion the t w o main dieren t goals that mo del seletion an target: estimation and iden tiation. 2.2 Mo del seletion for estimation On the one hand, the goal of mo del seletion is estimation when b s b m ( D n ) ( D n ) is used as an appro ximation of the target s , and the goal is to minimize its loss. F or instane, AIC and Mallo ws' C p mo del seletion pro edures are built for estimation (see Setion 3.1 ). The qualit y of a mo del seletion pro edure D n 7→ b m ( D n ) , designed for esti- mation, is measured b y the exess loss of b s b m ( D n ) ( D n ) . Hene, the b est p ossible mo del hoie for estimation is the so-alled or ale mo del S m ⋆ , dened b y m ⋆ = m ⋆ ( D n ) ∈ arg min m ∈M n { ℓ ( s, b s m ( D n ) ) } . (3) Sine m ⋆ ( D n ) dep ends on the unkno wn distribution P of data, one annot exp et to selet b m ( D n ) = m ⋆ ( D n ) almost surely . Nev ertheless, w e an hop e to selet b m ( D n ) su h that b s b m ( D n ) is almost as lose to s as b s m ⋆ ( D n ) . Note that there is no requiremen t for s to b elong to S m ∈M n S m . Dep ending on the framew ork, the optimalit y of a mo del seletion pro edure for estimation is assessed in at least t w o dieren t w a ys. First, in the asymptoti framew ork, a mo del seletion pro edure b m is alled eient (or asymptotially optimal) when it leads to b m su h that ℓ s, b s b m ( D n ) ( D n ) inf m ∈M n { ℓ ( s, b s m ( D n ) ) } a.s. − − − − → n →∞ 1 . Sometimes, a w eak er result is pro v ed, the on v ergene holding only in probabil- it y . 7 Seond, in the non-asymptoti framew ork, a mo del seletion pro edure sat- ises an or ale ine quality with onstan t C n ≥ 1 and remainder term R n ≥ 0 when ℓ s, b s b m ( D n ) ( D n ) ≤ C n inf m ∈M n { ℓ ( s, b s m ( D n ) ) } + R n (4) holds either in exp etation or with large probabilit y (that is, a probabilit y larger than 1 − C ′ /n 2 , for some p ositiv e onstan t C ′ ). Note that if ( 4) holds on a large probabilit y ev en t with C n tending to 1 when n tends to innit y and R n ≪ ℓ ( s, b s m ⋆ ( D n ) ) , then the mo del seletion pro edure b m is eien t. In the estimation setting, mo del seletion is often used for building adaptive estimators , assuming that s b elongs to some funtion spae T α (Barron et al. , 1999 ).Then, a mo del seletion pro edure b m is optimal when it leads to an estima- tor b s b m ( D n ) ( D n ) (appro ximately) minimax with resp et to T α without kno wing α , pro vided the family ( S m ) m ∈M n has b een w ell- hosen. 2.3 Mo del seletion for iden tiation On the other hand, mo del seletion an aim at iden tifying the true mo del S m 0 , dened as the smallest mo del among ( S m ) m ∈M n to whi h s b elongs. In partiular, s ∈ S m ∈M n S m is assumed in this setting. A t ypial example of mo del seletion pro edure built for iden tiation is BIC (see Setion 3.3 ). The qualit y of a mo del seletion pro edure designed for iden tiation is measured b y its probabilit y of reo v ering the true mo del m 0 . Then, a mo del seletion pro edure is alled (mo del) onsistent when P ( b m ( D n ) = m 0 ) − − − − → n →∞ 1 . Note that iden tiation an naturally b e extended to the general algorithm seletion problem, the true mo del b eing replaed b y the statistial algorithm whose risk on v erges at the fastest rate (see for instane Y ang , 2007 ). 2.4 Estimation vs. iden tiation When a true mo del exists, mo del onsisteny is learly a stronger prop ert y than eieny dened in Setion 2.2 . Ho w ev er, in man y framew orks, no true mo del do es exist so that eieny is the only w ell-dened prop ert y . Could a mo del seletion pro edure b e mo del onsisten t in the former ase (lik e BIC) and eien t in the latter ase (lik e AIC)? The general answ er to this question, often alled the AIC-BIC dilemma, is negativ e: Y ang (2005 ) pro v ed in the regression framew ork that no mo del seletion pro edure an b e sim ultane- ously mo del onsisten t and minimax rate optimal. Nev ertheless, the strengths of AIC and BIC an sometimes b e shared; see for instane the in tro dution of a pap er b y Y ang (2005 ) and a reen t pap er b y v an Erv en et al. (2008 ). 3 Ov erview of some mo del seletion pro edures Sev eral approa hes an b e used for mo del seletion. Let us briey sk et h here some of them, whi h are partiularly helpful for understanding ho w CV w orks. 8 Lik e CV, all the pro edures onsidered in this setion selet b m ( D n ) ∈ arg m in m ∈M n { crit( m ; D n ) } , (5) where ∀ m ∈ M n , crit( m ; D n ) = crit( m ) ∈ R is some data-dep enden t riterion. A partiular ase of (5) is p enalization , whi h onsists in ho osing the mo del minimizing the sum of empirial on trast and some measure of omplexit y of the mo del (alled p enalt y) whi h an dep end on the data, that is, b m ( D n ) ∈ arg min m ∈M n { L P n ( b s m ) + p en( m ; D n ) } . (6) This setion do es not pretend to b e exhaustiv e. Completely dieren t approa hes exist for mo del seletion, su h as the Minim um Desription Length (MDL) (Rissanen, 1983 ), and the Ba y esian approa hes. The in terested reader will nd more details and referenes on mo del seletion pro edures in the b o oks b y Burnham and Anderson (2002 ) or Massart (2007 ) for instane. Let us fo us here on v e main ategories of mo del seletion pro edures, the rst three ones oming from a lassiation made b y Shao (1997 ) in the linear regression framew ork. 3.1 The un biased risk estimation priniple When the goal of mo del seletion is estimation, man y mo del seletion pro- edures are of the form (5 ) where crit( m ; D n ) un biasedly estimates (at least, asymptotially) the loss L P ( b s m ) . This general idea is often alled un biased risk estimation priniple, or Mallo ws' or Ak aik e's heuristis. In order to explain wh y this strategy an p erform w ell, let us write the starting p oin t of most theoretial analysis of pro edures dened b y (5): By denition (5), for ev ery m ∈ M n , ℓ ( s , b s b m ) + crit( b m ) − L P ( b s b m ) ≤ ℓ ( s, b s m ) + crit( m ) − L P ( b s m ) . (7) If E [ crit( m ) − L P ( b s m ) ] = 0 for ev ery m ∈ M n , then onen tration inequalities are lik ely to pro v e that ε − n , ε + n > 0 exist su h that ∀ m ∈ M n , ε + n ≥ crit( m ) − L P ( b s m ) ℓ ( s, b s m ) ≥ − ε − n > − 1 with high probabilit y , at least when Card( M n ) ≤ C n α for some C, α ≥ 0 . Then, (7) diretly implies an orale inequalit y lik e (4) with C n = (1 + ε + n ) / (1 − ε − n ) . If ε + n , ε − n → 0 when n → ∞ , this pro v es the pro edure dened b y (5 ) is eien t. Examples of mo del seletion pro edures follo wing the un biased risk estima- tion priniple are FPE (Final Predition Error, Ak aik e, 1970 ), sev eral ross- v alidation pro edures inluding the Lea v e-one-out (see Setion 4), and GCV (Generalized Cross-V alidation, Cra v en and W ah ba , 1979 , see Setion 4.3.3 ). With the p enalization approa h (6), the un biased risk estimation priniple is that E [ pen ( m ) ] should b e lose to the ideal p enalt y pen id ( m ) := L P ( b s m ) − L P n ( b s m ) . Sev eral lassial p enalization pro edures follo w this priniple, for instane: 9 • With the log-lik eliho o d on trast, AIC (Ak aik e Information Criterion, Ak aik e , 1973 ) and its orreted v ersions (Sugiura , 1978 ; Hurvi h and T sai , 1989 ). • With the least-squares on trast, Mallo ws' C p (Mallo ws , 1973 ) and sev eral rened v ersions of C p (see for instane Baraud , 2002 ). • With a general on trast, o v ariane p enalties (Efron, 2004 ). AIC, Mallo ws' C p and related pro edures ha v e b een pro v ed to b e optimal for estimation in sev eral framew orks, pro vided Card( M n ) ≤ C n α for some onstan ts C, α ≥ 0 (see the pap er b y Birgé and Massart , 2007 , and referenes therein). The main dra wba k of p enalties su h as AIC or Mallo ws' C p is their dep en- dene on some assumptions on the distribution of data. F or instane, Mallo ws' C p assumes the v ariane of Y do es not dep end on X . Otherwise, it has a sub optimal p erformane (Arlot, 2008b ). Sev eral resampling-based p enalties ha v e b een prop osed to o v erome this problem, at the prie of a larger omputational omplexit y , and p ossibly sligh tly w orse p erformane in simpler framew orks; see a pap er b y Efron (1983 ) for b o ot- strap, and a pap er b y Arlot (2008a ) and referenes therein for generalization to ex hangeable w eigh ts. Finally , note that all these p enalties dep end on m ultiplying fators whi h are not alw a ys kno wn (for instane, the noise-lev el, for Mallo ws' C p ). Birgé and Massart (2007 ) prop osed a general data-driv en pro edure for estimat- ing su h m ultiplying fators, whi h satises an orale inequalit y with C n → 1 in regression (see also Arlot and Massart , 2009 ). 3.2 Biased estimation of the risk Sev eral mo del seletion pro edures are of the form (5) where crit( m ) do es not un biasedly estimate the loss L P ( b s m ) : The w eigh t of the v ariane term om- pared to the bias in E [ crit( m ) ] is sligh tly larger than in the deomp osition (2) of L P ( b s m ) . F rom the p enalization p oin t of view, su h pro edures are overp e- nalizing . Examples of su h pro edures are FPE α (Bhansali and Do wnham , 1977 ) and GIC λ (Generalized Information Criterion, Nishii , 1984 ; Shao , 1997 ) with α, λ > 2 , whi h are losely related. Some ross-v alidation pro edures, su h as Lea v e- p -out with p/n ∈ (0 , 1) xed, also b elong to this ategory (see Setion 4.3.1 ). Note that FPE α with α = 2 is FPE, and GIC λ with λ = 2 is lose to FPE and Mallo ws' C p . When the goal is estimation, there are t w o main reasons for using biased mo del seletion pro edures. First, exp erimen tal evidene sho w that o v erp enal- izing often yields b etter p erformane when the signal-to-noise ratio is small (see for instane Arlot, 2007 , Chapter 11). Seond, when the n um b er of mo dels Card( M n ) gro ws faster than an y p o w er of n , as in the omplete v ariable seletion problem with n v ariables, then the un biased risk estimation priniple fails. F rom the p enalization p oin t of view, Birgé and Massart (2007 ) pro v ed that when Card( M n ) = e κn for some κ > 0 , 10 the minimal amoun t of p enalt y required so that an orale inequalit y holds with C n = O (1) is m u h larger than pen id ( m ) . In addition to the FPE α and GIC λ with suitably hosen α, λ , sev eral p enalization pro edures ha v e b een prop osed for taking in to aoun t the size of M n (Barron et al., 1999 ; Baraud , 2002 ; Birgé and Massart , 2001 ; Sauv é , 2009 ). In the same pap ers, these pro edures are pro v ed to satisfy orale inequalities with C n as small as p ossible, t ypially of order ln( n ) when Card( M n ) = e κn . 3.3 Pro edures built for iden tiation Some sp ei mo del seletion pro edures are used for iden tiation. A t ypial example is BIC (Ba y esian Information Criterion, S h w arz , 1978 ). More generally , Shao (1997 ) sho w ed that sev eral pro edures iden tify on- sisten tly the orret mo del in the linear regression framew ork as so on as they o v erp enalize within a fator tending to innit y with n , for instane, GIC λ n with λ n → + ∞ , FPE α n with α n → + ∞ (Shibata, 1984 ), and sev eral CV pro edures su h as Lea v e- p -out with p = p n ∼ n . BIC is also part of this piture, sine it oinides with GIC ln( n ) . In another pap er, Shao (1996 ) sho w ed that m n -out-of- n b o otstrap p enaliza- tion is also mo del onsisten t as so on as m n ∼ n . Compared to Efron's b o otstrap p enalties, the idea is to estimate pen id with the m n -out-of- n b o otstrap instead of the usual b o otstrap, whi h results in o v erp enalization within a fator tending to innit y with n (Arlot, 2008a ). Most MDL-based pro edures an also b e put in to this ategory of mo del seletion pro edures (see Grün w ald , 2007 ). Let us nally men tion the Lasso (Tibshirani , 1996 ) and other ℓ 1 p enalization pro edures, whi h ha v e reen tly attrated m u h atten tion (see for instane Hesterb erg et al. , 2008 ). They are a omputationally eien t w a y of iden tifying the true mo del in the on text of v ariable seletion with man y v ariables. 3.4 Strutural risk minimization In the on text of statistial learning, V apnik and Cherv onenkis (1974 ) pro- p osed the strutural risk minimization approa h (see also V apnik, 1982 , 1998 ). Roughly , the idea is to p enalize the empirial on trast with a p enalt y (o v er)- estimating pen id ,g ( m ) := sup t ∈ S m { L P ( t ) − L P n ( t ) } ≥ p en id ( m ) . Su h p enalties ha v e b een built using the V apnik-Cherv onenkis dimension, the om binatorial en trop y , (global) Radema her omplexities ( K olt hinskii , 2001 ; Bartlett et al., 2002 ), (global) b o otstrap p enalties (F romon t , 2007 ), Gaus- sian omplexities or the maximal disrepany (Bartlett and Mendelson, 2002 ). These p enalties are often alled glob al b eause pen id ,g ( m ) is a suprem um o v er S m . The lo alization approa h (see Bou heron et al., 2005 ) has b een in tro dued in order to obtain p enalties loser to pen id (su h as lo al Radema her om- plexities), hene smaller predition errors when p ossible (Bartlett et al. , 2005 ; K olt hinskii , 2006 ). Nev ertheless, these p enalties are still larger than pen id ( m ) 11 and an b e diult to ompute in pratie b eause of sev eral unkno wn on- stan ts. A non-asymptoti analysis of sev eral global and lo al p enalties an b e found in the b o ok b y Massart (2007 ) for instane; see also K olt hinskii (2006 ) for reen t results on lo al p enalties. 3.5 A d ho p enalization Let us nally men tion that p enalties an also b e built aording to partiular features of the problem. F or instane, p enalties an b e prop ortional to the ℓ p norm of b s m (similarly to ℓ p -regularized learning algorithms) when ha ving an estimator with a on trolled ℓ p norm seems b etter. The p enalt y an also b e prop ortional to the squared norm of b s m in some repro duing k ernel Hilb ert spae (similarly to k ernel ridge regression or spline smo othing), with a k ernel adapted to the sp ei framew ork. More generally , an y p enalt y an b e used, as so on as pen( m ) is larger than the estimation error (to a v oid o v ertting) and the b est mo del for the nal user is not the orale m ⋆ , but more lik e arg min m ∈M n { ℓ ( s, S m ) + κ p en( m ) } for some κ > 0 . 3.6 Where are ross-v alidation pro edures in this piture? The family of CV pro edures, whi h will b e desrib ed and deeply in v estigated in the next setions, on tains pro edures in the rst three ategories. CV pro e- dures are all of the form (5), where crit( m ) either estimates (almost) un biasedly the loss L P ( b s m ) , or o v erestimates the v ariane term (see Setion 2.1 ). In the latter ase, CV pro edures either b elong to the seond or the third ategory , dep ending on the o v erestimation lev el. This fat has t w o ma jor impliations. First, CV itself do es not tak e in to aoun t prior information for seleting a mo del. T o do so, one an either add to the CV estimate of the risk a p enalt y term (su h as k b s m k p ), or use prior information to pre-selet a subset of mo dels f M ( D n ) ⊂ M n b efore letting CV selet a mo del among ( S m ) m ∈ f M ( D n ) . Seond, in statistial learning, CV and resampling-based pro edures are the most widely used mo del seletion pro edures. Strutural risk minimization is often to o p essimisti, and other alternativ es rely on unrealisti assumptions. But if CV and resampling-based pro edures are the most lik ely to yield go o d predition p erformanes, their theoretial grounds are not that rm, and to o few CV users are areful enough when ho osing a CV pro edure to p erform mo del seletion. Among the aims of this surv ey is to p oin t out b oth p ositiv e and negativ e results ab out the mo del seletion p erformane of CV. 4 Cross-v alidation pro edures The purp ose of this setion is to desrib e the rationale b ehind CV and to dene the dieren t CV pro edures. Sine all CV pro edures are of the form ( 5), dening a CV pro edure amoun ts to dene the orresp onding CV estimator of the risk of an algorithm A , whi h will b e crit( · ) in (5). 12 4.1 Cross-v alidation philosoph y As notied in the early 30s b y Larson (1931 ), training an algorithm and ev aluat- ing its statistial p erformane on the same data yields an o v eroptimisti result. CV w as raised to x this issue (Mosteller and T uk ey, 1968 ; Stone , 1974 ; Geisser , 1975 ), starting from the remark that testing the output of the algorithm on new data w ould yield a go o d estimate of its p erformane (Breiman , 1998 ). In most real appliations, only a limited amoun t of data is a v ailable, whi h led to the idea of splitting the data : P art of the data (the training sample) is used for training the algorithm, and the remaining data (the v alidation sample) is used for ev aluating its p erformane. The v alidation sample an pla y the role of new data as so on as data are i.i.d. . Data splitting yields the validation estimate of the risk, and a v eraging o v er sev eral splits yields a r oss-validation estimate of the risk. As will b e sho wn in Setions 4.2 and 4.3 , v arious splitting strategies lead to v arious CV estimates of the risk. The ma jor in terest of CV lies in the univ ersalit y of the data splitting heuris- tis, whi h only assumes that data are iden tially distributed and the train- ing and v alidation samples are indep enden t, t w o assumptions whi h an ev en b e relaxed (see Setion 8.3 ). Therefore, CV an b e applied to (almost) an y algorithm in (almost) an y framew ork, for instane regression (Stone, 1974 ; Geisser , 1975 ), densit y estimation (Rudemo, 1982 ; Stone, 1984 ) and lassi- ation (Devro y e and W agner , 1979 ; Bartlett et al., 2002 ), among man y others. On the on trary , most other mo del seletion pro edures (see Setion 3 ) are sp e- i to a framew ork: F or instane, C p (Mallo ws , 1973 ) is sp ei to least-squares regression. 4.2 F rom v alidation to ross-v alidation In this setion, the hold-out (or v alidation) estimator of the risk is dened, leading to a general denition of CV. 4.2.1 Hold-out The hold-out (Devro y e and W agner , 1979 ) or (simple) validation relies on a sin- gle split of data. F ormally , let I ( t ) b e a non-empt y prop er subset of { 1 , . . . , n } , that is, su h that b oth I ( t ) and its omplemen t I ( v ) = I ( t ) c = { 1 , . . . , n } \ I ( t ) are non-empt y . The hold-out estimator of the risk of A ( D n ) with training set I ( t ) is dened b y b L H − O A ; D n ; I ( t ) := 1 n v X i ∈ D ( v ) n γ A ( D ( t ) n ); ( X i , Y i ) , (8) where D ( t ) n := ( ξ i ) i ∈ I ( t ) is the tr aining sample , of size n t = Car d( I ( t ) ) , and D ( v ) n := ( ξ i ) i ∈ I ( v ) is the validation sample , of size n v = n − n t ; I ( v ) is alled the v alidation set. The question of ho osing n t , and I ( t ) giv en its ardinalit y n t , is disussed in the rest of this surv ey . 13 4.2.2 General denition of ross-v alidation A general desription of the CV strategy has b een giv en b y Geisser (1975 ): In brief, CV onsists in a v eraging sev eral hold-out estimators of the risk orre- sp onding to dieren t splits of the data. F ormally , let B ≥ 1 b e an in teger and I ( t ) 1 , . . . , I ( t ) B b e a sequene of non-empt y prop er subsets of { 1 , . . . , n } . The CV estimator of the risk of A ( D n ) with training sets I ( t ) j 1 ≤ j ≤ B is dened b y b L CV A ; D n ; I ( t ) j 1 ≤ j ≤ B := 1 B B X j =1 b L H − O A ; D n ; I ( t ) j . (9) All existing CV estimators of the risk are of the form (9), ea h one b eing uniquely determined b y the w a y the sequene I ( t ) j 1 ≤ j ≤ B is hosen, that is, the hoie of the splitting s heme. Note that when CV is used in mo del seletion for iden tiation, an alterna- tiv e denition of CV w as prop osed b y Y ang (2006 , 2007 ) and alled CV with voting (CV-v). When t w o algorithms A 1 and A 2 are ompared, A 1 is seleted b y CV-v if and only if b L H − O ( A 1 ; D n ; I ( t ) j ) < b L H − O ( A 2 ; D n ; I ( t ) j ) for a ma jorit y of the splits j = 1 , . . . , B . By on trast, CV pro edures of the form ( 9) an b e alled CV with a v eraging (CV-a), sine the estimates of the risk of the algorithms are a v eraged b efore their omparison. 4.3 Classial examples Most lassial CV estimators split the data with a xed size n t of the training set, that is, Card( I ( t ) j ) ≈ n t for ev ery j . The question of ho osing n t is disussed extensiv ely in the rest of this surv ey . In this subsetion, sev eral CV estimators are dened. T w o main ategories of splitting s hemes an b e distinguished, giv en n t : exhaustiv e data splitting, that is onsidering all training sets I ( t ) of size n t , and partial data splitting. 4.3.1 Exhaustiv e data splitting Lea v e-one-out (LOO, Stone, 1974 ; Allen, 1974 ; Geisser , 1975 ) is the most lassial exhaustiv e CV pro edure, orresp onding to the hoie n t = n − 1 : Ea h data p oin t is suessiv ely left out from the sample and used for v alidation. F ormally , LOO is dened b y (9) with B = n and I ( t ) j = { j } c for j = 1 , . . . , n : b L LOO ( A ; D n ) = 1 n n X j =1 γ A D ( − j ) n ; ξ j (10) where D ( − j ) n = ( ξ i ) i 6 = j . The name LOO an b e traed ba k to pap ers b y Piard and Co ok (1984 ) and b y Breiman and Sp etor (1992 ), but LOO has sev- eral other names in the literature, su h as delete-one CV (see Li, 1987 ), or dinary CV (Stone, 1974 ; Burman, 1989 ), or ev en only CV (Efron, 1983 ; Li, 1987 ). 14 Lea v e- p -out (LPO, Shao , 1993 ) with p ∈ { 1 , . . . , n } is the exhaustiv e CV with n t = n − p : ev ery p ossible set of p data p oin ts are suessiv ely left out from the sample and used for v alidation. Therefore, LPO is dened b y ( 9) with B = n p and ( I ( t ) j ) 1 ≤ j ≤ B are all the subsets of { 1 , . . . , n } of size p . LPO is also alled delete- p CV or delete- p multifold CV (Zhang , 1993 ). Note that LPO with p = 1 is LOO. 4.3.2 P artial data splitting Considering n p training sets an b e omputationally in tratable, ev en for small p , so that partial data splitting metho ds ha v e b een prop osed. V -fold CV (VF CV) with V ∈ { 1 , . . . , n } w as in tro dued b y Geisser (1975 ) as an alternativ e to the omputationally exp ensiv e LOO (see also Breiman et al. , 1984 , for instane). VF CV relies on a preliminary partitioning of the data in to V subsamples of appro ximately equal ardinalit y n/V ; ea h of these subsamples suessiv ely pla ys the role of v alidation sample. F ormally , let A 1 , . . . , A V b e some partition of { 1 , . . . , n } with Card ( A j ) ≈ n/V . Then, the VF CV estimator of the risk of A is dened b y (9) with B = V and I ( t ) j = A c j for j = 1 , . . . , B , that is, b L VF b s ; D n ; ( A j ) 1 ≤ j ≤ V = 1 V V X j =1 1 Card( A j ) X i ∈ A j γ b s D ( − A j ) n ; ξ i (11) where D ( − A j ) n = ( ξ i ) i ∈ A c j . By onstrution, the algorithmi omplexit y of VF CV is only V times that of training A with n − n/V data p oin ts, whi h is m u h less than LOO or LPO if V ≪ n . Note that VF CV with V = n is LOO. Balaned Inomplete CV (BICV, Shao, 1993 ) an b e seen as an alternativ e to VF CV w ell-suited for small training sample sizes n t . Indeed, BICV is dened b y (9) with training sets ( A c ) A ∈T , where T is a balaned inomplete blo k designs (BIBD, John, 1971 ), that is, a olletion of B > 0 subsets of { 1 , . . . , n } of size n v = n − n t su h that: 1. Card { A ∈ T s.t. k ∈ A } do es not dep end on k ∈ { 1 , . . . , n } . 2. Card { A ∈ T s.t. k , ℓ ∈ A } do es not dep end on k 6 = ℓ ∈ { 1 , . . . , n } . The idea of BICV is to giv e to ea h data p oin t (and ea h pair of data p oin ts) the same role in the training and v alidation tasks. Note that VF CV relies on a similar idea, sine the set of training sample indies used b y VF CV satisfy the rst prop ert y and almost the seond one: P airs ( k , ℓ ) b elonging to the same A j app ear in one v alidation set more than other pairs. Rep eated learning-testing (RL T) w as in tro dued b y Breiman et al. (1984 ) and further studied b y Burman (1989 ) and b y Zhang (1993 ) for instane. The RL T estimator of the risk of A is dened b y ( 9) with an y B > 0 and ( I ( t ) j ) 1 ≤ j ≤ B are B dieren t subsets of { 1 , . . . , n } , hosen randomly and indep enden tly from the data. RL T an b e seen as an appro ximation to LPO with p = n − n t , with whi h it oinides when B = n p . 15 Mon te-Carlo CV (MCCV, Piard and Co ok , 1984 ) is v ery lose to RL T: B indep enden t subsets of { 1 , . . . , n } are randomly dra wn, with uniform distribu- tion among subsets of size n t . The only dierene with RL T is that MCCV allo ws the same split to b e hosen sev eral times. 4.3.3 Other ross-v alidation-lik e risk estimators Sev eral pro edures ha v e b een in tro dued whi h are lose to, or based on CV. Most of them aim at xing an observ ed dra wba k of CV. Bias-orreted v ersions of VF CV and RL T risk estimators ha v e b een pro- p osed b y Burman (1989 , 1990 ), and a losely related p enalization pro edure alled V -fold p enalization has b een dened b y Arlot (2008 ), see Setion 5.1.2 for details. Generalized CV (GCV, Cra v en and W ah ba , 1979 ) w as in tro dued as a rotation-in v arian t v ersion of LOO in least-squares regression, for estimating the risk of a linear estimator b s = M Y where Y = ( Y i ) 1 ≤ i ≤ n ∈ R n and M is an n × n matrix indep enden t from Y : crit GCV ( M , Y ) := n − 1 k Y − M Y k 2 ( 1 − n − 1 tr( M ) ) 2 where ∀ t ∈ R n , k t k 2 = n X i =1 t 2 i . GCV is atually loser to C L (Mallo ws , 1973 ) than to CV, sine GCV an b e seen as an appro ximation to C L with a partiular estimator of the v ariane (Efron, 1986 ). The eieny of GCV has b een pro v ed in v arious framew orks, in partiular b y Li (1985 , 1987 ) and b y Cao and Golub ev (2006 ). Analyti Appro ximation When CV is used for seleting among linear mo d- els, Shao (1993 ) prop osed an analyti appro ximation to LPO with p ∼ n , whi h is alled APCV. LOO b o otstrap and .632 b o otstrap The b o otstrap is often used for stabi- lizing an estimator or an algorithm, replaing A ( D n ) b y the a v erage of A ( D ⋆ n ) o v er sev eral b o otstrap resamples D ⋆ n . This idea w as applied b y Efron (1983 ) to the LOO estimator of the risk, leading to the LOO b o otstr ap . Noting that the LOO b o otstrap w as biased, Efron (1983 ) ga v e a heuristi argumen t leading to the . 632 b o otstr ap estimator of the risk, later mo died in to the . 632+ b o ot- str ap b y Efron and Tibshirani (1997 ). The main dra wba k of these pro edures is the w eakness of their theoretial justiations. Only empirial studies ha v e supp orted the go o d b eha viour of . 632+ b o otstrap ( Efron and Tibshirani , 1997 ; Molinaro et al. , 2005 ). 4.4 Historial remarks Simple v alidation or hold-out w as the rst CV-lik e pro edure. It w as in tro dued in the psy hology area (Larson , 1931 ) from the need for a reliable alternativ e to the r esubstitution err or , as illustrated b y Anderson et al. (1972 ). The hold- out w as used b y Herzb erg (1969 ) for assessing the qualit y of preditors. The problem of ho osing the training set w as rst onsidered b y Stone (1974 ), where 16 on trollable and unon trollable data splits w ere distinguished; an instane of unon trollable division an b e found in the b o ok b y Simon (1971 ). A primitiv e LOO pro edure w as used b y Hills (1966 ) and b y La hen bru h and Mi k ey (1968 ) for ev aluating the error rate of a predi- tion rule, and a primitiv e form ulation of LOO an b e found in a pap er b y Mosteller and T uk ey (1968 ). Nev ertheless, LOO w as atually in tro dued inde- p enden tly b y Stone (1974 ), b y Allen (1974 ) and b y Geisser (1975 ). The rela- tionship b et w een LOO and the ja kknife (Quenouille , 1949 ), whi h b oth rely on the idea of remo ving one observ ation from the sample, has b een disussed b y Stone (1974 ) for instane. The hold-out and CV w ere originally used only for estimating the risk of an algorithm. The idea of using CV for mo del seletion arose in the disussion of a pap er b y Efron and Morris (1973 ) and in a pap er b y Geisser (1974 ). The rst author to study LOO as a mo del seletion pro edure w as Stone (1974 ), who prop osed to use LOO again for estimating the risk of the seleted mo del. 5 Statistial prop erties of ross-v alidation esti- mators of the risk Understanding the b eha viour of CV for mo del seletion, whi h is the purp ose of this surv ey , requires rst to analyze the p erformanes of CV as an estimator of the risk of a single algorithm. T w o main prop erties of CV estimators of the risk are of partiular in terest: their bias, and their v ariane. 5.1 Bias Dealing with the bias inurred b y CV estimates an b e made b y t w o strategies: ev aluating the amoun t of bias in order to ho ose the least biased CV pro edure, or orreting for this bias. 5.1.1 Theoretial assessmen t of the bias The indep endene of the training and the v alidation samples imply that for ev ery algorithm A and an y I ( t ) ⊂ { 1 , . . . , n } with ardinalit y n t , E h b L H − O A ; D n ; I ( t ) i = E h γ A D ( t ) n ; ξ i = E [ L P ( A ( D n t ) ) ] . Therefore, assuming that Card( I ( t ) j ) = n t for j = 1 , . . . , B , the exp etation of the CV estimator of the risk only dep ends on n t : E b L CV A ; D n ; I ( t ) j 1 ≤ j ≤ B = E [ L P ( A ( D n t ) ) ] . (12) In partiular (12 ) sho ws that the bias of the CV estimator of the risk of A is the dierene b et w een the risks of A , omputed resp etiv ely with n t and n data p oin ts. Sine n t < n , the bias of CV is usually nonnegativ e, whi h an b e pro v ed rigorously when the risk of A is a dereasing funtion of n , that is, when A is a smart rule; note ho w ev er that a lassial algorithm su h as 1-nearest-neigh b our in lassiation is not smart (Devro y e et al., 1996 , Setion 6.8). Similarly , the bias of CV tends to derease with n t , whi h is rigorously true if A is smart. 17 More preisely , (12 ) has led to sev eral results on the bias of CV, whi h an b e split in to three main ategories: asymptoti results ( A is xed and the sample size n tends to innit y), non-asymptoti results (where A is allo w ed to mak e use of a n um b er of parameters gro wing with n , sa y n 1 / 2 , as often in mo del seletion), and empirial results. They are listed b elo w b y statistial framew ork. Regression The general b eha viour of the bias of CV (p ositiv e, dereasing with n t ) is onrmed b y sev eral pap ers and for sev eral CV estimators. F or LPO, non-asymptoti expressions of its bias w ere pro v ed b y Celisse (2008b ) for pro jetion estimators, and b y Arlot and Celisse (2009 ) for regressograms and k ernels estimators when the design is xed. F or VF CV and RL T, an asymptoti expansion of their bias w as yielded b y Burman (1989 ) for least-squares esti- mators in linear regression, and extended to spline smo othing ( Burman , 1990 ). Note nally that Efron (1986 ) pro v ed non-asymptoti analyti expressions of the exp etations of the LOO and GCV estimators of the risk in regression with binary data (see also Efron, 1983 , for some expliit alulations). Densit y estimation sho ws a similar piture. Non-asymptoti expressions for the bias of LPO estimators for k ernel and pro jetion estimators with the quadrati risk w ere pro v ed b y Celisse and Robin (2008 ) and b y Celisse (2008a ). Asymptoti expansions of the bias of the LOO estimator for histograms and k er- nel estimators w ere previously pro v ed b y Rudemo (1982 ); see Bo wman (1984 ) for sim ulations. Hall (1987 ) deriv ed similar results with the log-lik eliho o d on trast for k ernel estimators, and related the p erformane of LOO to the in teration b et w een the k ernel and the tails of the target densit y s . Classiation F or the simple problem of disriminating b et w een t w o p opula- tions with shifted distributions, Da vison and Hall (1992 ) ompared the asymp- totial bias of LOO and b o otstrap, sho wing the sup eriorit y of the LOO when the shift size is n − 1 / 2 : As n tends to innit y , the bias of LOO sta ys of or- der n − 1 , whereas that of b o otstrap w orsens to the order n − 1 / 2 . On realisti syn theti and real biologial data, Molinaro et al. (2005 ) ompared the bias of LOO, VF CV and .632+ b o otstrap: The bias dereases with n t , and is generally minimal for LOO. Nev ertheless, the 10 -fold CV bias is nearly minimal uniformly o v er their exp erimen ts. In the same exp erimen ts, .632+ b o otstrap exhibits the smallest bias for mo derate sample sizes and small signal-to-noise ratios, but a m u h larger bias otherwise. CV-alibrated algorithms When a family of algorithm ( A λ ) λ ∈ Λ is giv en, and b λ is hosen b y minimizing b L CV ( A λ ; D n ) o v er λ , b L CV ( A b λ ; D n ) is biased for estimating the risk of A b λ ( D n ) , as rep orted from sim ulation exp erimen ts b y Stone (1974 ) for the LOO, and b y Jonathan et al. (2000 ) for VF CV in the v ariable seletion setting. This bias is of dieren t nature ompared to the pre- vious framew orks. Indeed, b L CV ( A b λ , D n ) is biased simply b eause b λ w as hosen using the same data as b L CV ( A λ , D n ) . This phenomenon is similar to the op- timism of L P n ( b s ( D n ) ) as an estimator of the loss of b s ( D n ) . The orret w a y of estimating the risk of A b λ ( D n ) with CV is to onsider the full algorithm 18 A ′ : D n 7→ A b λ ( D n ) ( D n ) , and then to ompute b L CV ( A ′ ; D n ) . The resulting pro edure is alled double ross b y Stone (1974 ). 5.1.2 Corretion of the bias An alternativ e to ho osing the CV estimator with the smallest bias is to orret for the bias of the CV estimator of the risk. Burman (1989 , 1990 ) prop osed a orreted VF CV estimator, dened b y b L corrVF ( A ; D n ) = b L VF ( b s ; D n ) + L P n ( A ( D n ) ) − 1 V V X j =1 L P n A ( D ( − A j ) n ) , and a orreted RL T estimator w as dened similarly . Both estimators ha v e b een pro v ed to b e asymptotially un biased for least-squares estimators in linear regression. When the A j s ha v e exatly the same size n/V , the orreted VF CV riterion is equal to the sum of the empirial risk and the V -fold p enalt y ( Arlot, 2008 ), dened b y pen VF ( A ; D n ) = V − 1 V V X j =1 h L P n A ( D ( − A j ) n ) − L P ( − A j ) n A ( D ( − A j ) n ) i . The V -fold p enalized riterion w as pro v ed to b e (almost) un biased in the non- asymptoti framew ork for regressogram estimators. 5.2 V ariane CV estimators of the risk using training sets of the same size n t ha v e the same bias, but they still b eha v e quite dieren tly; their v ariane v ar( b L CV ( A ; D n ; ( I ( t ) j ) 1 ≤ j ≤ B )) aptures most of the information to explain these dierenes. 5.2.1 V ariabilit y fators Assume that Card( I ( t ) j ) = n t for ev ery j . The v ariane of CV results from the om bination of sev eral fators, in partiular ( n t , n v ) and B . Inuene of ( n t , n v ) Let us onsider the hold-out estimator of the risk. F ol- lo wing in partiular Nadeau and Bengio (2003 ), v ar h b L H − O A ; D n ; I ( t ) i = E h v ar L P ( v ) n A ( D ( t ) n ) D ( t ) n i + v ar [ L P ( A ( D n t ) ) ] = 1 n v E h v ar γ ( b s , ξ ) | b s = A ( D ( t ) n ) i + v ar [ L P ( A ( D n t ) ) ] . (13) The rst term, prop ortional to 1 /n v , sho ws that more data for v alidation dereases the v ariane of b L H − O , b eause it yields a b etter estimator of L P A ( D ( t ) n ) . The seond term sho ws that the v ariane of b L H − O also dep ends on the distribution of L P A ( D ( t ) n ) around its exp etation; in partiular, it strongly dep ends on the stability of A . 19 Stabilit y and v ariane When A is unstable, b L LOO ( A ) has often b een p oin ted out as a v ariable estimator (Setion 7.10, Hastie et al., 2001 ; Breiman , 1996 ). Con v ersely , this trend disapp ears when A is stable, as notied b y Molinaro et al. (2005 ) from a sim ulation exp erimen t. The relation b et w een the stabilit y of A and the v ariane of b L CV ( A ) w as p oin ted out b y Devro y e and W agner (1979 ) in lassiation, through upp er b ounds on the v ariane of b L LOO ( A ) . Bousquet and Elisse (2002 ) extended these results to the regression setting, and pro v ed upp er b ounds on the maxi- mal up w ard deviation of b L LOO ( A ) . Note nally that sev eral approa hes based on the b o otstrap ha v e b een pro- p osed for reduing the v ariane of b L LOO ( A ) , su h as LOO b o otstrap, .632 b o otstrap and .632+ b o otstrap (Efron, 1983 ); see also Setion 4.3.3 . P artial splitting and v ariane When ( n t , n v ) is xed, the v ariabilit y of CV tends to b e larger for partial data splitting metho ds than for LPO. Indeed, ha ving to ho ose B < n n t subsets ( I ( t ) j ) 1 ≤ j ≤ B of { 1 , . . . , n } , usually randomly , indues an additional v ariabilit y ompared to b L LPO with p = n − n t . In the ase of MCCV, this v ariabilit y dereases lik e B − 1 sine the I ( t ) j are hosen indep enden tly . The dep endene on B is sligh tly dieren t for other CV estimators su h as RL T or VF CV, b eause the I ( t ) j are not indep enden t. In partiular, it is maximal for the hold-out, and minimal (n ull) for LOO (if n t = n − 1 ) and LPO (with p = n − n t ). Note that the dep endene on V for VF CV is more omplex to ev aluate, sine B , n t , and n v sim ultaneously v ary with V . Nev ertheless, a non-asymptoti the- oretial quan tiation of this additional v ariabilit y of VF CV has b een obtained b y Celisse and Robin (2008 ) in the densit y estimation framew ork (see also em- pirial onsiderations b y Jonathan et al., 2000 ). 5.2.2 Theoretial assessmen t of the v ariane Understanding preisely ho w v ar( b L CV ( A )) dep ends on the splitting s heme is omplex in general, sine n t and n v ha v e a xed sum n , and the n um b er of splits B is generally link ed with n t (for instane, for LPO and VF CV). F urthermore, the v ariane of CV b eha v es quite dieren tly in dieren t framew orks, dep ending in partiular on the stabilit y of A . The onsequene is that on traditory results ha v e b een obtained in dieren t framew orks, in partiular on the v alue of V for whi h the VF CV estimator of the risk has a minimal v ariane ( Burman , 1989 ; Hastie et al. , 2001 , Setion 7.10). Despite the diult y of the problem, the v ariane of sev eral CV estimators of the risk has b een assessed in sev eral framew orks, as detailed b elo w. Regression In the linear regression setting, Burman (1989 ) yielded asymp- toti expansions of the v ariane of the VF CV and RL T estimators of the risk with homosedasti data. The v ariane of RL T dereases with B , and in the ase of VF CV, in a partiular setting, v ar b L VF ( A ) = 2 σ 2 n + 4 σ 4 n 2 4 + 4 V − 1 + 2 ( V − 1) 2 + 1 ( V − 1) 3 + o n − 2 . 20 The asymptotial v ariane of the VF CV estimator of the risk dereases with V , implying that LOO asymptotially has the minimal v ariane. Non-asymptoti losed-form form ulas of the v ariane of the LPO estimator of the risk ha v e b een pro v ed b y Celisse (2008b ) in regression, for pro jetion and k ernel estimators for instane. On the v ariane of RL T in the regression setting, see the asymptoti results of Girard (1998 ) for Nadara y a-W atson k ernel estima- tors, as w ell as the non-asymptoti omputations and sim ulation exp erimen ts b y Nadeau and Bengio (2003 ) with sev eral learning algorithms. Densit y estimation Non-asymptoti losed-form form ulas of the v ariane of the LPO estimator of the risk ha v e b een pro v ed b y Celisse and Robin (2008 ) and b y Celisse (2008a ) for pro jetion and k ernel estimators. In partiular, the dep endene of the v ariane of b L LPO on p has b een quan tied expliitly for histogram and k ernel estimators b y Celisse and Robin (2008 ). Classiation F or the simple problem of disriminating b et w een t w o p opu- lations with shifted distributions, Da vison and Hall (1992 ) sho w ed that the gap b et w een asymptoti v arianes of LOO and b o otstrap b eomes larger when data are noisier. Nadeau and Bengio (2003 ) made non-asymptoti omputations and sim ulation exp erimen ts with sev eral learning algorithms. Hastie et al. (2001 ) empirially sho w ed that VF CV has a minimal v ariane for some 2 < V < n , whereas LOO usually has a large v ariane; this fat ertainly dep ends on the stabilit y of the algorithm onsidered, as sho w ed b y sim ulation exp erimen ts b y Molinaro et al. (2005 ). 5.2.3 Estimation of the v ariane There is no univ ersalv alid under all distributionsun biased estimator of the v ariane of RL T (Nadeau and Bengio , 2003 ) and VF CV estimators (Bengio and Grandv alet , 2004 ). In partiular, Bengio and Grandv alet (2004 ) reommend the use of v ariane estimators taking in to aoun t the orrelation struture b et w een test errors; otherwise, the v ariane of CV an b e strongly underestimated. Despite these negativ e results, (biased) estimators of the v ariane of b L CV ha v e b een prop osed b y Nadeau and Bengio (2003 ), b y Bengio and Grandv alet (2004 ) and b y Mark atou et al. (2005 ), and tested in sim ulation exp erimen ts in regression and lassiation. F urthermore, in the framew ork of densit y estima- tion with histograms, Celisse and Robin (2008 ) prop osed an estimator of the v ariane of the LPO risk estimator. Its auray is assessed b y a onen tration inequalit y . These results ha v e reen tly b een extended to pro jetion estimators b y Celisse (2008a ). 6 Cross-v alidation for eien t mo del seletion This setion ta kles the prop erties of CV pro edures for mo del seletion when the goal is estimation (see Setion 2.2 ). 21 6.1 Relationship b et w een risk estimation and mo del se- letion As sho wn in Setion 3.1 , minimizing an un biased estimator of the risk leads to an eien t mo del seletion pro edure. One ould onlude here that the b est CV pro edure for estimation is the one with the smallest bias and v ariane (at least asymptotially), for instane, LOO in the least-squares regression framew ork (Burman , 1989 ). Nev ertheless, the b est CV estimator of the risk is not neessarily the b est mo del seletion pro edure. F or instane, Breiman and Sp etor (1992 ) observ ed that uniformly o v er the mo dels, the b est risk estimator is LOO, whereas 10- fold CV is more aurate for mo del seletion. Three main reasons for su h a dierene an b e in v ok ed. First, the asymptoti framew ork ( A xed, n → ∞ ) ma y not apply to mo dels lose to the orale, whi h t ypially has a dimension gro wing with n when s do es not b elong to an y mo del. Seond, as explained in Setion 3.2 , estimating the risk of ea h mo del with some bias an b e b eneial and omp ensate the eet of a large v ariane, in partiular when the signal-to- noise ratio is small. Third, for mo del seletion, what matters is not that ev ery estimate of the risk has small bias and v ariane, but more that sign ( cr it( m 1 ) − crit( m 2 ) ) = sign ( L P ( b s m 1 ) − L P ( b s m 2 ) ) with the largest probabilit y for mo dels m 1 , m 2 near the orale. Therefore, sp ei studies are required to ev aluate the p erformanes of the v arious CV pro edures in terms of mo del seletion eieny . In most frame- w orks, the mo del seletion p erformane diretly follo ws from the prop erties of CV as an estimator of the risk, but not alw a ys. 6.2 The global piture Let us start with the lassiation of mo del seletion pro edures made b y Shao (1997 ) in the linear regression framew ork, sine it giv es a go o d idea of the p erformane of CV pro edures for mo del seletion in general. T ypially , the eieny of CV only dep ends on the asymptotis of n t /n : • When n t ∼ n , CV is asymptotially equiv alen t to Mallo ws' C p , hene asymptotially optimal. • When n t ∼ λn with λ ∈ (0 , 1) , CV is asymptotially equiv alen t to GIC κ with κ = 1 + λ − 1 , whi h is dened as AIC with a p enalt y m ultiplied b y κ/ 2 . Hene, su h CV pro edures are o v erp enalizing b y a fator (1 + λ ) / (2 λ ) > 1 . The ab o v e results ha v e b een pro v ed b y Shao (1997 ) for LPO (see also Li, 1987 , for the LOO); they also hold for RL T when B ≫ n 2 sine RL T is then equiv alen t to LPO (Zhang , 1993 ). In a general statistial framew ork, the mo del seletion p erformane of MCCV, VF CV, LOO, LOO Bo otstrap, and .632 b o otstrap for se- letion among minim um on trast estimators w as studied in a series of pap ers (v an der Laan and Dudoit, 2003 ; v an der Laan et al. , 2004 , 2006 ; v an der V aart et al., 2006 ); these results apply in partiular to least-squares regression and densit y estimation. It turns out that under mild onditions, an orale-t yp e inequalit y is pro v ed, sho wing that up to a m ultiplying fator C n → 1 , 22 the risk of CV is smaller than the minim um of the risks of the mo dels with a sample size n t . In partiular, in most framew orks, this implies the asymptoti optimalit y of CV as so on as n t ∼ n . When n t ∼ λn with λ ∈ (0 , 1) , this naturally generalizes Shao's results. 6.3 Results in v arious framew orks This setion gathers results ab out mo del seletion p erformanes of CV when the goal is estimation, in v arious framew orks. Note that mo del seletion is on- sidered here with a general meaning, inluding in partiular bandwidth hoie for k ernel estimators. Regression First, the results of Setion 6.2 suggest that CV is sub optimal when n t is not asymptotially equiv alen t to n . This fat has b een pro v ed rigor- ously for VF CV when V = O (1) with regressograms (Arlot, 2008 ): with large probabilit y , the risk of the mo del seleted b y VF CV is larger than 1 + κ ( V ) times the risk of the orale, with κ ( V ) > 0 for ev ery xed V . Note ho w ev er that the b est V for VF CV is not the largest one in ev ery regression frame- w ork, as sho wn empirially in linear regression ( Breiman and Sp etor , 1992 ; Herzb erg and T suk ano v , 1986 ); Breiman (1996 ) prop osed to explain this phe- nomenon b y relating the stabilit y of the andidate algorithms and the mo del seletion p erformane of LOO in v arious regression framew orks. Seond, the univ ersalit y of CV has b een onrmed b y sho wing that it natu- rally adapts to heterosedastiit y of data when seleting among regressograms. Despite its sub optimalit y , VF CV with V = O (1) satises a non-asymptoti orale inequalit y with onstan t C > 1 ( Arlot, 2008 ). F urthermore, V -fold p e- nalization (whi h often oinides with orreted VF CV, see Setion 5.1.2 ) sat- ises a non-asymptoti orale inequalit y with C n → 1 as n → + ∞ , b oth when V = O (1) (Arlot, 2008 ) and when V = n (Arlot , 2008a ). Note that n -fold p e- nalization is v ery lose to LOO, suggesting that it is also asymptotially optimal with heterosedasti data. Sim ulation exp erimen ts in the on text of hange- p oin t detetion onrmed that CV adapts w ell to heterosedastiit y , on trary to usual mo del seletion pro edures in the same framew ork ( Arlot and Celisse , 2009 ). The p erformanes of CV ha v e also b een assessed for other kinds of estimators in regression. F or ho osing the n um b er of knots in spline smo othing, Burman (1990 ) pro v ed that orreted v ersions of VF CV and RL T are asymptotially optimal pro vided n/ ( B n v ) = O (1) . F urthermore, in k ernel regression, sev eral CV metho ds ha v e b een ompared to GCV in k ernel regression b y Härdle et al. (1988 ) and b y Girard (1998 ); the onlusion is that GCV and related riteria are omputationally more eien t than MCCV or RL T, for a similar statistial p erformane. Finally , note that asymptoti results ab out CV in regression ha v e b een pro v ed b y Györ et al. (2002 ), and an orale inequalit y with onstan t C > 1 has b een pro v ed b y W egk amp (2003 ) for the hold-out, with least-squares estimators. Densit y estimation CV p erforms similarly than in regression for seleting among least-squares estimators (v an der Laan et al., 2004 ): It yields a risk smaller than the minim um of the risk with a sample size n t . In partiular, 23 non-asymptoti orale inequalities with onstan t C > 1 ha v e b een pro v ed b y Celisse (2008b ) for the LPO when p/n ∈ [ a, b ] , for some 0 < a < b < 1 . The p erformane of CV for seleting the bandwidth of k ernel densit y esti- mators has b een studied in sev eral pap ers. With the least-squares on trast, the eieny of LOO w as pro v ed b y Hall (1983 ) and generalized to the m ultiv ari- ate framew ork b y Stone (1984 ); an orale inequalit y asymptotially leading to eieny w as reen tly pro v ed b y Dalelane (2005 ). With the Kullba k-Leibler div ergene, CV an suer from troubles in p erforming mo del seletion (see also S h uster and Gregory , 1981 ; Cho w et al. , 1987 ). The inuene of the tails of the target s w as studied b y Hall (1987 ), who ga v e onditions under whi h CV is eien t and the hosen bandwidth is optimal at rst-order. Classiation In the framew ork of binary lassiation b y in terv als (that is, with X = [0 , 1] and pieewise onstan t lassiers), Kearns et al. (1997 ) pro v ed an orale inequalit y for the hold-out. F urthermore, empirial exp erimen ts sho w that CV yields (almost) alw a ys the b est p erformane, ompared to deterministi p enalties (Kearns et al. , 1997 ). On the on trary , sim ulation exp erimen ts b y Bartlett et al. (2002 ) in the same setting sho w ed that random p enalties su h as Radema her omplexit y and maximal disrepany usually p erform m u h b etter than hold-out, whi h is sho wn to b e more v ariable. Nev ertheless, the hold-out still enjo ys quite go o d theoretial prop erties: It w as pro v ed to adapt to the margin ondition b y Blan hard and Massart (2006 ), a prop ert y nearly una hiev able with usual mo del seletion pro edures (see also Massart , 2007 , Setion 8.5). This suggests that CV pro edures are naturally adaptiv e to sev eral unkno wn prop erties of data in the statistial learning frame- w ork. The p erformane of the LOO in binary lassiation w as related to the stabilit y of the andidate algorithms b y Kearns and Ron (1999 ); they pro v ed orale-t yp e inequalities alled sanit y- he k b ounds, desribing the w orst-ase p erformane of LOO (see also Bousquet and Elisse, 2002 ). An exp erimen tal omparison of sev eral CV metho ds and b o otstrap-based CV (in partiular .632+ b o otstrap) in lassiation an also b e found in pap ers b y Efron (1986 ) and Efron and Tibshirani (1997 ). 7 Cross-v alidation for iden tiation Let us no w fo us on mo del seletion when the goal is to iden tify the true mo del S m 0 , as desrib ed in Setion 2.3 . In this framew ork, asymptoti optimalit y is replaed b y (mo del) onsisteny , that is, P ( b m ( D n ) = m 0 ) − − − − → n →∞ 1 . Classial mo del seletion pro edures built for iden tiation, su h as BIC, are desrib ed in Setion 3.3 . 7.1 General onditions to w ards mo del onsisteny A t rst sigh t, it ma y seem strange to use CV for iden tiation: LOO, whi h is the pioneering CV pro edure, is atually losely related to the un biased risk 24 estimation priniple, whi h is only eien t when the goal is estimation. F ur- thermore, estimation and iden tiation are someho w on traditory goals, as explained in Setion 2.4 . This in tuition ab out inonsisteny of some CV pro edures is onrmed b y sev eral theoretial results. Shao (1993 ) pro v ed that sev eral CV metho ds are inonsisten t for v ariable seletion in linear regression: LOO, LPO, and BICV when lim inf n →∞ ( n t /n ) > 0 . Ev en if these CV metho ds asymptotially selet all the true v ariables with probabilit y 1, the probabilit y that they selet to o m u h v ariables do es not tend to zero. More generally , Shao (1997 ) pro v ed that CV pro edures b eha v e asymptotially lik e GIC λ n with λ n = 1 + n/n t , whi h leads to inonsisteny as so on as n/n t = O (1) . In the on text of ordered v ariable seletion in linear regression, Zhang (1993 ) omputed the asymptoti v alue of the probabilit y of seleting the true mo del for sev eral CV pro edures. He also n umerially ompared the v alues of this probabilit y for the same CV pro edures in a sp ei example. F or LPO with p/n → λ ∈ (0 , 1) as n tends to + ∞ , P ( b m = m 0 ) inreases with λ . The result is sligh tly dieren t for VF CV: P ( b m = m 0 ) inreases with V (hene, it is maximal for the LOO, whi h is the w orst ase of LPO). The v ariabilit y indued b y the n um b er V of splits seems to b e more imp ortan t here than the bias of VF CV. Nev ertheless, P ( b m = m 0 ) is almost onstan t b et w een V = 10 and V = n , so that taking V > 10 is not advised for omputational reasons. These results suggest that if the training sample size n t is negligible in fron t of n , then mo del onsisteny ould b e obtained. This has b een onrmed theo- retially b y Shao (1993 , 1997 ) for the v ariable seletion problem in linear regres- sion: CV is onsisten t when n ≫ n t → ∞ , in partiular RL T, BICV (dened in Setion 4.3.2 ) and LPO with p = p n ∼ n and n − p n → ∞ . Therefore, when the goal is to iden tify the true mo del, a larger prop ortion of the data should b e put in the v alidation set in order to impro v e the p erformane. This phenomenon is somewhat related to the r oss-validation p ar adox ( Y ang, 2006 ). 7.2 Rened analysis for the algorithm seletion problem The b eha viour of CV for iden tiation is b etter understo o d b y onsidering a more general framew ork, where the goal is to selet among statistial algorithms the one with the fastest on v ergene rate. Y ang (2006 , 2007 ) onsidered this problem for t w o andidate algorithms (or more generally an y nite n um b er of algorithms). Let us men tion here that Stone (1977 ) onsidered a few sp ei examples of this problem, and sho w ed that LOO an b e inonsisten t for ho osing the b est among t w o go o d estimators. The onlusion of Y ang's pap ers is that the suien t ondition on n t for the onsisteny in seletion of CV strongly dep ends on the on v ergene rates ( r n,i ) i =1 , 2 of the andidate algorithms. Let us assume that r n, 1 and r n, 2 dier at least b y a m ultipliativ e onstan t C > 1 . Then, in the regression framew ork, if the risk of b s i is measured b y E k b s i − s k 2 , Y ang (2007 ) pro v ed that the hold- out, VF CV, RL T and LPO with v oting (CV-v, see Setion 4.2.2 ) are onsisten t in seletion if n v , n t → ∞ and √ n v max i r n t ,i → ∞ , (14) 25 under some onditions on k b s i − s k p for p = 2 , 4 , ∞ . In the lassiation frame- w ork, if the risk of b s i is measured b y P ( b s i 6 = s ) , Y ang (2006 ) pro v ed the same onsisteny result for CV-v under the ondition n v , n t → ∞ and n v max i r 2 n t ,i s n t → ∞ , (15) where s n is the on v ergene rate of P ( b s 1 ( D n ) 6 = b s 2 ( D n ) ) . In tuitiv ely , onsisteny holds as so on as the unertain t y of ea h estimate of the risk (roughly prop ortional to n − 1 / 2 v ) is negligible in fron t of the risk gap | r n t , 1 − r n t , 2 | (whi h is of the same order as max i r n t ,i ). This ondition holds either when at least one of the algorithms on v erges at a non-parametri rate, or when n t ≪ n , whi h artiially widens the risk gap. Empirial results in the same diretion w ere pro v ed b y Dietteri h (1998 ) and b y Alpa ydin (1999 ), leading to the advie that V = 2 is the b est hoie when VF CV is used for omparing t w o learning pro edures. See also the re- sults b y Nadeau and Bengio (2003 ) ab out CV onsidered as a testing pro edure omparing t w o andidate algorithms. The suien t onditions (14 ) and (15) an b e simplied dep ending on max i r n,i , so that the abilit y of CV to distinguish b et w een t w o algorithms de- p ends on their on v ergene rates. On the one hand, if max i r n,i ∝ n − 1 / 2 , then (14 ) or (15 ) only hold when n v ≫ n t → ∞ (under some onditions on s n in lassiation). Therefore, the ross-v alidation parado x holds for omparing al- gorithms on v erging at the parametri rate (mo del seletion when a true mo del exists b eing only a partiular ase). Note that p ossibly stronger onditions an b e required in lassiation where algorithms an on v erge at fast rates, b et w een n − 1 and n − 1 / 2 . On the other hand, (14 ) and (15 ) are milder onditions when max i r n,i ≫ n − 1 / 2 : They are implied b y n t /n v = O (1) , and they ev en allo w n t ∼ n (under some onditions on s n in lassiation). Therefore, non-parametri algorithms an b e ompared b y more usual CV pro edures ( n t > n / 2 ), ev en if LOO is still exluded b y onditions (14 ) and (15). Note that aording to a sim ulation exp erimen ts, CV with a v eraging (that is, CV as usual) and CV with v oting are equiv alen t at rst but not at seond order, so that they an dier when n is small (Y ang, 2007 ). 8 Sp eiities of some framew orks Originally , the CV priniple has b een prop osed for i.i.d. observ ations and usual on trasts su h as least-squares and log-lik eliho o d. Therefore, CV pro edures ma y ha v e to b e mo died in other sp ei framew orks, su h as estimation in presene of outliers or with dep enden t data. 8.1 Densit y estimation In the densit y estimation framew ork, some sp ei mo diations of CV ha v e b een prop osed. First, Hall et al. (1992 ) dened the smo othed CV, whi h onsists in pre- smo othing the data b efore using CV, an idea related to the smo othed b o otstrap. 26 They pro v ed that smo othed CV yields an exellen t asymptotial mo del seletion p erformane under v arious smo othness onditions on the densit y . Seond, when the goal is to estimate the densit y at one p oin t (and not globally), Hall and S h uan y (1989 ) prop osed a lo al v ersion of CV and pro v ed its asymptoti optimalit y . 8.2 Robustness to outliers In presene of outliers in regression, Leung (2005 ) studied ho w CV m ust b e mo died to get b oth asymptoti eieny and a onsisten t bandwidth estimator (see also Leung et al., 1993 ). T w o hanges are p ossible to a hiev e robustness: Cho osing a robust regres- sor, or ho osing a robust loss-funtion. In presene of outliers, lassial CV with a non-robust loss funtion has b een sho wn to fail b y Härdle (1984 ). Leung (2005 ) desrib ed a CV pro edure based on robust losses lik e L 1 and Hub er's (Hub er, 1964 ) ones. The same strategy remains appliable to other setups lik e linear mo dels in Ron hetti et al. (1997 ). 8.3 Time series and dep enden t observ ations As explained in Setion 4.1 , CV is built up on the heuristis that part of the sample (the v alidation set) an pla y the role of new data with resp et to the rest of the sample (the training set). New means that the v alidation set is indep enden t from the training set with the same distribution. Therefore, when data ξ 1 , . . . , ξ n are not indep enden t, CV m ust b e mo di- ed, lik e other mo del seletion pro edures (in non-parametri regression with dep enden t data, see the review b y Opsomer et al., 2001 ). Let us rst onsider the statistial framew ork of Setion 1 with ξ 1 , . . . , ξ n iden tially distributed but not indep enden t. Then, when for instane data are p ositiv ely orrelated, Hart and W ehrly (1986 ) pro v ed that CV o v erts for ho os- ing the bandwidth of a k ernel estimator in regression (see also Ch u and Marron , 1991 ; Opsomer et al. , 2001 ). The main approa h used in the literature for solving this issue is to ho ose I ( t ) and I ( v ) su h that min i ∈ I ( t ) , j ∈ I ( v ) | i − j | > h > 0 , where h on trols the dis- tane from whi h observ ations i and j are indep enden t. F or instane, the LOO an b e hanged in to: I ( v ) = { J } where J is uniformly hosen in { 1 , . . . , n } , and I ( t ) = { 1 , . . . , J − h − 1 , J + h + 1 , . . . , n } , a metho d alled mo died CV b y Ch u and Marron (1991 ) in the on text of bandwidth seletion. Then, for short range dep endenes, ξ i is almost indep enden t from ξ j when | i − j | > h is large enough, so that ( ξ j ) j ∈ I ( t ) is almost indep enden t from ( ξ j ) j ∈ I ( v ) . Sev eral asymptoti optimalit y results ha v e b een pro v ed on mo died CV, for instane b y Hart and Vieu (1990 ) for bandwidth hoie in k ernel densit y estimation, when data are α -mixing (hene, with a short range dep endene struture) and h = h n → ∞ not to o fast. Note that mo died CV also enjo ys some asymptoti optimalit y results with long-range dep endenes, as pro v ed b y Hall et al. (1995 ), ev en if an alternativ e blo k b o otstrap metho d seems more appropriate in su h a framew ork. Sev eral alternativ es to mo died CV ha v e also b een prop osed. The h -blo k CV (Burman et al., 1994 ) is mo died CV plus a orretiv e term, similarly to 27 the bias-orreted CV b y Burman (1989 ) (see Setion 5.1 ). Sim ulation exp eri- men ts in sev eral (short range) dep enden t framew orks sho w that this orretiv e term matters when h/n is not small, in partiular when n is small. The partitioned CV has b een prop osed b y Ch u and Marron (1991 ) for bandwidth seletion: An in teger g > 0 is hosen, a bandwidth b λ k is hosen b y CV based up on the subsample ( ξ k + g j ) j ≥ 0 for ea h k = 1 , . . . , g , and the seleted bandwidth is a om bination of ( b λ k ) . When a parametri mo del is a v ailable for the dep endeny struture, Hart (1994 ) prop osed the time series CV. An imp ortan t framew ork where data often are dep enden t is time-series anal- ysis, in partiular when the goal is to predit the next observ ation ξ n +1 from the past ξ 1 , . . . , ξ n . When data are stationary , h -blo k CV and similar ap- proa hes an b e used to deal with (short range) dep endenes. Nev ertheless, Burman and Nolan (1992 ) pro v ed in some sp ei framew ork that unaltered CV is asymptoti optimal when ξ 1 , . . . , ξ n is a stationary Mark o v pro ess. On the on trary , using CV for non-stationary time-series is a quite diult problem. The only reasonable approa h in general is the hold-out, that is, I ( t ) = { 1 , . . . , m } and I ( v ) = { m + 1 , . . . , n } for some deterministi m . Ea h mo del is rst trained with ( ξ j ) j ∈ I ( t ) . Then, it is used for prediting suessiv ely ξ m +1 from ( ξ j ) j ≤ m , ξ m +2 from ( ξ j ) j ≤ m +1 , and so on. The mo del with the smallest a v erage error for prediting ( ξ j ) j ∈ I ( v ) from the past is hosen. 8.4 Large n um b er of mo dels As men tioned in Setion 3 , mo del seletion pro edures estimating un biasedly the risk of ea h mo del fail when, in partiular, the n um b er of mo dels gro ws exp onen tially with n (Birgé and Massart , 2007 ). Therefore, CV annot b e used diretly , exept ma yb e with n t ≪ n , pro vided n t is w ell hosen (see Setion 6 and Celisse, 2008b , Chapter 6). F or least-squares regression with homosedasti data, W egk amp (2003 ) pro- p osed to add to the hold-out estimator of the risk a p enalt y term dep ending on the n um b er of mo dels. This metho d is pro v ed to satisfy a non-asymptoti orale inequalit y with leading onstan t C > 1 . Another general approa h w as prop osed b y Arlot and Celisse (2009 ) in the on text of m ultiple hange-p oin t detetion. The idea is to p erform mo del se- letion in t w o steps: First, gather the mo dels ( S m ) m ∈M n in to meta-mo dels ( e S D ) D ∈D n , where D n denotes a set of indies su h that Card( D n ) gro ws at most p olynomially with n . Inside ea h meta-mo del e S D = S m ∈M n ( D ) S m , b s D is hosen from data b y optimizing a giv en riterion, for instane the empirial on- trast L P n ( t ) , but other riteria an b e used. Seond, CV is used for ho osing among ( b s D ) D ∈D n . Sim ulation exp erimen ts sho w this simple tri k automatially tak es in to aoun t the ardinalit y of M n , ev en when data are heterosedasti, on trary to other mo del seletion pro edures built for exp onen tial olletion of mo dels whi h all assume homosedastiit y of data. 28 9 Closed-form form ulas and fast omputation Resampling strategies, lik e CV, are kno wn to b e time onsuming. The naiv e im- plemen tation of CV has a omputational omplexit y of B times the omplexit y of training ea h algorithm A , whi h is usually in tratable for LPO, ev en with p = 1 . The omputational ost of VF CV or RL T an still b e quite ostly when B > 10 in man y pratial problems. Nev ertheless, losed-form form ulas for CV estimators of the risk an b e obtained in sev eral framew orks, whi h greatly dereases the omputational ost of CV. In densit y estimation, losed-form form ulas ha v e b een originally deriv ed b y Rudemo (1982 ) and b y Bo wman (1984 ) for the LOO risk estimator of his- tograms and k ernel estimators. These results ha v e b een reen tly extended b y Celisse and Robin (2008 ) to the LPO risk estimator with the quadrati loss. Similar results are more generally a v ailable for pro jetion estimators as settled b y Celisse (2008a ). In tuitiv ely , su h form ulas an b e obtained pro vided the n um b er N of v alues tak en b y the B = n n v hold-out estimators of the risk, orresp onding to dieren t data splittings, is at most p olynomial in the sample size. F or least-squares estimators in linear regression, Zhang (1993 ) pro v ed a losed-form form ula for the LOO estimator of the risk. Similar results ha v e b een obtained b y W ah ba (1975 , 1977 ), and b y Cra v en and W ah ba (1979 ) in the spline smo othing on text as w ell. These pap ers led in partiular to the denition of GCV (see Setion 4.3.3 ) and related pro edures, whi h are often used instead of CV (with a naiv e implemen tation) b eause of their small omputational ost, as emphasized b y Girard (1998 ). Closed-form form ulas for the LPO estimator of the risk w ere also obtained b y Celisse (2008b ) in regression for k ernel and pro jetion estimators, in partiular for regressograms. An imp ortan t prop ert y of these losed-form form ulas is their additivit y: F or a regressogram asso iated to a partition ( I λ ) λ ∈ Λ m of X , the LPO estimator of the risk an b e written as a sum o v er λ ∈ Λ m of terms whi h only dep end on observ ations ( X i , Y i ) su h that X i ∈ I λ . Therefore, dynami programming (Bellman and Dreyfus, 1962 ) an b e used for minimizing the LPO estimator of the risk o v er the set of partitions of X in D piees. As an illustration, Arlot and Celisse (2009 ) suessfully applied this strategy in the hange-p oin t detetion framew ork. Note that the same idea an b e used with VF CV or RL T, but for a larger omputational ost sine no losed-form form ulas are a v ailable for these CV metho ds. Finally , in framew orks where no losed-form form ula an b e pro v ed, some eien t algorithms exist for a v oiding to reompute b L H − O ( A ; D n ; I ( t ) j ) from srat h for ea h data splitting I ( t ) j . These algorithms rely on up dating form ulas su h as the ones b y Ripley (1996 ) for LOO in linear and quadrati disriminan t analysis; this approa h mak es LOO as exp ensiv e to ompute as the empirial risk. V ery similar form ulas are also a v ailable for LOO and the k -nearest neigh- b ours estimator in lassiation (Daudin and Mary-Huard , 2008 ). 29 10 Conlusion: whi h ross-v alidation metho d for whi h problem? This onlusion ollets a few guidelines aiming at helping CV users, rst in- terpreting the results of CV, seond appropriately using CV in ea h sp ei problem. 10.1 The general piture Dra wing a general onlusion on CV metho ds is an imp ossible task b eause of the v ariet y of framew orks where CV an b e used, whi h indues a v ariet y of b eha viors of CV. Nev ertheless, w e an still p oin t out the three main riteria to tak e in to aoun t for ho osing a CV metho d for a partiular mo del seletion problem: • Bias : CV roughly estimates the risk of a mo del with a sample size n t < n (see Setion 5.1 ). Usually , this implies that CV o v erestimates the v ariane term ompared to the bias term in the bias-v ariane deomp osition (2) with sample size n . When the goal is estimation and the signal-to-noise ratio (SNR) is large, the smaller bias usually is the b etter, whi h is obtained b y taking n t ∼ n . Otherwise, CV an b e asymptotially sub optimal. Nev ertheless, when the goal is estimation and the SNR is small, k eeping a small up w ard bias for the v ariane term often impro v es the p erformane, whi h is obtained b y taking n t ∼ κ n with κ ∈ (0 , 1) . See Setion 6. When the goal is iden tiation, a large bias is often needed, whi h is obtained b y taking n t ≪ n ; dep ending on the framew ork, larger v alues of n t an also lead to mo del onsisteny , see Setion 7 . • V ariability : The v ariane of the CV estimator of the risk is usually a dereasing funtion of the n um b er B of splits, for a xed training size. When the n um b er of splits is xed, the v ariabilit y of CV also dep ends on the training sample size n t . Usually , CV is more v ariable when n t is loser to n . Ho w ev er, when B is link ed with n t (as for VF CV or LPO), the v ariabilit y of CV m ust b e quan tied preisely , whi h has b een done in few framew orks. The only general onlusion on this p oin t is that the CV metho d with minimal v ariabilit y seems strongly framew ork-dep enden t, see Setion 5.2 for details. • Computational omplexity : Unless losed-form form ulas or analyti ap- pro ximations are a v ailable (see Setion 9), the omplexit y of CV is roughly prop ortional to the n um b er of data splits: 1 for the hold-out, V for VF CV, B for RL T or MCCV, n for LOO, and n p for LPO. The optimal trade-o b et w een these three fators an b e dieren t for ea h prob- lem, dep ending for instane on the omputational omplexit y of ea h estimator, on sp eiities of the framew ork onsidered, and on the nal user's trade-o b et w een statistial p erformane and omputational ost. Therefore, no opti- mal CV metho d an b e p oin ted out b efore ha ving tak en in to aoun t the nal user's preferenes. 30 Nev ertheless, in densit y estimation, losed-form expressions of the LPO es- timator ha v e b een deriv ed b y Celisse and Robin (2008 ) with histograms and k ernel estimators, and b y Celisse (2008a ) for pro jetion estimators. These ex- pressions allo w to p erform LPO without additional omputational ost, whi h redues the aforemen tioned trade-o to the easier bias-v ariabilit y trade-o. In partiular, Celisse and Robin (2008 ) prop osed to ho ose p for LPO b y minimiz- ing a riterion dened as the sum of a squared bias and a v ariane terms (see also P olitis et al. , 1999 , Chapter 9). 10.2 Ho w the splits should b e hosen? F or hold-out, VF CV, and RL T, an imp ortan t question is to ho ose a partiular sequene of data splits. First, should this step b e random and indep enden t from D n , or tak e in to aoun t some features of the problem or of the data? It is often reommended to tak e in to aoun t the struture of data when ho osing the splits. If data are stratied, the prop ortions of the dieren t strata should (appro ximately) b e the same in the sample and in ea h training and v alidation sample. Be- sides, the training samples should b e hosen so that b s m ( D ( t ) n ) is w ell dened for ev ery training set; in the regressogram ase, this led Arlot (2008 ) and Arlot and Celisse (2009 ) to ho ose arefully the splitting s heme. In sup ervised lassiation, pratitioners usually ho ose the splits so that the prop ortion of ea h lass is the same in ev ery v alidation sample as in the sample. Nev erthe- less, Breiman and Sp etor (1992 ) made sim ulation exp erimen ts in regression for omparing sev eral splitting strategies. No signian t impro v emen t w as rep orted from taking in to aoun t the stratiation of data for ho osing the splits. Another question related to the hoie of ( I ( t ) j ) 1 ≤ j ≤ B is whether the I ( t ) j should b e indep enden t (lik e MCCV), slighly dep enden t (lik e RL T), or strongly dep enden t (lik e VF CV). It seems in tuitiv e that giving similar roles to all data p oin ts in the B training and v alidation tasks should yield more reliable results as other metho ds. This in tuition ma y explain wh y VF CV is m u h more used than RL T or MCCV. Similarly , Shao (1993 ) prop osed a CV metho d alled BICV, where ev ery p oin t and pair of p oin ts app ear in the same n um b er of splits, see Setion 4.3.2 . Nev ertheless, most reen t theoretial results on the v arious CV pro edures are not aurate enough to distinguish whi h one ma y b e the b est splitting strategy: This remains a widely op en theoretial question. Note nally that the additional v ariabilit y due to the hoie of a sequene of data splits w as quan tied empirially b y Jonathan et al. (2000 ) and theoretially b y Celisse and Robin (2008 ) for VF CV. 10.3 V-fold ross-v alidation VF CV is ertainly the most p opular CV pro edure, in partiular b eause of its mild omputational ost. Nev ertheless, the question of ho osing V remains widely op en, ev en if indiations an b e giv en to w ards an appropriate hoie. A sp ei feature of VF CVas w ell as exhaustiv e strategiesis that ho os- ing V uniquely determines the size of the training set n t = n ( V − 1) / V and the n um b er of splits B = V , hene the omputational ost. Con traditory phenomena then o ur. 31 On the one hand, the bias of VF CV dereases with V sine n t = n (1 − 1 / V ) observ ations are used in the training set. On the other hand, the v ariane of VF CV dereases with V for small v alues of V , whereas the LOO ( V = n ) is kno wn to suer from a high v ariane in sev eral framew orks su h as lassiation or densit y estimation. Note ho w ev er that the v ariane of VF CV is minimal for V = n in some framew orks lik e linear regression (see Setion 5.2). F urthermore, estimating the v ariane of VF CV from data is a diult problem in general, see Setion 5.2.3 . When the goal of mo del seletion is estimation, it is often rep orted in the literature that the optimal V is b et w een 5 and 10 , b eause the statistial p erfor- mane do es not inrease m u h for larger v alues of V , and a v eraging o v er 5 or 10 splits remains omputationally feasible (Hastie et al., 2001 , Setion 7.10). Ev en if this laim is learly true for man y problems, the onlusion of this surv ey is that b etter statistial p erformane an sometimes b e obtained with other v alues of V , for instane dep ending on the SNR v alue. When the SNR is large, the asymptoti omparison of CV pro edures re- alled in Setion 6.2 an b e trusted: LOO p erforms (nearly) un biased risk es- timation hene is asymptotially optimal, whereas VF CV with V = O (1) is sub optimal. On the on trary , when the SNR is small, o v erp enalization an impro v e the p erformane. Therefore, VF CV with V < n an yield a smaller risk than LOO thanks to its bias and despite its v ariane when V is small (see sim ulation exp erimen ts b y Arlot, 2008 ). F urthermore, other CV pro edures lik e RL T an b e in teresting alternativ es to VF CV, sine they allo w to ho ose the bias (through n t ) indep enden tly from B , whi h mainly go v erns the v ariane. Another p ossible alternativ e is V -fold p enalization, whi h is related to orreted VF CV (see Setion 4.3.3 ). When the goal of mo del seletion is iden tiation, the main dra wba k of VF CV is that n t ≪ n is often required for ho osing onsisten tly the true mo del (see Setion 7), whereas VF CV do es not allo w n t < n/ 2 . Dep ending on the framew orks, dieren t (empirial) reommandations for ho osing V an b e found in the literature. In ordered v ariable seletion, the largest V seems to b e the b etter, V = 10 pro viding results lose to the optimal ones ( Zhang, 1993 ). On the on trary , Dietteri h (1998 ) and Alpa ydin (1999 ) reommend V = 2 for ho osing the b est learning pro edures among t w o andidates. 10.4 F uture resear h P erhaps the most imp ortan t diretion for future resear h w ould b e to pro vide, in ea h sp ei framew ork, preise quan titativ e measures of the v ariane of CV estimators of the risk, dep ending on n t , the n um b er of splits, and ho w the splits are hosen. Up to no w, only a few preise results ha v e b een obtained in this diretion, for some sp ei CV metho ds in linear regression or densit y estimation (see Setion 5.2 ). Pro ving similar results in other framew orks and for more general CV metho ds w ould greatly help to ho ose a CV metho d for an y giv en mo del seletion problem. More generally , most theoretial results are not preise enough to mak e an y distintion b et w een the hold-out and CV metho ds ha ving the same training sample size n t , b eause they are equiv alen t at rst order. Seond order terms do matter for realisti v alues of n , whi h sho ws the dramati need for theory 32 that tak es in to aoun t the v ariane of CV when omparing CV metho ds su h as VF CV and RL T with n t = n ( V − 1) /V but B 6 = V . Referenes Ak aik e, H. (1970). Statistial preditor iden tiation. A nn. Inst. Statist. Math. , 22:203217. Ak aik e, H. (1973). Information theory and an extension of the maxim um lik e- liho o d priniple. In Se ond International Symp osium on Information The ory (Tsahkadsor, 1971) , pages 267281. Ak adémiai Kiadó, Budap est. Allen, D. M. (1974). The relationship b et w een v ariable seletion and data aug- men tation and a metho d for predition. T e hnometris , 16:125127. Alpa ydin, E. (1999). Com bined 5 x 2 v F test for omparing sup ervised las- siation learning algorithms. Neur. Comp. , 11(8):18851892. Anderson, R. L., Allen, D. M., and B., C. F. (1972). Seletio on of preditor v ariables in linear m ultiple regression. In banroft, T. A., editor, In Statisti al p ap ers in Honor of Ge or ge W. Sne de or . Io w a: io w a State Univ ersit y Press. Arlot, S. (2007). R esampling and Mo del Sele tion . PhD thesis, Univ ersit y P aris- Sud 11. oai:tel.ar hiv es-ouv ertes.fr:tel-0019 88 03 _v1 . Arlot, S. (2008a). Mo del seletion b y resampling p enalization. Ele tr oni Jour- nal of Statistis . T o app ear. oai:hal.ar hiv es-ouv ertes.fr:hal-00 26 24 78 _v1 . Arlot, S. (2008b). Sub optimalit y of p enalties prop ortional to the dimension for mo del seletion in heterosedasti regression. Arlot, S. (2008). V -fold ross-v alidation impro v ed: V -fold p enalization. Arlot, S. and Celisse, A. (2009). Segmen tation in the mean of heterosedasti data via ross-v alidation. Arlot, S. and Massart, P . (2009). Data-driv en alibration of p enalties for least- squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni). Baraud, Y. (2002). Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni). Barron, A., Birgé, L., and Massart, P . (1999). Risk b ounds for mo del seletion via p enalization. Pr ob ab. The ory R elate d Fields , 113(3):301413. Bartlett, P . L., Bou heron, S., and Lugosi, G. (2002). Mo del seletion and error estimation. Mahine L e arning , 48:85113. Bartlett, P . L., Bousquet, O., and Mendelson, S. (2005). Lo al Radema her omplexities. A nn. Statist. , 33(4):14971537. Bartlett, P . L. and Mendelson, S. (2002). Radema her and Gaussian omplexi- ties: risk b ounds and strutural results. J. Mah. L e arn. R es. , 3(Sp e. Issue Comput. Learn. Theory):463482. 33 Bellman, R. E. and Dreyfus, S. E. (1962). Applie d Dynami Pr o gr amming . Prineton. Bengio, Y. and Grandv alet, Y. (2004). No un biased estimator of the v ariane of K -fold ross-v alidation. J. Mah. L e arn. R es. , 5:10891105 (eletroni). Bhansali, R. J. and Do wnham, D. Y. (1977). Some prop erties of the order of an autoregressiv e mo del seleted b y a generalization of Ak aik e's FPE riterion. Biometrika , 64(3):547551. Birgé, L. and Massart, P . (2001). Gaussian mo del seletion. J. Eur. Math. So . (JEMS) , 3(3):203268. Birgé, L. and Massart, P . (2007). Minimal p enalties for Gaussian mo del sele- tion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373. Blan hard, G. and Massart, P . (2006). Disussion: Lo al Radema her omplex- ities and orale inequalities in risk minimization [Ann. Statist. 34 (2006), no. 6, 25932656℄ b y V. K olt hinskii. A nn. Statist. , 34(6):26642671. Bou heron, S., Bousquet, O., and Lugosi, G. (2005). Theory of lassiation: a surv ey of some reen t adv anes. ESAIM Pr ob ab. Stat. , 9:323375 (eletroni). Bousquet, O. and Elisse, A. (2002). Stabilit y and Generalization. J. Mahine L e arning R ese ar h , 2:499526. Bo wman, A. W. (1984). An alternativ e metho d of ross-v alidation for the smo othing of densit y estimates. Biometrika , 71(2):353360. Breiman, L. (1996). Heuristis of instabilit y and stabilization in mo del seletion. A nn. Statist. , 24(6):23502383. Breiman, L. (1998). Aring lassiers. A nn. Statist. , 26(3):801849. With disussion and a rejoinder b y the author. Breiman, L., F riedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Clas- si ation and r e gr ession tr e es . W adsw orth Statistis/Probabilit y Series. W adsw orth A dv aned Bo oks and Soft w are, Belmon t, CA. Breiman, L. and Sp etor, P . (1992). Submo del seletion and ev aluation in re- gression. the x-random ase. International Statisti al R eview , 60(3):291319. Burman, P . (1989). A omparativ e study of ordinary ross-v alidation, v - fold ross-v alidation and the rep eated learning-testing metho ds. Biometrika , 76(3):503514. Burman, P . (1990). Estimation of optimal transformations using v -fold ross v alidation and rep eated learning-testing metho ds. Sankhy a Ser. A , 52(3):314 345. Burman, P ., Cho w, E., and Nolan, D. (1994). A ross-v alidatory metho d for dep enden t data. Biometrika , 81(2):351358. Burman, P . and Nolan, D. (1992). Data-dep enden t estimation of predition funtions. J. Time Ser. A nal. , 13(3):189207. 34 Burnham, K. P . and Anderson, D. R. (2002). Mo del sele tion and multimo del in- fer en e . Springer-V erlag, New Y ork, seond edition. A pratial information- theoreti approa h. Cao, Y. and Golub ev, Y. (2006). On orale inequalities related to smo othing splines. Math. Metho ds Statist. , 15(4):398414. Celisse, A. (2008a). Densit y estimation via ross-v alidation: Mo del seletion p oin t of view. T e hnial rep ort, arXiv: 0811.0802. Celisse, A. (2008b). Mo del Sele tion Via Cr oss-V alidation in Density Estima- tion, R e gr ession and Change-Points Dete tion . PhD thesis, Univ ersit y P aris- Sud 11, http://tel.arh ive s- ou ver te s. fr/ te l- 003 46 320 /e n/ . Celisse, A. and Robin, S. (2008). Nonparametri densit y estimation b y ex- at lea v e-p-out ross-v alidation. Computational Statistis and Data A nalysis , 52(5):23502368. Cho w, Y. S., Geman, S., and W u, L. D. (1987). Consisten t ross-v alidated densit y estimation. A nn. Statist. , 11:2538. Ch u, C.-K. and Marron, J. S. (1991). Comparison of t w o bandwidth seletors with dep enden t errors. A nn. Statist. , 19(4):19061918. Cra v en, P . and W ah ba, G. (1979). Smo othing noisy data with spline funtions. Estimating the orret degree of smo othing b y the metho d of generalized ross-v alidation. Numer. Math. , 31(4):377403. Dalelane, C. (2005). Exat orale inequalit y for sharp adaptiv e k ernel densit y estimator. T e hnial rep ort, arXiv. Daudin, J.-J. and Mary-Huard, T. (2008). Estimation of the onditional risk in lassiation: The sw apping metho d. Comput. Stat. Data A nal. , 52(6):3220 3232. Da vison, A. C. and Hall, P . (1992). On the bias and v ariabilit y of b o ot- strap and ross-v alidation estimates of error rate in disrimination problems. Biometrika , 79(2):279284. Devro y e, L., Györ, L., and Lugosi, G. (1996). A pr ob abilisti the ory of p attern r e o gnition , v olume 31 of Appli ations of Mathematis (New Y ork) . Springer- V erlag, New Y ork. Devro y e, L. and W agner, T. J. (1979). Distribution-Free p erformane Bounds for P oten tial F untion Rules. IEEE T r ansation in Information The ory , 25(5):601604. Dietteri h, T. G. (1998). Appro ximate statistial tests for omparing sup ervised lassiation learning algorithms. Neur. Comp. , 10(7):18951924. Efron, B. (1983). Estimating the error rate of a predition rule: impro v emen t on ross-v alidation. J. A mer. Statist. Asso . , 78(382):316331. Efron, B. (1986). Ho w biased is the apparen t error rate of a predition rule? J. A mer. Statist. Asso . , 81(394):461470. 35 Efron, B. (2004). The estimation of predition error: o v ariane p enalties and ross-v alidation. J. A mer. Statist. Asso . , 99(467):619642. With ommen ts and a rejoinder b y the author. Efron, B. and Morris, C. (1973). Com bining p ossibly related estimation prob- lems (with disussion). J. R. Statist. So . B , 35:379. Efron, B. and Tibshirani, R. (1997). Impro v emen ts on ross-v alidation: the .632+ b o otstrap metho d. J. A mer. Statist. Asso . , 92(438):548560. F romon t, M. (2007). Mo del seletion b y b o otstrap p enalization for lassiation. Mah. L e arn. , 66(23):165207. Geisser, S. (1974). A preditiv e approa h to the random eet mo del. Biometrika , 61(1):101107. Geisser, S. (1975). The preditiv e sample reuse metho d with appliations. J. A mer. Statist. Asso . , 70:320328. Girard, D. A. (1998). Asymptoti omparison of (partial) ross-v alidation, GCV and randomized GCV in nonparametri regression. A nn. Statist. , 26(1):315 334. Grün w ald, P . D. (2007). The Minimum Desription Length Priniple . MIT Press, Cam bridge, MA, USA. Györ, L., K ohler, M., Krzy»ak, A., and W alk, H. (2002). A distribution-fr e e the ory of nonp ar ametri r e gr ession . Springer Series in Statistis. Springer- V erlag, New Y ork. Hall, P . (1983). Large sample optimalit y of least squares ross-v alidation in densit y estimation. A nn. Statist. , 11(4):11561174. Hall, P . (1987). On Kullba k-Leibler loss and densit y estimation. The A nnals of Statistis , 15(4):14911519. Hall, P ., Lahiri, S. N., and P olzehl, J. (1995). On bandwidth hoie in nonpara- metri regression with b oth short- and long-range dep enden t errors. A nn. Statist. , 23(6):19211936. Hall, P ., Marron, J. S., and P ark, B. U. (1992). Smo othed ross-v alidation. Pr ob ab. The ory R elate d Fields , 92(1):120. Hall, P . and S h uan y , W. R. (1989). A lo al ross-v alidation algorithm. Statist. Pr ob ab. L ett. , 8(2):109117. Härdle, W. (1984). Ho w to determine the bandwidth of some nonlinear smo others in pratie. In R obust and nonline ar time series analysis (Heidel- b er g, 1983) , v olume 26 of L e tur e Notes in Statist. , pages 163184. Springer, New Y ork. Härdle, W., Hall, P ., and Marron, J. S. (1988). Ho w far are automatially hosen regression smo othing parameters from their optim um? J. A mer. Statist. As- so . , 83(401):86101. With ommen ts b y Da vid W. Sott and Iain Johnstone and a reply b y the authors. 36 Hart, J. D. (1994). Automated k ernel smo othing of dep enden t data b y using time series ross-v alidation. J. R oy. Statist. So . Ser. B , 56(3):529542. Hart, J. D. and Vieu, P . (1990). Data-driv en bandwidth hoie for densit y estimation based on dep enden t data. A nn. Statist. , 18(2):873890. Hart, J. D. and W ehrly , T. E. (1986). Kernel regression estimation using re- p eated measuremen ts data. J. A mer. Statist. Asso . , 81(396):10801088. Hastie, T., Tibshirani, R., and F riedman, J. (2001). The elements of statisti- al le arning . Springer Series in Statistis. Springer-V erlag, New Y ork. Data mining, inferene, and predition. Herzb erg, A. M. and T suk ano v, A. V. (1986). A note on mo diations of ja k- knife riterion for mo del seletion. Utilitas Math. , 29:209216. Herzb erg, P . A. (1969). The parameters of ross-v alidation. Psyhometrika , 34:Monograph Supplemen t. Hesterb erg, T. C., Choi, N. H., Meier, L., and F raley , C. (2008). Least angle and l1 p enalized regression: A review. Statistis Surveys , 2:6193 (eletroni). Hills, M. (1966). Allo ation Rules and their Error Rates. J. R oyal Statist. So . Series B , 28(1):131. Hub er, P . (1964). Robust estimation of a lo al parameter. A nn. Math. Statist. , 35:73101. Hurvi h, C. M. and T sai, C.-L. (1989). Regression and time series mo del sele- tion in small samples. Biometrika , 76(2):297307. John, P . W. M. (1971). Statisti al design and analysis of exp eriments . The Mamillan Co., New Y ork. Jonathan, P ., Krzano wki, W. J., and MCarth y , W. V. (2000). On the use of ross-v alidation to assess p erformane in m ultiv ariate predition. Stat. and Comput. , 10:209229. Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Exp erimen tal and Theoretial Comparison of Mo del Seletion Metho ds. Mahine L e arning , 27:750. Kearns, M. and Ron, D. (1999). Algorithmi Stabilit y and Sanit y-Che k Bounds for Lea v e-One-Out Cross-Validation. Neur al Computation , 11:14271453. K olt hinskii, V. (2001). Radema her p enalties and strutural risk minimization. IEEE T r ans. Inform. The ory , 47(5):19021914. K olt hinskii, V. (2006). Lo al Radema her omplexities and orale inequalities in risk minimization. A nn. Statist. , 34(6):25932656. La hen bru h, P . A. and Mi k ey , M. R. (1968). Estimation of Error Rates in Disriminan t Analysis. T e hnometris , 10(1):111. Larson, S. C. (1931). The shrink age of the o eien t of m ultiple orrelation. J. Edi. Psyhol. , 22:4555. 37 Leung, D., Marriott, F., and W u, E. (1993). Bandwidth seletion in robust smo othing. J. Nonp ar ametr. Statist. , 2:333339. Leung, D. H.-Y. (2005). Cross-v alidation in nonparametri regression with out- liers. A nn. Statist. , 33(5):22912310. Li, K.-C. (1985). F rom stein's un biased risk estimates to the metho d of gener- alized ross v alidation. A nn. Statist. , 13(4):13521377. Li, K.-C. (1987). Asymptoti optimalit y for C p , C L , ross-v alidation and gen- eralized ross-v alidation: disrete index set. A nn. Statist. , 15(3):958975. Mallo ws, C. L. (1973). Some ommen ts on C p . T e hnometris , 15:661675. Mark atou, M., Tian, H., Bisw as, S., and Hrip sak, G. (2005). Analysis of v ariane of ross-v alidation estimators of the generalization error. J. Mah. L e arn. R es. , 6:11271168 (eletroni). Massart, P . (2007). Con entr ation ine qualities and mo del sele tion , v olume 1896 of L e tur e Notes in Mathematis . Springer, Berlin. Letures from the 33rd Summer S ho ol on Probabilit y Theory held in Sain t-Flour, July 623, 2003, With a forew ord b y Jean Piard. Molinaro, A. M., Simon, R., and Pfeier, R. M. (2005). Predition error estima- tion: a omparison of resampling metho ds. Bioinformatis , 21(15):33013307. Mosteller, F. and T uk ey , J. W. (1968). Data analysis, inluding statistis. In Lindzey , G. and Aronson, E., editors, Handb o ok of So ial Psyholo gy, Vol. 2 . A ddison-W esley . Nadeau, C. and Bengio, Y. (2003). Inferene for the generalization error. Ma- hine L e arning , 52:239281. Nishii, R. (1984). Asymptoti prop erties of riteria for seletion of v ariables in m ultiple regression. A nn. Statist. , 12(2):758765. Opsomer, J., W ang, Y., and Y ang, Y. (2001). Nonparametri regression with orrelated errors. Statist. Si. , 16(2):134153. Piard, R. R. and Co ok, R. D. (1984). Cross-v alidation of regression mo dels. J. A mer. Statist. Asso . , 79(387):575583. P olitis, D. N., Romano, J. P ., and W olf, M. (1999). Subsampling . Springer Series in Statistis. Springer-V erlag, New Y ork. Quenouille, M. H. (1949). Appro ximate tests of orrelation in time-series. J. R oy. Statist. So . Ser. B. , 11:6884. Ripley , B. D. (1996). Pattern R e o gnition and Neur al Networks . Cam bridge Univ. Press. Rissanen, J. (1983). Univ ersal Prior for In tegers and Estimation b y Minim um Desription Length. The A nnals of Statistis , 11(2):416431. Ron hetti, E., Field, C., and Blan hard, W. (1997). Robust linear mo del sele- tion b y ross-v alidation. J. A mer. Statist. Asso . , 92:10171023. 38 Rudemo, M. (1982). Empirial Choie of Histograms and Kernel Densit y Esti- mators. S andinavian Journal of Statistis , 9:6578. Sauv é, M. (2009). Histogram seletion in non gaussian regression. ESAIM: Pr ob ability and Statistis , 13:7086. S h uster, E. F. and Gregory , G. G. (1981). On the onsisteny of maxim um lik e- liho o d nonparametri densit y estimators. In Eddy , W. F., editor, Computer Sien e and Statistis: Pr o e e dings of the 13th Symp osium on the Interfa e , pages 295298. Springer-V erlag, New Y ork. S h w arz, G. (1978). Estimating the dimension of a mo del. A nn. Statist. , 6(2):461464. Shao, J. (1993). Linear mo del seletion b y ross-v alidation. J. A mer. Statist. Asso . , 88(422):486494. Shao, J. (1996). Bo otstrap mo del seletion. J. A mer. Statist. Asso . , 91(434):655665. Shao, J. (1997). An asymptoti theory for linear mo del seletion. Statist. Sini a , 7(2):221264. With ommen ts and a rejoinder b y the author. Shibata, R. (1984). Appro ximate eieny of a seletion pro edure for the n um b er of regression v ariables. Biometrika , 71(1):4349. Simon, F. (1971). Predition metho ds in riminology . v olume 7. Stone, C. (1984). An asymptotially optimal windo w seletion rule for k ernel densit y estimates. The A nnals of Statistis , 12(4):12851297. Stone, M. (1974). Cross-v alidatory hoie and assessmen t of statistial predi- tions. J. R oy. Statist. So . Ser. B , 36:111147. With disussion b y G. A. Barnard, A. C. A tkinson, L. K. Chan, A. P . Da wid, F. Do wn ton, J. Di k ey , A. G. Bak er, O. Barndor-Nielsen, D. R. Co x, S. Giesser, D. Hinkley , R. R. Ho king, and A. S. Y oung, and with a reply b y the authors. Stone, M. (1977). Asymptotis for and against ross-v alidation. Biometrika , 64(1):2935. Sugiura, N. (1978). F urther analysis of the data b y ak aik e's information riterion and the nite orretions. Comm. Statist. AThe ory Metho ds , 7(1):1326. Tibshirani, R. (1996). Regression Shrink age and Seletion via the Lasso. J. R oyal Statist. So . Series B , 58(1):267288. v an der Laan, M. J. and Dudoit, S. (2003). Unied ross-v alidation metho dology for seletion among estimators and a general ross-v alidated adaptiv e epsilon- net estimator: Finite sample orale inequalities and examples. W orking P ap er Series W orking P ap er 130, U.C. Berk eley Division of Biostatistis. a v ailable at h ttp://www.b epress.om/ubbiostat/pap er130. v an der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptoti optimalit y of lik eliho o d-based ross-v alidation. Stat. Appl. Genet. Mol. Biol. , 3:Art. 4, 27 pp. (eletroni). 39 v an der Laan, M. J., Dudoit, S., and v an der V aart, A. W. (2006). The ross- v alidated adaptiv e epsilon-net estimator. Statist. De isions , 24(3):373395. v an der V aart, A. W., Dudoit, S., and v an der Laan, M. J. (2006). Orale inequalities for m ulti-fold ross v alidation. Statist. De isions , 24(3):351371. v an Erv en, T., Grün w ald, P . D., and de Ro oij, S. (2008). Cat hing up faster b y swit hing so oner: A prequen tial solution to the ai-bi dilemma. V apnik, V. (1982). Estimation of dep enden es b ase d on empiri al data . Springer Series in Statistis. Springer-V erlag, New Y ork. T ranslated from the Russian b y Sam uel K otz. V apnik, V. N. (1998). Statisti al le arning the ory . A daptiv e and Learning Sys- tems for Signal Pro essing, Comm uniations, and Con trol. John Wiley & Sons In., New Y ork. A Wiley-In tersiene Publiation. V apnik, V. N. and Cherv onenkis, A. Y. (1974). T e oriya r asp oznavaniya obr a- zov. Statistiheskie pr oblemy obuheniya . Izdat. Nauk a, Moso w. Theory of P attern Reognition (In Russian). W ah ba, G. (1975). Perio di splines for sp etral densit y estimation: The use of ross v alidation for determining the degree of smo othing. Communi ations in Statistis , 4:125142. W ah ba, G. (1977). Pratial Appro ximate Solutions to Linear Op erator Equa- tions When the Data are Noisy . SIAM Journal on Numeri al A nalysis , 14(4):651667. W egk amp, M. (2003). Mo del seletion in nonparametri regression. The A nnals of Statistis , 31(1):252273. Y ang, Y. (2005). Can the strengths of AIC and BIC b e shared? A onit b e- t w een mo del inden tiation and regression estimation. Biometrika , 92(4):937 950. Y ang, Y. (2006). Comparing learning metho ds for lassiation. Statist. Sini a , 16(2):635657. Y ang, Y. (2007). Consisteny of ross v alidation for omparing regression pro- edures. A nn. Statist. , 35(6):24502473. Zhang, P . (1993). Mo del seletion via m ultifold ross v alidation. A nn. Statist. , 21(1):299313. 40
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment