Random Forests: some methodological insights

apport   de recherche ISSN 0249-6399 ISRN INRIA/RR--6729--FR+ENG Thème COG INSTITUT N A TION AL DE RECHERCHE EN INFORMA TIQUE ET EN A UTOMA TIQUE Random F orests: some methodological insights Robin Genuer — Jean-Michel Poggi — Christine T uleau N° 6729 Novembre 2008 Centre de recher che INRIA Saclay – Île-de- France Parc Orsay Uni versité 4, rue Jacques Monod , 91893 ORSA Y Cedex Téléphone : + 33 1 72 92 59 00 Random F orests: some metho dological insigh ts Robin Gen uer ∗ , Jean-Mic hel P oggi † ∗ , Christine T uleau ‡ Th` eme COG — Syst` emes cognitifs ´ Equip es-Pro jets Select Rapp o rt de recherche n ° 6729 — Nov embre 2008 — 3 2 pag es Abstract: This pap er examines from an exp e r imen tal p ersp ective random forests, the increasingly used statistical metho d for class iﬁcation and regres- sion problems int ro duced by Leo Breiman in 2 0 01. It ﬁr st aims at conﬁrming, known but sparse, a dvice for us ing ra ndom forests and at pro pos ing some com- plemen tary remarks for b oth standard problems as well as high dimensional ones for which the num b er of v aria bles hugely exceeds the sample size. But the main cont ribution of this pap er is tw ofold: to provide so me insights ab out the behavior o f the v a riable imp ortance index based on r andom fore s ts and in a ddi- tion, to prop ose to inv estigate t wo classical issues of v ariable selection. The ﬁrst one is to ﬁnd imp ortant v aria bles for interpretation and the second one is mor e restrictive and try to design a go o d pr ediction mo del. The strategy inv olves a ranking of explanatory v ar iables using the random fore s ts sco re of imp ortance and a stepwise a scending v ariable introduction s tr ategy . Key-w ords: Random Forests, Regression, Classifica tion, V ariable Impor t ance, V ariable Selection. ∗ Unive rsit´ e Pa ris -Sud, Math´ emat ique, Bˆ at. 425, 91405 Or say , F r ance † Unive rsit´ e Pa ris Descartes, F rance ‡ Unive rsit´ e Nice Sophia-Antipolis, F rance F orˆ ets al ´ eatoires : remarques m ´ etho dol ogiques R´ esum´ e : O n s ’in t´ eresse ` a la m´ etho de des forˆ ets al´ eatoires d’un p oint de vue m ´ etho dologique. Introduite par Leo Breiman en 2 0 01, elle est d´ es ormais lar ge- men t utilis´ ee tan t en classiﬁca tion qu’en r´ egressio n av ec un succ ` es spectacula ire. On vis e tout d’ab ord ` a conﬁrmer les r´ esultats exp´ erimentaux, connus mais ´ e pars, quant au choix des param` etres de la m´ etho de, tan t p our les pro bl` emes dits ” stan- dards” que p our ceux dits de ”g rande dimension” (po ur lesq uels le nombre de v aria bles est tr` es grand vis ` a vis du nom bre d’observ ations). Mais la con tribution principale de cet ar ticle est d’ ´ etudier le comp ortement du score d’impo rtance des v aria bles bas´ e s ur les forˆ ets al´ eatoir es et d’examiner deux probl` emes cla ssiques de s´ election de v ariables. Le premier est de d´ eg ager les v a riables impo rtantes ` a des ﬁns d’in terpr´ etation tandis que le sec o nd, plus restrictif, vise ` a se restreindre ` a un sous-ens e mble suﬃsant p our la pr´ ediction. La stra t ´ egie g´ en´ era le pro c` ede en deux ´ eta pes : le clas s emen t des v ariables bas ´ e sur les scores d’importance suivi d’une pr o c´ edure d’introduction a scendante s´ equen tielle des v a riables. Mots-cl´ es : F or ˆ ets al ´ ea toires, R ´ egression, Classifica tion, Impor- t ance d es V ariables, S ´ election des V ariables. R andom F or ests: some metho dolo gic al insights 3 1 Int ro duction Random forests (RF henceforth) is a p opular and v ery eﬃcient algo rithm, based on mo del aggr e g ation ideas, for b oth cla s siﬁcation a nd regres sion problems, in- tro duced by Breiman (2001) [8]. It b elongs to the family of ensemble metho ds, app earing in machine learning at the end of nineties (see for example Dietterich (1999) [1 5] a nd (2 000) [16]). Let us brieﬂy r ecall the s tatistical framework by considering a lea rning set L = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } made of n i.i.d. obser- v ations of a random vector ( X, Y ). V ector X = ( X 1 , ..., X p ) contains predictors or explanatory v aria bles, say X ∈ R p , and Y ∈ Y where Y is either a class la bel or a n umerical r espo nse. F or classiﬁcation pro blems, a clas siﬁer t is a mapping t : R p → Y while for reg ression problems, we supp ose that Y = s ( X ) + ε and s is the so-called r e g ression function. F or more background on statistical lear n- ing, see Hastie et al. (2001) [24]. Random forests is a mo del building stra tegy providing estimators of either the Bay es classiﬁer or the reg ression function. The principle o f random forests is to combine many binary decision trees built using several b o otstrap samples coming from the learning sa mple L and choosing rando mly at ea ch no de a subset of explanatory v ar iables X . More precisely , with resp ect to the w ell-known CAR T mo del building s trategy (see Breiman et al. (1984 ) [6]) p erfo r ming a growing step follow ed by a pruning one, t wo diﬀerences ca n b e noted. First, at each no de, a given num b er (denoted by mtry ) of input v ar iables are randomly chosen and the b est split is calculated only within this subset. Second, no pr uning step is p erformed so all the tr e e s are maximal tree s . In addition to CAR T, another well-kno wn rela ted tree-based metho d must be mentioned: ba gging (see B reiman (1996 ) [7]). Indeed random forests with mtry = p reduce simply to unpruned bagging. The asso ciated R 1 pack ages are res pectively ra ndomFo rest (int ensively used in the se q uel of the pape r ), rpart and ipre d for CAR T and bagging resp ectively (cited here for the sake of completeness). RF alg o rithm b ecomes more and more p opular and a ppea rs to b e very p ow- erful in a lo t of diﬀerent a pplications (see for example D ´ ıaz-Uriarte and Alv a rez de Andr´ es (2006 ) [14] for gene expr e s sion data analysis) even if it is not clear ly elucidated fro m a mathematica l p oint of view (see the r ecent paper by Bia u et al. (20 0 8) [5] and B ¨ uhlmann, Y u (2002) [1 1] for bagging). Nevertheless, Breiman (200 1) [8] sketc hes an explanation o f the g o o d p erformance of random forests related to the go o d quality of each tree (at least from the bias p oint of view) tog e ther with the sma ll cor relation a mong the trees of the forest, where the correlation b etw een trees is deﬁned as the ordinar y co rrelation o f pr edictions on so-called out-of-bag (OOB henceforth) sa mples. The OOB sample which is the set of observ ations which ar e not used for building the current tree, is used to estimate the pr ediction e rror and then to ev aluate v ariable imp ortance. T uning metho d parameters It is now classica l to distinguish tw o typical situatio ns dep e nding on n the n umber of o bserv a tio ns , and p the num b er of v aria bles: standa rd (for n >> p ) and high dimensional (when n << p ). The ﬁrst question when so meone try to use practically random fores ts is to get infor ma tion ab out sensible v alues 1 see htt p://www.r- pro ject.org/ RR n ° 6729 4 Genuer et. al for the t wo main para meters of the metho d. E ssentially , the study ca rried o ut in the tw o pa per s [8] and [14] give interesting insights but Breiman fo cuses on standard pr oblems while D ´ ıaz-Uriarte a nd Alv a rez de Andr´ es concentrate on high dimensional cla ssiﬁcation o nes. So the ﬁr s t ob jective of this pap er is to give compact information a bo ut selected be nch datasets and to examine aga in the choice of the metho d par am- eters addressing more closely the diﬀerent s ituations. RF v ariable i mp ortance The quantiﬁcation of the v ariable imp ortance (VI henceforth) is an imp or- tant issue in many applied pro blems complementing v ariable selection by inter- pretation iss ues . In the linear r egressio n framework it is exa mined for example b y Gr¨ o mping (2007 ) [22], making a distinction b et ween v ar ious v ariance de- comp osition based indicato rs: ”disp ersio n imp ortance” , ”level impor tance” or ”theoretical impo rtance” qua ntifying explained v ariance or changes in the re- sp o nse for a given c ha nge of eac h regress o r. V arious w ays t o deﬁne and compute using R such indicators are av ailable (see Gr ¨ o mping (200 6 ) [2 3]). In the random forests fra mew ork , the mos t widely use d s c o re of imp or tance of a given v ar iable is the increasing in mean of the error o f a tree (MSE for regress io n and misclassiﬁcation ra te for clas siﬁcation) in the forest when the observed v alues of this v a riable are randomly p ermuted in the OO B samples. Often, such ra ndom forests VI is called per m utation imp ortance indices in op- po sition to total decrea se of no de impurit y mea s ures alrea dy intro duced in the seminal b o ok a b out CAR T by Breiman et al. (1984 ) [6]. Even if only little inv estigation is av aila ble ab out RF v ar iable impo rtance, some interesting facts are collected fo r classiﬁca tion pr oblems. This index ca n be ba sed on the average loss of ano ther criterion, like the Gini entrop y used fo r growing c lassiﬁcation tree s . Let us cite tw o r e ma rks. The ﬁrst one is that the RF Gini imp ortance is not fair in favor of predictor v ariables with many ca tegories while the RF p ermutation impo rtance is a more reliable indicator (see Str o bl et al. (200 7) [36]). So we restrict our attenti on to this las t one. The second one is that it s eems that p ermutation imp ortance ov erestimates the v ar ia ble imp or- tance of highly co rrelated v ariables and they pro po se a conditional v ar ia nt (see Strobl et al. (2008 ) [3 7]). Let us men tion that, in this paper , we do not no tice such phenomenon. F o r cla s siﬁcation problems, Ben Ishak, Ghattas (200 8) [4] and D ´ ıaz-Ur iarte, Alv a rez de Andr ´ es (20 06) [14] for example, use RF v aria ble impo r tance and note that it is stable for correla ted predictors , scale inv ariant and stable with respect to small pertur ba tions of the lea rning sample. But these preliminary remarks need to be extended and the re c e nt pap er by Archer et al. (2008) [3], focusing mor e sp eciﬁcally on the VI topic, do not answer so me cru- cial questions a bo ut the v ar iable imp orta nce be havior: like the impor tance of a g roup o f v a riables or its b ehavior in presence of highly correla ted v a riables. This one is the second go al o f this pap e r . V ariable sel e ction Many v a riable selection pro c edures a re based on the co o per ation o f v a riable impo r tance for ranking and mo del estimation to ev aluate and compare a family of mo dels. Three types of v ariable selec tio n metho ds are distinguished (see Ko- havi et al. (1 997) [27] a nd Guyon et al. (2003 ) [20]): ”ﬁlter” for which the score of v ariable importance do es no t dep e nd on a giv en mo del design method; ”wra p- INRIA R andom F or ests: some metho dolo gic al insights 5 per ” which include the predictio n p erformance in the sco re ca lculation; and ﬁnally ”embedded” which intricate mo re closely v ariable s elec tio n a nd mo del estimation. F o r non-par ametric mo dels, o nly a small num b er of metho ds a re av ailable, esp ecially for the clas siﬁcation case. Let us br ieﬂy men tion some of them, which are po ten tially comp eting to ols. Of course w e must ﬁrs tly mention the wrapp er methods ba sed on VI coming fro m CAR T, see Br e ima n et al. (1984) [6] a nd of course, random forests, see B reiman (20 01) [8]. Then some examples of em- bedded metho ds: P ogg i, T uleau (200 6) [30] prop ose a method based on CAR T scores and using stepwise a scending pro cedure with elimination s tep; Guyon et al. (2002 ) [19] (and Rakotomamonjy (2003) [3 2]), prop ose SVM-RFE, a method based on SVM scores a nd using des c e nding elimination. Mo re recent ly , Ben Ishak et al. (200 8) [4] pr op ose a stepwise v ar iant while Park et al. (2007 ) [29] prop ose a ”LARS” type strateg y (see Efro n et al. (200 4 ) [1 7] for classiﬁcation problems. Let us recall that tw o dis tinct ob jectives a bo ut v a riable selection can b e iden- tiﬁed: (1) to ﬁnd imp orta n t v aria bles highly related to the resp onse v a riable for in terpretatio n purp ose; (2) to ﬁnd a sma ll n umber of v ariables suﬃcient for a go o d pr ediction of the r esp onse v ariable. The key too l for task 1 is thresholding v aria ble imp ortance while the crucial p o int for ta s k 2 is to combine v ariable ranking and s tep wise introduction of v a riables o n a pr ediction mo del building. It co uld b e as c ending in order to av oid to s elect r edundant v ariables o r, for the case n << p , descending ﬁrs t to reach a class ical situation n ∼ p , and then ascending using the ﬁrst s trategy , see F an, Lv (2008) [18]. W e prop ose in this pap er, a tw o-steps pro cedure, the ﬁrst one is commo n while the seco nd one de- pends on the ob jectiv e interpretation or prediction. The pap er is o rganized as follows. After this introduction, Section 2 fo cuses on random fores ts para meters. Section 3 prop oses to study the behavior of the RF v ariable imp o rtance index. Section 4 inv estigates the tw o classical issues of v aria ble s e lection using the r andom forests ba sed score of impor tance. Section 5 ﬁnally o pens disc us s ion ab out future work. 2 Selecting metho d parameters 2.1 Exp erimen tal framew ork 2.1.1 RF pro cedure The R pa ck ag e ab out ra ndo m for ests is based on the the seminal c o nt ribution of Br eiman and Cutler [10] a nd is describ ed in Liaw, Wiener (2002 ) [28]. In this pap er, we fo cus on the r andomF orest pro cedure. The tw o main par ameters ar e mtry , the num b er of input v a riables randomly chosen a t each split and ntr ee , the num b er of trees in the forest 2 . A third parameter , denoted by nodesi z e , allows to sp ecify the minimum n umber of obse rv ations in a no de. W e retain the default v alue (1 for cla ssiﬁcation and 5 for regressio n) of this parameter for all of our exp erimentations, since it is close to the maximal tree choice. 2 In all the pap er, mtr y = m with m ∈ R stands for mtr y = ⌊ m ⌋ RR n ° 6729 6 Genuer et. al 2.1.2 OOB error In this section, we concent rate on the prediction p erfor ma nce of RF fo cusing on out-of-bag (OO B) er ror (see [8 ]). W e use this kind of pre diction error estimate for three reasons: the main is th at we are mainly in terested in comparing r esults instead of a ssessing mo dels , the s e c ond is that it gives fair estimation compar ed to the usual a lternative test set erro r even if it is considered as a little bit optimistic and the last o ne, but not the least, is that it is a default output of the pro cedur e. T o avoid unsigniﬁcant sampling eﬀects, each OOB erro rs is actually the mean of OOB err or ov er 10 runs. 2.1.3 Datasets W e have collected information ab o ut the data sets considered in this pap er: the name, the na me o f the corresp onding data structure (when diﬀerent), n , p , the n umber o f classes c in the mult iclass cas e, a refer e nce , a website or a pack age. The tw o next tables contain synthetic informatio n while details ar e pos tponed in the Appendix. W e dis tinguis h standard and high dimensional s ituations and, in addition, the three pr oblems: r egressio n, 2-class cla ssiﬁcation a nd multiclass classiﬁcation. T a ble 1 displays some informatio n a bo ut standard pro blems datasets: for classiﬁcation at the top and for r e g ression a t the b ottom. Name Observ ations V a r iables Classes Ionosphere 351 34 2 Diabetes 768 8 2 Sonar 208 60 2 V o tes 435 16 2 Ringnorm 200 20 2 Threenorm 200 20 2 Twonorm 200 20 2 Glass 2 1 4 9 6 Letters 20000 16 26 Sat-images 6435 36 6 V ehicle 846 18 4 V owel 990 10 11 W aveform 200 21 3 BostonHousing 506 13 Ozone 366 12 Servo 167 4 F r iedman1 300 10 F r iedman2 300 4 F r iedman3 300 4 T a ble 1: Standa rd problems: data sets for classiﬁcation at the top, and for regress io n at the b ottom T a ble 2 displa ys high dimensio na l problems datasets: for classiﬁcation at the top and for r egression at the b ottom. INRIA R andom F or ests: some metho dolo gic al insights 7 Name Observ ations V a r iables Classes Adenoc arcinoma 76 9868 2 Colon 62 2000 2 Leukemia 38 3051 2 Prostate 102 603 3 2 Brain 42 5597 5 Breast 96 4869 3 Lymphoma 62 4026 3 Nci 61 6033 8 Srb ct 63 2308 4 toys data 100 100 to 10 00 2 P A C 209 4 67 F r iedman1 100 100 to 1 000 F r iedman2 100 100 to 1 000 F r iedman3 100 100 to 1 000 T a ble 2: High dimensional problems: data sets for c la ssiﬁcation at the top, and for regr e s sion at the bo ttom 2.2 Regression Abo ut regress ion problems, even if it seems at ﬁrst insp ection that the seminal pap er by Breiman [8] clo ses the debate ab out go o d advice, it remains that the exp erimental r e s ults are ab out a v ariant which is not implemen ted in the univ ersa lly used R pack age. Moreover, exce pt this reference, at our knowledge, no such a general pap er is av ailable, so we dev elop a gain the Breiman’s study bo th for real and simul ated data corresp onding to the ca se n >> p and we provide some additional study on data cor r esp onding to the case n < < p (such examples typically come fro m chemometrics). W e o bserve that the default v alue o f mtr y prop osed b y the R pack age is not optimal, a nd that there is no improv emen t b y using rando m forests with res pect to unpruned bagg ing (obtained for mtry = p ). 2.2.1 Standard proble ms Let us br ieﬂy examine standa r d ( n >> p ) regress io n da tasets. In Figure 1 for real ones and for simulated o nes in Fig ure 2. Each plot gives for mtr y = 1 to p the OOB erro r for three diﬀerent v alues of ntr ee = 100 , 5 0 0 and 1000 . The vertical solid line indicates the v alue mtry = p/ 3, the default v alue prop os e d by the R pack age for reg ression problems, the vertical dashed line b eing the v a lue mtry = √ p . Three r emarks can be formulated. First, the OOB err or is maximal for mtry = 1 and then decreases quickly (except for the ozone da taset, for reaso ns not clea rly elucidated), then as so on as mtr y > √ p , the error remains the same. Second, the choice mtr y = √ p gives always low er OOB erro r than mtr y = p/ 3 , and the ga in can be impo r tant. So the default v alue prop osed by the R pack age seems to b e often not optimal, esp ecially when ⌊ p/ 3 ⌋ = 1. L astly , the default v alue ntr ee = 500 is conv enien t, but a muc h smaller o ne ntre e = 10 0 leads to comparable results. RR n ° 6729 8 Genuer et. al 2 4 6 8 10 12 10 12 14 16 18 mtry OOB Error BostonHousing ntree=100 500 1000 2 4 6 8 10 12 20 21 22 23 24 25 26 Ozone 1 1.5 2 2.5 3 3.5 4 20 25 30 35 40 45 Servo Figure 1: Standar d r egression: 3 real data sets 2 4 6 8 10 7 8 9 10 11 12 mtry OOB Error friedman1 ntree=100 500 1000 1 1.5 2 2.5 3 3.5 4 2.5 3 3.5 x 10 4 friedman2 1 1.5 2 2.5 3 3.5 4 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 friedman3 Figure 2: Standard reg ression: 3 simu lated data sets So, for standar d ( n >> p ) reg ression pr oblems, it seems that there is no im- prov ement by using rando m fore s ts with resp ect to unpruned bagging (obtained for mtr y = p ). 2.2.2 High di mensional problems Let us start with a simulated data set for the high dimensional case n < < p . This example is built by a dding extra noisy v ariables (indep endent and uniformly distributed on [0 , 1]) to the F riedman1 mo del deﬁned b y: Y = 10 sin( πX 1 X 2 ) + 20( X 3 − 0 . 5) 2 + 10 X 4 + 5 X 5 + ǫ INRIA R andom F or ests: some metho dolo gic al insights 9 where X 1 , . . . , X 5 are indep endent and uniformly distributed on [0 , 1] and ǫ ∼ N (0 , 1). So we hav e 5 v ariables related to the r esp o nse Y , the others b eing noise. W e set n = 1 00 and let p v ary . 10 0 10 1 10 2 14 15 16 17 18 19 20 21 22 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 16 17 18 19 20 21 22 p=200 10 0 10 1 10 2 18 19 20 21 22 23 p=500 10 0 10 1 10 2 10 3 19 19.5 20 20.5 21 21.5 22 22.5 23 23.5 p=1000 Figure 3 : Hi gh dimensional regress ion s im ulated data set: F riedman1. The x-axis is in log sc a le Figure 3 con tains four plots co r resp onding to 4 v alues of p (1 0 0, 200, 500 and 1000) increasing the nuisance space dimensio n. Each plot g ives for ten v alues of mtry (1, √ p/ 2, √ p , 2 √ p , 4 √ p , p/ 4, p/ 3, p/ 2, 3 p/ 4 , p ) the OO B err or for three diﬀerent v a lues of n t r ee = 10 0 , 500 and 10 00. The x- axis is in log scale and the vertical solid line indicates mtr y = p/ 3 the default v alue prop osed by the R pack age for regressio n, the vertical das hed line being the v alue mtry = √ p . Let us give four comments. A ll curves have the same s ha pe: the O OB error decreas es while mtr y increases. While p incr e a ses, b oth O OB error s of unpruned bagging (obtained with m tr y = p ) and r andom forests with default v alue of mtr y increase, but unpruned bagg ing p erforms b etter than RF (ab out 25% of improvemen t). The choice mtry = √ p gives alwa ys worse results than those obtained for mtry = p/ 3. Finally , the default choice ntree = 50 0 is conv enien t, but a muc h smaller one ntr ee = 100 leads to compara ble results. Figure 4 a nd 5 show the results o f the s a me study for the F riedman2 and F r iedman3 mo dels. The previo us comments remain v alid. Let us just note that the diﬀerence b et ween unpruned bagging a nd ra ndo m forests with mtr y default v alue is even more pro no unced for these t wo pro blems. T o end, let us now examine the high dimensional r e a l data se t P AC. Figure 6 g ives for same ten v alues of mtr y the O OB err o r for four diﬀerent v a lues of ntree = 10 0 , 500 , 1000 and 500 0 (x- axis is in log s cale). The genera l be havior is similar except for the shap e: a s s o on as mtr y > √ p , the error remains the sa me instead of still decreasing. T he diﬀerence o f the shap e of the cur ves b etw e e n simu lated and real datasets can be explained b y the fact that, in simulated datasets we co nsidered, the num ber of true v ariables is very s mall co mpared RR n ° 6729 10 Genuer et. al 10 0 10 1 10 2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10 5 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=200 10 0 10 1 10 2 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=500 10 0 10 1 10 2 10 3 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=1000 Figure 4 : Hi gh dimensional regress ion s im ulated data set: F riedman2. The x-axis is in log scale 10 0 10 1 10 2 0.07 0.08 0.09 0.1 0.11 0.12 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.07 0.08 0.09 0.1 0.11 0.12 p=200 10 0 10 1 10 2 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 0.125 p=500 10 0 10 1 10 2 10 3 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 0.125 p=1000 Figure 5 : Hi gh dimensional regress ion s im ulated data set: F riedman3. The x-axis is in log scale to the total num b e r of v a r iables. One may exp ect that in r eal datasets, the prop ortion of true v aria bles is lar ger. So, for high dimensional ( n << p ) r egression problems, unpruned bagging seems to p erfor m b etter than rando m forests a nd the diﬀerence can b e large. INRIA R andom F or ests: some metho dolo gic al insights 11 10 0 10 1 10 2 150 200 250 300 350 400 450 500 550 mtry OOB Error PAC ntree=100 500 1000 5000 Figure 6: High dimensiona l r egressio n: P A C data. The x- a xis is in lo g scale 2.3 Classiﬁcation Abo ut s tandard clas siﬁcation pro blems, we chec k that Breiman’s conclusions remain v alid for t he considered v ariant a nd that the mtry default v alue pro po sed in the R pack age is go o d. How ever for high dimensiona l clas siﬁcation problems, we observe that lar ger v alues o f mtr y give so metimes muc h b etter results. 2.3.1 Standard proble ms F o r cla s siﬁcation problems for which n >> p , ag ain the pap e r by B r eiman is in teresting a nd we just quickly chec k the conclusions . Let us ﬁr st exa mine in Figure 7 standard ( n >> p ) classiﬁca tion r eal data sets. Each plot gives for mtr y = 1 to p the OOB erro r for thre e diﬀerent v alues of ntree = 100 , 500 and 1000 . The vertical solid line indicates the v alue mtry = √ p , the default v alue pro p os ed by the R pack age for cla ssiﬁcation. Three remar ks can b e for m ulated. The default v alue mtr y = √ p is conv e- nien t for a ll the examples. The default v alue ntr ee = 50 0 is suﬃcien t and a m uch smaller one ntr ee = 100 is not conv enient a nd can lea ds to signiﬁcantly larger err ors. The gener al shap e is the following: the error s for mtr y = 1 a nd for mtr y = p (corr esp onding to the unpruned bagging) are o f the same ” large” order o f magnitude a nd the minim um is rea ch ed for the v alue √ p . The ga in can be ab out 30 or 50%. So, for these 9 examples, the default v alue prop osed b y the R pack age is quite optimal. Let us now examine in Figure 8 sta nda rd ( n >> p ) clas siﬁcation simulated datasets. As it can b e seen, ntree = 500 is suﬃcient and, except for the ringno rm already pointed out as a somewhat special dataset (see Cutler, Zhao (2001) [13]) the v alue mtr y = √ p is go o d. Here, the general shap e o f the err or curve is quite diﬀerent compar ed to rea l datasets: the error incre ases with mtr y . So for these four examples, the sma ller mtry , the better. 2.3.2 High di mensional problem s Let us now consider the case n << p for whic h D ´ ıaz-Uriar te and Alv ar ez de Andr´ es (2006) [14] g ive numerous advic e . W e co mplete the study b y trying RR n ° 6729 12 Genuer et. al 2 4 6 8 0.2 0.21 0.22 0.23 0.24 mtry OOB Error Glass ntree=100 500 1000 2 4 6 8 0.215 0.22 0.225 Diabetes 10 20 30 40 50 60 0.15 0.16 0.17 0.18 0.19 0.2 Sonar 2 4 6 8 10 0.02 0.03 0.04 0.05 0.06 Vowel 5 10 15 20 25 30 0.065 0.07 0.075 0.08 Ionosphere 5 10 15 0.25 0.255 0.26 0.265 Vehicle 5 10 15 0.04 0.05 0.06 0.07 Votes 10 20 30 0.08 0.085 0.09 Sat−images 5 10 15 0.03 0.04 0.05 0.06 Letters Figure 7: Standard class iﬁca tion: 9 real da ta sets 5 10 15 20 0.16 0.165 0.17 0.175 mtry OOB Error waveform ntree=100 500 1000 5 10 15 20 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 twonorm 5 10 15 20 0.15 0.16 0.17 0.18 0.19 0.2 threenorm 5 10 15 20 0.05 0.06 0.07 0.08 0.09 0.1 0.11 ringnorm Figure 8: Standar d cla ssiﬁcation: 4 sim ulated da ta sets larger v alues o f mtry , which give interesting results. One can found in Figure 9 the O OB erro rs for nine high dimensional real datasets. Ea c h plot gives for nine v alues o f mtr y (1, √ p/ 2, √ p , 2 √ p , 4 √ p , p/ 4 , p/ 2 , 3 p/ 4, p ) the O OB er ror for four diﬀerent v alues of ntr ee = 100 , 500 , 1 000 and 5 000. The x-axis is in log scale. The vertical solid line indicates the default v alue prop osed by the R pack age mtr y = √ p . Again the default v alue ntre e = 5 00 is suﬃcient, and at the co nt rar y the v alue ntr ee = 100 ca n leads to signiﬁcantly larger errors . The general shap e is the following: it decreases in genera l a nd the minimum v alue is o btained or is close to the one r eached using mtry = p (corr esp onding to the unpruned bag- INRIA R andom F or ests: some metho dolo gic al insights 13 10 0 10 2 0.155 0.16 0.165 0.17 0.175 mtry OOB Error adenocarcinoma ntree=100 500 1000 5000 10 0 10 2 0.2 0.25 0.3 0.35 brain 10 0 10 2 0.38 0.4 0.42 0.44 0.46 0.48 breast.3.class 10 0 10 2 0.15 0.2 0.25 0.3 colon 10 0 10 2 0 0.05 0.1 0.15 0.2 leukemia 10 0 10 2 0 0.02 0.04 0.06 0.08 lymphoma 10 0 10 2 0.35 0.4 0.45 0.5 nci 10 0 10 2 0.05 0.1 0.15 0.2 0.25 prostate 10 0 10 2 0.05 0.1 0.15 0.2 srbct Figure 9 : High dimensiona l classiﬁca tion: 9 r e a l data sets. The x-axis is in log scale ging). The diﬀerence with standa rd pr oblems is nota ble, the reaso n is t hat when p is large, mtr y must be suﬃciently large in or der to have a high probability to capture impo rtant v ariables (that is v ar iables highly related to the r esp o nse) for deﬁning the splits o f the RF. In addition, let us mention that the default v alue mtry = √ p is still reasonable from the OOB error viewpo in t but of course, since √ p is small with resp ect to p , it is a very a ttractive v alue from a computational per sp ectiv e (notice tha t the trees ar e not too deep since n is not to o larg e). Let us examine a simulated dataset for the case n << p , in tro duced b y W esto n et al. (2003) [39], called “toys data” in the sequel. It is an equiprobable t wo-class problem, Y ∈ {− 1 , 1 } , with 6 true v aria bles, the o thers b eing some noise. This ex ample is interesting since it constructs tw o near independent groups of 3 signiﬁcant v ar iables (highly , mo derately and weakly corr elated with resp onse Y ) and a n additional gr oup of noise v aria bles, uncorrelated with Y . A forward reference to the plots on the left s ide of Figure 11 allow to see the v aria ble imp or tance picture and to note that the imp orta nce of the v ariables 1 to 3 is muc h higher than the one o f v a riables 4 to 6. Mo re precisely , the mo del is deﬁned through the conditional distribution of the X i for Y = y :  for 70% of da ta , X i ∼ y N ( i, 1) fo r i = 1 , 2 , 3 and X i ∼ y N (0 , 1) for i = 4 , 5 , 6.  for the 30 % left, X i ∼ y N (0 , 1) for i = 1 , 2 , 3 and X i ∼ y N ( i − 3 , 1) for i = 4 , 5 , 6.  the other v ariables are noise, X i ∼ N (0 , 1) fo r i = 7 , . . . , p . After simulation, obtained v ariables ar e standar dized. Let us ﬁx n = 100. The plots of Figur e 10 are or ganized as pr e v iously , four v alues o f p a re considered: 100, 2 00, 5 00 a nd 1 000 cor resp onding to increasing nuisance space RR n ° 6729 14 Genuer et. al 1 0 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 p=200 10 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 p=500 10 0 10 1 10 2 10 3 0.1 0.2 0.3 0.4 p=1000 Figure 10: High dimensiona l clas s iﬁcation simulated data s e t: toys data for 4 v alues of p . The x-axis is in log sca le dimension. F or p = 100 and p = 200 , the error decr eases h ugely un til mtr y reaches √ p and then remains c o nstant, so the default v alues work well and per form a s well as unpruned bag ging, even if the true dimension ˜ p = 6 << p . F o r la rger v a lues o f p ( p ≥ 50 0 ), the shape of the curve is close to the o ne for high dimensional rea l data sets (t he error decreases and the minimum is reac hed when mtry = p ). Whence, the err or rea ched by using r andom forests with default mtry is ab out 7 0% to 15 0% larger than the err or reached by unpruned bagging which is close to 3% for all the considered v a lues of p . Finally , for high dimensional class iﬁca tion pr oblems, our conclusion is that it may b e worthwhile to choose mtr y larg e r than the defa ult v alue √ p . After this section fo cusing o n the prediction p erformance, let us now fo cus on the s econd attractive feature of RF: the v ar iable imp ortance index. 3 V ariable imp ortance The qua ntiﬁcation of the v ariable imp ortance (abbr e v iated VI) is a cr ucial issue not only for ra nking the v a riables befor e a stepwise estimation mo del but also to interpret data a nd understand under lying phenomenons in many applied problems. In this section, w e examine the RF v ariable impor tance behavior accor ding to three diﬀerent issues. The ﬁrst o ne deals with the s ensitivit y to the sample size n and the n um b er of v ariables p . The second examines the s ensitivit y to method parameters mtr y and ntr ee . The las t o ne deals with the v a riable imp or tance of a gro up o f v a riables, highly correla ted or po orly cor r elated together with the problem of co rrect identiﬁcation of irrelev ant v ar iables. As a result, a go o d c hoice of pa rameters of RF can help to b etter discr imi- nate b etw een imp orta nt and useless v ariables. In addition, it can increa se the stabilit y of VI scores. T o illustrate this discuss ion, let us consider the toys data intro duced in Section 2.3.2 and compute the v ariable impo rtance. Recall that only the ﬁrst 6 v aria bles are of interest a nd the others are no ise. INRIA R andom F or ests: some metho dolo gic al insights 15 Remark 3.1 L et us mention that vari able imp ortanc e is c ompute d c onditionally to a given r e alization even for simulate d datasets. This choic e which is criticiz- able if t he obje ctive is to r e ach a go o d est imation of an underlying c onst ant, is c onsistent with the ide a of staying as close as p ossible to the exp erimental situ - ation de aling with a given dataset. In addition, the numb er of p ermutations of the observe d values in the OOB sample, us e d to c ompute the sc or e of imp ortanc e is set to the default value 1 . 3.1 Sensitivit y to n and p Figure 1 1 illustra tes the b ehavior of v ariable impor tance for se veral v alues o f n and p . Parameters ntr ee a nd mtry are set to their default v alues. Boxplots are based on 50 runs of the RF algo rithm and for visibility , we plot the v ariable impo r tance only for a few v ar ia bles. 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 importance variable n=500 p=6 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 n=500 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 n=500 p=500 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 importance variable n=100 p=6 n=100 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 n=100 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 n=100 p=500 Figure 11: V ariable imp ortance sensitivity to n and p (toys da ta) On ea ch row, the ﬁrst plot is the reference o ne for which we observe a conv enien t picture o f the relative imp ortance of the initial v ariables. Then, when p increases tremendously , we try to c heck if: (1) the s ituatio n b etw een the t wo groups remains readable; (2 ) the situation within each g r oup is stable; (3) the imp ortance of the additional dummy v a r iables is close to 0. The situation n = 50 0 (gr aphs a t the top of the ﬁgure) cor resp onds to an “easy” case, where a lo t of data are av ailable and n = 10 0 (graphs at the bo ttom) to a har der one. F or each v alue of n , three v alues of p are consider e d: 6 , 200 and 5 00. When p = 6 only the 6 true v ariables are pr esent . Then tw o very diﬃcult situations a re considered: p = 20 0 with a lot of no isy v aria bles and p = 500 is even harder. Gra phs are truncated after the 16th v a riable for readability (impor tance of noisy v ariables left are the same or der o f mag nitude as the last plotted). Let us commen t on g r aphs on the ﬁrst row ( n = 500). When p = 6 w e obtain concentrated b oxplots and the order is clear , v aria bles 2 and 6 having near ly the sa me impo r tance. When p increases, the order of magnitude of imp ortance RR n ° 6729 16 Genuer et. al decreases. The order within the tw o gr oups of v ar iables (1 , 2 , 3 and 4 , 5 , 6) remains the same, while the ov erall or der is mo diﬁed (v ariable 6 is now less impo r tant than v ar ia ble 2 ). In addition, v a riable imp ortance is more unstable for huge v alues of p . B ut what is r emark a ble is that all nois y v ariables hav e a zero VI. So one can ea s ily recover v ariables of interest. In the sec o nd ro w( n = 100 ), w e note a greater insta bility since the num ber of observ ations is o nly mo derate, but the v a r iable ranking remains quite the sa me. What diﬀers is that in the diﬃcult situations ( p = 200 , 500) impo r tance of so me noisy v ariables increase s , and for example v a riable 4 cannot b e hig hlig h ted from noise (even v aria ble 5 in the b ottom rig h t gra ph). This is due to the decrea sing behavior of VI with p gr owing, coming from the fact that when p = 500 the algorithm r andomly choos e only 22 v aria bles at eac h s plit (with the mtry default v alue). The pr obability of choo sing one of the 6 true v ariables is re a lly small and the les s a v ar iable is chosen, the les s it can b e co nsidered as imp ortant. In addition, le t us remark that the v a riability o f VI is larg e f or true v a riables with resp ect to us e le s s ones . This remark ca n b e used to build some kind of test for VI (see Strobl et al. (20 07) [36]) but of co urse ranking is b etter suited for v ar iable selection. W e now study how this VI index b ehav es when changing v alues of the main method parameters . 3.2 Sensitivit y t o mtr y and ntree The choice of mtr y and ntre e can b e imp ortant for the VI computation. Let us ﬁx n = 100 and p = 200. In Figure 12 we plot v ar iable imp ortance o btained using three v alues of mtr y (14 the default, 100 and 2 00) and tw o v a lues of ntr ee (500 the default, and 2000). 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 importance variable ntree=500 mtry=14 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=500 mtry=100 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 importance variable ntree=2000 mtry=14 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=2000 mtry=100 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=500 mtry=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=2000 mtry=200 Figure 12: V ariable imp or ta nce sensitivity to mtr y and ntr ee (toys data ) The eﬀect of taking a larger v alue for mtry is ob vious. Indeed the magnitude of VI is more than doubled starting from mtr y = 14 to mtr y = 100 , and it again increases whith mtr y = 20 0 . The eﬀect o f ntree is less visible, but tak ing INRIA R andom F or ests: some metho dolo gic al insights 17 ntree = 2 000 leads to b etter stability . What is int eresting in the b ottom right graph is that we get the sa me or der for all true v ar iables in every r un of the pro cedure. In top left situation the mean OOB err or rate is a bo ut 5% and in the b ottom right one it is 3%. The gain in erro r ma y not b e co nsidered as la rge, but what we g et in VI is interesting. 3.3 Sensitivit y to highly correlated predictors Let us a ddress an imp orta n t issue: how do es v a riable impo rtance b ehav e in presence of se veral highly cor related v aria bles? W e take a s bas ic framework the pr evious cont ext with n = 100 , p = 200, ntr ee = 2000 and mtry = 100 . Then we add to the datas e t highly corr elated replications o f some of the 6 true v aria bles. The replicates a r e inserted be t ween the true v ariables and the useless ones. 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Figure 13: V a riable imp ortance of a gr oup of cor r elated v ariables (augmented toys data) The ﬁrst gr aph o f Figure 13 is the refer ence o ne: the situation is the same as previo usly . Then for the three other cases, we sim ulate 1, 10 and 20 v ariables with a co rrelation of 0 . 9 with v ar iable 3 (the most imp or tan t one). These replications are plotted b etw een the t wo vertical lines. The magnitude of imp ortance of the g roup 1 , 2 , 3 is steadily decrea sing when adding mo re r eplications of v ariable 3. On the other hand, the imp or ta nce of the group 4 , 5 , 6 is unchanged. Notice that the impor tance is not divided by the num b er of replications. Indeed in our example, even with 20 replications the ma xim um imp or tance of the gr oup containing v ariable 3 (that is v ar iable 1 , 2 , 3 and all r eplications of v ariable 3 ) is o nly three times low er than the initial impo r tance of v a riable 3. Fina lly , note that even if some v ariables in this gr oup hav e low imp ortance, they cannot b e confused with noise. Let us brieﬂy commen t on similar experiments (see Figure 14) but p ertur bing the basic situation not o nly by int ro ducing highly correlated versions of the third v aria ble but also of the sixth, leading to replicate the mo st imp or ta n t of ea ch group. RR n ° 6729 18 Genuer et. al 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 32 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Figure 14 : V a riable imp ortance of tw o g r oups of corr elated v ariables (augmented toys data) Again, the ﬁrst graph is the r eference o ne. Then we s im ulate 1, 5 a nd 10 v aria bles of correla tio n ab out 0 . 9 with v ariable 3 and the same with v a r iable 6. Replications of v ariable 3 ar e plotted betw een the ﬁrs t vertical line and the das hed line, and replica tio ns o f v a riable 6 b et ween the das hed line and the second vertical line. The magnitude of imp orta nce of ea ch gro up (1 , 2 , 3 and 4 , 5 , 6 r esp ectiv ely) is steadily decreasing when a dding more replications. The rela tiv e imp ortance betw een the tw o gr o ups is pres e r ved. And the relative impo rtance b et ween the t wo g roups of replications is of the same order than the one b et ween the tw o initial gro ups. 3.4 Prostate data v ariable imp ort ance T o end this section, we illustrate the b ehavior of v ariable imp ortance on a high dimensional real da taset: the microarr ay da ta called Pro state. The global pic- ture is the following: t wo v ariables hugely imp ortant, ab out tw ent y mo der ately impo r tant v ariables and the others of small imp ortance. So, mor e precisely Fig- ure 15 compa res VI obtained for para meters set to their default v a lues (graphs of the left column) and those obtained for ntr ee = 200 0 and mtry = p/ 3 ( gra phs of the r ight column). Let us co mmen t on Figure 1 5. F or the tw o mo st impo rtant v ariables (ﬁrst row), the magnitude of imp orta nce obtained with ntr ee = 2000 and mtry = p/ 3 is muc h larger than to the one obtained with default v alues. In the seco nd row, the increase of magnitude is s till no ticeable from the third to the 9th most impo r tant v ar iables and from the 10th to the 2 0th most imp ortant v ariables, VI is quite the same for the tw o par ameter choices. In the third row, w e get VI closer to zero for the v ariables with ntree = 2000 and mtr y = p/ 3 than with default v a lues . In addition, note that for the less imp orta n t v ar ia bles, b oxplots are lar ger for default v a lues, especially for unim p ortant v ariables (from th e 200th to the 2 50th). INRIA R andom F or ests: some metho dolo gic al insights 19 1 2 0 0.02 0.04 0.06 0.08 0.1 variable importance 1 2 0 0.02 0.04 0.06 0.08 0.1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 x 10 −3 importance variable 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 x 10 −3 200 210 220 230 240 250 −5 0 5 10 x 10 −4 importance variable 200 210 220 230 240 250 −5 0 5 10 x 10 −4 Figure 15: V a riable impo rtance for Pro s tate data (using ntr ee = 2 000 and mtry = p / 3, on the r ig ht and using default v a lues on the left) 4 V ariable selection 4.1 Pro cedure 4.1.1 Principle W e distinguish tw o v ariable selection o b jectiv es: 1. to ﬁnd imp ortant v aria bles highly related to the r esp onse v a riable for in terpretatio n purp ose; 2. to ﬁnd a small num b er of v a riables suﬃcient to a go o d pre diction of the resp onse v aria ble. The ﬁrst is to magnify all the impo r tant v aria bles, even with hig h redundancy , for interpretation purpos e and the second is to ﬁnd a suﬃcien t parsimonious set of imp ortant v ariables for prediction. Two e a rlier works must b e c ited: D ´ ıa z - Uriarte, Alv arez de Andr´ es (2006 ) [14] and Ben Ishak, Ghattas (2008 ) [4]. D ´ ıa z-Uriarte, Alv ar ez de Andr´ es prop ose a stra tegy ba sed on recur sive elim- ination of v ariables. Mor e precisely , they ﬁr st compute RF v ariable imp ortance. Then, a t each step, they eliminate the 2 0% of the v ar iables having the sma lles t impo r tance a nd build a new forest with the remaining v ariables. They ﬁnally select the set of v a riables leading to the smallest OOB er r or ra te. The prop or- tion of v ariables to eliminate is an arbitrary parameter o f their meth o d and do es not dep end on the data. RR n ° 6729 20 Genuer et. al Ben Isha k, Ghattas c ho ose an ascendant strateg y based o n a sequen tial intro- duction of v ariables. Firs t, they compute s ome SVM-based v ariable importance. Then, they build a sequence o f SVM mo dels inv oking a t the b eginning the k most imp ortant v a riables, by s tep o f 1. When k beco mes to o larg e, the addi- tional v aria bles a re inv oked by packets. They ﬁna lly select the set o f v aria bles leading to the model of smallest erro r rate. The wa y to in tro duce v ariables is no t data-driven since it is ﬁxed b efore running the pro cedure. They also co mpare their pro cedure with a similar one using RF instead of SVM. W e prop ose the following t wo-steps procedure, the ﬁrst one is common while the second one dep ends on the o b jectiv e: 1. P reliminary elimination and r a nking:  Compute the RF scores o f imp ortance, cancel the v a r iables of small impo r tance;  Order the m remaining v ariables in decr easing order of imp ortance. 2. V ariable se lection:  F o r interpr etation : co ns truct the nested collection o f RF mo dels in- volving the k ﬁrst v aria bles, for k = 1 to m a nd selec t the v ariables in volv ed in the mo del leading to the smallest O OB error ;  F o r pr e diction : starting from the order ed v ariables reta ined for int er- pretation, construct a n ascending sequence of RF mo dels, by inv oking and testing the v ariables stepwise. The v ariables of the la s t mo del are selected. Of course, this is a sketc h o f pr o cedure a nd more details are needed to b e eﬀectiv e. The next par a graph a nswer this po in t but we e mphas iz e that we pro - po se an heuris tic str a tegy which is no t supp orted by sp eciﬁc mo del hypotheses but base d o n da ta-driven thresholds to take decisio ns. Remark 4.1 Sinc e we want to tre at in an uniﬁe d way al l the situations, we wil l use for ﬁnding pr e diction variables t he somewhat crude str ate gy pr eviously de- ﬁne d. Nevertheless, st arting fr om the set of variables sele cte d for interpr etation (say of size K ), a b etter s t r ate gy c ould b e to examine al l, or at le ast a lar ge p art, of the 2 K p ossible mo dels and to sele ct the variables of the mo del minimizing the OOB err or. But this st ra te gy b e c omes quickly unr e alistic for high dimensional pr oblems so we pr efer to exp eriment a str ate gy designe d for smal l n and lar ge K which is not c onservative and even p ossibly le ads to sele ct fewer variables. 4.1.2 Starting example T o bo th illustrate a nd g ive mo r e details ab out this pro cedure, we apply it on a simu lated lear ning se t of size n = 1 00 from the class iﬁca tion toys data mo del (see Section 2.3.2) with p = 2 00. The r esults a re summarized in Figure 16. The true v ariables (1 to 6) ar e resp ectively represented by ( ✄ , △ , ◦ , ⋆ , ✁ ,  ). W e compute, thanks to the le a rning set, 50 forests with ntr ee = 2 000 and mtry = 100 , which a re v alues of the main parameters pre v iously co nsider ed a s well adapted for VI calculations. Let us deta il the main stag es o f the pro ce dur e toge ther with the results obtained on toys data: INRIA R andom F or ests: some metho dolo gic al insights 21 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 variables mean of importance 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 variables standard deviation of importance 0 5 10 15 20 25 30 35 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 nested models OOB error 1 1.5 2 2.5 3 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 predictive models OOB error Figure 16: V a riable selection pro cedures for interpretation and prediction for toys data  First we rank the v ariables by sorting the VI in descending order. The result is drawn on the top left graph for the 50 most imp ortant v ari- ables (the other noisy v a r iables having an impo r tance very close to zero to o). Note that true v ar iables are sig niﬁcant ly mor e imp o rtant than the noisy ones.  W e k eep this order in mind and plot the corresp onding standard deviations of VI. W e use this g raph to estimate some threshold for impor tance, a nd we keep only the v ariables of imp ortance exceeding this level. Mo re precisely , we selec t the threshold as the minim um prediction v alue given b y a CAR T mo del ﬁtting this curve. This rule is, in general conserv ative a nd leads to retain more v ariables than necessar y in or der to make a careful choice later. The standa r d deviatio ns of VI can b e found in the top right graph. W e can see that true v aria bles standard deviation is large compar ed to the noisy v ar ia bles one, which is c lo se to zero. The thresho ld lea ds to retain 33 v aria bles.  Then, we compute OOB erro r rates of rando m forests (using default pa - rameters) of the nested mo dels star ting from the one with only the most impo r tant v ariable, and ending with the one inv o lving all importa n t v ari- ables kept prev io usly . The v ar ia bles of the mo del leading to the smalles t OOB error are selected. Note that in the bo ttom left gr a ph the erro r decreases quickly a nd reaches its minimum when the ﬁrs t 4 true v a riables are included in the model. Then it remains constant. W e select the mo del co n taining 4 of the 6 true v a riables. Mor e precisely , we select the v a riables inv olved in the mo del almost leading to the smallest OOB erro r, i.e. the ﬁr st model RR n ° 6729 22 Genuer et. al almost leading to the minimum. The actual minim um is reached with 24 v aria bles. The expected behavior is non-decreasing a s so on as all the ”tr ue” v aria bles hav e b een selected. It is then diﬃcult to treat in a uniﬁed wa y nearly constant o f or s lig ht ly increasing. In fact, we pr op ose to use a n heur istic rule s imilar to the 1 SE rule of Breiman et al. (1984 ) [6] used for selec tio n in the cos t-complexity pr uning pr o cedure.  W e per form a sequential v ariable introduction with testing: a v ariable is added only if the err o r gain exceeds a threshold. The idea is that the error decrease must b e signiﬁcantly greater than the av erage v ariation o bta ined b y adding no is y v aria bles. The bottom right graph shows the result of this step, the ﬁnal mo del for pr ediction purp ose in volv es o nly v ariables 3 , 6 and 5. The thresho ld is set to the mean of the absolute v alues of the ﬁrs t o rder diﬀerenti ated errors b etw een the mo del with 5 v ariables (the ﬁrst mo del a fter the one we selected for interpretation, see the b ottom left gr aph) and the last o ne. It should be noted that if one wan ts to estimate the prediction erro r, since ranking and selection are made on the same set of obs e r v ations, of co urse an error ev aluation o n a test set or using a cros s v alidation scheme sho uld b e preferred. It is taken int o a ccount in the next sec tio n when our res ults are compar e d to others. T o ev a luate fairly the diﬀerent prediction e r rors, we pr efer here to simulate a test set of the same size than the learning set. The test er ror rate with all (200) v a riables is ab out 6% while the o ne with the 4 v ariables s elected for in terpretatio n is ab out 4 . 5%, a little bit smaller. The mo del with prediction v aria bles 3, 6 a nd 5 reaches an er ror of 1%. Repeating the global pr o cedure 10 times on the same da ta alwa ys gav e the same interpretation set of v ariables and the same prediction set, in the sa me o rder. 4.1.3 Highly correlated v ariables Let us now apply the pr o cedure on toys data with replicated v ariables: a ﬁrs t group of v ariables highly c o rrelated with v ar iable 3 and a s econd one replicated from v ariable 6 (the most imp ortant v a r iable of ea c h gro up). The situations of in terest are the same as those co nsidered to pro duce Figure 14 . n umber of replications in terpretatio n s e t prediction set 1 3 7 3 2 6 5 3 6 5 5 3 2 7 3 10 3 6 11 3 5 12 6 3 6 5 10 3 14 3 8 3 2 15 3 6 5 10 3 13 3 20 6 3 6 5 10 3 T a ble 3: V ariable selec tio n pro cedure in presence of hig hly corr e la ted v a riables (augmented toys data) Let us comment o n T able 3, where the expr ession i j means that v aria ble i is a r e plica tion of v ar ia ble j . In terpretatio n sets do not contain all v ar iables of interest. Particularly we hardly k eep replicatio ns of v aria ble 6. The r eason is that even b efore adding INRIA R andom F or ests: some metho dolo gic al insights 23 noisy v ariables to the mo del the error r ate of nested mo dels do increase (or remain constant): when s everal highly correlated v ar ia bles are added, the bias remains the same while the v ar iance incre a ses. How ev er the prediction sets a re satisfactory: we alwa ys highlig h t v a riables 3 and 6 and at most o ne c o rrelated v aria ble with each of them. Even if all the v ar iables of interest do no t app ear in the interpretation set, they alw ays app ear in th e ﬁrst pos itions of our ranking according to imp ortance. More precisely the 16 most imp ortant v ar iables in the case of 5 r eplications ar e: (3 2 7 3 10 3 6 11 3 5 12 6 8 3 13 6 16 6 1 15 6 14 6 9 3 4), and the 26 mo st imp orta n t v aria bles in the case of 10 replications a re: (3 1 4 3 8 3 2 15 3 6 5 10 3 13 3 20 6 21 6 11 3 12 3 18 6 1 24 6 7 3 26 6 23 6 16 3 25 6 22 6 17 6 19 6 4 9 3 ). No te that the or der o f the true v a riables (3 2 6 5 1 4) r emains the same in all situations. 4.2 Classiﬁcation 4.2.1 Prostate data W e apply the v ariable selection pro cedur e on P rostate da ta . The g raphs of Figure 17 are obtained as those of Figure 1 6, ex c ept that for the RF pro ce dur e, we use ntr ee = 2000, mtr y = p/ 3 and for the b ottom left g raph, we o nly plot the 100 most imp or tan t v ariables for visibility . The pro cedure leads to the s ame picture a s previously , e x cept for the O OB ra te along the nested mo dels whic h is less regular. The key p oint is that it selects 9 v ariables for in terpretation, and 6 v aria bles for prediction. The num b er of selected v a r iables is then very muc h smaller than p = 6033 . 0 20 40 60 80 100 0 0.02 0.04 0.06 0.08 0.1 0.12 variables mean of importance 0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 variables standard deviation of importance 1 2 3 4 5 6 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 predictive models OOB error 0 20 40 60 80 100 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 nested models OOB error Figure 17: V a riable selection pro cedures for interpretation and prediction for Prostate data In addition, to exa mine the v ar ia bilit y of the interpretation and prediction sets the global pro cedure is rep eated ﬁve times on the entire Pros tate dataset. The ﬁve prediction sets a re very clo se to each other. The num ber of prediction v aria bles ﬂuctuates betw een 6 and 10, a nd 5 v ar iables appear in all s e ts. Among the ﬁve interpretation sets, 2 are identical and made of 9 v ariables and the 3 RR n ° 6729 24 Genuer et. al other are made of 25 v ariables. The 9 v aria bles of the smalles t sets are present in all sets and the biggest sets (o f size 25 ) have 23 v a r iables in common. So, a lthough the sets of v ariables ar e not identical fo r each r un of the pr o- cedure, they are no t co mpletely diﬀerent . And in addition the most impor tant v aria bles are included in all s ets of v a riables. 4.2.2 High di mensional classiﬁcation W e apply the globa l v ariable selection pro cedure on high dimensional real data sets studied in Section 2.3.2, and we wan t to get an estimation o f prediction err or rates. Since these datasets are of small size, we use a 5- fo ld cross - v alidation to estimate the err o r ra te. So we split the sample in 5 s tratiﬁed par ts, each part is successively used a s a test s et, and the r emaining o f the data is used a s a learn- ing set. Note that the set of v a r iables selected v ary from one fold to another. So, we give in T able 4 the misclassiﬁcation erro r r ate, given by the 5- fold cross- v alidation, for in terpretatio n a nd prediction sets of v ariables r e s pectively . The n umber into brack ets is the av erag e num ber of selected v ariables. In addition, one can ﬁnd the orig inal error which stands for the misclassiﬁcatio n r ate given b y the 5 -fold cross-v alidation achiev ed with random forests using all v ariables. This error is calculated using the same partition in 5 par ts and again we use ntree = 20 00 and mtry = p / 3 for all da ta sets. Dataset in terpretatio n prediction original Colon 0.16 (35) 0.20 (8) 0.14 Leukemia 0 (1) 0 (1) 0.02 Lymphoma 0.08 (77) 0.09 (12) 0.10 Prostate 0.085 (3 3) 0.075 (8) 0.07 T a ble 4: V aria ble selection pro cedure for four high dimensional r eal datasets. CV-error ra te a nd into brack ets the average num b er o f selected v a r iables The num b er of interpretation v aria bles is hugely sma ller than p , at most tens to be compared to thousands. The num ber of prediction v ariables is very small (always s maller than 12) and the r eduction can b e very imp ortant w.r.t the interpretation set size. The erro rs for the tw o v ariable selection pro ce dur es are of the same o rder of magnitude as the original error (but a little bit la r ger). W e compare these results with the results obtained by Ben I s hak and Ghattas (2008) (see tables 9 and 11 in [4]) which hav e compared their method with 5 comp etitors (mentioned in the in tro duction) for class iﬁcation problems on these four datasets. Erro r rates are compara ble. With the prediction pro c e dure, as already noted in the introductory remark, we alwa ys select fewer v ariables than their pr o cedures (except for their metho d GLMpath which select less than 3 v aria bles for all datasets). 4.3 Regression 4.3.1 A s im ulated dataset W e now apply the pro cedure to a simulated regres sion problem. W e construct starting from the F riedman1 mo del and adding noisy v ariables as in Section INRIA R andom F or ests: some metho dolo gic al insights 25 2.2.2, a learning se t of size n = 1 0 0 with p = 200 v ar iables. Figure 18 dis plays the results of the pro cedure. The true v ariables of the mo del (1 to 5 ) are resp ectively represe nted by ( ✄ , △ , ◦ , ⋆, ✁ ). 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 variables mean of importance 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 variables standard deviation of importance 0 5 10 15 20 25 30 35 8 10 12 14 16 18 20 22 24 26 nested models OOB error 1 1.5 2 2.5 3 3.5 4 8 10 12 14 16 18 20 22 24 26 predictive models OOB error Figure 18: V a riable selection pro cedures for interpretation and prediction for F r iedman1 data The gra phs ar e of the same kind a s in cla ssiﬁcation problems. Note that v aria ble 3 is co nfused with noise a nd is not s elected by the pro cedure. This is explained by the fact that it is hardly correla ted with the resp onse v ar iable. The in terpretation pro cedure select the true v ariables except v aria ble 3 and t wo no isy v ar iables, and the prediction se t of v a riables contains only the true v aria bles (except v ariable 3). Again the whole pro cedure is stable in the sense that several runs g ive the same set o f selected v ar iables. In addition, we simulate a test set o f the same size than the learning set to estimate the pr ediction err o r. The test mean squa red error with a ll v aria bles is ab out 19 . 2, the one with the 6 v ariables selected for int erpreta tion is 12 . 6 and the one with the 4 v ar iables selected for prediction is 9 . 8. 4.3.2 Ozone data Before ending the pap er, let us apply the entire pro cedure to the ozone dataset. It consists of n = 3 66 observ a tions of t he daily maximum o ne-hour-av era ge ozone together with p = 1 2 meteor ologic explana tory v a riables. Let us ﬁrst examine, in Figure 19 the VI obtained with RF pro cedure using mtr y = p/ 3 = 4 and ntree = 20 00. F r om the left to the right, the 12 explanatory v ariables are 1-Month, 2- Day of month, 3-Day of week, 5-Pres sure height, 6-Wind s peed, 7-Humidit y , 8 - T e mp era ture (Sandburg), 9- T emp era ture (El Monte), 10-Inv ersion base height, 11-Pr essure gradient, 12- In version ba se temper ature, 13 -Visibilit y . Three very sensible groups o f v ariables app ear from the most to the lea st impo r tant. First, the tw o temp era tures (8 and 9), the inv ersion base temp er- ature (12) known to b e the b est ozo ne pr edictors, and the month (1), which RR n ° 6729 26 Genuer et. al 1 2 3 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 importance variable Figure 19: V ariable imp ortance for O zone data is an impo r tant predictor since ozone concentration exhibits an heavy s e asonal comp onent. A second gro up of clear ly less imp ortant meteorolo gical v a riables: pressure height (5 ), h umidity (7), inv ersion base height (10), pressure gradient (11) and visibility (13). Finally three unimp ortant v ar iables: day o f month (2), day of week (3) of co urse a nd mor e surprisingly wind sp eed (6). This last fact is cla ssical: w ind enter in the model only when ozo ne p o llution ar ises, otherwise wind and p ollution are uncorrelated (see for e x ample Cheze et a l. (200 3) [12] highlighting this phenomenon using partial estimators). Let us now exa mine the results of the selection pro cedures. 0 2 4 6 8 10 12 −5 0 5 10 15 20 25 variables mean of importance 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 variables standard deviation of importance 0 2 4 6 8 10 0 5 10 15 20 25 30 nested models OOB error 1 2 3 4 5 6 0 5 10 15 20 25 30 predictive models OOB error Figure 20: V a riable selection pro cedures for interpretation and prediction for Ozone data After the ﬁrst elimination s tep, the 2 v ariables of nega tiv e impor tance are canceled, as exp ected. Therefore we keep 1 0 v a riables for interpretation step and then the mo del with 7 v ariables is then selected a nd it contains a ll the most imp ortant v ariables: (9 8 1 2 1 11 7 5 ). INRIA R andom F or ests: some metho dolo gic al insights 27 F o r the prediction pr o cedure, the mo del is the same except one more v a riable is eliminated: humidit y (7) . In a ddition, when diﬀerent v alues fo r mtr y are considered, the most imp or- tant 4 v a riables (9 8 12 1 ) highlighted by the VI index, are selected and app ear in the same or der. The v ariable 5 a lso alwa ys app ear s but another one can app ear after of b efore. 5 Discussion Of course, one of the main op en issue a bo ut random forests is to elucidate from a mathematical p oint of view its exceptionally a ttractive p erfor mance. In fact, only a s mall nu mber of refer ences deal with this very diﬃcult challenge and, in addition to bagg ing theoretical exa mination by B ¨ ulmann a nd Y u (200 2 ) [11], only pur e ly random trees, a simple version of ra ndom forests, is considered. Purely random tr ees hav e been introduced by Cutler and Zhao (2001 ) [1 3] for classiﬁcation problems and then studied b y Breiman (20 04) [9], but the results are somewhat preliminary . Mor e rec e ntly Biau et al. (2008) [5] obtained the ﬁrst well stated co ns istency type results. F r om a practical p ers pective, surprisingly , this simpliﬁed and es sent ially no t data-driven s tr ategy s eems to p erform well, at least for prediction purp ose (see Cutler a nd Zhao 20 01 [13]) and, of course, can b e ha ndled theoretically in a easier wa y . Nev ertheless, it should b e interesting to c heck that the s ame conclusions hold for v ariable imp ortance and v ariable selection tas ks. In addition, it c o uld b e interesting to examine some v a riants of random forests which, at the contrary , try to take into account more information. Let us give for example tw o ideas. The ﬁrst is ab out pr uning: why pruning is not used for individual trees? Of course , from the computational point o f view the answer is obvious and for pr ediction p erforma nce, averaging eliminate the negative eﬀects of individual overﬁtting. But from the tw o other prev io usly men tioned statistical pro blems, prediction and v ar ia ble selection, it remains unclear. The se cond remark is ab out the ra ndom feature s e lectio n step. The most widely used version of RF selects randomly mtry input v ariables according to the discrete uniform distr ibution. Two v aria nt s can be suggested: the ﬁr st is to select random inputs a ccording to a distribution coming from a preliminary ranking given by a pilot es timator; the second one is to adaptively up date this distribution taking proﬁt of the ranking based o n the curr e n t forest which is then more and mor e accurate. These diﬀerent future directions, b oth theoretical and practical, will be ad- dressed in the next s tep of the work. 6 A pp end ix In the sequel, informa tion abo ut datasets retrieved from the R pac k a ge mlbench can b e found in the corr espo nding descr iption ﬁle. Standard probl ems, n >> p :  Binary classiﬁcation RR n ° 6729 28 Genuer et. al – Real data sets 3 * Ionosphere ( n = 3 51 , p = 34) * Diabetes, Pima Indian sDiabetes2 ( n = 7 68 , p = 8) * Sonar ( n = 20 8 , p = 60) * V o tes , H ouseV otes84 ( n = 4 35 , p = 16) – Simulated data sets 3 * Ringnorm, mlbe nch.ri ngnorm ( n = 200 , p = 20) * Threenorm, mlben ch.thr eenorm ( n = 200 , p = 20 ) * Twonorm, mlben ch.two norm ( n = 20 0 , p = 20)  Multiclass classiﬁcation – Real data sets 3 * Glass ( n = 21 4 , p = 9 , c = 6) * Letters, Lett erReco gnition ( n = 20 000 , p = 16 , c = 26) * Sat-images, Sate llite ( n = 6435 , p = 36 , c = 6) * V ehicle ( n = 84 6 , p = 18 , c = 4) * V owel ( n = 990 , p = 10 , c = 11 ) – Simulated data sets 3 * W aveform, mlbenc h.wave form ( n = 20 0 , p = 21 , c = 3)  Regression – Real data sets 3 * BostonHousing ( n = 5 06 , p = 13) * Ozone ( n = 36 6 , p = 12) * Servo ( n = 16 7 , p = 4) – Simulated data sets 3 * F r iedman1, mlben ch.fri edman1 ( n = 30 0 , p = 10) * F r iedman2, mlben ch.fri edman2 ( n = 30 0 , p = 4) * F r iedman3, mlben ch.fri edman3 ( n = 30 0 , p = 4) High dimensi onal probl ems, n << p :  Binary classiﬁcation – Real data sets 4 * Adenoc arcinoma ( n = 76 , p = 986 8), see Ra ma swam y et al. (2003) [33] * Colon ( n = 62 , p = 200 0), see Alon et al. (1999 ) [1] * Leukemia ( n = 38 , p = 305 1): see Golub et al. (1999 ) [2 1] * Prostate ( n = 1 02 , p = 6033): see Singh et a l. (2002) [35] – Simulated data sets 5 3 from the R pack age mlbench 4 see htt p://ligarto.org/rdiaz/Papers/rfVS/randomF orestV arSel.html 5 see description in section 2.3.2 INRIA R andom F or ests: some metho dolo gic al insights 29 * toys data ( n = 100 , 1 00 ≤ p ≤ 1000), see W eston e t a l. (2003) [39]  Multiclass classiﬁcation – Rea l da ta sets 4 * Brain ( n = 42 , p = 559 7 , c = 5), see Pomeroy et a l. (200 2) [31] * Breast, brea st.3. class ( n = 96 , p = 4869 , c = 3), see v an’t V eer et a l. (200 2) [38 ] * Lymphoma ( n = 6 2 , p = 40 26 , c = 3), see Alizadeh (2000) [2] * Nci ( n = 61 , p = 6 0 33 , c = 8), see Ros s et al. (2000) [34] * Srb ct ( n = 6 3 , p = 23 08 , c = 4), see Kha n et al. (2001) [26]  Regression – Rea l da ta sets 6 * P A C ( n = 2 0 9 , p = 467) – Simulated data sets 3 * F r iedman1, mlben ch.fri edman1 ( n = 100 , 1 00 ≤ p ≤ 1000) References [1] Alon U., Ba r k ai N., Notterman D.A., Gish K., Ybarr a S., Mack D., and Levine A.J. (19 99) Br o ad p atterns of gene expr ession r eve ale d by clustering analysis of t umor and normal c olon tissues pr ob e d by oligonucle otide arr ays . Pro c Natl Acad Sci USA, Cell Bio logy , 96(1 2):6745- 6750 [2] Aliza deh A.A. (2000 ) Distinct typ es of diﬀues lar ge b-c el l lymphoma identi- ﬁe d by gene expr ession pr oﬁling . Nature, 4 03:503 -511 [3] Ar cher K .J. and Kimes R.V. (20 08) Empiric al char acterization of r andom for est variable imp ortanc e me asur es . Computational Statistics & Data Anal- ysis 52:22 49-22 60 [4] B e n Ishak A. a nd Ghattas B. (2008) S´ ele ction de variables en classiﬁc ation binair e : c omp ar aisons et applic ation aux donn ´ ees de biopuc es . T o app ear, Revue SFDS-RSA [5] B ia u G., Devroy e L., a nd Lugosi G. (2008) Consistency of r andom for ests and other aver aging classiﬁers . Journal of Machine Learning Research, 9:2039 -2057 [6] B r eiman L., F r iedman J.H., O lshen R.A., Stone C.J. (19 84) Classiﬁc ation And R e gr ession T r e es . Chapman & Hall [7] B r eiman, L. (199 6) Bagging pr e dictors . Machine Learning, 26 (2):123-1 40 [8] B r eiman L. (200 1) R andom F or ests . Ma ch ine Lear ning, 4 5:5-32 6 from the R pack age chemometrics RR n ° 6729 30 Genuer et. al [9] B r eiman L. (2004) Consistency for a simple mo del of Ra ndom F or ests . T ech- nical Rep ort 67 0, Ber keley [10] Breiman L. and Cutler, A. (2005 ) R andom F or ests . Berkeley , h ttp://www.stat.b erkeley .edu/ users/breiman/ RandomF orests/ [11] B ¨ uhlmann, P . and Y u, B. (2002 ) Analyzing Bagging . The Annals of Statis- tics, 30(4 ):9 2 7-961 [12] Cheze N., Poggi J.M. and Portier B. (2003) Partial and R e c ombine d Es- timators for Nonline ar Ad ditive Mo dels . Statistical Inference for Sto chastic Pro cesses , V ol. 6, 2, 1 55-197 [13] Cutler A. and Zhao G. (200 1) Pert - Perfe ct r andom t r e e ensembles . Com- puting Science a nd Statistics, 33:490 -497 [14] D ´ ıa z-Uriarte R. and Alv a rez de Andr´ es S. (2006) Gene Sele ction and clas- siﬁc ation of m icr o arr ay data using r andom for est . B MC Bioinformatics, 7:3, 1-13 [15] Dietteric h, T. (1999 ) An exp erimental c omp arison of t hr e e metho ds for c on- structing ensembles of de cision tr e es : Bagging, Bo osting and r andomization . Machine Lea r ning, 1- 2 2 [16] Dietteric h, T. (2000) Ensemble Metho ds in Machine L e arning . Lecture Notes in C o mputer Science, 185 7:1-15 [17] Efron B., Hastie T., Johnstone I., and Tibshirani R. (2004) L e ast angle r e gr ession . Annals of Statistics, 3 2(2):407-4 99 [18] F a n J. and Lv J . (200 8) Sur e indep endenc e scr e ening for ultra -high dimen- sional fe atur e sp ac e . J . Roy . Statist. So c. Ser . B, 70:8 49-91 1 [19] Guyon I., W esto n J., Ba rnhill S., and V apnik V.N. (200 2) Gene sele ct ion for c anc er classiﬁc ation using supp ort ve ctor machines . Ma chine Learning, 46(1-3):3 8 9-422 [20] Guyon I. and Elisseﬀ A. (200 3) An intr o duct ion to variable and fe atu r e sele ction . Journal of Ma ch ine Lear ning Resear ch, 3:1 157-1 182 [21] Golub T.R., Slonim D.K, T ama yo P ., Huard C., Gaasenbeek M., Mesir ov J.P ., Co ller H., Lo h M.L., Downing J.R., Caligiuri M.A., Blo omﬁeld C.D., and Lander E.S. (199 9) Mole cu lar classiﬁc ation of c anc er: Class disc overy and class pr e diction by gene expr ession monitoring . Science, 286:531 -537 [22] Gr¨ omping U. (2007 ) Est imators of Re lative Imp ortanc e in Line ar R e gr es- sion Base d on V arianc e D e c omp osition . The America n Statistician 61:13 9- 147 [23] Gr¨ omping U. (200 6) Rela tive Imp ortanc e for Line ar R e gr ession in R: The Package r elaimp o . Jo urnal of Statistical Softw are 17, Iss ue 1 [24] Hastie T., Tibshirani R., F riedma n J. (2001 ) The Elements of Statistic al L e arning . Springer INRIA R andom F or ests: some metho dolo gic al insights 31 [25] Ho, T.K. (19 9 8) The r andom subsp ac e metho d for c onstructing de ci- sion for ests . IEE E T r ans. on Pattern Analysis and Machine Intelligence, 20(8):832 -844 [26] Khan J., W ei J.S., Ring ner M., Saa l L.H., Ladanyi M., W estermann F., Berthold F., Sch wab M., Antonescu C.R., Peterson C., Meltzer P .S. (200 1) Classiﬁc ation and diagnostic pr e diction of c anc ers using gene expr ession pr o- ﬁling and artiﬁcial neur al networks . Nat Med, 7 :673-6 79 [27] Kohavi R. and Jo hn G.H. (1997) Wr app ers for F e atur e Su bset Sele ction . Artiﬁcial Intelligence, 97 (1-2):273 - 324 [28] Liaw A. a nd Wiener M. (2002 ). Classiﬁc ation and Re gr ession by r andom- F or est . R News, 2(3):18-2 2 [29] Park M.Y. a nd Hastie T. (200 7 ) An L1 r e gularization-p ath algorithm for gener alize d line ar mo dels . J. Roy . Statist. So c. Ser. B, 69:65 9-677 [30] Poggi J.M. and T uleau C. (2006) Classiﬁc ation su p ervis´ ee en gr ande dimen- sion. Applic ation ` a l’agr´ ement de c onduite aut omobile . Revue de Statistique Appliqu ´ ee, LIV(4):39-58 [31] Pomeroy S.L., T amay o P ., Gaasenbeek M., Sturla L.M., Angelo M., McLaughlin M.E., Kim J.Y., Goumnerov a L.C., Black P .M., Lau C., Allen J.C., Zagzag D., Olson J.M., Curr an T., W etmore C., Biegel J .A., Poggio T., Mukherjee S., Rifkin R., Califano A., Stolovitzky G., Louis D.N., Mesirov J.P ., Lander E.S., Golub T.R. (2 0 02) Pr e diction of c entr al nervous system embryonal tumour outc ome b ase d on gene expr ession . Na ture, 4 15:436 -442 [32] Rakotomamonjy A. (2003) V ariable sele ction using S VM-b ase d criteria . Journal of Machine Le arning Res earch, 3:1357 -1370 [33] Ramaswam y S., Ross K.N., Lander E.S., Golub T.R. (2 003) A mole cular signatur e of metastasis in primary solid tumors . Nature Genetics, 33:49-5 4 [34] Ross D.T., Sc herf U., Eisen M.B., Perou C.M., Rees C., Spellman P ., Iyer V., Jeﬀrey S.S., de Rijn M.V., W altham M., Pergamenschik ov A., Lee J.C., Lashk a ri D., Sha lon D., Myers T.G., W einstein J.N., Botstein D., Brown P .O. (200 0) Systematic variatio n in gene expr ession p atterns in human c an- c er c el l lines . Nature Genetics, 24(3):227 -235 [35] Singh D., F ebb o P .G., Ross K., Jackson D.G., Manola J., Ladd C., T amay o P ., Renshaw A.A., D’Amico A.V., Richie J.P ., Lander E .S., Loda M., Kan toﬀ P .W., Golub T.R., and Sellers W.R. (2002) Gene expr ession c orr elates of clinic al pr ostate c anc er b ehavior . Cancer Ce ll, 1:203 -209 [36] Strobl C., Boules teix A.-L., Z e ileis A. and Hothor n T. (2007) Bias in r an- dom for est variable imp ortanc e me asur es: il lu str ations, sour c es and a solu- tion . BMC B io informatics, 8 :25 [37] Strobl C., Bo ulesteix A.-L., Kneib T., Augustin T. and Zeileis A. (200 8) Conditional variable imp ortanc e for R andom F or ests . BMC B io informatics, 9:307 RR n ° 6729 32 Genuer et. al [38] v an’t V eer L.J., Dai H., v an de Vijv er M.J., He Y.D., Hart A.A.M., Mao M., Peterse H.L., v a n der Ko oy K., Marton M.J ., Witteveen A.T., Schreiber G.J., Kerkhov en R.M., Rob erts C., Linsley P .S., Berna rds R., F riend S.H. (2002) Gene expr ession pr oﬁling pr e dicts clinic al outc ome of br e ast c anc er . Nature, 41 5:530- 5 36 [39] W esto n J., Elisseﬀ A., Sc ho elkopf B., and Tipping M. (200 3) Use of the zer o norm with line ar mo dels and kernel metho ds . J o urnal of Ma chine Learning Research, 3:1 439-1 4 61 INRIA Centre de recherche INRIA Saclay – Île-de-Fran ce Parc Orsay Uni versité - ZA C des V ignes 4, rue Jacques Monod - 9189 3 Orsay Cedex (France) Centre de recherc he INRIA Bordeaux – Sud Ouest : Domaine Uni versit aire - 351, cours de la Libération - 33405 T alenc e Cedex Centre de recherc he INRIA Grenoble – Rhône-Alpes : 655, a venue de l’Europe - 38334 Montbonnot Saint-Ismie r Centre de recherc he INRIA Lille – Nord Europe : Pa rc Scientiﬁque de la H aute Borne - 40, a venue Halle y - 59650 V illene uve d’Ascq Centre de recherc he INRIA Nancy – Grand Est : LORIA, T echnopôl e de Nancy-Brab ois - Campus scientiﬁque 615, rue du Jardin Botani que - BP 101 - 54602 V illers-lè s-Nancy Cedex Centre de recherc he INRIA Paris – Rocquenc ourt : Domaine de V olucea u - Rocquenco urt - BP 105 - 78153 Le Chesnay Cedex Centre de recherc he INRIA Rennes – Bretagne Atlantique : IRISA, Campus uni versi taire de Beaulieu - 35042 Rennes Cedex Centre de recherc he INRIA Sophia Antipolis – Méditerran ée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipoli s Cedex Éditeur INRIA - Domaine de V olucea u - Rocquenc ourt, BP 105 - 78153 Le Chesnay Cedex (France) http://www.inria.fr ISSN 0249 -6399 10 0 10 1 10 2 150 200 250 300 350 400 450 500 550 mtry OOB Error PAC ntree=100 500 1000 5000 2 4 6 8 10 12 10 11 12 13 14 15 16 17 18 19 mtry OOB Error BostonHousing ntree=100 500 1000 2 4 6 8 10 20 21 22 23 24 25 26 Ozone 1 1.5 2 2.5 3 3.5 4 20 25 30 35 40 45 Servo 1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 mtry OOB Error friedman1 ntree=100 500 1000 1 1.5 2 2.5 3 3. 5 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10 4 friedman2 1 1.5 2 2.5 3 3.5 4 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 friedman3 0 10 20 30 40 0 0.5 1 1.5 2 2.5 3 3.5 x 10 −3 variables standard deviation of importance 0 10 20 30 40 50 0 0.05 0.1 0.15 0.2 variables mean of importance 0 5 10 15 20 25 30 0 0.05 0.1 0.15 0.2 0.25 nested models OOB error on test sample 1 1.2 1.4 1.6 1.8 0 0.05 0.1 0.15 0.2 0.25 predictive models OOB error on test sample

Random Forests: some methodological insights

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment