Random Forests: some methodological insights
This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for usin…
Authors: Robin Genuer (LM-Orsay), Jean-Michel Poggi (LM-Orsay), Christine Tuleau (JAD)
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6729--FR+ENG Thème COG INSTITUT N A TION AL DE RECHERCHE EN INFORMA TIQUE ET EN A UTOMA TIQUE Random F orests: some methodological insights Robin Genuer — Jean-Michel Poggi — Christine T uleau N° 6729 Novembre 2008 Centre de recher che INRIA Saclay – Île-de- France Parc Orsay Uni versité 4, rue Jacques Monod , 91893 ORSA Y Cedex Téléphone : + 33 1 72 92 59 00 Random F orests: some metho dological insigh ts Robin Gen uer ∗ , Jean-Mic hel P oggi † ∗ , Christine T uleau ‡ Th` eme COG — Syst` emes cognitifs ´ Equip es-Pro jets Select Rapp o rt de recherche n ° 6729 — Nov embre 2008 — 3 2 pag es Abstract: This pap er examines from an exp e r imen tal p ersp ective random forests, the increasingly used statistical metho d for class ification and regres- sion problems int ro duced by Leo Breiman in 2 0 01. It fir st aims at confirming, known but sparse, a dvice for us ing ra ndom forests and at pro pos ing some com- plemen tary remarks for b oth standard problems as well as high dimensional ones for which the num b er of v aria bles hugely exceeds the sample size. But the main cont ribution of this pap er is tw ofold: to provide so me insights ab out the behavior o f the v a riable imp ortance index based on r andom fore s ts and in a ddi- tion, to prop ose to inv estigate t wo classical issues of v ariable selection. The first one is to find imp ortant v aria bles for interpretation and the second one is mor e restrictive and try to design a go o d pr ediction mo del. The strategy inv olves a ranking of explanatory v ar iables using the random fore s ts sco re of imp ortance and a stepwise a scending v ariable introduction s tr ategy . Key-w ords: Random Forests, Regression, Classifica tion, V ariable Impor t ance, V ariable Selection. ∗ Unive rsit´ e Pa ris -Sud, Math´ emat ique, Bˆ at. 425, 91405 Or say , F r ance † Unive rsit´ e Pa ris Descartes, F rance ‡ Unive rsit´ e Nice Sophia-Antipolis, F rance F orˆ ets al ´ eatoires : remarques m ´ etho dol ogiques R´ esum´ e : O n s ’in t´ eresse ` a la m´ etho de des forˆ ets al´ eatoires d’un p oint de vue m ´ etho dologique. Introduite par Leo Breiman en 2 0 01, elle est d´ es ormais lar ge- men t utilis´ ee tan t en classifica tion qu’en r´ egressio n av ec un succ ` es spectacula ire. On vis e tout d’ab ord ` a confirmer les r´ esultats exp´ erimentaux, connus mais ´ e pars, quant au choix des param` etres de la m´ etho de, tan t p our les pro bl` emes dits ” stan- dards” que p our ceux dits de ”g rande dimension” (po ur lesq uels le nombre de v aria bles est tr` es grand vis ` a vis du nom bre d’observ ations). Mais la con tribution principale de cet ar ticle est d’ ´ etudier le comp ortement du score d’impo rtance des v aria bles bas´ e s ur les forˆ ets al´ eatoir es et d’examiner deux probl` emes cla ssiques de s´ election de v ariables. Le premier est de d´ eg ager les v a riables impo rtantes ` a des fins d’in terpr´ etation tandis que le sec o nd, plus restrictif, vise ` a se restreindre ` a un sous-ens e mble suffisant p our la pr´ ediction. La stra t ´ egie g´ en´ era le pro c` ede en deux ´ eta pes : le clas s emen t des v ariables bas ´ e sur les scores d’importance suivi d’une pr o c´ edure d’introduction a scendante s´ equen tielle des v a riables. Mots-cl´ es : F or ˆ ets al ´ ea toires, R ´ egression, Classifica tion, Impor- t ance d es V ariables, S ´ election des V ariables. R andom F or ests: some metho dolo gic al insights 3 1 Int ro duction Random forests (RF henceforth) is a p opular and v ery efficient algo rithm, based on mo del aggr e g ation ideas, for b oth cla s sification a nd regres sion problems, in- tro duced by Breiman (2001) [8]. It b elongs to the family of ensemble metho ds, app earing in machine learning at the end of nineties (see for example Dietterich (1999) [1 5] a nd (2 000) [16]). Let us briefly r ecall the s tatistical framework by considering a lea rning set L = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } made of n i.i.d. obser- v ations of a random vector ( X, Y ). V ector X = ( X 1 , ..., X p ) contains predictors or explanatory v aria bles, say X ∈ R p , and Y ∈ Y where Y is either a class la bel or a n umerical r espo nse. F or classification pro blems, a clas sifier t is a mapping t : R p → Y while for reg ression problems, we supp ose that Y = s ( X ) + ε and s is the so-called r e g ression function. F or more background on statistical lear n- ing, see Hastie et al. (2001) [24]. Random forests is a mo del building stra tegy providing estimators of either the Bay es classifier or the reg ression function. The principle o f random forests is to combine many binary decision trees built using several b o otstrap samples coming from the learning sa mple L and choosing rando mly at ea ch no de a subset of explanatory v ar iables X . More precisely , with resp ect to the w ell-known CAR T mo del building s trategy (see Breiman et al. (1984 ) [6]) p erfo r ming a growing step follow ed by a pruning one, t wo differences ca n b e noted. First, at each no de, a given num b er (denoted by mtry ) of input v ar iables are randomly chosen and the b est split is calculated only within this subset. Second, no pr uning step is p erformed so all the tr e e s are maximal tree s . In addition to CAR T, another well-kno wn rela ted tree-based metho d must be mentioned: ba gging (see B reiman (1996 ) [7]). Indeed random forests with mtry = p reduce simply to unpruned bagging. The asso ciated R 1 pack ages are res pectively ra ndomFo rest (int ensively used in the se q uel of the pape r ), rpart and ipre d for CAR T and bagging resp ectively (cited here for the sake of completeness). RF alg o rithm b ecomes more and more p opular and a ppea rs to b e very p ow- erful in a lo t of different a pplications (see for example D ´ ıaz-Uriarte and Alv a rez de Andr´ es (2006 ) [14] for gene expr e s sion data analysis) even if it is not clear ly elucidated fro m a mathematica l p oint of view (see the r ecent paper by Bia u et al. (20 0 8) [5] and B ¨ uhlmann, Y u (2002) [1 1] for bagging). Nevertheless, Breiman (200 1) [8] sketc hes an explanation o f the g o o d p erformance of random forests related to the go o d quality of each tree (at least from the bias p oint of view) tog e ther with the sma ll cor relation a mong the trees of the forest, where the correlation b etw een trees is defined as the ordinar y co rrelation o f pr edictions on so-called out-of-bag (OOB henceforth) sa mples. The OOB sample which is the set of observ ations which ar e not used for building the current tree, is used to estimate the pr ediction e rror and then to ev aluate v ariable imp ortance. T uning metho d parameters It is now classica l to distinguish tw o typical situatio ns dep e nding on n the n umber of o bserv a tio ns , and p the num b er of v aria bles: standa rd (for n >> p ) and high dimensional (when n << p ). The first question when so meone try to use practically random fores ts is to get infor ma tion ab out sensible v alues 1 see htt p://www.r- pro ject.org/ RR n ° 6729 4 Genuer et. al for the t wo main para meters of the metho d. E ssentially , the study ca rried o ut in the tw o pa per s [8] and [14] give interesting insights but Breiman fo cuses on standard pr oblems while D ´ ıaz-Uriarte a nd Alv a rez de Andr´ es concentrate on high dimensional cla ssification o nes. So the fir s t ob jective of this pap er is to give compact information a bo ut selected be nch datasets and to examine aga in the choice of the metho d par am- eters addressing more closely the different s ituations. RF v ariable i mp ortance The quantification of the v ariable imp ortance (VI henceforth) is an imp or- tant issue in many applied pro blems complementing v ariable selection by inter- pretation iss ues . In the linear r egressio n framework it is exa mined for example b y Gr¨ o mping (2007 ) [22], making a distinction b et ween v ar ious v ariance de- comp osition based indicato rs: ”disp ersio n imp ortance” , ”level impor tance” or ”theoretical impo rtance” qua ntifying explained v ariance or changes in the re- sp o nse for a given c ha nge of eac h regress o r. V arious w ays t o define and compute using R such indicators are av ailable (see Gr ¨ o mping (200 6 ) [2 3]). In the random forests fra mew ork , the mos t widely use d s c o re of imp or tance of a given v ar iable is the increasing in mean of the error o f a tree (MSE for regress io n and misclassification ra te for clas sification) in the forest when the observed v alues of this v a riable are randomly p ermuted in the OO B samples. Often, such ra ndom forests VI is called per m utation imp ortance indices in op- po sition to total decrea se of no de impurit y mea s ures alrea dy intro duced in the seminal b o ok a b out CAR T by Breiman et al. (1984 ) [6]. Even if only little inv estigation is av aila ble ab out RF v ar iable impo rtance, some interesting facts are collected fo r classifica tion pr oblems. This index ca n be ba sed on the average loss of ano ther criterion, like the Gini entrop y used fo r growing c lassification tree s . Let us cite tw o r e ma rks. The first one is that the RF Gini imp ortance is not fair in favor of predictor v ariables with many ca tegories while the RF p ermutation impo rtance is a more reliable indicator (see Str o bl et al. (200 7) [36]). So we restrict our attenti on to this las t one. The second one is that it s eems that p ermutation imp ortance ov erestimates the v ar ia ble imp or- tance of highly co rrelated v ariables and they pro po se a conditional v ar ia nt (see Strobl et al. (2008 ) [3 7]). Let us men tion that, in this paper , we do not no tice such phenomenon. F o r cla s sification problems, Ben Ishak, Ghattas (200 8) [4] and D ´ ıaz-Ur iarte, Alv a rez de Andr ´ es (20 06) [14] for example, use RF v aria ble impo r tance and note that it is stable for correla ted predictors , scale inv ariant and stable with respect to small pertur ba tions of the lea rning sample. But these preliminary remarks need to be extended and the re c e nt pap er by Archer et al. (2008) [3], focusing mor e sp ecifically on the VI topic, do not answer so me cru- cial questions a bo ut the v ar iable imp orta nce be havior: like the impor tance of a g roup o f v a riables or its b ehavior in presence of highly correla ted v a riables. This one is the second go al o f this pap e r . V ariable sel e ction Many v a riable selection pro c edures a re based on the co o per ation o f v a riable impo r tance for ranking and mo del estimation to ev aluate and compare a family of mo dels. Three types of v ariable selec tio n metho ds are distinguished (see Ko- havi et al. (1 997) [27] a nd Guyon et al. (2003 ) [20]): ”filter” for which the score of v ariable importance do es no t dep e nd on a giv en mo del design method; ”wra p- INRIA R andom F or ests: some metho dolo gic al insights 5 per ” which include the predictio n p erformance in the sco re ca lculation; and finally ”embedded” which intricate mo re closely v ariable s elec tio n a nd mo del estimation. F o r non-par ametric mo dels, o nly a small num b er of metho ds a re av ailable, esp ecially for the clas sification case. Let us br iefly men tion some of them, which are po ten tially comp eting to ols. Of course w e must firs tly mention the wrapp er methods ba sed on VI coming fro m CAR T, see Br e ima n et al. (1984) [6] a nd of course, random forests, see B reiman (20 01) [8]. Then some examples of em- bedded metho ds: P ogg i, T uleau (200 6) [30] prop ose a method based on CAR T scores and using stepwise a scending pro cedure with elimination s tep; Guyon et al. (2002 ) [19] (and Rakotomamonjy (2003) [3 2]), prop ose SVM-RFE, a method based on SVM scores a nd using des c e nding elimination. Mo re recent ly , Ben Ishak et al. (200 8) [4] pr op ose a stepwise v ar iant while Park et al. (2007 ) [29] prop ose a ”LARS” type strateg y (see Efro n et al. (200 4 ) [1 7] for classification problems. Let us recall that tw o dis tinct ob jectives a bo ut v a riable selection can b e iden- tified: (1) to find imp orta n t v aria bles highly related to the resp onse v a riable for in terpretatio n purp ose; (2) to find a sma ll n umber of v ariables sufficient for a go o d pr ediction of the r esp onse v ariable. The key too l for task 1 is thresholding v aria ble imp ortance while the crucial p o int for ta s k 2 is to combine v ariable ranking and s tep wise introduction of v a riables o n a pr ediction mo del building. It co uld b e as c ending in order to av oid to s elect r edundant v ariables o r, for the case n << p , descending firs t to reach a class ical situation n ∼ p , and then ascending using the first s trategy , see F an, Lv (2008) [18]. W e prop ose in this pap er, a tw o-steps pro cedure, the first one is commo n while the seco nd one de- pends on the ob jectiv e interpretation or prediction. The pap er is o rganized as follows. After this introduction, Section 2 fo cuses on random fores ts para meters. Section 3 prop oses to study the behavior of the RF v ariable imp o rtance index. Section 4 inv estigates the tw o classical issues of v aria ble s e lection using the r andom forests ba sed score of impor tance. Section 5 finally o pens disc us s ion ab out future work. 2 Selecting metho d parameters 2.1 Exp erimen tal framew ork 2.1.1 RF pro cedure The R pa ck ag e ab out ra ndo m for ests is based on the the seminal c o nt ribution of Br eiman and Cutler [10] a nd is describ ed in Liaw, Wiener (2002 ) [28]. In this pap er, we fo cus on the r andomF orest pro cedure. The tw o main par ameters ar e mtry , the num b er of input v a riables randomly chosen a t each split and ntr ee , the num b er of trees in the forest 2 . A third parameter , denoted by nodesi z e , allows to sp ecify the minimum n umber of obse rv ations in a no de. W e retain the default v alue (1 for cla ssification and 5 for regressio n) of this parameter for all of our exp erimentations, since it is close to the maximal tree choice. 2 In all the pap er, mtr y = m with m ∈ R stands for mtr y = ⌊ m ⌋ RR n ° 6729 6 Genuer et. al 2.1.2 OOB error In this section, we concent rate on the prediction p erfor ma nce of RF fo cusing on out-of-bag (OO B) er ror (see [8 ]). W e use this kind of pre diction error estimate for three reasons: the main is th at we are mainly in terested in comparing r esults instead of a ssessing mo dels , the s e c ond is that it gives fair estimation compar ed to the usual a lternative test set erro r even if it is considered as a little bit optimistic and the last o ne, but not the least, is that it is a default output of the pro cedur e. T o avoid unsignificant sampling effects, each OOB erro rs is actually the mean of OOB err or ov er 10 runs. 2.1.3 Datasets W e have collected information ab o ut the data sets considered in this pap er: the name, the na me o f the corresp onding data structure (when different), n , p , the n umber o f classes c in the mult iclass cas e, a refer e nce , a website or a pack age. The tw o next tables contain synthetic informatio n while details ar e pos tponed in the Appendix. W e dis tinguis h standard and high dimensional s ituations and, in addition, the three pr oblems: r egressio n, 2-class cla ssification a nd multiclass classification. T a ble 1 displays some informatio n a bo ut standard pro blems datasets: for classification at the top and for r e g ression a t the b ottom. Name Observ ations V a r iables Classes Ionosphere 351 34 2 Diabetes 768 8 2 Sonar 208 60 2 V o tes 435 16 2 Ringnorm 200 20 2 Threenorm 200 20 2 Twonorm 200 20 2 Glass 2 1 4 9 6 Letters 20000 16 26 Sat-images 6435 36 6 V ehicle 846 18 4 V owel 990 10 11 W aveform 200 21 3 BostonHousing 506 13 Ozone 366 12 Servo 167 4 F r iedman1 300 10 F r iedman2 300 4 F r iedman3 300 4 T a ble 1: Standa rd problems: data sets for classification at the top, and for regress io n at the b ottom T a ble 2 displa ys high dimensio na l problems datasets: for classification at the top and for r egression at the b ottom. INRIA R andom F or ests: some metho dolo gic al insights 7 Name Observ ations V a r iables Classes Adenoc arcinoma 76 9868 2 Colon 62 2000 2 Leukemia 38 3051 2 Prostate 102 603 3 2 Brain 42 5597 5 Breast 96 4869 3 Lymphoma 62 4026 3 Nci 61 6033 8 Srb ct 63 2308 4 toys data 100 100 to 10 00 2 P A C 209 4 67 F r iedman1 100 100 to 1 000 F r iedman2 100 100 to 1 000 F r iedman3 100 100 to 1 000 T a ble 2: High dimensional problems: data sets for c la ssification at the top, and for regr e s sion at the bo ttom 2.2 Regression Abo ut regress ion problems, even if it seems at first insp ection that the seminal pap er by Breiman [8] clo ses the debate ab out go o d advice, it remains that the exp erimental r e s ults are ab out a v ariant which is not implemen ted in the univ ersa lly used R pack age. Moreover, exce pt this reference, at our knowledge, no such a general pap er is av ailable, so we dev elop a gain the Breiman’s study bo th for real and simul ated data corresp onding to the ca se n >> p and we provide some additional study on data cor r esp onding to the case n < < p (such examples typically come fro m chemometrics). W e o bserve that the default v alue o f mtr y prop osed b y the R pack age is not optimal, a nd that there is no improv emen t b y using rando m forests with res pect to unpruned bagg ing (obtained for mtry = p ). 2.2.1 Standard proble ms Let us br iefly examine standa r d ( n >> p ) regress io n da tasets. In Figure 1 for real ones and for simulated o nes in Fig ure 2. Each plot gives for mtr y = 1 to p the OOB erro r for three different v alues of ntr ee = 100 , 5 0 0 and 1000 . The vertical solid line indicates the v alue mtry = p/ 3, the default v alue prop os e d by the R pack age for reg ression problems, the vertical dashed line b eing the v a lue mtry = √ p . Three r emarks can be formulated. First, the OOB err or is maximal for mtry = 1 and then decreases quickly (except for the ozone da taset, for reaso ns not clea rly elucidated), then as so on as mtr y > √ p , the error remains the same. Second, the choice mtr y = √ p gives always low er OOB erro r than mtr y = p/ 3 , and the ga in can be impo r tant. So the default v alue prop osed by the R pack age seems to b e often not optimal, esp ecially when ⌊ p/ 3 ⌋ = 1. L astly , the default v alue ntr ee = 500 is conv enien t, but a muc h smaller o ne ntre e = 10 0 leads to comparable results. RR n ° 6729 8 Genuer et. al 2 4 6 8 10 12 10 12 14 16 18 mtry OOB Error BostonHousing ntree=100 500 1000 2 4 6 8 10 12 20 21 22 23 24 25 26 Ozone 1 1.5 2 2.5 3 3.5 4 20 25 30 35 40 45 Servo Figure 1: Standar d r egression: 3 real data sets 2 4 6 8 10 7 8 9 10 11 12 mtry OOB Error friedman1 ntree=100 500 1000 1 1.5 2 2.5 3 3.5 4 2.5 3 3.5 x 10 4 friedman2 1 1.5 2 2.5 3 3.5 4 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 friedman3 Figure 2: Standard reg ression: 3 simu lated data sets So, for standar d ( n >> p ) reg ression pr oblems, it seems that there is no im- prov ement by using rando m fore s ts with resp ect to unpruned bagging (obtained for mtr y = p ). 2.2.2 High di mensional problems Let us start with a simulated data set for the high dimensional case n < < p . This example is built by a dding extra noisy v ariables (indep endent and uniformly distributed on [0 , 1]) to the F riedman1 mo del defined b y: Y = 10 sin( πX 1 X 2 ) + 20( X 3 − 0 . 5) 2 + 10 X 4 + 5 X 5 + ǫ INRIA R andom F or ests: some metho dolo gic al insights 9 where X 1 , . . . , X 5 are indep endent and uniformly distributed on [0 , 1] and ǫ ∼ N (0 , 1). So we hav e 5 v ariables related to the r esp o nse Y , the others b eing noise. W e set n = 1 00 and let p v ary . 10 0 10 1 10 2 14 15 16 17 18 19 20 21 22 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 16 17 18 19 20 21 22 p=200 10 0 10 1 10 2 18 19 20 21 22 23 p=500 10 0 10 1 10 2 10 3 19 19.5 20 20.5 21 21.5 22 22.5 23 23.5 p=1000 Figure 3 : Hi gh dimensional regress ion s im ulated data set: F riedman1. The x-axis is in log sc a le Figure 3 con tains four plots co r resp onding to 4 v alues of p (1 0 0, 200, 500 and 1000) increasing the nuisance space dimensio n. Each plot g ives for ten v alues of mtry (1, √ p/ 2, √ p , 2 √ p , 4 √ p , p/ 4, p/ 3, p/ 2, 3 p/ 4 , p ) the OO B err or for three different v a lues of n t r ee = 10 0 , 500 and 10 00. The x- axis is in log scale and the vertical solid line indicates mtr y = p/ 3 the default v alue prop osed by the R pack age for regressio n, the vertical das hed line being the v alue mtry = √ p . Let us give four comments. A ll curves have the same s ha pe: the O OB error decreas es while mtr y increases. While p incr e a ses, b oth O OB error s of unpruned bagging (obtained with m tr y = p ) and r andom forests with default v alue of mtr y increase, but unpruned bagg ing p erforms b etter than RF (ab out 25% of improvemen t). The choice mtry = √ p gives alwa ys worse results than those obtained for mtry = p/ 3. Finally , the default choice ntree = 50 0 is conv enien t, but a muc h smaller one ntr ee = 100 leads to compara ble results. Figure 4 a nd 5 show the results o f the s a me study for the F riedman2 and F r iedman3 mo dels. The previo us comments remain v alid. Let us just note that the difference b et ween unpruned bagging a nd ra ndo m forests with mtr y default v alue is even more pro no unced for these t wo pro blems. T o end, let us now examine the high dimensional r e a l data se t P AC. Figure 6 g ives for same ten v alues of mtr y the O OB err o r for four different v a lues of ntree = 10 0 , 500 , 1000 and 500 0 (x- axis is in log s cale). The genera l be havior is similar except for the shap e: a s s o on as mtr y > √ p , the error remains the sa me instead of still decreasing. T he difference o f the shap e of the cur ves b etw e e n simu lated and real datasets can be explained b y the fact that, in simulated datasets we co nsidered, the num ber of true v ariables is very s mall co mpared RR n ° 6729 10 Genuer et. al 10 0 10 1 10 2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10 5 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=200 10 0 10 1 10 2 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=500 10 0 10 1 10 2 10 3 0.6 0.8 1 1.2 1.4 1.6 x 10 5 p=1000 Figure 4 : Hi gh dimensional regress ion s im ulated data set: F riedman2. The x-axis is in log scale 10 0 10 1 10 2 0.07 0.08 0.09 0.1 0.11 0.12 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.07 0.08 0.09 0.1 0.11 0.12 p=200 10 0 10 1 10 2 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 0.125 p=500 10 0 10 1 10 2 10 3 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 0.125 p=1000 Figure 5 : Hi gh dimensional regress ion s im ulated data set: F riedman3. The x-axis is in log scale to the total num b e r of v a r iables. One may exp ect that in r eal datasets, the prop ortion of true v aria bles is lar ger. So, for high dimensional ( n << p ) r egression problems, unpruned bagging seems to p erfor m b etter than rando m forests a nd the difference can b e large. INRIA R andom F or ests: some metho dolo gic al insights 11 10 0 10 1 10 2 150 200 250 300 350 400 450 500 550 mtry OOB Error PAC ntree=100 500 1000 5000 Figure 6: High dimensiona l r egressio n: P A C data. The x- a xis is in lo g scale 2.3 Classification Abo ut s tandard clas sification pro blems, we chec k that Breiman’s conclusions remain v alid for t he considered v ariant a nd that the mtry default v alue pro po sed in the R pack age is go o d. How ever for high dimensiona l clas sification problems, we observe that lar ger v alues o f mtr y give so metimes muc h b etter results. 2.3.1 Standard proble ms F o r cla s sification problems for which n >> p , ag ain the pap e r by B r eiman is in teresting a nd we just quickly chec k the conclusions . Let us fir st exa mine in Figure 7 standard ( n >> p ) classifica tion r eal data sets. Each plot gives for mtr y = 1 to p the OOB erro r for thre e different v alues of ntree = 100 , 500 and 1000 . The vertical solid line indicates the v alue mtry = √ p , the default v alue pro p os ed by the R pack age for cla ssification. Three remar ks can b e for m ulated. The default v alue mtr y = √ p is conv e- nien t for a ll the examples. The default v alue ntr ee = 50 0 is sufficien t and a m uch smaller one ntr ee = 100 is not conv enient a nd can lea ds to significantly larger err ors. The gener al shap e is the following: the error s for mtr y = 1 a nd for mtr y = p (corr esp onding to the unpruned bagging) are o f the same ” large” order o f magnitude a nd the minim um is rea ch ed for the v alue √ p . The ga in can be ab out 30 or 50%. So, for these 9 examples, the default v alue prop osed b y the R pack age is quite optimal. Let us now examine in Figure 8 sta nda rd ( n >> p ) clas sification simulated datasets. As it can b e seen, ntree = 500 is sufficient and, except for the ringno rm already pointed out as a somewhat special dataset (see Cutler, Zhao (2001) [13]) the v alue mtr y = √ p is go o d. Here, the general shap e o f the err or curve is quite different compar ed to rea l datasets: the error incre ases with mtr y . So for these four examples, the sma ller mtry , the better. 2.3.2 High di mensional problem s Let us now consider the case n << p for whic h D ´ ıaz-Uriar te and Alv ar ez de Andr´ es (2006) [14] g ive numerous advic e . W e co mplete the study b y trying RR n ° 6729 12 Genuer et. al 2 4 6 8 0.2 0.21 0.22 0.23 0.24 mtry OOB Error Glass ntree=100 500 1000 2 4 6 8 0.215 0.22 0.225 Diabetes 10 20 30 40 50 60 0.15 0.16 0.17 0.18 0.19 0.2 Sonar 2 4 6 8 10 0.02 0.03 0.04 0.05 0.06 Vowel 5 10 15 20 25 30 0.065 0.07 0.075 0.08 Ionosphere 5 10 15 0.25 0.255 0.26 0.265 Vehicle 5 10 15 0.04 0.05 0.06 0.07 Votes 10 20 30 0.08 0.085 0.09 Sat−images 5 10 15 0.03 0.04 0.05 0.06 Letters Figure 7: Standard class ifica tion: 9 real da ta sets 5 10 15 20 0.16 0.165 0.17 0.175 mtry OOB Error waveform ntree=100 500 1000 5 10 15 20 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 twonorm 5 10 15 20 0.15 0.16 0.17 0.18 0.19 0.2 threenorm 5 10 15 20 0.05 0.06 0.07 0.08 0.09 0.1 0.11 ringnorm Figure 8: Standar d cla ssification: 4 sim ulated da ta sets larger v alues o f mtry , which give interesting results. One can found in Figure 9 the O OB erro rs for nine high dimensional real datasets. Ea c h plot gives for nine v alues o f mtr y (1, √ p/ 2, √ p , 2 √ p , 4 √ p , p/ 4 , p/ 2 , 3 p/ 4, p ) the O OB er ror for four different v alues of ntr ee = 100 , 500 , 1 000 and 5 000. The x-axis is in log scale. The vertical solid line indicates the default v alue prop osed by the R pack age mtr y = √ p . Again the default v alue ntre e = 5 00 is sufficient, and at the co nt rar y the v alue ntr ee = 100 ca n leads to significantly larger errors . The general shap e is the following: it decreases in genera l a nd the minimum v alue is o btained or is close to the one r eached using mtry = p (corr esp onding to the unpruned bag- INRIA R andom F or ests: some metho dolo gic al insights 13 10 0 10 2 0.155 0.16 0.165 0.17 0.175 mtry OOB Error adenocarcinoma ntree=100 500 1000 5000 10 0 10 2 0.2 0.25 0.3 0.35 brain 10 0 10 2 0.38 0.4 0.42 0.44 0.46 0.48 breast.3.class 10 0 10 2 0.15 0.2 0.25 0.3 colon 10 0 10 2 0 0.05 0.1 0.15 0.2 leukemia 10 0 10 2 0 0.02 0.04 0.06 0.08 lymphoma 10 0 10 2 0.35 0.4 0.45 0.5 nci 10 0 10 2 0.05 0.1 0.15 0.2 0.25 prostate 10 0 10 2 0.05 0.1 0.15 0.2 srbct Figure 9 : High dimensiona l classifica tion: 9 r e a l data sets. The x-axis is in log scale ging). The difference with standa rd pr oblems is nota ble, the reaso n is t hat when p is large, mtr y must be sufficiently large in or der to have a high probability to capture impo rtant v ariables (that is v ar iables highly related to the r esp o nse) for defining the splits o f the RF. In addition, let us mention that the default v alue mtry = √ p is still reasonable from the OOB error viewpo in t but of course, since √ p is small with resp ect to p , it is a very a ttractive v alue from a computational per sp ectiv e (notice tha t the trees ar e not too deep since n is not to o larg e). Let us examine a simulated dataset for the case n << p , in tro duced b y W esto n et al. (2003) [39], called “toys data” in the sequel. It is an equiprobable t wo-class problem, Y ∈ {− 1 , 1 } , with 6 true v aria bles, the o thers b eing some noise. This ex ample is interesting since it constructs tw o near independent groups of 3 significant v ar iables (highly , mo derately and weakly corr elated with resp onse Y ) and a n additional gr oup of noise v aria bles, uncorrelated with Y . A forward reference to the plots on the left s ide of Figure 11 allow to see the v aria ble imp or tance picture and to note that the imp orta nce of the v ariables 1 to 3 is muc h higher than the one o f v a riables 4 to 6. Mo re precisely , the mo del is defined through the conditional distribution of the X i for Y = y : for 70% of da ta , X i ∼ y N ( i, 1) fo r i = 1 , 2 , 3 and X i ∼ y N (0 , 1) for i = 4 , 5 , 6. for the 30 % left, X i ∼ y N (0 , 1) for i = 1 , 2 , 3 and X i ∼ y N ( i − 3 , 1) for i = 4 , 5 , 6. the other v ariables are noise, X i ∼ N (0 , 1) fo r i = 7 , . . . , p . After simulation, obtained v ariables ar e standar dized. Let us fix n = 100. The plots of Figur e 10 are or ganized as pr e v iously , four v alues o f p a re considered: 100, 2 00, 5 00 a nd 1 000 cor resp onding to increasing nuisance space RR n ° 6729 14 Genuer et. al 1 0 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 mtry OOB Error p=100 ntree=100 500 1000 10 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 p=200 10 0 10 1 10 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 p=500 10 0 10 1 10 2 10 3 0.1 0.2 0.3 0.4 p=1000 Figure 10: High dimensiona l clas s ification simulated data s e t: toys data for 4 v alues of p . The x-axis is in log sca le dimension. F or p = 100 and p = 200 , the error decr eases h ugely un til mtr y reaches √ p and then remains c o nstant, so the default v alues work well and per form a s well as unpruned bag ging, even if the true dimension ˜ p = 6 << p . F o r la rger v a lues o f p ( p ≥ 50 0 ), the shape of the curve is close to the o ne for high dimensional rea l data sets (t he error decreases and the minimum is reac hed when mtry = p ). Whence, the err or rea ched by using r andom forests with default mtry is ab out 7 0% to 15 0% larger than the err or reached by unpruned bagging which is close to 3% for all the considered v a lues of p . Finally , for high dimensional class ifica tion pr oblems, our conclusion is that it may b e worthwhile to choose mtr y larg e r than the defa ult v alue √ p . After this section fo cusing o n the prediction p erformance, let us now fo cus on the s econd attractive feature of RF: the v ar iable imp ortance index. 3 V ariable imp ortance The qua ntification of the v ariable imp ortance (abbr e v iated VI) is a cr ucial issue not only for ra nking the v a riables befor e a stepwise estimation mo del but also to interpret data a nd understand under lying phenomenons in many applied problems. In this section, w e examine the RF v ariable impor tance behavior accor ding to three different issues. The first o ne deals with the s ensitivit y to the sample size n and the n um b er of v ariables p . The second examines the s ensitivit y to method parameters mtr y and ntr ee . The las t o ne deals with the v a riable imp or tance of a gro up o f v a riables, highly correla ted or po orly cor r elated together with the problem of co rrect identification of irrelev ant v ar iables. As a result, a go o d c hoice of pa rameters of RF can help to b etter discr imi- nate b etw een imp orta nt and useless v ariables. In addition, it can increa se the stabilit y of VI scores. T o illustrate this discuss ion, let us consider the toys data intro duced in Section 2.3.2 and compute the v ariable impo rtance. Recall that only the first 6 v aria bles are of interest a nd the others are no ise. INRIA R andom F or ests: some metho dolo gic al insights 15 Remark 3.1 L et us mention that vari able imp ortanc e is c ompute d c onditionally to a given r e alization even for simulate d datasets. This choic e which is criticiz- able if t he obje ctive is to r e ach a go o d est imation of an underlying c onst ant, is c onsistent with the ide a of staying as close as p ossible to the exp erimental situ - ation de aling with a given dataset. In addition, the numb er of p ermutations of the observe d values in the OOB sample, us e d to c ompute the sc or e of imp ortanc e is set to the default value 1 . 3.1 Sensitivit y to n and p Figure 1 1 illustra tes the b ehavior of v ariable impor tance for se veral v alues o f n and p . Parameters ntr ee a nd mtry are set to their default v alues. Boxplots are based on 50 runs of the RF algo rithm and for visibility , we plot the v ariable impo r tance only for a few v ar ia bles. 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 importance variable n=500 p=6 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 n=500 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 n=500 p=500 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 importance variable n=100 p=6 n=100 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 n=100 p=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 n=100 p=500 Figure 11: V ariable imp ortance sensitivity to n and p (toys da ta) On ea ch row, the first plot is the reference o ne for which we observe a conv enien t picture o f the relative imp ortance of the initial v ariables. Then, when p increases tremendously , we try to c heck if: (1) the s ituatio n b etw een the t wo groups remains readable; (2 ) the situation within each g r oup is stable; (3) the imp ortance of the additional dummy v a r iables is close to 0. The situation n = 50 0 (gr aphs a t the top of the figure) cor resp onds to an “easy” case, where a lo t of data are av ailable and n = 10 0 (graphs at the bo ttom) to a har der one. F or each v alue of n , three v alues of p are consider e d: 6 , 200 and 5 00. When p = 6 only the 6 true v ariables are pr esent . Then tw o very difficult situations a re considered: p = 20 0 with a lot of no isy v aria bles and p = 500 is even harder. Gra phs are truncated after the 16th v a riable for readability (impor tance of noisy v ariables left are the same or der o f mag nitude as the last plotted). Let us commen t on g r aphs on the first row ( n = 500). When p = 6 w e obtain concentrated b oxplots and the order is clear , v aria bles 2 and 6 having near ly the sa me impo r tance. When p increases, the order of magnitude of imp ortance RR n ° 6729 16 Genuer et. al decreases. The order within the tw o gr oups of v ar iables (1 , 2 , 3 and 4 , 5 , 6) remains the same, while the ov erall or der is mo dified (v ariable 6 is now less impo r tant than v ar ia ble 2 ). In addition, v a riable imp ortance is more unstable for huge v alues of p . B ut what is r emark a ble is that all nois y v ariables hav e a zero VI. So one can ea s ily recover v ariables of interest. In the sec o nd ro w( n = 100 ), w e note a greater insta bility since the num ber of observ ations is o nly mo derate, but the v a r iable ranking remains quite the sa me. What differs is that in the difficult situations ( p = 200 , 500) impo r tance of so me noisy v ariables increase s , and for example v a riable 4 cannot b e hig hlig h ted from noise (even v aria ble 5 in the b ottom rig h t gra ph). This is due to the decrea sing behavior of VI with p gr owing, coming from the fact that when p = 500 the algorithm r andomly choos e only 22 v aria bles at eac h s plit (with the mtry default v alue). The pr obability of choo sing one of the 6 true v ariables is re a lly small and the les s a v ar iable is chosen, the les s it can b e co nsidered as imp ortant. In addition, le t us remark that the v a riability o f VI is larg e f or true v a riables with resp ect to us e le s s ones . This remark ca n b e used to build some kind of test for VI (see Strobl et al. (20 07) [36]) but of co urse ranking is b etter suited for v ar iable selection. W e now study how this VI index b ehav es when changing v alues of the main method parameters . 3.2 Sensitivit y t o mtr y and ntree The choice of mtr y and ntre e can b e imp ortant for the VI computation. Let us fix n = 100 and p = 200. In Figure 12 we plot v ar iable imp ortance o btained using three v alues of mtr y (14 the default, 100 and 2 00) and tw o v a lues of ntr ee (500 the default, and 2000). 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 importance variable ntree=500 mtry=14 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=500 mtry=100 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 importance variable ntree=2000 mtry=14 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=2000 mtry=100 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=500 mtry=200 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.05 0.1 0.15 0.2 ntree=2000 mtry=200 Figure 12: V ariable imp or ta nce sensitivity to mtr y and ntr ee (toys data ) The effect of taking a larger v alue for mtry is ob vious. Indeed the magnitude of VI is more than doubled starting from mtr y = 14 to mtr y = 100 , and it again increases whith mtr y = 20 0 . The effect o f ntree is less visible, but tak ing INRIA R andom F or ests: some metho dolo gic al insights 17 ntree = 2 000 leads to b etter stability . What is int eresting in the b ottom right graph is that we get the sa me or der for all true v ar iables in every r un of the pro cedure. In top left situation the mean OOB err or rate is a bo ut 5% and in the b ottom right one it is 3%. The gain in erro r ma y not b e co nsidered as la rge, but what we g et in VI is interesting. 3.3 Sensitivit y to highly correlated predictors Let us a ddress an imp orta n t issue: how do es v a riable impo rtance b ehav e in presence of se veral highly cor related v aria bles? W e take a s bas ic framework the pr evious cont ext with n = 100 , p = 200, ntr ee = 2000 and mtry = 100 . Then we add to the datas e t highly corr elated replications o f some of the 6 true v aria bles. The replicates a r e inserted be t ween the true v ariables and the useless ones. 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 9 10 12 14 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Figure 13: V a riable imp ortance of a gr oup of cor r elated v ariables (augmented toys data) The first gr aph o f Figure 13 is the refer ence o ne: the situation is the same as previo usly . Then for the three other cases, we sim ulate 1, 10 and 20 v ariables with a co rrelation of 0 . 9 with v ar iable 3 (the most imp or tan t one). These replications are plotted b etw een the t wo vertical lines. The magnitude of imp ortance of the g roup 1 , 2 , 3 is steadily decrea sing when adding mo re r eplications of v ariable 3. On the other hand, the imp or ta nce of the group 4 , 5 , 6 is unchanged. Notice that the impor tance is not divided by the num b er of replications. Indeed in our example, even with 20 replications the ma xim um imp or tance of the gr oup containing v ariable 3 (that is v ar iable 1 , 2 , 3 and all r eplications of v ariable 3 ) is o nly three times low er than the initial impo r tance of v a riable 3. Fina lly , note that even if some v ariables in this gr oup hav e low imp ortance, they cannot b e confused with noise. Let us briefly commen t on similar experiments (see Figure 14) but p ertur bing the basic situation not o nly by int ro ducing highly correlated versions of the third v aria ble but also of the sixth, leading to replicate the mo st imp or ta n t of ea ch group. RR n ° 6729 18 Genuer et. al 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 importance variable 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 32 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Figure 14 : V a riable imp ortance of tw o g r oups of corr elated v ariables (augmented toys data) Again, the first graph is the r eference o ne. Then we s im ulate 1, 5 a nd 10 v aria bles of correla tio n ab out 0 . 9 with v ariable 3 and the same with v a r iable 6. Replications of v ariable 3 ar e plotted betw een the firs t vertical line and the das hed line, and replica tio ns o f v a riable 6 b et ween the das hed line and the second vertical line. The magnitude of imp orta nce of ea ch gro up (1 , 2 , 3 and 4 , 5 , 6 r esp ectiv ely) is steadily decreasing when a dding more replications. The rela tiv e imp ortance betw een the tw o gr o ups is pres e r ved. And the relative impo rtance b et ween the t wo g roups of replications is of the same order than the one b et ween the tw o initial gro ups. 3.4 Prostate data v ariable imp ort ance T o end this section, we illustrate the b ehavior of v ariable imp ortance on a high dimensional real da taset: the microarr ay da ta called Pro state. The global pic- ture is the following: t wo v ariables hugely imp ortant, ab out tw ent y mo der ately impo r tant v ariables and the others of small imp ortance. So, mor e precisely Fig- ure 15 compa res VI obtained for para meters set to their default v a lues (graphs of the left column) and those obtained for ntr ee = 200 0 and mtry = p/ 3 ( gra phs of the r ight column). Let us co mmen t on Figure 1 5. F or the tw o mo st impo rtant v ariables (first row), the magnitude of imp orta nce obtained with ntr ee = 2000 and mtry = p/ 3 is muc h larger than to the one obtained with default v alues. In the seco nd row, the increase of magnitude is s till no ticeable from the third to the 9th most impo r tant v ar iables and from the 10th to the 2 0th most imp ortant v ariables, VI is quite the same for the tw o par ameter choices. In the third row, w e get VI closer to zero for the v ariables with ntree = 2000 and mtr y = p/ 3 than with default v a lues . In addition, note that for the less imp orta n t v ar ia bles, b oxplots are lar ger for default v a lues, especially for unim p ortant v ariables (from th e 200th to the 2 50th). INRIA R andom F or ests: some metho dolo gic al insights 19 1 2 0 0.02 0.04 0.06 0.08 0.1 variable importance 1 2 0 0.02 0.04 0.06 0.08 0.1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 x 10 −3 importance variable 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 x 10 −3 200 210 220 230 240 250 −5 0 5 10 x 10 −4 importance variable 200 210 220 230 240 250 −5 0 5 10 x 10 −4 Figure 15: V a riable impo rtance for Pro s tate data (using ntr ee = 2 000 and mtry = p / 3, on the r ig ht and using default v a lues on the left) 4 V ariable selection 4.1 Pro cedure 4.1.1 Principle W e distinguish tw o v ariable selection o b jectiv es: 1. to find imp ortant v aria bles highly related to the r esp onse v a riable for in terpretatio n purp ose; 2. to find a small num b er of v a riables sufficient to a go o d pre diction of the resp onse v aria ble. The first is to magnify all the impo r tant v aria bles, even with hig h redundancy , for interpretation purpos e and the second is to find a sufficien t parsimonious set of imp ortant v ariables for prediction. Two e a rlier works must b e c ited: D ´ ıa z - Uriarte, Alv arez de Andr´ es (2006 ) [14] and Ben Ishak, Ghattas (2008 ) [4]. D ´ ıa z-Uriarte, Alv ar ez de Andr´ es prop ose a stra tegy ba sed on recur sive elim- ination of v ariables. Mor e precisely , they fir st compute RF v ariable imp ortance. Then, a t each step, they eliminate the 2 0% of the v ar iables having the sma lles t impo r tance a nd build a new forest with the remaining v ariables. They finally select the set of v a riables leading to the smallest OOB er r or ra te. The prop or- tion of v ariables to eliminate is an arbitrary parameter o f their meth o d and do es not dep end on the data. RR n ° 6729 20 Genuer et. al Ben Isha k, Ghattas c ho ose an ascendant strateg y based o n a sequen tial intro- duction of v ariables. Firs t, they compute s ome SVM-based v ariable importance. Then, they build a sequence o f SVM mo dels inv oking a t the b eginning the k most imp ortant v a riables, by s tep o f 1. When k beco mes to o larg e, the addi- tional v aria bles a re inv oked by packets. They fina lly select the set o f v aria bles leading to the model of smallest erro r rate. The wa y to in tro duce v ariables is no t data-driven since it is fixed b efore running the pro cedure. They also co mpare their pro cedure with a similar one using RF instead of SVM. W e prop ose the following t wo-steps procedure, the first one is common while the second one dep ends on the o b jectiv e: 1. P reliminary elimination and r a nking: Compute the RF scores o f imp ortance, cancel the v a r iables of small impo r tance; Order the m remaining v ariables in decr easing order of imp ortance. 2. V ariable se lection: F o r interpr etation : co ns truct the nested collection o f RF mo dels in- volving the k first v aria bles, for k = 1 to m a nd selec t the v ariables in volv ed in the mo del leading to the smallest O OB error ; F o r pr e diction : starting from the order ed v ariables reta ined for int er- pretation, construct a n ascending sequence of RF mo dels, by inv oking and testing the v ariables stepwise. The v ariables of the la s t mo del are selected. Of course, this is a sketc h o f pr o cedure a nd more details are needed to b e effectiv e. The next par a graph a nswer this po in t but we e mphas iz e that we pro - po se an heuris tic str a tegy which is no t supp orted by sp ecific mo del hypotheses but base d o n da ta-driven thresholds to take decisio ns. Remark 4.1 Sinc e we want to tre at in an unifie d way al l the situations, we wil l use for finding pr e diction variables t he somewhat crude str ate gy pr eviously de- fine d. Nevertheless, st arting fr om the set of variables sele cte d for interpr etation (say of size K ), a b etter s t r ate gy c ould b e to examine al l, or at le ast a lar ge p art, of the 2 K p ossible mo dels and to sele ct the variables of the mo del minimizing the OOB err or. But this st ra te gy b e c omes quickly unr e alistic for high dimensional pr oblems so we pr efer to exp eriment a str ate gy designe d for smal l n and lar ge K which is not c onservative and even p ossibly le ads to sele ct fewer variables. 4.1.2 Starting example T o bo th illustrate a nd g ive mo r e details ab out this pro cedure, we apply it on a simu lated lear ning se t of size n = 1 00 from the class ifica tion toys data mo del (see Section 2.3.2) with p = 2 00. The r esults a re summarized in Figure 16. The true v ariables (1 to 6) ar e resp ectively represented by ( ✄ , △ , ◦ , ⋆ , ✁ , ). W e compute, thanks to the le a rning set, 50 forests with ntr ee = 2 000 and mtry = 100 , which a re v alues of the main parameters pre v iously co nsider ed a s well adapted for VI calculations. Let us deta il the main stag es o f the pro ce dur e toge ther with the results obtained on toys data: INRIA R andom F or ests: some metho dolo gic al insights 21 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 variables mean of importance 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 variables standard deviation of importance 0 5 10 15 20 25 30 35 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 nested models OOB error 1 1.5 2 2.5 3 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 predictive models OOB error Figure 16: V a riable selection pro cedures for interpretation and prediction for toys data First we rank the v ariables by sorting the VI in descending order. The result is drawn on the top left graph for the 50 most imp ortant v ari- ables (the other noisy v a r iables having an impo r tance very close to zero to o). Note that true v ar iables are sig nificant ly mor e imp o rtant than the noisy ones. W e k eep this order in mind and plot the corresp onding standard deviations of VI. W e use this g raph to estimate some threshold for impor tance, a nd we keep only the v ariables of imp ortance exceeding this level. Mo re precisely , we selec t the threshold as the minim um prediction v alue given b y a CAR T mo del fitting this curve. This rule is, in general conserv ative a nd leads to retain more v ariables than necessar y in or der to make a careful choice later. The standa r d deviatio ns of VI can b e found in the top right graph. W e can see that true v aria bles standard deviation is large compar ed to the noisy v ar ia bles one, which is c lo se to zero. The thresho ld lea ds to retain 33 v aria bles. Then, we compute OOB erro r rates of rando m forests (using default pa - rameters) of the nested mo dels star ting from the one with only the most impo r tant v ariable, and ending with the one inv o lving all importa n t v ari- ables kept prev io usly . The v ar ia bles of the mo del leading to the smalles t OOB error are selected. Note that in the bo ttom left gr a ph the erro r decreases quickly a nd reaches its minimum when the firs t 4 true v a riables are included in the model. Then it remains constant. W e select the mo del co n taining 4 of the 6 true v a riables. Mor e precisely , we select the v a riables inv olved in the mo del almost leading to the smallest OOB erro r, i.e. the fir st model RR n ° 6729 22 Genuer et. al almost leading to the minimum. The actual minim um is reached with 24 v aria bles. The expected behavior is non-decreasing a s so on as all the ”tr ue” v aria bles hav e b een selected. It is then difficult to treat in a unified wa y nearly constant o f or s lig ht ly increasing. In fact, we pr op ose to use a n heur istic rule s imilar to the 1 SE rule of Breiman et al. (1984 ) [6] used for selec tio n in the cos t-complexity pr uning pr o cedure. W e per form a sequential v ariable introduction with testing: a v ariable is added only if the err o r gain exceeds a threshold. The idea is that the error decrease must b e significantly greater than the av erage v ariation o bta ined b y adding no is y v aria bles. The bottom right graph shows the result of this step, the final mo del for pr ediction purp ose in volv es o nly v ariables 3 , 6 and 5. The thresho ld is set to the mean of the absolute v alues of the firs t o rder differenti ated errors b etw een the mo del with 5 v ariables (the first mo del a fter the one we selected for interpretation, see the b ottom left gr aph) and the last o ne. It should be noted that if one wan ts to estimate the prediction erro r, since ranking and selection are made on the same set of obs e r v ations, of co urse an error ev aluation o n a test set or using a cros s v alidation scheme sho uld b e preferred. It is taken int o a ccount in the next sec tio n when our res ults are compar e d to others. T o ev a luate fairly the different prediction e r rors, we pr efer here to simulate a test set of the same size than the learning set. The test er ror rate with all (200) v a riables is ab out 6% while the o ne with the 4 v ariables s elected for in terpretatio n is ab out 4 . 5%, a little bit smaller. The mo del with prediction v aria bles 3, 6 a nd 5 reaches an er ror of 1%. Repeating the global pr o cedure 10 times on the same da ta alwa ys gav e the same interpretation set of v ariables and the same prediction set, in the sa me o rder. 4.1.3 Highly correlated v ariables Let us now apply the pr o cedure on toys data with replicated v ariables: a firs t group of v ariables highly c o rrelated with v ar iable 3 and a s econd one replicated from v ariable 6 (the most imp ortant v a r iable of ea c h gro up). The situations of in terest are the same as those co nsidered to pro duce Figure 14 . n umber of replications in terpretatio n s e t prediction set 1 3 7 3 2 6 5 3 6 5 5 3 2 7 3 10 3 6 11 3 5 12 6 3 6 5 10 3 14 3 8 3 2 15 3 6 5 10 3 13 3 20 6 3 6 5 10 3 T a ble 3: V ariable selec tio n pro cedure in presence of hig hly corr e la ted v a riables (augmented toys data) Let us comment o n T able 3, where the expr ession i j means that v aria ble i is a r e plica tion of v ar ia ble j . In terpretatio n sets do not contain all v ar iables of interest. Particularly we hardly k eep replicatio ns of v aria ble 6. The r eason is that even b efore adding INRIA R andom F or ests: some metho dolo gic al insights 23 noisy v ariables to the mo del the error r ate of nested mo dels do increase (or remain constant): when s everal highly correlated v ar ia bles are added, the bias remains the same while the v ar iance incre a ses. How ev er the prediction sets a re satisfactory: we alwa ys highlig h t v a riables 3 and 6 and at most o ne c o rrelated v aria ble with each of them. Even if all the v ar iables of interest do no t app ear in the interpretation set, they alw ays app ear in th e first pos itions of our ranking according to imp ortance. More precisely the 16 most imp ortant v ar iables in the case of 5 r eplications ar e: (3 2 7 3 10 3 6 11 3 5 12 6 8 3 13 6 16 6 1 15 6 14 6 9 3 4), and the 26 mo st imp orta n t v aria bles in the case of 10 replications a re: (3 1 4 3 8 3 2 15 3 6 5 10 3 13 3 20 6 21 6 11 3 12 3 18 6 1 24 6 7 3 26 6 23 6 16 3 25 6 22 6 17 6 19 6 4 9 3 ). No te that the or der o f the true v a riables (3 2 6 5 1 4) r emains the same in all situations. 4.2 Classification 4.2.1 Prostate data W e apply the v ariable selection pro cedur e on P rostate da ta . The g raphs of Figure 17 are obtained as those of Figure 1 6, ex c ept that for the RF pro ce dur e, we use ntr ee = 2000, mtr y = p/ 3 and for the b ottom left g raph, we o nly plot the 100 most imp or tan t v ariables for visibility . The pro cedure leads to the s ame picture a s previously , e x cept for the O OB ra te along the nested mo dels whic h is less regular. The key p oint is that it selects 9 v ariables for in terpretation, and 6 v aria bles for prediction. The num b er of selected v a r iables is then very muc h smaller than p = 6033 . 0 20 40 60 80 100 0 0.02 0.04 0.06 0.08 0.1 0.12 variables mean of importance 0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 variables standard deviation of importance 1 2 3 4 5 6 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 predictive models OOB error 0 20 40 60 80 100 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 nested models OOB error Figure 17: V a riable selection pro cedures for interpretation and prediction for Prostate data In addition, to exa mine the v ar ia bilit y of the interpretation and prediction sets the global pro cedure is rep eated five times on the entire Pros tate dataset. The five prediction sets a re very clo se to each other. The num ber of prediction v aria bles fluctuates betw een 6 and 10, a nd 5 v ar iables appear in all s e ts. Among the five interpretation sets, 2 are identical and made of 9 v ariables and the 3 RR n ° 6729 24 Genuer et. al other are made of 25 v ariables. The 9 v aria bles of the smalles t sets are present in all sets and the biggest sets (o f size 25 ) have 23 v a r iables in common. So, a lthough the sets of v ariables ar e not identical fo r each r un of the pr o- cedure, they are no t co mpletely different . And in addition the most impor tant v aria bles are included in all s ets of v a riables. 4.2.2 High di mensional classification W e apply the globa l v ariable selection pro cedure on high dimensional real data sets studied in Section 2.3.2, and we wan t to get an estimation o f prediction err or rates. Since these datasets are of small size, we use a 5- fo ld cross - v alidation to estimate the err o r ra te. So we split the sample in 5 s tratified par ts, each part is successively used a s a test s et, and the r emaining o f the data is used a s a learn- ing set. Note that the set of v a r iables selected v ary from one fold to another. So, we give in T able 4 the misclassification erro r r ate, given by the 5- fold cross- v alidation, for in terpretatio n a nd prediction sets of v ariables r e s pectively . The n umber into brack ets is the av erag e num ber of selected v ariables. In addition, one can find the orig inal error which stands for the misclassificatio n r ate given b y the 5 -fold cross-v alidation achiev ed with random forests using all v ariables. This error is calculated using the same partition in 5 par ts and again we use ntree = 20 00 and mtry = p / 3 for all da ta sets. Dataset in terpretatio n prediction original Colon 0.16 (35) 0.20 (8) 0.14 Leukemia 0 (1) 0 (1) 0.02 Lymphoma 0.08 (77) 0.09 (12) 0.10 Prostate 0.085 (3 3) 0.075 (8) 0.07 T a ble 4: V aria ble selection pro cedure for four high dimensional r eal datasets. CV-error ra te a nd into brack ets the average num b er o f selected v a r iables The num b er of interpretation v aria bles is hugely sma ller than p , at most tens to be compared to thousands. The num ber of prediction v ariables is very small (always s maller than 12) and the r eduction can b e very imp ortant w.r.t the interpretation set size. The erro rs for the tw o v ariable selection pro ce dur es are of the same o rder of magnitude as the original error (but a little bit la r ger). W e compare these results with the results obtained by Ben I s hak and Ghattas (2008) (see tables 9 and 11 in [4]) which hav e compared their method with 5 comp etitors (mentioned in the in tro duction) for class ification problems on these four datasets. Erro r rates are compara ble. With the prediction pro c e dure, as already noted in the introductory remark, we alwa ys select fewer v ariables than their pr o cedures (except for their metho d GLMpath which select less than 3 v aria bles for all datasets). 4.3 Regression 4.3.1 A s im ulated dataset W e now apply the pro cedure to a simulated regres sion problem. W e construct starting from the F riedman1 mo del and adding noisy v ariables as in Section INRIA R andom F or ests: some metho dolo gic al insights 25 2.2.2, a learning se t of size n = 1 0 0 with p = 200 v ar iables. Figure 18 dis plays the results of the pro cedure. The true v ariables of the mo del (1 to 5 ) are resp ectively represe nted by ( ✄ , △ , ◦ , ⋆, ✁ ). 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 variables mean of importance 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 variables standard deviation of importance 0 5 10 15 20 25 30 35 8 10 12 14 16 18 20 22 24 26 nested models OOB error 1 1.5 2 2.5 3 3.5 4 8 10 12 14 16 18 20 22 24 26 predictive models OOB error Figure 18: V a riable selection pro cedures for interpretation and prediction for F r iedman1 data The gra phs ar e of the same kind a s in cla ssification problems. Note that v aria ble 3 is co nfused with noise a nd is not s elected by the pro cedure. This is explained by the fact that it is hardly correla ted with the resp onse v ar iable. The in terpretation pro cedure select the true v ariables except v aria ble 3 and t wo no isy v ar iables, and the prediction se t of v a riables contains only the true v aria bles (except v ariable 3). Again the whole pro cedure is stable in the sense that several runs g ive the same set o f selected v ar iables. In addition, we simulate a test set o f the same size than the learning set to estimate the pr ediction err o r. The test mean squa red error with a ll v aria bles is ab out 19 . 2, the one with the 6 v ariables selected for int erpreta tion is 12 . 6 and the one with the 4 v ar iables selected for prediction is 9 . 8. 4.3.2 Ozone data Before ending the pap er, let us apply the entire pro cedure to the ozone dataset. It consists of n = 3 66 observ a tions of t he daily maximum o ne-hour-av era ge ozone together with p = 1 2 meteor ologic explana tory v a riables. Let us first examine, in Figure 19 the VI obtained with RF pro cedure using mtr y = p/ 3 = 4 and ntree = 20 00. F r om the left to the right, the 12 explanatory v ariables are 1-Month, 2- Day of month, 3-Day of week, 5-Pres sure height, 6-Wind s peed, 7-Humidit y , 8 - T e mp era ture (Sandburg), 9- T emp era ture (El Monte), 10-Inv ersion base height, 11-Pr essure gradient, 12- In version ba se temper ature, 13 -Visibilit y . Three very sensible groups o f v ariables app ear from the most to the lea st impo r tant. First, the tw o temp era tures (8 and 9), the inv ersion base temp er- ature (12) known to b e the b est ozo ne pr edictors, and the month (1), which RR n ° 6729 26 Genuer et. al 1 2 3 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 importance variable Figure 19: V ariable imp ortance for O zone data is an impo r tant predictor since ozone concentration exhibits an heavy s e asonal comp onent. A second gro up of clear ly less imp ortant meteorolo gical v a riables: pressure height (5 ), h umidity (7), inv ersion base height (10), pressure gradient (11) and visibility (13). Finally three unimp ortant v ar iables: day o f month (2), day of week (3) of co urse a nd mor e surprisingly wind sp eed (6). This last fact is cla ssical: w ind enter in the model only when ozo ne p o llution ar ises, otherwise wind and p ollution are uncorrelated (see for e x ample Cheze et a l. (200 3) [12] highlighting this phenomenon using partial estimators). Let us now exa mine the results of the selection pro cedures. 0 2 4 6 8 10 12 −5 0 5 10 15 20 25 variables mean of importance 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 variables standard deviation of importance 0 2 4 6 8 10 0 5 10 15 20 25 30 nested models OOB error 1 2 3 4 5 6 0 5 10 15 20 25 30 predictive models OOB error Figure 20: V a riable selection pro cedures for interpretation and prediction for Ozone data After the first elimination s tep, the 2 v ariables of nega tiv e impor tance are canceled, as exp ected. Therefore we keep 1 0 v a riables for interpretation step and then the mo del with 7 v ariables is then selected a nd it contains a ll the most imp ortant v ariables: (9 8 1 2 1 11 7 5 ). INRIA R andom F or ests: some metho dolo gic al insights 27 F o r the prediction pr o cedure, the mo del is the same except one more v a riable is eliminated: humidit y (7) . In a ddition, when different v alues fo r mtr y are considered, the most imp or- tant 4 v a riables (9 8 12 1 ) highlighted by the VI index, are selected and app ear in the same or der. The v ariable 5 a lso alwa ys app ear s but another one can app ear after of b efore. 5 Discussion Of course, one of the main op en issue a bo ut random forests is to elucidate from a mathematical p oint of view its exceptionally a ttractive p erfor mance. In fact, only a s mall nu mber of refer ences deal with this very difficult challenge and, in addition to bagg ing theoretical exa mination by B ¨ ulmann a nd Y u (200 2 ) [11], only pur e ly random trees, a simple version of ra ndom forests, is considered. Purely random tr ees hav e been introduced by Cutler and Zhao (2001 ) [1 3] for classification problems and then studied b y Breiman (20 04) [9], but the results are somewhat preliminary . Mor e rec e ntly Biau et al. (2008) [5] obtained the first well stated co ns istency type results. F r om a practical p ers pective, surprisingly , this simplified and es sent ially no t data-driven s tr ategy s eems to p erform well, at least for prediction purp ose (see Cutler a nd Zhao 20 01 [13]) and, of course, can b e ha ndled theoretically in a easier wa y . Nev ertheless, it should b e interesting to c heck that the s ame conclusions hold for v ariable imp ortance and v ariable selection tas ks. In addition, it c o uld b e interesting to examine some v a riants of random forests which, at the contrary , try to take into account more information. Let us give for example tw o ideas. The first is ab out pr uning: why pruning is not used for individual trees? Of course , from the computational point o f view the answer is obvious and for pr ediction p erforma nce, averaging eliminate the negative effects of individual overfitting. But from the tw o other prev io usly men tioned statistical pro blems, prediction and v ar ia ble selection, it remains unclear. The se cond remark is ab out the ra ndom feature s e lectio n step. The most widely used version of RF selects randomly mtry input v ariables according to the discrete uniform distr ibution. Two v aria nt s can be suggested: the fir st is to select random inputs a ccording to a distribution coming from a preliminary ranking given by a pilot es timator; the second one is to adaptively up date this distribution taking profit of the ranking based o n the curr e n t forest which is then more and mor e accurate. These different future directions, b oth theoretical and practical, will be ad- dressed in the next s tep of the work. 6 A pp end ix In the sequel, informa tion abo ut datasets retrieved from the R pac k a ge mlbench can b e found in the corr espo nding descr iption file. Standard probl ems, n >> p : Binary classification RR n ° 6729 28 Genuer et. al – Real data sets 3 * Ionosphere ( n = 3 51 , p = 34) * Diabetes, Pima Indian sDiabetes2 ( n = 7 68 , p = 8) * Sonar ( n = 20 8 , p = 60) * V o tes , H ouseV otes84 ( n = 4 35 , p = 16) – Simulated data sets 3 * Ringnorm, mlbe nch.ri ngnorm ( n = 200 , p = 20) * Threenorm, mlben ch.thr eenorm ( n = 200 , p = 20 ) * Twonorm, mlben ch.two norm ( n = 20 0 , p = 20) Multiclass classification – Real data sets 3 * Glass ( n = 21 4 , p = 9 , c = 6) * Letters, Lett erReco gnition ( n = 20 000 , p = 16 , c = 26) * Sat-images, Sate llite ( n = 6435 , p = 36 , c = 6) * V ehicle ( n = 84 6 , p = 18 , c = 4) * V owel ( n = 990 , p = 10 , c = 11 ) – Simulated data sets 3 * W aveform, mlbenc h.wave form ( n = 20 0 , p = 21 , c = 3) Regression – Real data sets 3 * BostonHousing ( n = 5 06 , p = 13) * Ozone ( n = 36 6 , p = 12) * Servo ( n = 16 7 , p = 4) – Simulated data sets 3 * F r iedman1, mlben ch.fri edman1 ( n = 30 0 , p = 10) * F r iedman2, mlben ch.fri edman2 ( n = 30 0 , p = 4) * F r iedman3, mlben ch.fri edman3 ( n = 30 0 , p = 4) High dimensi onal probl ems, n << p : Binary classification – Real data sets 4 * Adenoc arcinoma ( n = 76 , p = 986 8), see Ra ma swam y et al. (2003) [33] * Colon ( n = 62 , p = 200 0), see Alon et al. (1999 ) [1] * Leukemia ( n = 38 , p = 305 1): see Golub et al. (1999 ) [2 1] * Prostate ( n = 1 02 , p = 6033): see Singh et a l. (2002) [35] – Simulated data sets 5 3 from the R pack age mlbench 4 see htt p://ligarto.org/rdiaz/Papers/rfVS/randomF orestV arSel.html 5 see description in section 2.3.2 INRIA R andom F or ests: some metho dolo gic al insights 29 * toys data ( n = 100 , 1 00 ≤ p ≤ 1000), see W eston e t a l. (2003) [39] Multiclass classification – Rea l da ta sets 4 * Brain ( n = 42 , p = 559 7 , c = 5), see Pomeroy et a l. (200 2) [31] * Breast, brea st.3. class ( n = 96 , p = 4869 , c = 3), see v an’t V eer et a l. (200 2) [38 ] * Lymphoma ( n = 6 2 , p = 40 26 , c = 3), see Alizadeh (2000) [2] * Nci ( n = 61 , p = 6 0 33 , c = 8), see Ros s et al. (2000) [34] * Srb ct ( n = 6 3 , p = 23 08 , c = 4), see Kha n et al. (2001) [26] Regression – Rea l da ta sets 6 * P A C ( n = 2 0 9 , p = 467) – Simulated data sets 3 * F r iedman1, mlben ch.fri edman1 ( n = 100 , 1 00 ≤ p ≤ 1000) References [1] Alon U., Ba r k ai N., Notterman D.A., Gish K., Ybarr a S., Mack D., and Levine A.J. (19 99) Br o ad p atterns of gene expr ession r eve ale d by clustering analysis of t umor and normal c olon tissues pr ob e d by oligonucle otide arr ays . Pro c Natl Acad Sci USA, Cell Bio logy , 96(1 2):6745- 6750 [2] Aliza deh A.A. (2000 ) Distinct typ es of diffues lar ge b-c el l lymphoma identi- fie d by gene expr ession pr ofiling . Nature, 4 03:503 -511 [3] Ar cher K .J. and Kimes R.V. (20 08) Empiric al char acterization of r andom for est variable imp ortanc e me asur es . Computational Statistics & Data Anal- ysis 52:22 49-22 60 [4] B e n Ishak A. a nd Ghattas B. (2008) S´ ele ction de variables en classific ation binair e : c omp ar aisons et applic ation aux donn ´ ees de biopuc es . T o app ear, Revue SFDS-RSA [5] B ia u G., Devroy e L., a nd Lugosi G. (2008) Consistency of r andom for ests and other aver aging classifiers . Journal of Machine Learning Research, 9:2039 -2057 [6] B r eiman L., F r iedman J.H., O lshen R.A., Stone C.J. (19 84) Classific ation And R e gr ession T r e es . Chapman & Hall [7] B r eiman, L. (199 6) Bagging pr e dictors . Machine Learning, 26 (2):123-1 40 [8] B r eiman L. (200 1) R andom F or ests . Ma ch ine Lear ning, 4 5:5-32 6 from the R pack age chemometrics RR n ° 6729 30 Genuer et. al [9] B r eiman L. (2004) Consistency for a simple mo del of Ra ndom F or ests . T ech- nical Rep ort 67 0, Ber keley [10] Breiman L. and Cutler, A. (2005 ) R andom F or ests . Berkeley , h ttp://www.stat.b erkeley .edu/ users/breiman/ RandomF orests/ [11] B ¨ uhlmann, P . and Y u, B. (2002 ) Analyzing Bagging . The Annals of Statis- tics, 30(4 ):9 2 7-961 [12] Cheze N., Poggi J.M. and Portier B. (2003) Partial and R e c ombine d Es- timators for Nonline ar Ad ditive Mo dels . Statistical Inference for Sto chastic Pro cesses , V ol. 6, 2, 1 55-197 [13] Cutler A. and Zhao G. (200 1) Pert - Perfe ct r andom t r e e ensembles . Com- puting Science a nd Statistics, 33:490 -497 [14] D ´ ıa z-Uriarte R. and Alv a rez de Andr´ es S. (2006) Gene Sele ction and clas- sific ation of m icr o arr ay data using r andom for est . B MC Bioinformatics, 7:3, 1-13 [15] Dietteric h, T. (1999 ) An exp erimental c omp arison of t hr e e metho ds for c on- structing ensembles of de cision tr e es : Bagging, Bo osting and r andomization . Machine Lea r ning, 1- 2 2 [16] Dietteric h, T. (2000) Ensemble Metho ds in Machine L e arning . Lecture Notes in C o mputer Science, 185 7:1-15 [17] Efron B., Hastie T., Johnstone I., and Tibshirani R. (2004) L e ast angle r e gr ession . Annals of Statistics, 3 2(2):407-4 99 [18] F a n J. and Lv J . (200 8) Sur e indep endenc e scr e ening for ultra -high dimen- sional fe atur e sp ac e . J . Roy . Statist. So c. Ser . B, 70:8 49-91 1 [19] Guyon I., W esto n J., Ba rnhill S., and V apnik V.N. (200 2) Gene sele ct ion for c anc er classific ation using supp ort ve ctor machines . Ma chine Learning, 46(1-3):3 8 9-422 [20] Guyon I. and Elisseff A. (200 3) An intr o duct ion to variable and fe atu r e sele ction . Journal of Ma ch ine Lear ning Resear ch, 3:1 157-1 182 [21] Golub T.R., Slonim D.K, T ama yo P ., Huard C., Gaasenbeek M., Mesir ov J.P ., Co ller H., Lo h M.L., Downing J.R., Caligiuri M.A., Blo omfield C.D., and Lander E.S. (199 9) Mole cu lar classific ation of c anc er: Class disc overy and class pr e diction by gene expr ession monitoring . Science, 286:531 -537 [22] Gr¨ omping U. (2007 ) Est imators of Re lative Imp ortanc e in Line ar R e gr es- sion Base d on V arianc e D e c omp osition . The America n Statistician 61:13 9- 147 [23] Gr¨ omping U. (200 6) Rela tive Imp ortanc e for Line ar R e gr ession in R: The Package r elaimp o . Jo urnal of Statistical Softw are 17, Iss ue 1 [24] Hastie T., Tibshirani R., F riedma n J. (2001 ) The Elements of Statistic al L e arning . Springer INRIA R andom F or ests: some metho dolo gic al insights 31 [25] Ho, T.K. (19 9 8) The r andom subsp ac e metho d for c onstructing de ci- sion for ests . IEE E T r ans. on Pattern Analysis and Machine Intelligence, 20(8):832 -844 [26] Khan J., W ei J.S., Ring ner M., Saa l L.H., Ladanyi M., W estermann F., Berthold F., Sch wab M., Antonescu C.R., Peterson C., Meltzer P .S. (200 1) Classific ation and diagnostic pr e diction of c anc ers using gene expr ession pr o- filing and artificial neur al networks . Nat Med, 7 :673-6 79 [27] Kohavi R. and Jo hn G.H. (1997) Wr app ers for F e atur e Su bset Sele ction . Artificial Intelligence, 97 (1-2):273 - 324 [28] Liaw A. a nd Wiener M. (2002 ). Classific ation and Re gr ession by r andom- F or est . R News, 2(3):18-2 2 [29] Park M.Y. a nd Hastie T. (200 7 ) An L1 r e gularization-p ath algorithm for gener alize d line ar mo dels . J. Roy . Statist. So c. Ser. B, 69:65 9-677 [30] Poggi J.M. and T uleau C. (2006) Classific ation su p ervis´ ee en gr ande dimen- sion. Applic ation ` a l’agr´ ement de c onduite aut omobile . Revue de Statistique Appliqu ´ ee, LIV(4):39-58 [31] Pomeroy S.L., T amay o P ., Gaasenbeek M., Sturla L.M., Angelo M., McLaughlin M.E., Kim J.Y., Goumnerov a L.C., Black P .M., Lau C., Allen J.C., Zagzag D., Olson J.M., Curr an T., W etmore C., Biegel J .A., Poggio T., Mukherjee S., Rifkin R., Califano A., Stolovitzky G., Louis D.N., Mesirov J.P ., Lander E.S., Golub T.R. (2 0 02) Pr e diction of c entr al nervous system embryonal tumour outc ome b ase d on gene expr ession . Na ture, 4 15:436 -442 [32] Rakotomamonjy A. (2003) V ariable sele ction using S VM-b ase d criteria . Journal of Machine Le arning Res earch, 3:1357 -1370 [33] Ramaswam y S., Ross K.N., Lander E.S., Golub T.R. (2 003) A mole cular signatur e of metastasis in primary solid tumors . Nature Genetics, 33:49-5 4 [34] Ross D.T., Sc herf U., Eisen M.B., Perou C.M., Rees C., Spellman P ., Iyer V., Jeffrey S.S., de Rijn M.V., W altham M., Pergamenschik ov A., Lee J.C., Lashk a ri D., Sha lon D., Myers T.G., W einstein J.N., Botstein D., Brown P .O. (200 0) Systematic variatio n in gene expr ession p atterns in human c an- c er c el l lines . Nature Genetics, 24(3):227 -235 [35] Singh D., F ebb o P .G., Ross K., Jackson D.G., Manola J., Ladd C., T amay o P ., Renshaw A.A., D’Amico A.V., Richie J.P ., Lander E .S., Loda M., Kan toff P .W., Golub T.R., and Sellers W.R. (2002) Gene expr ession c orr elates of clinic al pr ostate c anc er b ehavior . Cancer Ce ll, 1:203 -209 [36] Strobl C., Boules teix A.-L., Z e ileis A. and Hothor n T. (2007) Bias in r an- dom for est variable imp ortanc e me asur es: il lu str ations, sour c es and a solu- tion . BMC B io informatics, 8 :25 [37] Strobl C., Bo ulesteix A.-L., Kneib T., Augustin T. and Zeileis A. (200 8) Conditional variable imp ortanc e for R andom F or ests . BMC B io informatics, 9:307 RR n ° 6729 32 Genuer et. al [38] v an’t V eer L.J., Dai H., v an de Vijv er M.J., He Y.D., Hart A.A.M., Mao M., Peterse H.L., v a n der Ko oy K., Marton M.J ., Witteveen A.T., Schreiber G.J., Kerkhov en R.M., Rob erts C., Linsley P .S., Berna rds R., F riend S.H. (2002) Gene expr ession pr ofiling pr e dicts clinic al outc ome of br e ast c anc er . Nature, 41 5:530- 5 36 [39] W esto n J., Elisseff A., Sc ho elkopf B., and Tipping M. (200 3) Use of the zer o norm with line ar mo dels and kernel metho ds . J o urnal of Ma chine Learning Research, 3:1 439-1 4 61 INRIA Centre de recherche INRIA Saclay – Île-de-Fran ce Parc Orsay Uni versité - ZA C des V ignes 4, rue Jacques Monod - 9189 3 Orsay Cedex (France) Centre de recherc he INRIA Bordeaux – Sud Ouest : Domaine Uni versit aire - 351, cours de la Libération - 33405 T alenc e Cedex Centre de recherc he INRIA Grenoble – Rhône-Alpes : 655, a venue de l’Europe - 38334 Montbonnot Saint-Ismie r Centre de recherc he INRIA Lille – Nord Europe : Pa rc Scientifique de la H aute Borne - 40, a venue Halle y - 59650 V illene uve d’Ascq Centre de recherc he INRIA Nancy – Grand Est : LORIA, T echnopôl e de Nancy-Brab ois - Campus scientifique 615, rue du Jardin Botani que - BP 101 - 54602 V illers-lè s-Nancy Cedex Centre de recherc he INRIA Paris – Rocquenc ourt : Domaine de V olucea u - Rocquenco urt - BP 105 - 78153 Le Chesnay Cedex Centre de recherc he INRIA Rennes – Bretagne Atlantique : IRISA, Campus uni versi taire de Beaulieu - 35042 Rennes Cedex Centre de recherc he INRIA Sophia Antipolis – Méditerran ée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipoli s Cedex Éditeur INRIA - Domaine de V olucea u - Rocquenc ourt, BP 105 - 78153 Le Chesnay Cedex (France) http://www.inria.fr ISSN 0249 -6399 10 0 10 1 10 2 150 200 250 300 350 400 450 500 550 mtry OOB Error PAC ntree=100 500 1000 5000 2 4 6 8 10 12 10 11 12 13 14 15 16 17 18 19 mtry OOB Error BostonHousing ntree=100 500 1000 2 4 6 8 10 20 21 22 23 24 25 26 Ozone 1 1.5 2 2.5 3 3.5 4 20 25 30 35 40 45 Servo 1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 mtry OOB Error friedman1 ntree=100 500 1000 1 1.5 2 2.5 3 3. 5 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10 4 friedman2 1 1.5 2 2.5 3 3.5 4 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 friedman3 0 10 20 30 40 0 0.5 1 1.5 2 2.5 3 3.5 x 10 −3 variables standard deviation of importance 0 10 20 30 40 50 0 0.05 0.1 0.15 0.2 variables mean of importance 0 5 10 15 20 25 30 0 0.05 0.1 0.15 0.2 0.25 nested models OOB error on test sample 1 1.2 1.4 1.6 1.8 0 0.05 0.1 0.15 0.2 0.25 predictive models OOB error on test sample
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment