Random survival forests

We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A co…

Authors: Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone

Random survival forests
The Annals of Applie d Statistics 2008, V ol. 2, No. 3, 841–86 0 DOI: 10.1214 /08-A OAS169 In the Public Domain RANDOM SUR VIV AL F ORESTS 1 By Hemant Ishw aran, Ud a y a B. K ogal ur, Eugene H. Blackstone and Michael S. Lauer Cleveland Clinic, Columbia University, Cleveland Clinic and National He art, Lung, and Blo o d Institute W e introduce random surv iv al forests, a random forests metho d for the analysis of righ t-censored surviv al data. New surviv al splitting rules f or gro wing surviv al trees are introduced, as is a n ew m issing data algorithm for imputing mis sing data. A conserv ation-of-events principle for surviv al f orests is in trodu ced and used to define e nsem- ble mortalit y , a simple interpretable measure of mortalit y th at can be used a s a predicted outcome. Several illustrativ e examples are given, including a case study of the prog nostic implicatio ns of b od y mass for individuals with coronary artery diseas e. Computations for all exam- ples were implemen ted using the freely a v ailable R- soft ware pack age, randomSurv ivalForest . 1. In tro du ction. In this article we introdu ce r andom survival for ests , an ensem b le tree metho d for analysis of right-c ensored surviv al data. As is wel l kno w n, constructing ensem b les from b ase learners, suc h as trees, can sub - stan tially imp ro ve prediction p erf orm ance. Recen tly it has b een sh own b y Breiman ( 200 1 ) that ensem ble learning can b e impro v ed further by inject- ing rand omizatio n in to the b ase learning pro cess, an approac h calle d r andom for ests . Random s u rviv al forests (RSF) metho dology extends Breiman’s ran- dom forests (RF) metho d. In RF, randomizatio n is in tro duced in t wo forms. First, a randomly dr a wn b o otstrap sample of the data is u sed to gro w a tree. Second, at eac h no de of the tree, a randomly selected subset of v ari- ables (co v ariates) is c hosen as candidate v ariables f or splitting. Av eraging o ver trees, in com bin ation with the randomization us ed in gro w ing a tree, enables RF to appro ximate ric h classes of f unctions w hile maint aining lo w generaliza tion error. Considerable empirical evidence h as shown RF to b e Received January 200 8; revised Marc h 2008. 1 Supp orted in part by National In stitutes of Health R O 1 Grant HL-072771. Key wor ds and phr ases. Conserv ation of even ts, cumulativ e hazard function, ensemble, out-of-bag, prediction error, surviv al tree. This is an electronic r e print of the or iginal article published by the Institute of Mathematical Statistics in The Annals of Appli e d St atistics , 2008, V ol. 2, No. 3, 84 1–860 . This reprint differs from the orig inal in pa gination and typogr aphic detail. 1 2 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER highly accurate, comparable to state-of-the-a rt metho ds suc h as bagging [ Breiman ( 1 996 )], b o osting [Sc hapire et al. ( 19 98 )], and supp ort v ector ma- c hin es [ Cortes and V apn ik ( 1995 )]. Un til now, applications of RF hav e f o cused p rimarily on classificatio n and regression problems. Ev en the p opu lar R-soft w are pac k age ran domForest [Lia w and Wi ener ( 2002 , 2007 )] considers only regression and m ulticlass d ata settings, not surviv al analysis. Extending r andom forests to righ t-censored surviv al data is of great v alue. Surviv al data are commonly analyzed using metho ds that rely on restrictiv e assumptions su c h as prop ortional hazards. F urth er, b ecause these metho ds are often parametric, nonlinear effect s of v ariables must b e mo deled by transformations or expandin g t he design ma- trix to include sp ecialized basis functions. Often ad ho c approac hes, su c h as step wise regression, are used to determine if nonlinear effects exist. Identi- fying in teractions, esp ecially those inv olving m ultiple v ariables, is also prob- lematic. Th is m ust b e done b y br ute force (examining all t wo- w a y and three- w a y int eractions, e.g.) , or m u s t rely on sub jectiv e knowledg e to n arr o w the searc h. In con trast, these difficulties are handled automaticall y using forests. W e illustrate the ease with whic h R S F can un co v er complex data structur es through an in-depth case study of the pr ognostic implications of b eing un - derw eigh t, o verw eigh t, or ob ese and havi ng sev ere, but stable coronary artery disease. Although m uc h has b een written ab out public health ramifications of the ob esit y epidemic [Olshansky et al. ( 2005 )], considerable con tro v ersy exists r egarding the pr ecise asso ciation of b o dy mass w ith prognosis. I nv esti- gators ha ve noted co mplex patterns surr ounding p ossible rev erse causation in und erw eigh t individu als, in teracti ons with smoking, and an unclear in- flection p oint at which p oin t increasing b o dy mass confers increased risk [Adams et al. ( 2006 ), Flega l ( 2005 , 2007 ), F on taine et al. ( 20 03 )]. Some ha ve iden tified a p ossible ob esit y parado x among patien ts with established heart disease in whic h increased b o d y mass predicts b etter surviv al [Urtesky et al. ( 2007 )]. T o clarify these issues, w e analyzed a la rge cohort of patien ts with coronary artery disease u n dergoing isolated coronary artery bypass surgery . Using RSF, w e identified a co mplex relati onship b et wee n long-te rm surviv al, b o dy mass, renal (kidney) fun ction, smoking, and num b er o f in ternal coro- nary artery by ass grafts. W e b eliev e our no v el findings help e xplain some of the apparen t co n tradictions previously rep orted. 1.1. Other for est appr o aches. RSF strictly adheres to the prescription laid out by Breiman ( 200 3 ) and in th is wa y differs from other forest ap- proac hes to surviv al data. Breiman’s prescription r equires that all asp ects of gro wing a rand om forest take in to accoun t the outcome. In righ t-censored surviv al settings, this compr ises surviv al time and censoring status. Thus, the splitting crit erion used in gro wing a tree must explicit y inv olv e surviv al RANDOM SUR VIV AL F OR ESTS 3 time and censoring informatio n. T ree no de impurit y , measuring effectiv eness of a sp lit in separating data, m ust measure separatio n by surviv al d ifference. F urth er, the predicted v alue for a terminal nod e in a tree, the resulting en- sem b le predicted v alue fr om the f orest, and the measure of predictio n accu- racy m ust al l prop erly in corp orate s u rviv al information. This differs f r om other forest app roac hes to surviv al analysis that tend to b e “off-the-shelf.” Th ese do not strictly imp lemen t Breiman’s forest metho d [ Breiman ( 2003 )], but instead recast the sur viv al setting in to one th at can b e treated using existing forest methodology . F or example, surviv al analysis is p ossible b y w orking within the Classification and Regression T ree (CAR T) paradigm [Breima n et al. ( 1984 )] b y reform u lating a surviv al tree in terms of a classifica tion tree. One such example is Ish w aran et al. ( 2004 ), where un- der a prop ortional hazards assump tion, CAR T methodology is us ed to pr o- duce a rela tiv e r isk forest b y exploit ing an equiv alence to P oisson tree li k e- liho o ds [ LeBlanc and Crowle y ( 1992 )]. Another interesting app roac h, con- sidered by Hothorn et al. ( 2006 ), analyzes right -censored su r viv al data by using log-transformed sur viv al time as the outcome in a w eigh ted RF regres- sion analysis. Observ ations in the regressio n anal ysis are w eigh ted b y , what are referred to as, in v erse probabilit y of censoring (IPC) w eigh ts. See also Molinaro, D udoit and v an der Laan ( 2004 ). Early exp erimenta l work b y Breiman ( 2002 ) on sur v iv al forests is also relev ant. I n this approac h a surviv al tree is gro wn us in g a h ybrid splitting metho d in wh ic h no des are split b oth on time and co v ariates. This yields a nonparametric estimate for the surviv al f unction that can then b e used to trace the effects o f v ariables on su rviv al as a fu nction of time . 1.2. Obje ctives and outline. The pur p ose of this article is to giv e a de- tailed d escription of RSF (Sections 2 and 3 ) and to illustrate sev eral of its imp ortan t features. A core idea und erlying the approac h is a conserv ation- of-ev ent s prin ciple for surviv al trees (Section 4 ). This principle is used to d e- fine ensem ble mortali t y , a new t yp e of pr ed icted outco me for su rviv al data. Ensem ble mortalit y has a natural in terpretation in terms of the exp ected total num b er of deaths and is d eriv ed from the ensem ble cu mulativ e hazard function (CHF), the forest pr edicted v alue for the CHF. Predict ion error is defined in Sectio n 5 , and pr ediction accuracy of the ensem ble is compared with sev eral comp eting method s in a large exp erimen t comprising sev eral real and sim ulated datasets (Section 6 ). Our results are quite promising and add to the gro w in g list of su ccessful app licatio ns of RF. In Section 7 w e in vestig ate the us e of v ariable imp ortance for v ariable selectio n. In Section 8 w e in tro du ce a no v el missing data algorithm for forests. Th e algorithm can b e used for b oth training and test data and applies to missing v alues f or b oth co v ariates and surviv al data. The pap er ends w ith our case study of b o dy mass of patien ts with coronary artery disease (Sectio n 9 ). 4 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER 2. Random surviv al forests algo rithm. W e b egin with a high-lev el de- scription of the al gorithm. Sp ecific details follo w: 1. Dra w B b o otstrap samples from the original data. Note that eac h b o ot- strap sample excludes on a v erage 37% of the d ata, called out-of-bag data (OOB data ). 2. Gro w a surviv al tree for eac h b o otstrap sample. A t eac h no de of the tree, randomly select p candidate v ariables. Th e no de is sp lit u sing the candidate v ariable that maximizes surviv al difference b et w een daughter no des. 3. Gro w the tree to full s ize under the constrain t that a terminal n o de should ha ve no less than d 0 > 0 un ique deaths. 4. Calculate a C HF for eac h tree. Av erage to obtain the ensemble CHF. 5. Using O OB data, calc ulate prediction error for the ensem ble CHF. 3. Ensem ble cum u lativ e hazard. Cen tral elemen ts of the RSF algorithm are gro wing a sur viv al tree and constructing the ensem b le CHF. Here we pro vide deta ils necessa ry to un derstand these. 3.1. Binary survival tr e e. Similar to CAR T, surviv al trees are binary trees gro wn b y r ecur s iv e splitting of tree no des. A tree is gro wn starting at the ro ot no de, wh ic h is the top of the tree comprising all the data. Usin g a predetermined surviv al criterion, the ro ot node is split in to t wo daugh ter no des: a left and righ t daugh ter no de. In turn, eac h daugh ter no de is s p lit with eac h sp lit giving rise to left and right daughte rs. The pro cess is rep eated in a recursiv e fashion for ea c h su bsequen t no de. A go o d split for a no de maximizes surviv al difference b etw een daught ers. The b est sp lit for a no d e is found by sea rc h ing o ve r all p ossible x v ariables and split v alues c , and choosing that x ∗ and c ∗ that maximizes surviv al dif- ference. By maximizing surviv al difference, the tree pushes dissimilar ca ses apart. Even tually , as the n u m b er of no d es increase, and dissimilar cases b ecome separate d, eac h no de in the tree b ecomes homogeneous and is p op- ulated b y ca ses with similar surviv al. 3.2. T erminal no de pr e diction. Even tually the surviv al tree reac hes a sat- uration p oin t when no new daugh ters can b e formed b ecause of t he criterion that eac h nod e m ust con tain a minimum of d 0 > 0 u n ique deaths. Th e most extreme no des in a saturated tree are called terminal no d es. Denote these b y T . Let ( T 1 ,h , δ 1 ,h ) , . . . , ( T n ( h ) ,h , δ n ( h ) ,h ) b e the surviv al times and the 0–1 censoring information for individuals (cases) in a terminal no de h ∈ T . An individual i is said to b e right- censored at time T i,h if δ i,h = 0; otherwise, if δ i,h = 1, the individual is said to ha v e died (exp erienced an ev en t) at T i,h . Let t 1 ,h < t 2 ,h < · · · < t N ( h ) ,h b e the N ( h ) distinct ev ent times. Define d l,h RANDOM SUR VIV AL F OR ESTS 5 and Y l,h to b e the num b er of deaths and individuals at r isk at time t l,h . Th e CHF esti mate for h is the Nelson–Aalen estimato r ˆ H h ( t ) = X t l,h ≤ t d l,h Y l,h . All cases within h ha v e the same CHF. Eac h case i has a d -dimensional co v ariate x i . The notation x ab ov e r efers to on e co ordinate of x i . Let H ( t | x i ) b e the CHF for i . T o determine this v alue, drop x i do w n th e tree. Beca use of the binary nature of a surviv al tree, x i will fall into a unique terminal no de h ∈ T . The CHF for i is the Nelson–Aale n estimator for x i ’s terminal no de: H ( t | x i ) = ˆ H h ( t ) , if x i ∈ h. (3.1) Iden tit y ( 3.1 ) defines the CHF for all cases and defines the CHF for the tree. 3.3. The b o otstr ap and OOB ensemble CHF. T h e CHF ( 3.1 ) is deriv ed from a single tr ee. T o compute an ensemble CHF, w e a v erage o ver B sur viv al trees. W e describe b oth an OOB and b o otstrap esti mate. Recall that eac h tree in the forest is gro wn u sing an ind ep endent b o otstrap sample. Define I i,b = 1 if i is an OOB case for b ; otherwise, set I i,b = 0. Let H ∗ b ( t | x ) denote the CHF ( 3.1 ) for a tree gro wn fr om the b th b o otstrap sample. The OOB ensem ble CHF for i is H ∗∗ e ( t | x i ) = P B b =1 I i,b H ∗ b ( t | x i ) P B b =1 I i,b . (3.2) Observe that ( 3.2 ) is an av erage o v er b o otstrap s amp les in which i is O O B. Equiv alen tly , H ∗∗ e ( t | x i ) ca n be ca lculated as follo ws. Drop OO B data do wn a surviv al tree gro wn fr om in-bag (b o otstrap) data. Keep trac k of i ’s terminal no de and its CHF. T ak e the a v erage of these CHFs. T his yields ( 3.2 ). In co n trast to ( 3.2 ), the b o otstrap ensem ble CHF for i is H ∗ e ( t | x i ) = 1 B B X b =1 H ∗ b ( t | x i ) . (3.3) Observe that ( 3.3 ) uses all surviv al trees and not ju st those where i is O O B. 4. Conserv ation of ev en ts. W e use ( 3.2 ) and ( 3.3 ) to define a predicte d outcome. Our a pproac h r ests on a conserv ation-o f-ev ents principle [ Naftel, Blac ks tone and T urn er ( 1985 )]. Un der fairly general conditions, con- serv ation of ev ent s asserts that the sum of the esti mated CHF o ver observ ed time (b oth censored and uncensored) equals the total n umber of deaths. This applies to a wide collection of estimators, including the Nel son–Aalen estimator. F or a terminal n o de h ∈ T in a give n tree, conserv ation of ev en ts can be stated as the follo win g lemma: 6 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER Lemma 1. P n ( h ) i =1 ˆ H h ( T i,h ) = P n ( h ) i =1 δ i,h for e ach terminal no de h ∈ T . In other wor ds, the total numb er of de aths is c onserve d within h . Lemma 1 sh o ws that sum m ing the CHF o ver obs erved surviv al times equals the total n u mb er of d eaths within a terminal no d e: a t yp e of con- serv ation of ev ents within the end s of a tree. An immediate corollary is the stronger assertion that the total n umb er of deaths in a tree is also conserved. Let ( T 1 , δ 1 ) , . . . , ( T n , δ n ) denote surviv al times and censoring v alues for the non b o ostrapp ed data. Corollar y 1. F or e ach tr e e gr own fr om the original nonb o otstr app e d data, n X i =1 H ( T i | x i ) = X h ∈ T n ( h ) X i =1 ˆ H h ( T i,h ) = X h ∈ T n ( h ) X i =1 δ i,h = n X i =1 δ i . Note that the right-hand side e quals total numb er of de aths. 4.1. Ensemble mortalit y. Corollary 1 is used to motiv ate a predicted outcome measurin g mortalit y . Corollary 1 sho ws that the total num b er of deaths equals the sum of the CHF ov er ( T i , x i ). Mortalit y , in con trast, is defined as the exp ected v alue for the CHF summed o ver time T j , conditioned on a sp ecific x i . It measures the num b er of deaths exp ected under a null h yp othesis of similar surviv al b eha vior. Sp ecifically , the mortalit y for i is defined as M i = E i n X j =1 H ( T j | x i ) ! , where E i is the exp ectation under the null hyp othesis th at all j are similar to i . Mortalit y can b e estimated natur ally within a su rviv al tree paradigm. The structure of a surviv al tree enforces a null hyp othesis of similar surviv al within its terminal n o des; individuals in a terminal no de share a common estimated haza rd function. The nature of a surviv al tree and its forest there- fore suggests an estimate f or mortalit y . W e refer to th is estimate as ensem ble mortalit y . The ensemble mortalit y for i is defin ed as ˆ M ∗ e,i = n X j =1 H ∗ e ( T j | x i ) . Similarly , OOB ensem ble mortalit y is defin ed as ˆ M ∗∗ e,i = P n j =1 H ∗∗ e ( T j | x i ). RANDOM SUR VIV AL F OR ESTS 7 5. Predictio n error. T o estimate prediction error, w e use Harrell’s con- cordance index [Harrell et al. ( 1982 )]. The C-index (concordance index) is related to the area u nder the R OC curve [ Heagert y and Zheng ( 2005 )]. It estimates the probabilit y that, in a randomly selected pair of cases, the case that fails fir s t had a worst predicted outcome. T h e in terpretation of the C- index as a miscla ssification p robabilit y is attractiv e, and is one reason w e use it f or prediction err or. Another attrac tiv e feature is that, unlik e other mea- sures of surviv al p erformance, the C-index do es not dep end on a single fixed time f or ev aluation. The C-index also sp ecifically ac coun ts for censoring. 5.1. C-index c alculation. The C-ind ex is calc ulated us in g the f ollo wing steps: 1. F orm all possib le pairs of ca ses o v er the d ata. 2. Omit those pairs whose shorter surviv al time is censored. Om it pairs i and j if T i = T j unless at least one is a death. Let Permissible d enote the total n umb er of perm iss ib le pairs. 3. F or eac h p ermissible pair where T i 6 = T j , coun t 1 if the sh orter surviv al time has worse predicted outcome; count 0.5 if p redicted outcomes are tied. F or eac h p ermissible pair, where T i = T j and b oth are d eaths, coun t 1 if p redicted outcomes are tied; otherwise, coun t 0.5. F or eac h perm iss ib le pair where T i = T j , but not b oth are d eaths, count 1 if th e death has w orse predicted outcome; otherwise, coun t 0.5. Let Concordance denote the sum o ver a ll perm issible pairs. 4. The C-ind ex, C , is defined by C = Concordance / P ermissible . 5.2. OOB pr e diction err or. Calculating C requir es a pred icted ou tcome. W e use the OOB ens emble CHF to define a predicted outcome similar to ensem b le mortalit y describ ed in Section 4.1 . Because this v alue is d eriv ed from OOB data, it can b e u sed to obtain an OOB estimate for C , and, consequen tly , an OOB error rate. Let t o 1 , . . . , t o m denote pr e-chose n un ique time p oints (w e use the u nique ev ent times, t 1 , . . . , t N ). T o rank t w o cases i and j , we sa y i has a worse pr e dicte d outc ome than j if m X l =1 H ∗∗ e ( t o l | x i ) > m X l =1 H ∗∗ e ( t o l | x j ) . Using this rule, compute C as outlined ab o ve. Denote the OOB estimate b y C ∗∗ . Th e OOB prediction error, PE ∗∗ , is defined as 1 − C ∗∗ . Note that 0 ≤ PE ∗∗ ≤ 1 and that a v alue PE ∗∗ = 0 . 5 in d icates prediction no b etter than random guessing. 8 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER 6. Empirical comparisons. Here we rep ort the results of an exp erimen t designed to study pr ediction ac curacy of RSF. Prediction performance w as calculate d usin g the C-index, C , of Section 5 . Elev en datasets were used in the exp erimen t; eigh t were distinct. One of these, nod e-p ositiv e breast cancer data studied in Hothorn et al. ( 2006 ), w as used to create three additional datasets. T o stud y stabilit y in h igh- dimensional settings, 10, 50 and 100 uncorrelated v ariables repr esen ting noise w ere drawn from a u niform distribution and w ere added to the data. W e refer to these d ata as breast10, breast50 and breast100, and th e original dataset as breast. Of the remaining s even d atasets, four are pro vided in the randomSu rvivalFor est R-pac k age. These are as follo ws: v eteran’s admin- istration lung cancer data from Kalbfleisc h and Pren tice ( 1980 ) (v eteran); primary biliary cirrh osis data from Fleming and Harrin gton ( 1991 ) (pb c); burn patien t data fr om Kalbfleisc h and P r en tice ( 198 0 ) (burn ); and recidi- vism data from Rossi, Berk and Lenih an ( 1980 ) (recid). The r emaining three datasets w ere a pr ostate dataset describ ed in Kattan ( 2003 ) (prostate), a dataset comprising patien ts listed for heart transplan t at Clev eland Clinic (transplan t), and early stage esophageal cancer d ata considered in Ishw aran et al. ( 2004 ) (esophagus). Computations w ere implemen ted using r andomSurv ivalFore st soft wa re under its default settings [Ishw aran and Kogalur ( 2007 , 2008 )]. In eac h in- stance 1000 trees were gro w n. Eac h of the four sp litting ru les a v ailable in the soft wa re pac k age w ere used. These were as f ollo ws [for more details see Ish w aran and Kogalur ( 2008 )]: 1. A log-rank splitting rule (logrank) that sp lits nod es b y maximizat ion of the lo g-rank test statistic [ Segal ( 1 988 ), LeBla nc and Cr owley ( 1993 )]. 2. A conserv ation-of-e v ents splitting ru le (conserv e) that splits no des b y finding daugh ters close st to the conserv ation-of- ev ents principle. 3. A log-rank score ru le (logrankscore) that splits no des u sing a standardized log-rank stat istic [ Hothorn and Lausen ( 2003 )]. 4. A rand om log-rank splitting rule (log rankrandom). A random split is selected for eac h of the p c andidate v ariables in a no de, and the v ariable with maxim um log-rank statistic (at its rand om split p oin t) is u sed to split the no d e. As a b enc h m ark, w e used Co x regression. The RF approac h of Hothorn et al. ( 2006 ) w as a lso included for comparison. I n this latter metho d, 1000 r e- gression trees w ere grown. Eac h tree w as deriv ed fr om b o otstrap data where eac h obs erv ation wa s sampled with probabilit y equal to its IPC weigh t (cen- sored ob s erv ations had IPC w eigh ts of zero). IPC we igh ts w ere calculated from the Kaplan–Meier estimate for the censoring distribution as recom- mended in Hothorn et al. ( 2006 ). RF regression with log-transformed time RANDOM SUR VIV AL F OR ESTS 9 as the outcome w as used to gro w eac h tree. Predicted v alues were tree- w eigh ted a ve raged v alues of log-transformed surviv al time using w eigh ts as describ ed in Step 4 of Secti on 3.1 in Hothorn et al. ( 2006 ). T o estimate p rediction error, 100 ind ep endent b o otstrap samples of eac h data set w ere used. Eac h method w as fit on the b o otstrap d ata and predic- tion e rror w as estimate d using the corresp onding O O B data. F or RSF, this metho d of estimation differs from the O OB method discussed in S ection 5 . Although OOB error estimation could ha ve b een emplo ye d, n ot doing s o b y-passes a p otent ial p roblem with the RF censored r egression approac h. Sometimes a case will receiv e a v ery large IPC w eigh t and app ear in all or nearly ev ery b o otstrap sample, th u s p recluding accurate OOB pr ediction for that case [Hothorn et al. ( 2006 )]. When compu ting the C-index, eac h metho d used a different predicted outcome. F or RS F, ensemble mortalit y wa s used. F or RF regression and Co x regression, p r edicted sur v iv al time and the Co x linear predictor we re used, respectiv ely . Results from our exp erimen t are sho wn in Figure 1 . Our fin dings are summarized a s follo ws: 1. In nearly all examples, RS F using logrank and logrankscore sp litting had the lo we st p rediction err or. Conserv ation of ev en ts splitting w as also ve ry go o d. 2. In terestingly , RSF using the logrankrandom splitting r ule w as go o d in all examples. Because this sp litting rule is significan tly faster, its p er- formance suggests it might b e the preferred metho d in settings where computational s p eed is essen tial. 3. Prediction errors fr om breast10, breast50 and breast100 demonstrated all forest metho ds w ere stable in the presence of noise v ariables. Co x regression, in contrast , b ecame pr ogressiv ely worse as the n umb er of noise v ariables increased. 4. P erformance of RF regression dep ended strongly on the censoring r ate. F or the transplant and v eteran data, where almost all cases we re deaths, p erformance was go o d. F or the prostate and esophagus data, where cen- soring w as higher, p erformance was p o or. 7. V ariable imp ortance (VIMP). V ariables can b e selected by filtering on the basis of their v ariable imp ortance (VIMP). T o calculate VIMP for a v ariable x , dr op O OB cases down their in-bag sur viv al tree. Whenev er a split for x is encounte red, assign a d augh ter node randomly . The C HF from eac h su c h tree is calculated and a verag ed. Th e VIMP for x is the prediction error for the orig inal ensem ble subtracted from the predicti on error for the new ensemble obtained usin g randomizing x assignmen ts [ Breiman ( 2001 ), Ish w aran ( 2007 )]. 10 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER Fig. 1. Boxplots of est imate d pr e di ction err or ( c alculate d using by C-index of Se ction 5 ) fr om 100 indep endent b o otstr ap r epli c ates. Pr e di ction err or est imate d on out-of-b ag data. Dots in b oxplots indic ate me an values; horizontal lines ar e me dians. Metho ds c omp ar e d wer e as fol lows: Cox (Cox-r e gr ession); RF ( RF for c ensor e d dat a [ Hothorn et al. ( 2006 )] ; RSF1 thr ough RSF4 (RSF using lo g-r ank, c onservation-of-ev ents, l o g-r ank sc or e and r an- dom lo g-r ank splitting). Al l for est analyses c omprise d 1000 tr e es. Datasets ar e i ndic ate d ab ove e ach b oxplot in b old. RANDOM SUR VIV AL F OR ESTS 11 Large imp ortance v alues indicate v ariables w ith predictiv e abilit y , w h ereas zero or negativ e v alues iden tify nonpr ed ictiv e v ariables to b e filtered. Und er the C-ind ex, one can in terpr et VIMP in terms of misclassification. Reca ll the C-index estimates the pr obabilit y of correctly classifying tw o cases. Thus, VIMP for x measures the in crease (or drop!) in misclassification error on the te st data if x w ere not a v ailable. One should b e careful int erpreting VIMP . Although tempting, it is incor- rect to think VIMP estimates c hange in prediction error f or a forest gro wn with and without a v ariable. F or example, if t w o v ariables are highly cor- related and b oth pr ed ictiv e, eac h can ha ve large VIMP v alues. Remo ving one v ariable and regro wing the forest ma y affect the VIMP for the other v ariable (its v alue migh t get larger), but prediction error will likel y remain unc hanged. VIMP measures the c hange in pr ediction error on a fresh test case if x w ere not a v ailable, giv en that the original forest was gro w n using x . Although, in practice, th is often equals change in pr edictio n error for a f orest gro wn with and without x , conceptually the t w o quan tities are different. T o examine the empir ical prop erties of VIMP , we re-analyzed the breast cancer datasets used in Section 6 . F or eac h dataset, a forest of 1000 trees w as g ro w n using lo g-rank splitting. VIMP for eac h v ariable in eac h dataset w as recorded. The analysis was rep eated 100 times indep end en tly and VIMP a vera ged o v er the runs (T able 1 ). T able 1 shows VIMP to b e reasonably stable across d atasets. In some instances, VIMP increases or decreases sligh tly f or a v ariable as n umb er of noise v ariables increases, b ut the ranking and relativ e size of VIMP is rea- sonably consisten t. Also apparen t from the table is the abilit y of VIMP to iden tify noise v ariables. The a verag e absolute VIMP for noise v ariables in breast10, br east50 and br east100 is 0.001, and num b er of noise v ariables with VIMP exceeding 0.002 is 0.21 on a v erage for breast100. Th is latter observ a- tion can b e used as a means for thresholding v ariables, b ecause an y v ariable with a VIMP exceeding 0.002 is unlik ely to b e noise. Hence, w e can con- clude p ositiv e no des, age, and progesterone are highly predictiv e, and hor- mone therap y and estrogen are m o derately p redictiv e. Ho wev er, menopause, tumor s ize and tu m or grade are unlik ely to b e predictiv e. 8. Missing data. On e of the earliest algorithms for treating missing data in CAR T was based on the idea of a sur rogate split [Chapter 5.3, Breiman et al. ( 1984 )]. If s is the b est split for a v ariable x , the surrogate split for s is the split s ∗ using some other v ariable x ∗ suc h that s ∗ and s are closest to one another in terms of predictiv e asso ciation [Breiman et al. ( 1984 )]. T o assign a case h aving a missing v alue for the v ariable used to sp lit a no de, the CAR T algorithm uses the b est surrogate split among those v ariables not missing for the case. Surrogate sp litting ensures ev ery case can b e classified, whether the case has missing v alues or n ot. 12 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER T able 1 V ariable imp ortanc e (VIMP) for br e ast c anc er datasets. L ast two r ows ar e me an of absolute VIMP f or noise variables and numb er of noise variables with VIM P exc e e ding 0.002 (applies to br e ast10, br e ast50 and br e ast100). Al l r ep orte d values aver age d over 100 indep endent runs. Each run b ase d on 1000 tr e es under lo g-r ank spli tting breast breast10 breast50 breast100 Hormone therap y 0.005 0.003 0.002 0.001 Age 0.021 0.018 0.016 0.017 Menopause − 0.002 0.000 0.000 0.000 T umor size − 0.002 − 0.002 0.00 0 0.001 T umor grade 0.003 0.001 0.001 0.001 P ositiv e no des 0.037 0.040 0.041 0.042 Progesterone 0.009 0.015 0.018 0.019 Estrogen 0.003 0.003 0.004 0.005 Noise [mean] — 0.002 0.001 0.001 Noise [misclass ] — 0.090 0.190 0.210 Although sur rogate splitting w orks well for trees, the method ma y not b e w ell suited for f orests. Sp eed is one issue. Finding a surrogate split is computationally intensiv e and ma y b ecome in f easible when g ro w in g a large n um b er of trees, esp ecially for fu lly saturated trees used b y forests. F urther, surrogate splits ma y not ev en b e meaningful in a forest paradigm. RF ran- domly selects v ariables when sp litting a no d e and , as su c h, v ariables within a no de ma y b e u ncorrelated, and a reasonable sur rogate split may not ex- ist. Another concern is that surrogate splitting alters the int erpretation of a v ariable, whic h affects measures suc h as VIMP . F or these reasons, a differen t strategy is required for RF. The current metho d f or imputing data uses a pr oximit y approac h [ Breiman ( 2003 ), Lia w and Wie ner ( 2002 )]. This w orks as foll o ws. First, the data are roughly imputed: missing v alues for con tin uous v ariables are r ep laced with the me- dian of nonmissing v alues, or if the v ariable is categorical , data are imputed using the m ost frequent o ccurring nonmissing v alue. The roughly imp uted data is analyzed using RF. Missing data are imputed using the resulting pro ximit y matrix, an n × n symmetric matrix whose ( i, j ) en try reco rds the frequency that case i and j o ccur within the same terminal no de. F or con- tin u ous v ariables, impu ted d ata are the proxi mit y w eigh ted av erage of the nonmissing data. F or inte ger v ariables, imp uted data are the integ er v alue ha vin g the largest a v erage p ro ximity o ve r nonmissing data. The up d ated data are then u sed as an input in RF, and the cycle is rep eated. Typi- cally , only a few iterations are needed to ac hiev e a stable so lution [ Breiman ( 2003 )]. An ob vious adv antag e of s uc h an approac h is that it works off-t he-shelf without sp ecial mo dification of t he RF algorithm. Also, an imp ortan t prop- RANDOM SUR VIV AL F OR ESTS 13 ert y of R F is its abilit y to cluster the d ata. Imp utation via pr o ximit y exploits this. Ho wev er, although th e metho d is reasonable, there are some d isadv an- tages: 1. OOB est imates for prediction error are biased, ge nerally on the order of 10–20 % [ Breiman ( 2003 )]. F urther, b ecause pr ediction err or is biased, so are o ther measures based on it, suc h as VIMP . 2. The forest cannot b e used to predict on te st data with missing v alues. 8.1. A new missing data algorithm: adaptive tr e e imputation. T o addr ess these issues, we in tro duce a new missing d ata algorithm for forests. The idea is to adaptiv ely impute missing data as a tree is gro wn. Imputation w orks b y d r a wing randomly from the set of nonmissing in -bag d ata within the w orking no de. Because only in-bag data are used for imputing, O OB d ata are not to uc hed, and the OO B prediction error is not optimistically biased. The ke y steps inv olv ed are ske tc hed b elo w. F or simplicit y , w e initially f o cus on the case where only x v ariables ha v e missing data: 1. F or eac h no de h , impu te missing d ata prior to splitting. Let X ∗ k ,h b e the set of nonmissing v alues for the k th co ordinate of those x -v ariables in-bag in h . Let P ∗ k ,h b e the empirical distribu tion function for X ∗ k ,h . F or eac h in-bag case in h with missin g v alue for the k th coordinate, impute by dra wing a random v alue from P ∗ k ,h . Rep eat for eac h k . Splitting p ro ceeds as usual once the data are imputed. Note carefully that only in-bag data is used as the basis for imputation and splitting. 2. The OO B data pla ys a p assiv e role. It is imp uted by drawing from P ∗ k ,h . 3. Daugh ter no des con tain no missing data b ecause the paren t n o de is im- puted prior to splitting. Reset the imputed data in the daugh ters to miss- ing. Pro ceed as in Step 1 ab o ve, con tinuing until the tree can no longer b e split. The final summary imp uted v alue for a missing case uses in-bag impu ted v alues from the case’s terminal no d es across the forest. If a case has a missing v alue for a conti n uous v ariable, the final su mmary v alue is the a v erage of its imputed in-bag v alues. If a case h as a missing v alue for an integ er v ariable, the summary v alue is its in-bag imp uted v alue o ccurring most frequen tly (in case of ties, a random tie -breaking rule is used). 8.2. Imputatio n for test data. The missing data algorithm can b e used for test d ata. The test d ata are dropp ed down eac h tree of the forest and missing v alues are imputed d ynamically as in Step 1 using P ∗ k ,h . On ce ter- minal no de m emb ership across the forest is established, miss in g data are summary imputed. If a test case has missing v alues for a conti n uous v ari- able, the final summary v alue is its a v erage imputed v alue. If a test case has missing v alues for an intege r v ariable, the summ ary v alue is its imputed v alue o ccurring m ost f r equen tly . 14 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER 8.3. Imputatio n for missing outc omes. Imputation f or missing v alues of censoring indicat ors and surviv al times pro ceeds in the same m anner a s for x v ariables. Within the wo rking no de h , missing outcome data are drawn randomly from the empir ical distribution fun ction of the n onmissing in-bag outcome data. O n ce missing outcome data and missing v alues for x v ariables are imputed, splitting proceeds using impu ted in-bag data. Summary imputation for outcome s is similar to that for x v ariables. If surviv al time for a case is m issing, the su mmary v alue is the a verag e of its imputed in-b ag v alues. If the censoring indicator for a case is missing, the su mmary v alue is its in-bag imputed censoring indicator o ccurr ing most frequen tly . OO B pr ediction error is calculated usin g imputed outcomes. If a case has a missing outcome, its OOB summary impu ted v alue is u sed when calculati ng the predictio n error. T est data with m issing outcomes are treated as follo ws. T est d ata are dropp ed do wn eac h tree and missing data are imputed within a no de by dra wing a random v alue from the empirical distribution function of the nonmissing in-bag training outcome d ata. Final sum mary imputation uses imputed v alues of test ca ses from terminal no des. 8.4. Iter ating the missing data algorithm. With increasing amoun ts of missing data, accuracy of imputed v alues ma y degrade. I n suc h c ases, accu- racy can b e impro v ed su bstan tially by iterating the missing d ata algorithm. Iterativ e imputation w orks as follo w s. After the initial cycle o f gro wing the forest, m issing data are imp u ted using OOB summary v alues. A new forest is then gro wn usin g the imp u ted data. F or eac h case originally missing a v alue, draw a random v alue fr om the nonm issing in-bag cases w ithin its ter- minal no de. A v alue is dra wn for eac h tree in the new forest. Impute case s with missing d ata using fu ll (in-bag plus OOB d ata) s u mmary i mputation. Use the reimputed data and gro w another forest. Rep eat iterativ ely . 8.5. Empiric al p erforman c e. T o illustrate the missing data algorithm, w e use the pb c data considered in Section 6 . The data is from a clinical trial in volving primary b iliary cirrhosis (an auto imm une disease of the liv er) a nd comprise 312 ind ividuals who participated in a randomized trial to study the effectiv eness of the dru g D-p enicillamine. T he dataset cont ains 17 v ariables, as w ell as censoring information and time unt il death for eac h individu al [ Fleming and Harrington ( 1991 )]. A pr eliminary analysis of the data usin g RSF (100 0 trees un der log-rank splitting) identified sev eral v ariables that ha ve large VIMP v alues. VIMP for th e top predictiv e v ariable, serum bilirub in, wa s esp eciall y large, b eing almost t w ice the size of the second m ost predictiv e v ariable, age. This finding is consisten t with biology . In normal patien ts, a small amount of bilirubin, a w aste pro duced by the breakdo wn of old red b lo o d ce lls and h emoglo bin, RANDOM SUR VIV AL F OR ESTS 15 circulates in the bloo d, but el ev ated amoun ts are found in pati en ts with live r disease. Serum b ilirubin lev el is considered an accurate test of liv er function. Therefore, it is not surp rising th at this v ariable is predictiv e of d isease. Giv en the imp ortance of s eru m bilirubin, we w ere curious to see ho w w ell we could imp ute this v alue and what effects imputation might h a ve on prediction error. Because the v ariable had no missing data (i n fact, few data w ere missing), we randomly assigned missing v alues. In d oing so, we first p ut aside a 20% sub set of the data for testing. Over the remaining data, missing v alues w ere randomly assigned to v ariables. Th e missing data algorithm w as applied to these data, and the ro ot-mean-square error (RMSE) b etw een the imputed and tru e v alue for serum bilirubin w as calculated. The OOB error rate fr om the training data and the test s et error were also calc ulated. F or comparison, w e iterated the missing d ata algo rithm. W e also used pro ximit y imputation [ Breiman ( 2003 ), Lia w and Wiener ( 2002 )]. A five- iteratio n cycle was used in b oth cases. Results for all three metho ds are rep orted in T able 2 . V alues giv en are a v eraged o ver 100 random datasets generated under differing amoun ts of missingness. Sev eral trends ca n b e observe d from T able 2 : 1. Imputed v alues w ere accurate and OOB error matc hed test set error for the missing data algorithm u n der mo derate amoun ts of missing data (5– 10%). As missing data increased (25– 50%), OOB error o v erestimated test set err or and RMSE perform an ce degraded. 2. The iterativ e missing data algorithm was consisten tly go o d. RMSE w as consisten tly lo w and OO B error clo sely matc hed test set error in all set- tings. T able 2 Imputation for pb c data under differing amounts of m issing values. R ep orte d for e ach metho d ar e the f ol l owing: ( i) r o ot-me an-squar e err or (RMSE) b etwe en impute d and actual values for the c ovariate serum bilirubin (for c om p arison note that serum bilirubin had standar d deviation 4.41); (ii) O O B pr e diction; (iii) test set pr e diction err or using a 20% hold out test set ( TEST ) . V alues ar e av er age d over 100 r andom datasets. F or ests c omprise d 1000 tr e es under lo g-r ank splitting Missing data Pro ximit y % missing data Missing data algorithm imputation algorithm (5 iterations) (5 iterat ions) RMSE OOB TEST RMSE OOB TEST RMSE OOB TEST 5 3.181 0.170 0.174 3.033 0.165 0.175 3.068 0.171 0.174 10 3.537 0.173 0.168 3.323 0.163 0.169 3.381 0.173 0.166 25 3.766 0.185 0.165 3.357 0.159 0.169 3.420 0.173 0.165 50 3.912 0.204 0.155 3.416 0.156 0.161 3.464 0.164 0.157 16 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER 3. Pro ximit y imputation w as also consisten tly go o d. I mputed v alues w ere accurate, and O OB error rates matc hed test set errors in all settings. W e did not notice bias in OOB error rates rep orted elsewhere [ Breiman ( 2003 )]. These find ings suggest the missing data algorithm can b e u sed confidentl y with low to mo derate missing data. Its adv an tages include the abilit y to pre- dict on test data with m issing v alues and that it can b e used when outcomes are missing. With increasing amounts of m issing data, the algorithm can still b e u sed, b ut should b e iterated a few times to improv e accuracy of impu ted v alues. In terestingly , iterating the algorithm did not bias OO B error rates in our sim ulations. Inform al exp erimen ts with ot her data sho wed this pattern to b e consisten t, suggesting that the use of OOB summary imputation on the fi r st iteration and full summary impu tation on fur ther iterati ons (whic h mixes in-bag and OO B data) is helping to m itigat e bias. W e b eliev e this same effect is also at pla y for pr o ximit y imputation. Proxi mit y s ummary imputation uses b oth in-bag and OOB d ata, and this must b e helping to reduce bias. The results for pro ximit y imputation w ere b etter than expected and sugge st it could also b e us ed for missing data. 9. Bo dy mass index and long-term su rviv al among patien ts undergoing coronary artery b yp ass grafting (CABG) surgery . Over the p ast t w en ty y ears, pub lic health in v estigators hav e d o cumen ted a dramatic increase in the p r op ortion of the US p opulation that is o ve rwe igh t or ob ese [Mokdad et al. ( 2003 )]. A common metho d of assessing ob esit y is the calculation of b o dy mass index, whic h is we igh t (in kilograms) divided by h eigh t (in meters) squared (kg / m 2 ). Bo dy mass ind ex v alues of < 18 . 5, 18.5–25 , 25– 30, 30–3 5 and > 35 corresp ond to underweig h t, normal w eigh t, o verw eigh t, grade I ob esit y and grade I I ob esit y [Fle gal ( 2005 , 2007 )]. The a sso ciation of b o d y mass with mortalit y is c omplex. P eople who are underweig h t are at increased risk [ Adams et al. ( 2006 ), Flegal et al. ( 2005 )], but it is un clear whether this is due to rev ers e causalit y fr om c hronic con- ditions su c h as r enal dysf unction or undiagnosed cancer. S moking decreases b o dy mass bu t increases deat h risk fr om coronary artery disease, lun g can- cer, obstructive lung disease, and other d iseases. High b o dy mass index pre- dicts increased r isk, but it is not clea r whether risk starts to increase wh en sub jects b ecome simply o ve rwe igh t or fr an kly ob ese [ Adams et al. ( 2006 ), Flegal et al. ( 2005 )]. Ob esit y increases the risk of dev eloping and dying fr om coronary artery disease [Flegal et al. ( 2007 )], at least in part by predisp osing p eople to dia- b etes an d hyp ertension. Although it is wid ely accepted that ob esity increases the risk of devel oping disease, there is con tro versy o ve r the prognostic im- plications of ob esit y among patien ts with established disease; some suggest RANDOM SUR VIV AL F OR ESTS 17 there ma y ev en b e a parado xical protectiv e effect of ob esit y [Urtesky et al. ( 2007 )]. In order to explore these complex patte rns relating ob esit y to mortalit y in patien ts with stable, d o cumen ted, sev ere coronary artery d isease, w e fo cus on patien ts w h o u nderwe n t primary isolated CABG surgery . CABG wa s deve l- op ed in the late 1960s for treating refractory angina and has b een sh own to impro v e life exp ectancy in patien ts with sev ere obstructiv e coronary artery disease. Early randomized trials d emonstrated th at compared with medi- cal therap y , CABG resulted in b etter surviv al [ Y u suf, Zuc k er and P eduzzi ( 1994 )]. S in ce its inception, the pro cedure has ev olv ed subs tanti ally , with greater us e of arterial rev ascularization and in corp oration of min imally in- v asiv e tec hniqu es [Pusk as et al. ( 2004 )]. A num b er of mo dels hav e b een con- structed to p redict long-te rm surviv al after CABG, b ut n one has included b o dy mass index as a predictor [ Eagle et al. ( 2001 )]. The data analyze d comprised 15,586 p atien ts accrued b et wee n the ye ars 1990 and 2003 at Clev eland Clinic. The outcome u s ed for the analysis was all cause m ortalit y . Mean f ollo w-up time wa s 6.5 ± 3.8 ye ars, the m edian w as 6.4 yea rs. In total there were 36 v ariables analyzed, includin g: b o dy mass index; age in y ears; creatinine clea rance, a con tinuous mea sure of renal function; s m oking history; and num b er of inte rnal thoracic artery grafts used for CABG. These latter fiv e v ariables w ere found to b e among the most predictiv e using a RSF analysis. Fiv e-y ear pr edicted surviv al (estimated from the ensemble ) app ears in Figure 2 plotted against b o dy mass index. Su rviv al has b een conditio ned on smoking history and creatinine clearance lev el. Solid lines app earing in the figure are low ess sm o othed estimates of s u rviv al stratified by n um b er of in tern al thoracic artery grafts used. One can see a distinctiv e “ho c k ey stic k” pattern in Figure 2 . At lo w b o dy mass index, surviv al is lo w, after wh ic h surviv al increases with increasing b o dy mass un til reac hin g an infl ection p oin t of roughly 25 kg / m 2 , where it th en b egins to decrease. Int erestingly , this pattern is highly dep endent on creatinine clearance lev els. F or creatinine clearance v alues larger than 90 ml / min, signifying health y renal function, the ho c key stic k pattern is m uch straigh ter (coplots on the extreme right -hand side). These results add strengt h to the h yp othesis that there is a r ev erse cau- sation effect for und erw eight individuals. The p o or surviv al seen in p atien ts with lo w b o dy m ass ind ex could b e construed as an effect of b eing un der- w eigh t, but this ma y b e incorrect . As seen, the association of lo w b o dy mass with surviv al is rela ted to renal function, and this effect dissipates as r enal function imp ro ves. Th us, p o or surviv al in p atien ts with lo w b o d y m ass index ma y not b e due to b eing underwei gh t, bu t rather due to the systemic effect s of renal disease. 18 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER Fig. 2. Pr e dicte d five-ye ar survival vers us b o dy mass index (kg / m 2 ), c onditione d on cr e- atinine cle ar anc e (ml / min) and smoking history. Curves ar e smo othe d using lowess and str atifie d by numb er of internal thor acic artery gr afts (black = 0; r e d = 1; gr e en = 2; blue = c ombine d). Analysis of CABG dat a b ase d on RSF using 1000 tr e es with lo g-r ank splitting. The effect of smoking on surviv al is also interesting. Smoking decreases b o dy mass, but at the same time leads to increased risk and p o orer s urviv al. This effect can b e seen in Figure 2 . F or patien ts w ith no in ternal th oracic artery grafts (blac k curv es) and normal renal fu nction (creatinine clearance > 90 ml / min), smoking leads to a m ore pronounced h o c key stic k pattern (top r igh t coplot v ersus bottom righ t coplot). Note that for the same range of creatinine clearance, there is no effect when patien ts hav e at least one in tern al thoracic artery graft (green and red cur v es). Th ese p atien ts ha ve additional coronary artery protection, and the protection that this confers RANDOM SUR VIV AL F OR ESTS 19 masks the lesser effect of smoking. The effect of smoking on surviv al is also less apparen t in patien ts with abnormal creatinine clearance leve ls. Again, this is because the effec t is b eing mask ed, in this case by renal dysfunction. 10. Discussion. In this pap er w e ha v e in tro duced RS F, a new ext ension of Breiman’s forests metho d [ Breiman ( 2001 )], to right -censored surviv al data. A random surviv al forest consists of random surviv al trees. Using in- dep endent b o otstrap samples, eac h tree is gro wn by randomly s electing a subset of v ariables at eac h no de and then splitting the no de using a surviv al criterion inv olving su rviv al time a nd censoring sta tus information. Th e tree is considered fu lly gro wn when eac h terminal no d e has no few er than d 0 > 0 unique deaths. The estimated CHF for a case is the Nelson–Aalen estimator for the case’s terminal node. The ensem ble is the a verag e of these CHFs. Be- cause trees are grown from in -bag data, an OO B ensem ble can b e calc ulated b y dropp ing OOB ca ses down their in-bag surviv al trees and a v eraging. The predicted v alue for a case using the OO B ensem ble do es not use surviv al information for that case, and, therefore, it can b e used for nearly unbi- ased estimation of prediction err or. F rom this, other useful measur es can b e deriv ed , suc h as VIMP v alues for filtering and selecting v ariables. RSF incorp orates many of th e useful ideas p romoted b y Breiman ( 2001 ). A t the same time, w e hav e prop osed new wa y s to extend the method ology . A no v el missing data algorithm was in tro d uced that can b e used for b oth training and testing data and that pro vides nearly unbiase d estimates for error r ates eve n with large amoun ts of missing d ata. A larg e exp erimen t w as used to assess prediction accuracy of RS F. Ov er a wide range of r eal as well as sim ulated datasets, we found RSF to b e con- sisten tly b etter than, or at least as go o d as, co mp eting metho ds . Since th e in tr o duction of RF to the mac h ine learning comm un it y , there has b een a tremendous effort to do cument its empirical p erformance. Our results con- firm w h at has generally been found: RF pro duces highly accurate ensem ble predictors. W e hav e also illustrated the ease with which RSF can b e applied to r eal data settings to u nco ver highly complex in terrelationships b etw een v ariables. Our case study inv olving coronary a rtery disease found imp ortan t relation- ships among renal function, b o dy mass index, and long-term surviv al that help explain muc h of the confusion rep orted in the literature on this con tro- v ersial topic. Complex relationships lik e this are found with ease using to ols suc h a s VIMP in com bination with the h ighly adaptive n ature of forests. In con trast, con v en tional metho ds are muc h less automatic and require consid- erable sub jectiv e inpu t from the user in data settings wh ere v ariables are highly in terrelated. 20 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER Ac kn o wledgment s. The authors thank Mic hael Stei n, the Asso ciate Ed - itor, and t wo anonymous referees who all help ed to substant ially imp r o ve the pap er. REFERENCES Adams, K. F., Scha tz kin, A. , Ha rris, T . B . e t al. (200 6). Overw eight, ob esity , and mortalit y in a large prospective cohort of persons 50 to 71 years old. N. Engl. J. Me d. 355 763–778. Breiman, L. (1996). Bag ging predictors. Machine L e arning 26 123–140. Breiman, L. (2001). Random forests . Machine L e arning 45 5–32. Breiman, L. (2002). S oftw are for the masses. Slides p resen ted at the W ald Lectures, Meeting of the Institute of Mathematical Statistics, Banff, Canada. Av ailable at http://www .stat.berk eley.edu/users/breiman . Breiman, L. (200 3). Manual —setting up, using and understanding random forests V4.0. Av ailable at ftp://ftp.stat. b erkele y .edu/pub/users/breiman/Using random forests v4.0.pdf . Breiman, L., Friedman, J. H., Ol shen, R. A. and Stone, C. J . (1984 ). Classific ation and R e gr ession T r e es . W adsw orth, Belmont, Cali fornia. MR0726392 Cor tes, C. and V apnik, V. N. (1995). Support-vector netw orks. Machine L e arning 20 273–29 7. Ea gle, K. A., Guyton, R. A. , Da v idoff, R. et al. (2004). ACC/AHA 2004 guideline up date for coronary artery bypass graft surgery: Summary article. A rep ort of the American Co llege of Ca rdiology/American Heart A sso ciation T ask F orce on Practice Guidelines (Committee to Up d ate the 1999 Guidelines for Coronary Artery Byp ass Graft Surgery). J. Am. Col l . Car diol. 44 213–310. Flegal, K. M., Gra ubard, B. I., Williamson, D. F. and Gail, M. H. (2005). Ex cess deaths asso ciated with underw eigh t, ov erw eigh t and obesity . J. Amer. Me d. Asso c. 293 1861–1 867. Flegal, K. M., G raub ard, B. I., Williamson, D. F. and Gail, M. H. (200 7). Ca use- specific excess deaths as sociated with underweig ht, o verw eight and ob esit y . J. Amer . Me d. Asso c. 298 2028–2037. F ont aine, K. R., Re dden, D. T., W ang, C., Westf all, A. O. and Allison, D. B. (2003). Y ears of life los t due to ob esit y . J. Amer . Me d. Asso c. 289 187–193. Fleming, T. and Harrington, D. (1991). Counting Pr o c esses and Survival Ana lysis . Wiley , New Y ork. MR11 00924 Harrell, F., Califf, R ., Pr yor, D., Lee, K. and Rosa ti, R. (1 982). Ev aluating the yield of medical tests. J. Amer. Me d. Asso c. 247 2543–2546 . Heager ty, P. J. and Zheng, Y. (2005). Surv iv al model predictive accurracy and ROC curves. Biom etrics 61 92–105. MR213584 9 Hothorn, T. and Lausen, B. (20 03). O n t h e exact distribution of maximal ly selected rank statistics . Comput. Statist. Data Anal. 43 121– 137. MR198533 2 Hothorn, T., Buhlmann, P., Dudoit, S., M olinar o, A. a nd v an der Laa n, M. J. (2006). Surviv al ensem b les. Biostat. 7 355–37 3. Ishw ara n, H. (2007). V ariable imp ortance in binary regression trees and forests. El e ctr on. J. Statist. 1 519–537. MR235771 6 Ishw ara n, H ., Blacksto ne, E. H ., Pothier, C. and La uer, M. S. (200 4). Relativ e risk forests for exercise heart rate recov ery as a predictor of mortalit y . J. Amer. Statist. Asso c. 99 591–600. MR2086385 RANDOM SUR VIV AL F OR ESTS 21 Ishw ara n, H. and Kogalur, U. B. (2007 ). R andom surviv al forests for R. Rne ws 7 25–31. Ishw ara n, H. and Ko galur, U. B. (2008). RandomSu rviv alF orest 3.2.2. R pack age. Av ailable at http://cra n.r- project.org . Ishw ara n, H ., Black stone, E. H ., Appe rson, C. A. and Rice, T. W . A no vel data- driven approac h to stage grouping of esophageal cancer. Cleveland Clinic tec hnical rep ort. Kalbfleisch, J. and Prentice, R. (1980). The Statistic al Analysis of F ail ur e Time Data . Wiley , New Y ork. MR057011 4 Ka tt a n, M. (2003). Co mparison of Co x regress ion with other metho ds for determining prediction models and nomograms. J. Ur ol. 170 S6–S10. LeBlanc, M. and Cro wley, J. (1992). Relativ e risk trees for censored surviv al data. Biometrics 48 411–425 . LeBlanc, M. and Cro wley, J. (1993). Surviv al trees by goodn ess of split. J. A mer. Statist. Asso c. 88 457–467. MR122437 0 Lia w, A. and Wiener, M. (2002). Cl assification and regression by randomF orest. Rnews 2/3 18–22. Lia w, A. and Wiener, M. (2007). R andomF orest 4.5-18. R pac k age. Ava ilable at http://cra n.r- project.org . Molinaro , A . M. , Dudoit, S. and v an d er Laan, M. J. (2004). T ree-b ased multiv ariate regressio n and den sity estimation with righ t-censored data. J. Multivariate Anal. 90 154–17 7. MR206494 0 Mokd ad, A. H, Ford, E. S. , Bo wman, B . A. et al. (2003). Prev alence of obesity , diabetes and ob esity-related health ris k factors. J. Amer. Me d. Asso c. 289 76–79. Naftel, D. , Blacksto ne, E. H. a nd Tu rner, M. (1985). Conserv ation of even ts. Un- published notes. Olshansky, S. J., P assaro, D. J . , Hershow , R. C. et al. (2005). A p otentia l decline in lif e expectancy in the United S tates in the 21st century . N. Engl. J. Me d. 352 1138 – 1145. Puskas, J. D., Williams, W. H. , Mahoney, E. M. et al. (2004 ). Off-pump vs con- ven tional coronary artery bypass grafting: Early and 1-year graft patency , cost and qualit y-of-life outcomes: A randomized trial . J. Amer . Me d. Asso c. 291 1841–1849. Ro ssi, P. H., Berk, R. A. and Lenihan, K. J. (1980). Money, Work and Crime: Some Exp erimental R esults . Academic Press, New Y ork. Schapire, R., Freund, Y., B ar tlett, P. and Lee, W. (1998). Boosting the margin: A n ew explanation for the effectiv eness of voting metho ds. Ann. Statist. 26 1651–168 6. MR167327 3 Segal, M. R. (1 988). Regression trees fo r censored data. Biometrics 44 35–47. Uretsky, S., Messerli, F. H. , Bangalore, S. et al. (200 7). Ob esity parado x in pa- tients with hyp ertension and coronary artery disease. A m. J. Me d. 120 863–870. Yusuf, S., Zucker, D. , Peduz zi, P. et al. (1994). Effect of coronary artery b ypass graft surgery on surviv al: Overview of 10-y ear results from randomised trials by the Coronary Artery Bypass Graf t Surgery T rialists Collaboration. L anc et 344 563–570. H. Ishw aran Dep ar tmen t of Quantit a tive Heal th Sciences Cleveland Clinic 9500 Euclid A ven ue Cleveland, Ohio 44195 USA E-mail: heman t.i shw aran@gmail.com U. B. Kogalur Dep ar tmen t of St a tistics Columbia University 1255 Amsterdam A v enue New York, New York 10027 USA E-mail: ubk2101@co lumbia.edu 22 H. ISHW AR AN, U. B. KOGALUR, E. H. BLACKSTONE AN D M. S. L AUER E. H. Blacksto ne Dep ar tmen t of Thora cic and Cardiov ascular Surger y Cleveland Clinic 9500 Euclid A ven ue Cleveland, Ohio 44195 USA E-mail: blac kse@ccf.org M. S. Lauer Division of Prevention and Popula tion Science Na tional Hear t, Lung, and Blood Institute R ockledge Center II 6701 Ro ckledge Drive Besthesd a, Mar yland 20892 USA E-mail: lauerm@nhlbi.nih.gov

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment