An ensemble approach to improved prediction from multitype data

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 302–317 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000002 19 An ensem ble approac h to improv ed predicti on from m ultit yp e data ∗ Jennifer Clark e 1 and Da vid Seo 2 University of Miami Scho ol of Me dicine Abstract: W e hav e dev eloped a strategy f or the analysis of newly av ailable binary data t o improv e outcome predictions based on existing data (binary or non-binary). Our strategy in v olv es t w o mo deli ng approac hes for the newly a v ailable da ta, one com bining binary co v ariate select ion via LASSO with lo- gistic regression and o ne based on logic t rees. The results of these models are then compared to the results of a mo del based on existing data with the ob- jectiv e of com bining mo del results to ac hiev e the most accurate pr edictions. The combination of model pr edictions is aided by the use of support v ecto r mac hines to iden tify subspaces of the cov ariate space in which speciﬁc models lead to successful predictions. W e demonstrate our approac h in t he analysis of single n ucleotide p olymorphism (SNP) data and traditional clinical risk factors for the prediction of coronary heart disease. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 2 Mo del types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 2.1 Lo g ic r egressio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 2.2 V ariable selectio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3 Comparing and combining mo del predictions . . . . . . . . . . . . . . . . 306 3.1 Supp ort v ector machines . . . . . . . . . . . . . . . . . . . . . . . . . 307 3.2 Analys is strategy recap . . . . . . . . . . . . . . . . . . . . . . . . . 3 08 4 Example: The CA THGEN s tudy . . . . . . . . . . . . . . . . . . . . . . . 308 4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 4.2 Mo del building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 0 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 15 1. In troductio n In applied resear ch contexts the statistician is often faced with newly av aila ble data which may provide informa tio n relev an t to a recently co mpleted analysis. This sce- ∗ Supported by NIH Grant 5K25CA111636-03. 1 Departmen t of Epidemiology and Public Health, Univ ersity of Miami Leonard M. Miller Sc hool of Medicine, Miami, FL, USA, e-mail: JClarke@ med.miam i.edu 2 Departmen t of Medicine, Division of Cardiology , Unive rsity of M i ami Leonard M. M iller Sc hool of Medicine, Miami, FL, USA, e-mail: DSeo@med .miami.e du AMS 2000 subje c t classiﬁc ations: Pr imary 62M20, 62H30; secondary 62P10. Keywor ds and phr ases: mo del ensem bles, prediction, s i ngle nuc leotide p olymorphism (SNP), support vect or mach ines, v ari able s el ection. 302 Ensembles for pr e diction 303 nario is o ccurr ing more and more frequently in medical resea rch as genomic data bec omes av ailable whic h ma y provide information relev ant to the determinatio n of disease r isk, a determination t hat ha s b een traditionally ba sed on existing clinical data. There is a need for statistical appro aches to v ariable selec tion and mode l- ing which attempt to provide improv ed outcome pr edictions in such contexts by combining information from new and existing data which may b e of multiple types . W e hav e developed one such strateg y for utilizing newly av ailable binary data to improv e binary outcome predictions fro m an exis ting mo del based on bo th contin- uous and binar y data. There a re many approaches to regres sion and classiﬁcation in the machine lea rning and statistical literature that w ould b e appro priate fo r mo deling binary da ta, including CAR T [ 4 ], MARS [ 15 ], treed mo dels [ 6 ], a nd log ic regres s ion [ 30 ], to na me only a few. Since our interest is s p eciﬁca lly in single nu- cleotide p oly mo rphism (SNP) data , we have chosen to model binary da ta with logistic regr ession a s well as log ic r egressio n mo dels. The logic regr ession mo dels recognize the often c o mplex interactions that exist a mong SNPs a nd attempt to mo del such interactions in a nalyzing the relations hips b etw een SNPs and outcome status. Logic regress ion mo dels also perfo rm v aria ble sele ction a nd mo del constr uc- tion when the n um ber of obser v ations , n , is less than the num ber of cov ar iates, p , which is a con text o f particular interest to us. Our goal is to combine all av ailable infor mation in generating the b est outco me predictions pos sible. In doing so we consider several a pproaches whic h b orr ow ideas from th e multimodel ensemble mo deling liter a ture [ 11 ]. One approach is to tak e a weigh ted a verage of the pr edictions from the existing mo del and the binary da ta mo del, in the spirit o f Bayesian mo del av eraging [ 7 ]. A second approach is to build a mo del from a ll av aila ble cov ar iates and not utilize the ex isting mo del, which w as built b efor e the newer bina r y cov aria tes w ere av ailable. O ur ﬁnal a pproach is a tw o- stage approach: determine subspaces of the cov aria te space on which the predictions from the existing mo del are accurate, and utilize the predictions from a model of the new er binary cov ar iates on the remaining subspaces. This w ould yield a more accurate set of pr edictions ov erall in situations whe r e neither data type is globa lly informative, for example, w he r e the data hav e b een collected fro m a heteroge ne o us po pulation. T o av oid a subspace deﬁnition which r equires knowledge of the outcome of in terest for o bserv ations to be predicted, w e diﬀeren tiate these v a rious subspaces via supp ort vector ma chines (SVM) [ 3 , 8 , 39 ]. As a result our technique y ields “honest” predictio ns for new obser v ations. Initially we discuss the mo del cla s ses and v ariable selection for binar y da ta. W e then dis c us s ho w the pr edictions from such mo dels can be used to improve the predictions from models based on ex isting data via supp ort vector machines. Our approach is demo nstrated in the context of predictio n of corona ry hear t disease from traditio nal clinica l risk factors a nd ge ne tic (SNP) da ta. 2. Mo del t yp es W e as sume a contin uous resp onse v ariable Y and a p -dimensio nal v ector of binary cov a riates X (the “newly a v ailable” data). In the cas e of SNP data ea ch cov ar iate X j , j = 1 , . . . , p , is binary . Since the rela tionship be tw een the cov aria tes and the resp onse is unknown we co nsider t wo mo del types , logistic reg ression and logic regres s ion [ 30 ]. As logis tic regr ession is a well known mo deling technique we will not discuss it in detail. Ho w ev er we will discuss v ariable selection prio r to log istic regres s ion modeling when n < p in Section 2 .2 . Logic regressio n is disc us sed in more detail b elow. 304 J. Clarke and D. Se o 2.1. L o gic r e gr ession Logic r egressio n [ 20 , 30 ] is a n ada ptive r egressio n methodo lo gy for ﬁnding Bo o lean combinations o f binary cov a riates that are a sso ciated with an outcome v ariable. This metho dolo gy w as developed to addre ss situations where the interaction of many predictors is resp onsible for diﬀerences in the r esp onse, which is often the case when all pr edictors are binary . As describ ed in [ 30 ] logic regre s sion mo dels take the for m (2.1) g ( E [ Y ]) = β 0 + t X i =1 β i L i , where L i is a Bo olea n expressio n of the cov aria tes X j . A scor e function relates ﬁtted v alues to the resp onse. This framework includes linear r egressio n ( g ( E [ Y ]) = E [ Y ] w ith score function R S S ), logistic reg ression ( g ( E [ Y ]) = lo g ( E [ Y ] / (1 − E [ Y ]) with s c ore function binomial deviance), as well as class iﬁca tion ( ˆ Y = I ( L = 1 ) where I ( · ) is the indicator function and the score function is P ( Y 6 = ˆ Y )). Logic regres s ion models ca n b e conv enient ly represented in tree for m. F or exa mple, the tree in Figure 1 repr esents the logic expressio n (2.2) ((( X c 79 ) ∨ (( X c 48 ) ∧ ( X c 64 ))) ∧ ((( X c 28 ) ∨ ( X c 9 )) ∨ (( X c 43 ) ∧ X 63 ))) . where X j indicates X j = 1 a nd X c j indicates the conjugate ( X j = 0 ). Fig 1 . A lo gic tr e e r e pr esent i ng the Bo ole an expr ession in Equation ( 2.2 ) . White text on a b lack b ack gr ound denotes t he c onjugate of a variable. Ensembles for pr e diction 305 Note that each L i in Eq uation ( 2.1 ) may b e r epresented a s a tree, and hence logic regr ession a llows for multiple tree mo dels. The space of p ossible lo gic tre e s is enormo us, esp ecially in situations where n < p . T o search this space eﬃcien tly without sacriﬁcing the desire for optimalit y , either a gre e dy search or a search via sim ulated annealing can b e employed. These search techn iques es tima te the L i and β i simult aneously (see Equation ( 2.1 )) a nd us e simple “mov es” to sea r ch for “go o d” logic mo dels (i.e., mo dels which minimize the scoring function). Using terminolo gy simila r to tha t of CAR T [ 4 ] these “moves” include growing, pruning, splitting, and deleting. As greedy searches o ften lead to mo dels w hich o v erﬁt the data or are sub optimal (as when the sea rch gets “ stuck” in a lo ca l minimu m) [ 35 ] we prefer the use of simulated annea ling to sear ch fo r logic trees. Note that ea ch “mo ve” mentioned ab ov e has a matching “countermo v e” (e.g., growing a s opp ose d to pruning) which is imp or tant in the Marko v c hain theory which underlie s s im ulated annealing [ 38 ]. W e use randomization to bo th test the n ull model of no signal in the data and determine the optimal mo del size (if the tes t is rejected). F or testing the null mo del we r andomly p ermute the resp onse v a lues and ﬁnd the b est ﬁtting mo del. If ther e is no signal, the score of this model should b e comparable to the scor e of the best mo del ﬁt to the origina l data. By rep eating the ab ov e pr o cedure mu ltiple times we can c o nsider the n um ber of runs with mo del sco res b etter than the scor e of the b est mo del ﬁt to the original da ta as a p-v a lue for our test. The method for ﬁnding the optimal model size is based o n a series o f random- ization tests. The null hypo thesis for each test is that the optimal mo del size is k and la rger mo dels with be tter scores are due to noise. Assume the n ull hypothesis and the bes t mo del of s iz e k ha s s c ore s k . The ﬁtted v alues from this mo del fall int o t w o cla sses; we now p ermute the resp ons e v alues within each class and ﬁnd the bes t mo del o f any size on the p ermuted data. If this model has score s ∗ k then under the null hypothesis s k comes from the s a me dis tr ibution as s ∗ k . This distribution can b e appr oximated b y rep ea ted per mutations. W e perfo r m the ab ove pr o cess for k ∈ { 0 , . . . , K } yielding a series of his to grams of ra ndo mization sco res s ∗ k for eac h v alue of k . The optimal model size is determined by compa r ing these histograms, for exa mple, one may cho ose the mo del size for which only a small prop o r tion of scores s ∗ k are better than s k . T o further avoid overﬁtting the data set of n observ ations on p cov a riates X j , j = 1 , . . . , p , is split into a training set of size n 1 and a test set o f size n 2 ( n = n 1 + n 2 ). The logic regr e ssion mo dels are ﬁt to the tra ining s e t and the a ccuracy of their predictions a re ev aluated o n the test set. The ﬁtt ing and e v aluatio n of mo dels can be performed in t he R pack age Logi cReg as descr ib ed in [ 29 ]. 2.2. V ari able sele cti on Unlik e log ic trees, log istic regre ssion mo dels require that n < p . In c a ses where n ≥ p we p erfor m a v ar iable selection via least absolute shrink ag e and selection op erator (LASSO) [ 36 ] prior to regr ession mo deling. LASSO retains the beneﬁcial features of bo th subset s election and ridge regre s sion by minimizing the residua l sum of squa res sub ject to the constra int that the sum of the abso lute v alues of the co eﬃcients on the cov ariates is les s tha n a co ns tant (i.e., a c onstraint on the L 1 norm of the co eﬃcient vector). This tends to shrink some co eﬃcients and set others to zero, lea ding to mo dels with improv ed interpretability and s tability . LASSO can be a pplied to gene r alized r egressio n mo dels such as logistic r egressio n mo dels; see [ 36 ] for details. 306 J. Clarke and D. Se o Osb orne et al. [ 25 ] develop ed an eﬃcien t algo rithm for computing LASSO e s ti- mates w hich is applicable in the n < p c ase. W e use this algo rithm as implemented in the R pac k age lasso2 [ 21 ] in an iterativ e fashion to p erfor m v a riable selection, removing those cov ariates whose co eﬃcien ts has b een set to zero at each iteration. If the iterative LASSO tec hnique yields p ∗ ≥ n v ariables with non-zero co eﬃcients we remove v ariables o ne at a time b e tw een LASSO iterations , sta r ting with thos e v aria bles with the smallest co eﬃcients, un til p ∗ < n . The remaining v a r iables are used in developing a log is tic regr ession mo del of o ur resp onse v ariable v ia stepwise selection. It is imp or tant to mention that v a riable se le ction techniques ex ist sp eciﬁcally for SNP data. F or e x ample, Genomic Control (GC) [ 9 , 10 ] is an analy tic metho d for SNP selection whic h controls the false p ositive r ate b y separating causal from confounding factors. There are also methods for s electing whic h SNP s to genotype when presented with a large num ber of arbitrary SNPs (see, for example, [ 37 , 41 ]). How ev er, w e deemed such metho ds inappropriate for o ur context of interest in which we were presented with only the par tial results of such metho ds, i.e., a mo dest n um- ber of SNPs no t in link a g e disequilibrium (LD) and without haplo t yp e information which had b een selected bas ed up on the applica tion of metho ds similar to those men tioned ab ov e (see Section 4 for more details on our co ntext of interest). 3. Comparing and combining mo del predictions Our goal is to determine whether the information from new binary cov a r iates X j , j = 1 , . . . , p , can b e used to improv e predictions of a respo nse Y from a model built on existing cov aria tes Z l , l = 1 , . . . , p ′ . Let M 1 and M 2 represent the logic regres s ion and logistic reg ression mo dels ﬁt to X j , j = 1 , . . . , p , resp ectively , and let M e represent the existing model ﬁt to Z l , l = 1 , . . . , p ′ . Suppose we are given a data set o f size n ′ consisting of co v aria tes X and Z for which w e would like to ge ne r ate predicted v alues of the outcome Y . Let ˆ Y 1 be the predictions for this data set fro m M 1 , ˆ Y 2 be the predictions from M 2 , and ˆ Y e be the predictions fr o m M e . Possible strategies for g enerating optimal predictions include the follo wing: • Weighte d Av er age of Pr e dictions ¯ ˆ Y . Determine whether a weigh ted a verage of the predictions from either ˆ Y 1 or ˆ Y 2 and ˆ Y e yields b etter r esults than ˆ Y e alone. A weighed a verage prediction ¯ ˆ Y is deﬁned as (3.1) ¯ ˆ Y = α ˆ Y e + (1 − α ) ˆ Y m , m = 1 , 2 , where 0 ≤ α ≤ 1. α is determined by rep eated training/tes t set ev alua tion. • Pr e dictions fr om Comp osite Mo del ˆ Y c . W e consider whether building a mo del directly to { X , Z } will lead to impro ved predictions. The mo deling pro cedures describ ed in Section 2 are rep eated with Z as w ell as X considered a s possi- ble co v ariates . This leads to models M c 1 (logic regression) and M c 2 (logistic regres s ion) whose predictio ns ˆ Y c 1 and ˆ Y c 2 can b e co mpared to ˆ Y e . • Two-Stage Pr e dictions ˆ Y s . Assume a tw o-class classiﬁcation pro blem, i.e ., Y ∈ {− 1 , 1 } . In Stag e 1 we deter mine for which observ ations the pr edictions ˆ Y e are corr ect ( n c ∈ { 1 , . . . , n ′ } ) or incorr e c t ( n ¯ c = { 1 , . . . , n ′ } /n c ). In Sta ge 2 for observ ations in n ¯ c we replace the predictions fro m ˆ Y e with the predictions from either M 1 or M 2 . In other words, w e create a tw o-stage model M s for Ensembles for pr e diction 307 Y i , i = 1 , . . . , n ′ , with predictions deﬁned as (3.2) ˆ Y si =  ˆ Y ei , if Y si = Y i , ˆ Y mi , if Y si 6 = Y i , where m = 1 , 2. This pr edictive sc heme ma y be par ticularly useful for data from a heterogeno us popula tion, wher e it is p o ssible that the accurac y o f the predictions from a g iven mo del may v ary acro ss diﬀerent subgro ups of the po pulation. Unfortunately M s , and hence ˆ Y s , depends on the true resp onse Y . As an alterna- tive we prop ose the use of a supp ort v ector mac hine (SVM) to discr iminate those subspaces of the cov aria te spa ce on which the results o f M e are correct from those on which the r esults are incor rect, based on the training data. 3.1. Supp ort ve ctor machines Suppo rt v ector machines (SVMs) [ 3 , 8 , 39 ] a re a group of related sup ervis ed lear ning metho ds for classiﬁcatio n or r e gressio n. In the case of t wo-class cla ssiﬁcation we consider a set o f data p oints { ( x 1 , y 1 ) , . . . , ( x n , y n ) } where eac h y i ∈ {− 1 , 1 } de no tes the class to which x i belo ngs. The ob jectiv e of an SVM is to pro duce a h ype r plane which can separate the tw o classes using only x i , i = 1 , . . . , n in a wa y which minimizes the empir ic a l cla ssiﬁcation e rror a nd maximizes the geometric margin betw een the clas ses. More sp eciﬁca lly , the (soft mar gin) s upp or t vector machine is the solution to the following o ptimiza tion problem: min w, b,ξ 1 2 w t w + C P l i =1 ξ i , C > 0 , sub ject to y i ( w t φ ( x i ) + b ) ≥ 1 − ξ i , ξ i ≥ 0 . Note that the vectors x i , i = 1 , . . . , n are mapp ed to a higher dimensio nal space by the function φ , and the SVM ﬁnds a linear separating hyperpla ne in this higher dimensional space. This SVM has a “so ft mar gin” in the sense that is a llows for misclassiﬁed samples; if no hyper plane exists which can separa te the tw o clas s es, this method will chose the hyperplane which splits the cla s ses as cleanly as p o ssible while still maximizing the geo metric mar g in. The s lack v ar iables ξ i measure the degree of misclassiﬁca tion of the datum x i . K ( x i , x j ) = φ ( x i ) t φ ( x j ) is called the kernel function of the SVM. The k ernel function typically falls in to one of four classes: linear , p olynomial, radia l basis function, and sigmoid. F or mo r e information on SVMs and their implemen tation we refer the reader to [ 5 , 31 ]. As stated previously , our use of SVMs is to discriminate thos e subspaces of the cov a riate space on which the results of the existing mo del M e are correct from those on which the res ults are incorrect, based o n the training data. Let X b e the cov aria te spa ce and co nsider a SVM whic h divides X int o subspaces X c and X ¯ c where the model results a re correct and incorrect, res p ectively . W e now redeﬁne the t wo-stage mo del M s (and ˆ Y si ), or iginally deﬁned in Equation ( 3.2 ), indep endently of Y using the results o f the SVMs: (3.3) ˆ Y si =  ˆ Y ei , if X i ∈ X c , ˆ Y mi , otherwise m = 1 , 2 . 308 J. Clarke and D. Se o 3.2. Analysis str ate gy r e c ap Before we mov e to Section 4 we brieﬂy summarize our analysis strategy . W e w ant a statistical model which can accura tely predict the outcome status of an o bserv a tion given a set of exis ting predictor s (b oth contin uo us a nd binar y) and a set of ne w binary predictor s. Our key modeling approach is a t w o-stage mo del. In the ﬁrst stage we build a mo del fro m the existing pr edictors only (a logistic r egress ion mo del). Given the predictio ns from this mo del we design an SVM which ca n identify the subspaces in which obser v ations are correctly or incorrectly predicted. In the second stage, in those subspaces wher e observ ations ar e inco rrectly predicted, w e use a mo del based only on the new binary predictors (a logis tic r e gressio n or log ic tree mo del) to genera te accurate predictions. In this appro ach informa tion from the new binar y predictors is only utilized where needed, i.e., in subspa ces where the existing predictors do not provide enough information to gener ate accurate outco me predictions. 4. Example: The CA THGEN study W e demonstrate the use o f our metho d in the analysis of data fro m a ca rdiology study . A substantial pr oblem in clinical car diology is the ga p in the a bilit y to detect asymptomatic individuals at high r is k for coronary heart disease (CHD) for preven tiv e and thera pe utic interv ent ions [ 17 , 26 ]. Up to 75% of such individuals ar e designated a s low to intermediate risk by standard CHD risk assess ment mo dels; how ev er, a substantial num ber of such individuals who are actually a t increa sed risk may not b e iden tiﬁed. One a nalysis from the F ramingha m Heart Study found that fo r individuals that ma nifested a new CHD even t, the initial presentation in over 5 0% o f the cases w as my oc a rdial infarc tio n, silent my oc ardial infar c tion or sudden ca rdiac dea th [ 1 ]. Over 50% of individuals with sudden cardiac death hav e no prior symptoms of CHD [ 40 ]. Therefore, it is likely that the traditional risk factors do no t a ccount fully for CHD risk [ 16 , 22 , 24 , 28 ]. F urthermore , current CHD risk assessment mo dels do not provide one’s individual ris k. Ra ther, the calculated assessment is for a po pulation o f individuals who share the sa me demog raphics a nd panel of risk facto rs. A group o f researchers at Duke University Medical Center (DUMC) has pur sued an a v en ue of study ev aluating the role of genes and gene v ar iants in the development of ather osclero sis (the AGEND A study) [ 18 , 1 9 , 33 ]. As a result o f their eﬀorts they hav e compiled a list of candidate g enes with a stro ng statistica l corr elation with v ascula r atherosc le r osis. Thr ough subsequent a nalysis for SNPs in these candidate genes, they analyze d 1300+ SNP s for asso cia tion with sig niﬁca nt CHD (stenosis ≥ 75% in at lea st o ne c o ronar y artery) in a co hort of 1 500 sub jects who had un- dergone ca rdiac catheterizatio n (CA THGEN). These SNPs were then ra nked by their ma rginal ass o ciation with the presence of CHD in a ca rdiac catheteriza tion po pulation. W e conduct an analysis of a subset of the CA THGEN data to test the hypoth- esis that genetic information in the for m of SNPs will improve the ability of risk assessment mo dels that use o nly traditiona l risk factors to c la ssify individuals a s having hig h risk for CHD. W e developed prediction mo dels for likelihoo d o f sig - niﬁcant CHD based on traditional ris k fac tors such as c holesterol, blo o d pressure, diab etes and smoking, using a gr oup of CA THGE N sub jects who underwen t cardiac catheterization. A separate set of CA THGEN sub ject data was used in selecting Ensembles for pr e diction 309 from the candidate SNP p o o l those SNPs with the highest marg inal a sso ciation with signiﬁcant CHD; 81 such SNPs w ere av a ilable for analys is. W e then assessed whether including genetic information improv e d our ability to classify individua ls as having s igniﬁcant CHD. The r esearch w as p er formed under an appro ved proto co l from the Institutional Review Boa rd o f DUMC. 4.1. Data Two data sets w ere cons tructed fro m the CA THGEN data, one for SNP selection and mo del building (build set) and o ne for the ev aluation of mo del predictions (ev aluatio n o r ev al se t). The ev alua tion set consisted of white individuals (self- rep orted race) with co mplete data for all 81 SNPs and all clinical v ariables (see Section 4.2 ). The build set consisted of white individuals with co mplete data for a ll 81 SNPs but incomplete clinical data (clinica l data was as s umed to b e missing at random). Within each set individuals were separated in to three cohorts: a) co nt rols, ≥ 65 years o f age without signiﬁcan t CHD, b) older ca ses (OC), ≥ 65 years of age with signiﬁcant C HD, and c) younger cases (YC), ≤ 50 years of a ge with signiﬁcant CHD. F o r each co hort a gro up of samples w as selected for mo del v alidatio n o nly , those 50– 55 years of age with either minimal o r signiﬁcant CHD as deﬁned by coronar y angio g raphy (for v alida tion of mo dels of cohorts a) and c )) and tho se 56–65 years of age with either minimal or signiﬁcant CHD as deﬁned by co r onary angiogr aphy (for v alidation of mo dels o f coho r ts a) and b)). Each co ho rt was fur ther split by gender. A table of the study cohor ts and n um ber o f sub jects is shown in T able 1 . W e used the 81 SNPs fro m the AGEND A study with the strongest statistica l asso ciatio n with the presence of signiﬁca nt CHD. The streng th of as so ciation was determined by 1) the p-v alue of SNP status in a logistic reg ression mo del of CHD, including b oth age and gender as cov ar iates, and 2) the p-v alue o f SNP status from a Co chran-Armitage T est fo r T rend [ 2 ]. W e should note that the desig na tion of the top SNPs was performed using a la rge group o f sub jects (1500) that included the data used for this study . Typically with a ny given sing le base-pa ir diﬀerence, or sing le nucleotide p oly- morphism (SNP), o nly tw o out o f the fo ur p o ssible nucleotides o ccur. Since ea ch cell contains a pair of every autosome, w e can think of a SNP a s a three-level v a ri- able X taking the v a lues 0, 1, or 2 (e.g., for nucleotide pairs A/A, A/G, and G/G, resp ectively). Each SNP ca n be reco ded as a bina ry v ariable using either dominant co ding ( X d = 1 if X ≥ 1 and X d = 0 otherwise) or recessive co ding ( X r = 1 if X = 2 and X r = 0 otherwise). With the CA THGEN da ta we chose dominant T a ble 1 CA THGEN data Build set Ev aluat ion set T ra ining V ali dation T r aining V ali dation Y oung Cases 44 80 69 103 Male Con trols 34 14 32 18 Older Cases 79 13 47 11 Y oung Cases 11 21 11 18 F emale Con trols 59 12 42 18 Older Cases 15 3 15 4 310 J. Clarke and D. Se o co ding for the SNPs, where X = 0 indicates no copy of the minor (less frequently o ccurring) allele . 4.2. Mo del bui lding Mo dels were constructed on either male or female sub jects. Within gender , these mo dels co mpared either co nt rols with young cases o r controls with older cases. W e describ e the mo deling a pproaches used for each gender / compariso n com bination. All computations were per formed in R [ 27 ]. Pr e dictive mo del using clinic al variables ( M e ). F or the clinical v ariables we used standard CHD risk factors a s denoted b y as- sessment to ols such as the F ramingham heart s tudy r isk algorithm [ 40 ]. W e included presence of diab etes, current smoking status, total c holesterol lev el, HDL c holesterol level, systolic blo o d pr essure and diastolic blo o d pressure . Clinica l v a riables were collected at the time of car diac catheterization. W e used these v ariables to tra in bo th weigh ted a nd un w eighted logistic r e gressio n mo dels in the ev a luation set, as the build s e t has incomplete clinica l data. The weights were chosen to balance the impo rtance o f case and co ntrol sa mples. The trained mo del was then used to clas- sify the v alida tion sub jects in the ev aluation set a s having minimal o r signiﬁcant CHD. Pr e dictive mo del using genetic variable s ( M 1 , M 2 ). Our SNP data consisted of the 81 SNPs from the A GEND A study t yp e d in our CA THGEN samples, as describ ed in Sec tio n 4.1 . W e constructed tw o mo dels using only the CA THGEN samples from the build set: 1 ) LASSO for SNP selection fol- low ed by logistic r egress ion (w eighted and unw eighted) using bac kw ards selection, and 2) log ic r egressio n based o n all 81 SNPs. Logic models were ﬁt for both classiﬁ- cation and logistic reg r ession. These mo dels were then used to classify the sub jects in the ev alua tio n set as ha ving minimal or signiﬁcan t CHD. LASSO and logic re- gressio n were per formed using the R pa ck ag es l asso2 and LogicR eg , res pe c tively . Pr e dictive mo del using c ombi ne d clinic al and genetic va riables ( M c ). First, logistic reg r ession mo dels (weighted and unw eigh ted) were built using ge- netic v ariables, as describ ed ab ov e. The SNPs which app e ar in ea ch mo del and the clinical v ariables w ere combined to train logistic regre ssion models in the ev aluation set. Thes e mo dels w ere then used to classify the v alidation sub jects in the ev a lua- tion set as having minimal or signiﬁcant CHD. A similar pro cedure w as p er formed for each log ic regres s ion mo del. Two-Stage Pr e dictive mo del using the clinic al a nd genetic mo dels ( M s ). First, the trained clinical model was used to cla ssify the sub jects in the ev aluation set a s having minimal o r sig niﬁca nt CHD. Next an SVM was constructed whic h could discriminate the s ubspace o f cor rectly classiﬁed samples fr om the subspa c e of inco rrectly classiﬁed s a mples. F or those sa mples in the subspace of incorrec tly classiﬁed sa mples, the trained genetic mo dels w ere applied to r eclassify the sub jects int o the minimal and signiﬁcant CHD groups. This r esulted in a se t of tw o-stage predictions, as de s crib ed in Section 3.1 and Eq uation ( 3.3 ). SVMs w ere based on a radial basis function k ernel and co mputed using the R pack ag e e10 71 [ 12 , 23 ]. 4.3. R esults Our interest is in class ifying individua ls as having non-sig niﬁcant CHD ( Y = 0) or signiﬁcant CHD ( Y = 1 ). A ﬁtted o r predicted probability of signiﬁcant CHD ˆ Y i Ensembles for pr e diction 311 T a ble 2 R e sults of clinica l mo del and lo gic r egr ession classiﬁc ation mo dels on e valuation data T ra ining/T est Samples V ali dation Sam ples ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s F emale, controls vs older cases auR OC 71.38 48.81 75.38 81.31 50.48 57.62 53.33 75.24 acc 71.93 50.88 66.67 82.46 50.00 68.18 54.55 81.82 fn 36.00 68.00 40.00 28.00 57.14 71.43 42.8 6 42.86 fp 21.88 34.38 28.13 9.38 46.67 13.33 46.67 6.67 Male, controls vs older cases auR OC 81.97 51.46 82.97 74.18 59.52 42.86 59.94 6 0.00 acc 78.48 73.42 78.48 87.34 51.72 41.38 83.47 58.62 fn 11.48 8.20 11.48 1.64 14.29 14.29 5.66 0. 00 fp 55.56 88.89 55.56 50.00 80.00 100 .0 93.33 80.0 0 F emale, controls vs young er cases auR OC 80.80 50.15 80.80 86.53 63.81 50.00 63.81 85.71 acc 79.25 56.60 79.25 88.68 61.11 47.22 61.11 83.33 fn 33.33 80.95 33.33 23.81 42.86 66.67 42.59 28.57 fp 12.50 18.75 12.50 3.13 33.33 33.33 33.33 0.00 Male, controls vs younger cases auR OC 84.00 37.01 86.81 73.19 60.19 58.33 59.94 8 8.11 acc 83.17 46.53 83.17 88.12 81.82 52.07 83.47 94.21 fn 4.82 48.19 4. 82 3 .61 8.49 50.00 5.66 3 .77 fp 72.22 77.78 72.22 50.00 86.67 33.33 93.33 20.00 from a logistic regr ession model for a given indiv idua l was considered an “a ccurate” classiﬁcation if Y i = 1 a nd ˆ Y i ≥ 0 . 5, or Y i = 0 a nd ˆ Y i < 0 . 5, and considered “inaccu- rate” o therwise. A ﬁtted or pre dicted outco me ˆ Y i from a logic regress ion mo del for a given individual w as considered “accurate” if Y i = ˆ Y i and consider ed “ inaccurate” otherwise. W e considered ov erall mo del accurac y a s well as the rate of fa ls e p os itive ( P ( ˆ Y i ≥ 0 . 5 | Y i = 0)) a nd false negative ( P ( ˆ Y i < 0 . 5 | Y i = 1 )) mo del r esults. In an attempt to bala nce spe c iﬁcit y and sensitivity we also calculated the a rea under the re c eiver-oper ating c haracteris tic (ROC) curve (auROC) for the results of ea ch mo del. The auROC is eq ual to the v alue of the Wilcoxon–Mann–Whitney statis- tic and can b e interpreted as the pr obability that the mo del will a ssign a highe r probability o f sig niﬁcant CHD to a r andomly selec ted p ositive sample than to a randomly s elected negative sample. The a uR OC calcula tions were per formed with the R pack age ROCR [ 34 ]. The results of the weigh ted logis tic regr ession mo dels a nd the logic regres s ion classiﬁcation and logistic mo dels for each gender and compar ison (control vs. young cases o r control vs. older cases) o n the ev a lua tion set a re presented in T ables 2 , 3 , and 4 . The results of the unw eigh ted logistic regr ession mo dels are not discussed here due to their similar ity to the results of the weighted mo dels. In these T ables ˆ Y e = clinical only mode l, ˆ Y 1 or ˆ Y 2 = SNP o nly mo del, ˆ Y c = Clinical+SNP mo del, ˆ Y s = Tw o -Stage Pr edictions using SVM, acc = accuracy , fn = false nega tive r ate, and fp = false p ositive rate. These tables show cle a rly that the t w o-stage predictions yield the bes t results. In so me ca ses the combined clinical+SNP mo dels p er form be tter o n the training set in comparison to the t w o-stage predictions, but their p erfor mance deterio rates on the v alidation samples. This could be due to the fact that the clinica l model and the clinical+SNP mo del were tra ined on the training/test samples a nd tested on the v alidation samples, while the SNP mo dels w ere tes ted on b oth s ets of sa mples (having alre a dy b een trained on the s a mples in the build set). This would also 312 J. Clarke and D. Se o T a ble 3 R e sults of clinica l mo del and lo gic r egr ession lo gistic mo dels on e valuation data T ra ining/T est Samples V a lidatio n Samples ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s F emale, controls vs older cases auR OC 71.38 52.50 72.50 84.63 50.48 59.05 48.57 7 9.53 acc 71.93 49.12 63.16 84.21 50.00 54.55 31.82 72 .73 fn 36.00 20.00 44.00 12.00 57.14 28.57 42.86 14 .29 fp 21.88 75.00 31.25 18.75 46.67 53.33 80.00 26 .67 Male, controls vs older cases auR OC 81.97 59.38 82.97 88.39 59.52 43.81 63.33 8 2.86 acc 78.48 49.36 82.28 91.14 51.72 44.83 51.72 82 .76 fn 11.48 59.02 4.92 6.56 14.29 85.71 14.29 14.2 9 fp 55.56 22.22 61.11 16.67 80.00 26.67 80.00 20 .00 F emale, controls vs young er cases auR OC 80.80 52.60 81.40 80.21 63.81 50.00 64.13 80.95 acc 79.25 56.60 70.70 83.02 61.11 47.22 55.56 77 .78 fn 33. 33 66.67 47.62 33. 33 42.86 66.67 47.62 38.10 fp 12.50 28.13 15.63 6.25 33.33 33.33 40.00 0 .00 Male, controls vs younger cases auR OC 84.00 47.52 91.43 77.78 60.19 57.83 61.32 80.97 acc 83.17 49.50 81.13 92.08 81.82 57.20 78.51 91 .74 fn 4.82 49.40 4.82 0.00 8.49 44.34 12.26 4.7 2 fp 72.22 55.56 50.00 44.44 86.67 40.00 86.67 33 .33 T a ble 4 R e sults of clinica l mo del and weighte d lo gist i c r e g ressio n mo dels on evaluation data T ra ining/T est Samples V ali dation Samples ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s ˆ Y e ˆ Y 1 ˆ Y c ˆ Y s F emale, controls vs older cases auR OC 71.38 54.56 77.38 84.00 50.48 60.48 49.52 75.24 acc 71.93 56.14 70.18 85.96 50.00 72.73 50.00 81.82 fn 36.00 64.00 32.00 32. 00 57.14 71.43 42. 86 42.86 fp 21.88 28.13 28.13 0.0 0 46.67 6. 67 53.33 6.67 Male, controls vs older cases auR OC 81.97 47.27 93.6 2 77.78 59.52 45.00 66.19 73.3 3 acc 78.48 65.82 88.61 8 9.87 51.72 51.72 48.28 72.4 1 fn 11.48 22.95 6.56 0.0 0 14.29 35.71 21.43 0.0 0 fp 55.56 72.22 27.78 44.44 80.00 60.00 80.00 53. 33 F emale, controls vs young er cases auR OC 80.80 45.47 83.3 3 81.77 63.81 52.38 49.52 78.5 7 acc 79.25 56.60 75.47 8 4.91 61.11 38.89 50.00 75.0 0 fn 33. 33 100.0 33. 33 33.33 42.86 95.24 42.8 6 42.86 fp 12.50 6.25 18.75 3.1 3 33.33 13.33 53.33 0.0 0 Male, controls vs younger cases auR OC 84.00 51.94 95.7 9 82.13 60.19 62.52 66.19 91.4 5 acc 83.17 47.62 93.0 7 92.08 81.82 50.41 48.28 95.0 4 fn 4.82 54.22 2.41 2.41 8 .49 52.83 21.43 3. 77 fp 72.22 44.44 27.78 33.33 86.67 26.67 80.00 13. 33 explain the consistency of the SNP mo del r esults acr oss both the training/ test and v alidatio n samples. Int erestingly the comb ined clinical+ SNP mo dels did not p erfor m b etter than the clinical only or SNP only models on the v alidation samples. In many compar - isons the clinical and clinical+ SNP mo dels per formed co mparably , with the SNP mo dels p erforming q uite po orly . W e s urmise that the p opulation under study is quite heter ogeneous, and that no one data t ype provides information predictiv e o f Ensembles for pr e diction 313 Fig 2 . Quality of t wo-stage pr ed ictions on evaluation data. CHD in all subpo pulations. The clinical data is predictive for some samples while the genetic da ta is predictive for others. The results o f a w eighted average o f the predictions from the clinica l only and SNP o nly mo dels ( ¯ ˆ Y ; results not shown) were not successful b ecaus e both data types are not relev an t for a ll samples; often the data t ypes provide conﬂicting informatio n. The tw o-stage predictions a re an attempt to use an SVM to deﬁne subpo pulations for which the clinic a l da ta o r the genetic data are predictive. Using the SVM results we can identify which data t ype is predictive for a given sample, leading to more acc urate predictions overall. In Figure 2 we display the quality of the tw o -stage predictions o n the ev aluation set (tra in/test subset or v alidation subset) for each mo del and e a ch compar ison. No sing le mo del pe rforms co nsistently b est in a ll comparis ons. By averaging the per formance measur es (auROC or ac c uracy) on the v alidation samples acr oss com- parisons we ﬁnd in terms of auROC the logic regress ion log is tic mo dels pe r form 1.43% better than the weigh ted logistic regression models , whic h in turn p erfor m 2.38% b etter than the lo gic reg ression classiﬁcation mo dels. In terms o f accura cy the logic reg r ession logis tic mo dels p erfor m 0.19% better than the w eight ed logistic regres s ion mo dels, which in turn p erfor m 1.57% b etter than the logic regres s ion classiﬁcation mo dels. Hence we conclude that overall the logic re g ressio n logistic mo dels p erform b est, follow ed by the weigh ted logistic reg ression mo dels a nd ﬁ- nally the logic regr ession classiﬁcation models . Howev er, the diﬀerence in av erage per formance b etw een a n y t wo mo del types is quite small. The only c omparison in which mo del p erfo rmances are clear ly dis ting uished is 314 J. Clarke and D. Se o the male controls vs. o lder cas es compar ison. The rela tively po or p erfor ma nce of the t w o-stage predictio ns from the logic reg ression clas s iﬁcation mo dels is striking; the results in T a ble 2 r eveal that the logic regr ession cla ssiﬁcation mo del had false po sitive rates of 89% for the tr aining/test patien ts and 100% for the v alidation patients. Unfortunately the clinical mo del also had a high false pos itive rate of 80%. Both data types (a nd consequently the t w o-stage predictions) failed to predict those with minimal CHD. T his is p oss ibly a result of SNP selectio n; o f the 9 , 8, and 8 SNP s selected by the w eight ed lo gistic, logic regressio n class iﬁcation a nd logistic mo dels , r esp ectively , only 2 appear in all three mo dels and no other SNPs are sha red by any tw o models . No deﬁnitiv e conclusions can be drawn w itho ut an independent data se t on whic h to v alida te our results. 5. Discussion W e hav e pr esented a tw o-stage appro a ch to generating combined predictions from mo dels built from diﬀerent data sour ces. One mo del is built on e x isting da ta o f m ultiple types (e.g., traditional clinical risk fac to rs), while a second set of mode ls are built on newly av ailable binary predicto r s only (in our case g enetic SNP data ). This t wo-stage a ppr oach uses an SVM to distinguish the cov a riate subspaces on whic h the exis ting da ta mo del g enerates accurate or ina ccurate predictions . The existing mo del is used to generate predictions for samples in the “ accurate” s ubs pa ce while a mo del built on the newly av aila ble data is used to generate predictions fo r samples in the “inaccur ate” subspace. This approach appea rs to p erform well in generating predictions for a hetero geneous p o pulation for whic h no single data type provides predictive informa tion for all sa mples . As discussed brieﬂy in Section 1 there exis t mo deling a ppr oaches other than lo- gistic and log ic regress ion mo dels which could hav e been e mploy ed here. W e chose logic trees b ecause of their ability to capture higher order interactions, an issue of g reat imp ortance in regression and a k ey to v a riable s e lection. How ever, similar mo dels c ould b e constructed by Bay esian model av e raging with lower-dimensional logistic reg ression mo de ls that allow for in teractions among cov ariates . W e also could hav e employed neural netw o rks or pro jection pursuit mo dels. These a lterna- tive approa ches w ould require careful pr ior v aria ble selection in any context where n < p , but would be w orth consider ing in future work. Our mo deling appr oach is simila r in spir it to ensemble metho ds [ 11 ], lear ning algorithms which constr uct a set o f cla ssiﬁers and then g enerate predictions b y taking a (weighted) a verage o r vote of their predictions. One such a pproach is bo osting [ 13 , 14 , 32 ], a metho d for con v erting a weak learning alg orithm in to o ne with high acc ur acy . This is done by training classiﬁers on weighted versions of the training data, giving higher w eight to misclassiﬁed sa mples, a nd forming the ﬁnal classiﬁer a s a linear combination o f the tra ining classiﬁer s. This appro ach do es not apply diﬀerent mo dels to diﬀere n t cov a riate subspa ces, but do es a ttempt to improv e mo de l perfor mance in subspaces where the mo del p erfor ms p o orly . Our approach is a type of ensemble metho d in whic h each cla ssiﬁer g e ts either a single, fully weigh ted v ote or no vote depending up on the subspace in which the sample of interest is lo cated. It would be of interest to compare our t wo-stage predictiv e approach to an approach aimed at building a b o osted classiﬁer from a ll av aila ble cov a riates. The results of such a c omparison w ould help in determining the necessit y of building a subs pa ce-dep endent class iﬁe r . Several diﬀerent model types were used in generating predictions from the newly av a ilable binary data, including logistic regres sion and logic r egressio n mo dels. No Ensembles for pr e diction 315 single mo del type p erformed sig niﬁcantly b etter than the others , a lthough a slight per formance a dv antage was observed when using the t w o-stage predictions from the logic reg r ession lo gistic mo dels. Acro s s co mparisons within gender and case the bes t mo de ls generated tw o-stage predictions with v alidation accuracies b etw een 81.82% and 94.21 %. It should be noted, how ever, that the sizes of the v alida tio n sets for s o me compa risons ar e quite small and all co mparisons w ere co nducted within a single p opulation (CA THGEN). Also, our inferences are done co nditio na l on a ﬁxed ch osen mo del; the v ariabilit y of the mo dels is not considered in the inference procedure. This is a weakness in our approach as mo del uncertain t y ca n be substa nt ial in high dimensional da ta contexts. Hence we regard our results a s a “pro of-o f-concept” for o ur ana lysis appr oach. W e are planning an analysis of a second, independent p opulation a nd aw ait the re s ults o f such a n analy sis b efore making any deﬁnitive conclusions rega rding the predictive p ow er o f our metho d. Ac kno wle dgment s. The authors wish to tha nk the following for their as sistance: Bertrand Clarke, Department of Statistics , University of British C o lumbia; Ed Iversen, Department of Statistical Scie nce, Duke Universit y; Pascal Goldsc hmidt, Dean, Leonar d M. Miller Schoo l of Medicine, University of Miami. References [1] American Hear t As socia tion (2006). He art Dise ase and Str oke St atistics – 2006 U p date 2–10. [2] Armitrage, P. (1955). T ests for linea r trends in prop or tions and frequencies. Biometrics 11 37 5–38 6. [3] Boser, B., G uyon, I. and V apnik, V. (1992). A training algor ithm for o pti- mal marg in classiﬁers. In 5th Annual ACM Workshop on COL T (D. Haussle r , ed.) 141– 152. ACM Press. [4] Breiman, L., Friedman, J., Ol shen, R. and Stone, C. (1 984). Classiﬁ- c ation and Re gr ession T r e es . W adsworth Pr ess, Belmont, CA. MR07263 92 [5] Chang, C.-C. and Lin, C.-J. (20 01). LIBSVM – A library for s uppo rt vector machines. Softw are av aila ble at ht tp://www.csie.ntu.edu.t w/˜cjlin/libsvm . [6] Chipman, H., George, E. and McCullo ugh, R. (200 2). Bay esian treed mo dels. Machine L e arning 48 299– 3 20. [7] Cl yde, M. (1999 ). Bay esian mo del av eraging and mo del s e a rch strategies. In Bayesian S tatistics 6 (J . Bernardo, J. Berger, A. Da wid and A. Smith, eds.) 157–1 85. Oxfor d University Pre ss, O xford, UK. MR17234 97 [8] Cor tes, C. and V apnik, V. (1995 ). Suppo rt-vector net works. Mac hine L e arning 20 273 –297 . [9] Devlin, B., Bacanu, S.-A. and Roeder, K. (2004 ). Geno mic co n trol to the extreme. Natur e Genetics 36 1129 –113 0 . [10] Devlin, B . and R oeder, K. (19 99). Genomic control for asso ciatio n studies. Biometrics 55 99 7–10 04. [11] Dietterich, T. (2000). Ense mble metho ds in machin e learning . L e c- tur e Notes in Comput. Sci. 1857 1 –15. Av a ila ble a t citeseer.is t.psu.edu/ dietterich00ensem ble.h tml. [12] Dimitriadou, E. , Hornik, K., Leisch, F. , Meyer, D. and Weinges- sel, A. (2006 ). The e10 7 1 Pac k age: Miscellaneous functions of the department of statistics (e10 71). T echnisc he Universit¨ at Wien, Austria. V er sion 1.5-1 6 . [13] Freund, Y. (1 995). Bo os ting a weak learning a lgorithm by ma jorit y . Infor- mation and Computation 121 2 56–28 5. MR13485 30 316 J. Clarke and D. Se o [14] Freund, Y. and Schapire, R. (1997). A decision-theo retic generaliz a tion of o n- line lea rning and an application to b o osting. J. Comput. System Sci. 55 119–1 39. MR14730 55 [15] Friedman, J. (1991). Multiv a riate adaptive reg ression splines (with discus- sion). Ann. St atist . 1 9 1–141 . MR10918 42 [16] Greenland, P. , Knoll, M., St amler, J., Nea ton, J., Dyer, A., Gar- side, D . and Wilson, P. (2003). Ma jor risk factors as an tecedent s of fatal and nonfatal corona r y heart disease even ts. J. Amer. Me dic al Asso ciation 290 891–8 97. [17] Greenland, P., Smith, S. and Grund y, S. (200 1). Improving cor onary heart disease risk ass e ssment in asymptomatic p e ople: Role of traditio na l risk factors and noninv asive car diov as cular tests. Cir culation 104 18 63–1 867. [18] Hauser, E., Cr ossman, D., Granger, C., Haines, J. , Jones, C., Mooser, V., McAdam, B., Winkelmann, B., Wiseman, A., Muhlstein, J. , Ba r tel, A., D ennis, C., Dowd y, E., Est abrooks, S., Eggleston, K., Francis, S., Roche, K., Clevenger, P., Huang, L. , Pedersen, B., Shah, S. , Schmidt, S., Ha ynes, C., West, S., Asper, D., Booze, M., Sharma, S., Sund seth, S., Middl eton, L., R o ses, A., Hauser, M., V ance, J., Pericak-V ance, M. and Kra us, W. (2004). A genomewide scan for early-ons et cor o nary artery disease in 438 families: The GENECARD study . A mer. J. Human Ge netics 75 4 36–44 7. [19] Karra, R., Vermullap alli, S., Dong, C., Herderick, E., Song, X., Slosek, K., Nevins, J., West, M., Goldschmidt-Clermont, P. and Seo, D. (2005 ). Molecula r evidence for arteria l r epair in atheroscler osis. Pr o c. Nat. A c ad . Sci. U.S.A. 10 2 16789 –1679 4. [20] Kooperber g, C., Ruczinski, I., LeBlanc, M. and Hsu, L. (2001). Se- quence analysis using lo gic regre s sion. Genetic Epide miolo gy 21 S626– S6 31. [21] Lokhorst, J., Venables, B., Turlach, B. and Maechler, M. (2 0 06). The lasso 2 pack ag e : L1 constr ained estima tion ak a “la s so.” Univ. W estern Australia Sc hoo l of Mathema tics and Statistics . V ersion 1.2- 5. Av aila ble at ht tp://www.maths.uw a.edu.au/˜b erwin/s oft ware/lasso.htm l . [22] Magnus, P. and Bea glehole, R. (2001). The rea l cont ribution of the ma jor risk factors to the coronar y epidemics: time to end the “only-5 0%” myth. Ar chives of Internal Me dicine 161 26 57–26 60. [23] Meyer, D. (20 06). Support vector ma chines: The interface to libsvm in pa ck- age e107 1. T echnisc he Universit¨ at Wien, Austria. [24] Mosca, L. (2002). C-Reactive protein: T o scre e n or no t to screen? N ew Eng- land J . Me dicine 347 161 5–16 17. [25] Osborne, M. , Presnell, B. and Turla ch, B. (2000). On the LASSO and its dual. J. Comput. Gr aph. Statist. 9 3 19–3 37. MR1822 089 [26] P a sternak, R ., Abrams, J., Greenland, P., Smaha, L., Wilson, P. and Housto n-Miller, N. (200 3). T ask force #1 – identiﬁcation of cor onary heart disease risk – is there a detectio n gap? J. Americ an Col le ge of Car diolo gy 41 1863 – 1874 . [27] R Development Core Team (2006 ). R: A language and environmen t for statistical computing. R F oundation for Statistical Computing, Vienna, Aus- tria. Av aila ble at http:/ /www. R- project.org . [28] Ridker, P., Rif ai, N., R ose, L., Buring, J. and Cook, N. (20 02). Com- parison of C-reactive protein and low-density lipo protein cholesterol lev els in the prediction of ﬁrst cardiov ascular even ts. New England J. Me di cine 347 1557– 1565 . Ensembles for pr e diction 317 [29] Ruczinski, I., Kooperber g, C. and LeBlanc, M. (200 2 ). Logic reg ression – metho ds and softw are. In Pr o c e e dings of the MSRI workshop on Nonline ar Estimation and Cl assiﬁc ation (D. Denison, M. Hansen, B. Holmes , B. Mallick and B. Y u, eds.) 3 33–34 4. Springer, New Y ork. MR20058 00 [30] Ruczinski, I., Kooperber g, C. and LeBlanc, M. (2003). Logic reg ression. J. Co mput. Gr aph. Statist. 12 475–5 11. MR20 0263 2 [31] Sch ¨ olkopf, B. and Smol a, A. (2002). L e arning with Kernels . MIT Pre s s, Cambridge, MA. [32] Schapire, R. (1990). The streng th of w e a k le arnability . Machine L e arning 5 197–2 27. [33] Seo, D., W ang, T., Dressman, H ., Her gerick, E., Iversen, E., Dong, C., V a t a, K. , Milano, C., Riga t, F., Pittman, J., Nevins, J., West, M. and Gol dschmidt-Clermont, P. (2004). Gene e x pression phe- notypes of atheroscle rosis. Ather oscler osi s, Thr omb osis, and V ascular Biolo gy 24 1922 – 1927 . [34] Sing, R., Sand er, O., Beere nwinkel, N . and Lenga uer, T. (2005). R OCR: Visua lizing clas s iﬁer p erfor mance in R. Bioinformatics 21 3940–3 941. Av ailable at http://ro cr.bio inf.mpi- sb.mpg.de/ . [35] Sutton, C. (19 91). Improving classiﬁca tion trees with simulated annealing. In Pr o c e e dings of the 23r d Symp osium on t he Interfac e (E. Kazimadas, ed.) 333–3 44. Interface F oundation of No rth America. [36] Tibshirani, R. (1996). Regression shr ink ag e and selection via the la sso. J. R oy. Statist. So c. S er. B 58 267–2 88. MR13792 42 [37] Tzeng, J.-Y. , Byerley, W., Devlin, B., Roeder, K. and W asser- man, L. (2003). Outlier detection and false disco very ra tes for whole-genome DNA matching. J . Amer. S t atist. Asso c. 98 236 –246. MR19 65689 [38] v an Laarho ven, P . and Aar ts, E. (1987 ). Simulate d Anne aling: The ory and App lic ations . Klu w er Aca demic Publishers, Norwell, MA. [39] V apnik, V. (1995). The Natu r e of St atistic al L e arning Th e ory . Springer, New Y ork. MR13679 65 [40] Wilson, P., D’Agostino, R., Levy, D., Belanger, A ., Silbersha tz, H. and Kann el, W. (1998). Pr ediction of coronary hea rt disea s e using risk factor categorie s. Cir culation 97 1837 –184 7. [41] Xu, H., G regor y, S., Ha user, E ., Stenger, J., Pericak-V ance, M., V ance, J. , Z uchner, S. and Hauser, M. (2005). SNPselector: a web to o l for selecting SNPs for genetic asso ciation studies. Bioi nformatics 21 4181– 4186. Av aila ble at http:/ /prim er.duhs.duke.edu/ .

An ensemble approach to improved prediction from multitype data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment