A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

A Comparati v e Analysis of Ensemble Classiﬁers: Case Studies in Genomics Sean Whalen and Gaurav P andey Department of Genetics and Genomic Sciences Icahn Institute for Genomics and Multiscale Biology Icahn School of Medicine at Mount Sinai, New Y ork, USA { sean.whalen,gaurav .pandey } @mssm.edu Abstract —The combination of multiple classiﬁers using ensem- ble methods is increasingly important f or making progr ess in a variety of difﬁcult pr ediction pr oblems. W e present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efﬁcacy on real-w orld datasets and draw useful conclusions about their behavior . These methods include simple aggregation, meta-learning, cluster -based meta-learning, and ensemble selection using heterogeneous clas- siﬁers trained on resampled data to improv e the diversity of their predictions. W e present a detailed analysis of these methods across 4 genomics datasets and ﬁnd the best of these methods offer statistically signiﬁcant improvements over the state of the art in their respectiv e domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance. Index T erms —Bioinformatics; Genomics; Supervised learning; Ensemble methods; Stacking; Ensemble selection I . I N T RO D U C T I O N Ensemble methods combining the output of indi vidual clas- siﬁers [1], [2] hav e been immensely successful in producing accurate predictions for many comple x classiﬁcation tasks [3]– [9]. The success of these methods is attributed to their ability to both consolidate accurate predictions and correct errors across many diverse base classiﬁers [10]. Div ersity is ke y to ensemble performance: If there is complete consensus the ensemble cannot outperform the best base classiﬁer , yet an ensemble lacking any consensus is unlikely to perform well due to weak base classiﬁers. Successful ensemble methods establish a balance between the div ersity and accuracy of the ensemble [11], [12]. Howe ver , it remains largely unknown how different ensemble methods achie ve this balance to extract the maximum information from the av ailable pool of base classiﬁers [11], [13]. A better understanding of how different ensemble methods utilize div ersity to increase accuracy using complex datasets is needed, which we attempt to address with this paper . Popular methods like bagging [14] and boosting [15] gener- ate div ersity by sampling from or assigning weights to training examples but generally utilize a single type of base classiﬁer to build the ensemble. Howe ver , such homogeneous ensembles may not be the best choice for problems where the ideal base classiﬁer is unclear . One may instead build an ensemble from the predictions of a wide variety of heter ogeneous base clas- siﬁers such as support vector machines, neural networks, and decision trees. T wo popular heterogeneous ensemble methods include a form of meta-learning called stacking [16], [17] as well as ensemble selection [18], [19]. Stacking constructs a higher-le vel predictiv e model over the predictions of base clas- siﬁers, while ensemble selection uses an incremental strategy to select base predictors for the ensemble while balancing div ersity and performance. Due to their ability to utilize heterogeneous base classiﬁers, these approaches ha ve superior performance across sev eral application domains [6], [20]. Computational genomics is one such domain where classi- ﬁcation problems are especially difﬁcult. This is due in part to incomplete kno wledge of how the cellular phenomenon of interest is inﬂuenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the best classiﬁer for speciﬁc problems. Even from a data perspectiv e, the frequent presence of extreme class imbalance, missing values, heterogeneous data sources of different scale, ov erlapping feature distributions, and measurement noise fur- ther complicate classiﬁcation. These difﬁculties suggest that heterogeneous ensembles constructed from a large and di verse set of base classiﬁers, each contrib uting to the ﬁnal predictions, are ideally suited for this domain. Thus, in this paper we use real-world genomic datasets (detailed in Section II-A) to analyze and compare the performance of ensemble methods for two important problems in this area: 1) prediction of pro- tein functions [21], and 2) predicting genetic interactions [9], both using high-throughput genomic datasets. Constructing accurate predictive models for these problems is notoriously difﬁcult for the abov e reasons, and e ven small impro vements in predictiv e accuracy ha ve the potential for lar ge contrib utions to biomedical knowledge. Indeed, such improvements uncovered the functions of mitochondrial proteins [22] and sev eral other critical protein families. Similarly , the computational discovery of genetic interactions between the human genes EGFR-IFIH1 and FKBP9L-MOSC2 potentially enables novel therapies for glioblastoma [23], the most aggressive type of brain tumor in humans. W orking with important problems in computational ge- nomics, we present a comparativ e analysis of several methods used to construct ensembles from large and diverse sets of base classiﬁers. Sev eral aspects of heterogeneous ensemble T ABLE I D E T A I L S O F G E N E TI C I NT E R AC T IO N ( G I) A N D P ROT E I N F U N CT I O N ( P F ) DAT A S E T S I N CL U D I NG T H E N U MB E R O F F E A T U RE S , N U M B ER O F E X AM P L E S I N T H E M I N O RI T Y ( P O S IT I V E ) A N D M A J O RI T Y ( N E GAT IV E ) C L AS S E S , A N D T OTA L N U M B ER O F E X A M PL E S . Problem Features Positiv es Negati ves T otal GI 152 9,994 125,509 135,503 PF1 300 382 3,597 3,979 PF2 300 344 3,635 3,979 PF3 300 327 3,652 3,979 construction that ha ve not previously been addressed are examined in detail including a novel connection between ensemble selection and meta-learning, the optimization of the di versity/accurac y tradeoff made by these disparate ap- proaches, and the role of calibration in their performance. This analysis sheds light on ho w variants of simple greedy ensemble selection achieve enhanced performance, why meta-learning often out-performs ensemble selection, and several directions for future work. The insights obtained from the performance and behavior of ensemble methods for these complex domain- driv en classiﬁcation problems should hav e wide applicability across div erse applications of ensemble learning. W e begin by detailing our datasets, experimental method- ology , and the ensemble methods studied (namely ensemble selection and stacking) in Section II. This is followed by a discussion of their performance in terms of standard e valuation metrics in Section III. W e next examine ho w the roles of div ersity and accuracy are balanced in ensemble selection and establish a connection with stacking by examining the weights assigned to base classiﬁers by both methods (Section IV -A). In Section IV -B, we discuss the impact of classiﬁer calibration on heterogeneous ensemble performance, an important issue that has only recently receiv ed attention [24]. W e conclude and indicate directions for future work in Section V. I I . M AT E R I A LS A N D M E T H O D S A. Pr oblem Deﬁnitions and Datasets For this study we focus on two important problems in computational genomics: The prediction of protein functions, and the prediction of genetic interactions. Belo w we describe these problems and the datasets used to assess the efﬁcacy of various ensemble methods. A summary of these datasets is giv en in T able I. 1) Pr otein Function Prediction: A key goal in molecular biology is to infer the cellular functions of proteins. T o keep pace with the rapid identiﬁcation of proteins due to adv ances in genome sequencing technology , a large number of computa- tional approaches hav e been dev eloped to predict various types of protein functions. These approaches use various genomic datasets to characterize the cellular functions of proteins or their corresponding genes [21]. Protein function prediction is essentially a classiﬁcation problem using features deﬁned for each gene or its resulting protein to predict whether the protein performs a certain function ( 1 ) or not ( 0 ). W e use the gene expression compendium of Hughes et al. [25] to predict the T ABLE II F E A T U R E M ATR I X O F G EN E T I C I N TE R AC T I ON S W H ER E n RO W S R E PR E S E NT PAI R S O F G E NE S M EA S U R ED B Y F E A T UR E S F 1 . . . F m H A V I NG L A BE L 1 I F T HE Y A R E K N OW N T O I N T E RA CT , 0 I F T HE Y D O N OT , A N D ? I F T H EI R I NT E R AC T IO N H A S N OT B EE N E S T A B L I SH E D . Gene Pair F 1 F 2 · · · F m Interaction? Pair 1 0.5 0.1 · · · 0.7 1 Pair 2 0.2 0.7 · · · 0.8 0 . . . . . . . . . . . . . . . . . . Pair n 0.3 0.9 · · · 0.1 ? functions of roughly 4,000 baker’ s yeast ( S. cer evisiae ) genes. The three most abundant functional labels from the list of Gene Ontology Biological Process terms compiled by Myers et al. [26] are used in our e v aluation. The three corresponding prediction problems are referred to as PF1, PF2, and PF3 respectiv ely and are suitable targets for classiﬁcation case studies due to their difﬁculty . These datasets are publicly av ailable from Pandey et al. [27]. 2) Genetic Interaction Pr ediction: Genetic interactions (GIs) are a category of cellular interactions that are inferred by comparing the effect of the simultaneous knockout of two genes with the effect of knocking them out individually [28]. The knowledge of these interactions is critical for under- standing cellular pathways [29], ev olution [30], and numerous other biological processes. Despite their utility , a general paucity of GI data exists for sev eral organisms important for biomedical research. T o address this problem, Pandey et al. [9] used ensemble classiﬁcation methods to predict GIs between genes from S. cer evisiae (baker’ s yeast) using functional relationships between gene pairs such as correlation between expression proﬁles, extent of co-ev olution, and the presence or absence of physical interactions between their corresponding proteins. W e use the data from this study to assess the ef ﬁcacy of heterogeneous ensemble methods for predicting GIs from a set of 152 features (see T able II for an illustration) and measure the improvement of our ensemble methods ov er this state-of-the-art. B. Experimental Setup A total of 27 heterogeneous classiﬁer types are trained using the statistical language R [31] in combination with its various machine learning packages, as well as the R W eka interface [32] to the data mining software W eka [33] (see T able III). Among these are classiﬁers based on boosting and bagging which are themselves a type of ensemble method, but whose performance can be further improved by inclusion in a heterogeneous ensemble. Classiﬁers are trained using 10-fold cross-validation where each training split is resampled with replacement 10 times then balanced using undersampling of the majority class. The latter is a standard and essential step to pre vent learning decision boundaries biased to the majority class in the presence of extreme class imbalance such as ours (see T able I). In addition, a 5-fold nested cross-validation is performed on each training split to create a validation set for T ABLE III I N DI V I D UA L P E RF O R M AN C E O F 2 7 B A SE C L A SS I FI ER S O N G E N ET I C I N TE R AC T I ON A N D P ROT E I N F U N CT I O N DAT A S ET S E V A L UA T E D U S I N G A C O MB I NATI O N O F R [ 3 1 ] , C A R E T [ 3 4] , A N D T H E RW E K A I N T ER FAC E [ 3 2 ] T O W E K A [ 3 3 ]. D E T A I LS O F E AC H C LA S S I FIE R A R E O M I T TE D F O R B R EV I T Y . R PAC KA G ES I N CL U D E A C I T AT IO N D ES C R I BI N G T H E M E T H OD . F O R A L L O T HE R S , S E E T H E W E KA D O C UM E N T A T I O N . F O R B O O S TI N G M E TH O D S W I T H S E L E CTA B LE BA S E L E A R NE R S , T H E D E FAULT ( N O R MA L LY A D E C I SI O N S T U M P ) I S U S E D . Performance Classiﬁer GI PF1 PF2 PF3 Functions glmboost [35] 0.72 0.65 0.71 0.72 glmnet [36] 0.73 0.63 0.71 0.73 Logistic 0.73 0.61 0.66 0.71 MultilayerPerceptron 0.74 0.64 0.71 0.74 multinom [37] 0.73 0.61 0.67 0.71 RBFClassiﬁer 0.73 0.62 0.69 0.74 RBFNetwork 0.56 0.51 0.52 0.58 SGD 0.73 0.63 0.70 0.73 SimpleLogistic 0.73 0.65 0.72 0.73 SMO 0.73 0.64 0.70 0.73 SPegasos 0.66 0.52 0.56 0.56 V otedPerceptron 0.65 0.62 0.70 0.71 T rees AdaBoostM1 0.71 0.65 0.67 0.73 ADT ree 0.73 0.64 0.67 0.75 gbm [38] 0.77 0.68 0.72 0.78 J48 0.75 0.60 0.65 0.71 LADT ree 0.74 0.64 0.69 0.75 LMT 0.76 0.62 0.71 0.75 LogitBoost 0.73 0.65 0.69 0.75 MultiBoostAB 0.70 0.63 0.66 0.70 RandomT ree 0.71 0.57 0.60 0.63 rf [39] 0.79 0.67 0.72 0.76 Rule-Based JRip 0.76 0.63 0.67 0.70 P AR T 0.76 0.59 0.65 0.73 Other IBk 0.70 0.61 0.66 0.70 pam [40] 0.71 0.62 0.66 0.64 VFI 0.64 0.55 0.56 0.63 the corresponding test split. This validation set is used for the meta-learning and ensemble selection techniques described in Section II-C. The ﬁnal result is a pool of 270 classiﬁers. Performance is measured by combining the predictions made on each test split resulting from cross-validation into a single set and calculating the area under the Receiv er Oper- ating Characteristic curve (A UC). The performance of the 27 base classiﬁers for each dataset is giv en in T able III, where the bagged predictions for each base classiﬁer are av eraged before calculating the A UC. These numbers become important in later discussions since ensemble methods in volve a tradeoff between the div ersity of predictions and the performance of base classiﬁers constituting the ensemble. C. Ensemble Methods 1) Simple Aggre gation: The predictions of each base clas- siﬁer become columns in a matrix where rows are instances and the entry at row i , column j is the probability of instance i belonging to the positive class as as predicted by classiﬁer j . W e ev aluate ensembles using A UC by applying the mean across rows to produce an aggregate prediction for each instance. 2) Meta-Learning: Meta-learning is a general technique for improving the performance of multiple classiﬁers by using the meta information they provide. A common approach to meta- learning is stacked generalization ( stacking ) [17] that trains a higher-le vel ( level 1 ) classiﬁer on the outputs of base ( level 0 ) classiﬁers. Using the standard formulation of Ting and W itten [41], we perform meta-learning using stacking with a level 1 lo- gistic regression classiﬁer trained on the pr obabilistic outputs of multiple heterogeneous lev el 0 classiﬁers. Though other classiﬁers may be used, a simple logistic regression meta- classiﬁer helps avoid ov erﬁtting which typically results in superior performance [41]. In addition, its coefﬁcients hav e an intuitive interpretation as the weighted importance of each lev el 0 classiﬁer [6]. The layer 1 classiﬁer is trained on a validation set created by the nested cross-validation of a particular training split and ev aluated against the corresponding test split to prev ent the leaking of label information. Ov erall performance is ev aluated as described in Section II-B. In addition to stacking across all classiﬁer outputs, we also e valuate stacking using only the aggr e gate output of each resampled (bagged) base classiﬁer . For example, the outputs of all 10 SVM classiﬁers are av eraged and used as a single level 0 input to the meta learner . Intuitiv ely this combines classiﬁer outputs that hav e similar performance and calibration, which allo ws stacking to focus on weights between (instead of within) classiﬁer types. 3) Cluster-Based Meta-Learning: A variant on traditional stacking is to ﬁrst cluster classiﬁers with similar predictions, then learn a separate le vel 1 classiﬁer for each cluster [6]. Alternately , classiﬁers within a cluster can ﬁrst be combined by taking their mean (for example) and then learning a lev el 1 classiﬁer on these per-cluster av eraged outputs. This is a generalization of the aggregation approach described in Section II-C2 but using a distance measure instead of restricting each cluster to bagged homogeneous classiﬁers. W e use hierarchical clustering with 1 − | ρ | (where ρ is Pearson’ s correlation) as a distance measure. W e found little difference between alternate distance measures based on Pearson and Spearman correlation and so present results using only this formulation. For simplicity , we refer to the method of stacking within clusters and taking the mean of le vel 1 outputs as intra-cluster stacking . Its complement, inter-cluster stacking , averages the outputs of classiﬁers within a cluster then performs stacking on the averaged lev el 0 outputs. The intuition for both approaches is to group classiﬁers with similar (but ideally non-identical) predictions together and learn how to best resolve their dis- agreements via weighting. Thus the div ersity of classiﬁer predictions within a cluster is important, and the effecti veness of this method is tied to a distance measure that can utilize both accuracy and div ersity . 4) Ensemble Selection: Ensemble selection is the pro- cess of choosing a subset of all available classiﬁers that perform well together , since including every classiﬁer may decrease performance. T esting all possible classiﬁer com- binations quickly becomes infeasible for ensembles of any practical size and so heuristics are used to approximate the optimal subset. The performance of the ensemble can only improv e upon that of the best base classiﬁer if the ensem- ble has a sufﬁcient pool of accurate and div erse classiﬁers, and so successful selection methods must balance these two requirements. W e establish a baseline for this approach by performing simple greedy ensemble selection, sorting base classiﬁers by their individual performance and iterativ ely adding the best unselected classiﬁer to the ensemble. This approach disregards how well the classiﬁer actually complements the performance of the ensemble. Improving on this approach, Caruana et al. ’ s ensemble selection (CES) [18], [19] begins with an empty ensemble and iterativ ely adds ne w predictors that maximize its performance according to a chosen metric (here, A UC). At each iteration, a number of candidate classiﬁers are randomly selected and the performance of the current ensemble including the candidate is e v aluated. The candidate resulting in the best ensemble performance is selected and the process repeats until a maxi- mum ensemble size is reached. The ev aluation of candidates according to their performance with the ensemble, instead of in isolation, improves the performance of CES over simple greedy selection. Additional improvements over simple greedy selection in- clude 1) initializing the ensemble with the top n base clas- siﬁers, and 2) allowing classiﬁers to be added multiple times. The latter is particularly important as without replacement, the best classiﬁers are added early and ensemble performance then decreases as poor predictors are forced into the ensem- ble. Replacement giv es more weight to the best performing predictors while still allowing for div ersity . W e use an initial ensemble size of n = 2 to reduce the effect of multiple bagged versions of a single high performance classiﬁer dominating the selection process, and (for completeness) ev aluate all candidate classiﬁers instead of sampling. Ensemble predictions are combined using a cumulativ e moving average to speed the ev aluation of ensemble perfor- mance for each candidate predictor . Selection is performed on the validation set produced by nested cross-validation and the resulting ensemble ev aluated as described in Section II-B. D. Diversity Measures The div ersity of predictions made by members of an en- semble determines the ensemble’ s ability to outperform the best indi vidual, a long-accepted property which we explore in the following sections. W e measure div ersity using Y ule’ s Q - statistic [42] by ﬁrst creating predicted labels from thresholded classiﬁer probabilities, yielding a 1 for values greater than 0.5 and 0 otherwise. Giv en the predicted labels produced by each pair of classiﬁers D i and D k , we generate a contingenc y table counting ho w often each classiﬁer produces the correct label in relation to the other: D k correct (1) D k incorrect (0) D i correct (1) N 11 N 10 D i wrong (0) N 01 N 00 The pairwise Q statistic is then deﬁned as: Q i,k = N 11 N 00 − N 01 N 10 N 11 N 00 + N 01 N 10 . (1) This produces values tending towards 1 when D i and D k correctly classify the same instances, 0 when they do not, and − 1 when they are negativ ely correlated. W e ev aluated additional di versity measures such as Cohen’ s κ -statistic [43] but found little practical dif ference between the measures (in agreement with Kunche v a et al. [11]) and focus on Q for its simplicity . Multicore performance and diversity measures are implemented in C++ using the Rcpp package [44]. This prov es essential for their practical use with large ensembles and nested cross validation. W e adjust raw Q values using the transformation 1 − | Q | so that 0 represents no div ersity and 1 represents maximum div ersity for graphical clarity . I I I . E N S E M B L E P E R F O R M A N C E Performance of the methods described in Section II-C is summarized in T able IV. Overall, aggregated stacking is the best performer and edges out CES for all our datasets. The use of clustering in combination with stacking also performs well for certain cluster sizes k . Intra-cluster stacking performs best with cluster sizes 2, 14, 20, and 15 for GI, PF1, PF2, and PF3, respectiv ely . Inter-cluster stacking is optimal for sizes 24, 33, 33, and 36 on the same datasets. Due to the size of the GI dataset, only 10% of the validation set is used for non-aggreg ate stacking and cluster stacking methods. Performance lev els off beyond 10% and so this approach does not signiﬁcantly penalize these methods. This step was not necessary for the other methods and datasets. Ensemble selection also performs well, though we anticipate issues of calibration (detailed in Section IV -B) could hav e a negati ve impact since the mean is used to aggreg ate ensemble predictions. Greedy selection achieves best performance for ensemble sizes of 10, 14, 45, and 38 for GI, PF1, PF2, and PF3, respecti vely . CES is optimal for sizes 70, 43, 34, and 56 for the same datasets. Though the best performing ensembles for both selection methods are close in performance, simple greedy selection is much worse for non-optimal ensemble sizes than CES and its performance typically degrades after the best few base classiﬁers are selected (see Section IV -A). Thus, on av erage CES is the superior selection method. In agreement with Altman et al. [6], we ﬁnd the mean is the highest performing simple aggregation method for combining ensemble predictions. Howe ver , because we are using heterogeneous classiﬁers that may hav e uncalibrated outputs, the mean combines predictions made with different T ABLE IV AU C O F E N S EM B L E L E A RN I N G M E T HO D S F O R P RO TE I N F U N C TI O N A N D G E NE T I C I N T ER AC T I O N D A TA SE T S . M E T HO D S I N C L UD E M EA N AG G R EG ATI O N , G R E EDY E N S EM B L E S E LE C T I ON , S E LE C T IO N W I TH R E PL AC E M E NT ( C E S) , S T A C KI N G W I T H L O GI S T I C R E GR E S S IO N , AG G R EG ATE D S T AC K I NG ( A V E RA G IN G R ES A M P LE D H O MO G E NE O U S B AS E C L AS S I FI ER S B E F O RE S TAC KI N G ) , S TAC K IN G W I TH I N C L U S TE R S T H E N A V E R AG IN G ( I NT R A ) , A N D A V E RA GI N G W I T H IN C L US T E R S T H EN S T AC K I NG ( I N TR A ) . T H E B E S T P E R FO R M I NG BA S E C L A S SI FI E R ( R A ND O M F O RE S T F O R T H E G I DAT A S E T A N D G B M F O R P F S ) I S G IV E N F O R R E FE R E N CE . S T A R R E D V A L U ES A R E G E N ER A T E D F RO M A S U BS A M P LE O F T H E VAL I DATI O N S E T D U E T O I T S S I Z E ; S E E T E X T F O R D E TAI L . Performance Method GI PF1 PF2 PF3 Best Base Classiﬁer 0.79 0.68 0.72 0.78 Mean Aggregation 0.763 0.669 0.732 0.773 Greedy Selection 0.792 0.684 0.734 0.779 CES 0.802 0.686 0.741 0.785 Stacking (Aggregated) 0.812 0.687 0.742 0.788 Stacking (All) 0.809 ∗ 0.684 0.726 0.773 Intra-Cluster Stacking 0.799 ∗ 0.684 0.725 0.775 Inter-Cluster Stacking 0.786 ∗ 0.683 0.735 0.783 T ABLE V P A I RW IS E P E R F OR M A N CE C O M P A R I SO N O F M U L T I PL E N ON - E N SE M B L E A N D E N SE M B L E M E T H OD S AC RO S S D A TAS E T S . O N LY PA I RS W I T H S T A T IS T I C AL LY S I GN I FI C AN T D I FFE R E NC E S , D E T ER M I N ED B Y F R IE D M A N / N E M EN Y I T E S TS AT α = 0 . 05 , A R E S H OW N . Method A Method B p-value Best Base Classiﬁer CES 0.001902 Best Base Classiﬁer Stacking (Aggregated) 0.000136 CES Inter-Cluster Stacking 0.037740 CES Intra-Cluster Stacking 0.014612 CES Mean Aggregation 0.000300 CES Stacking (All) 0.029952 Greedy Selection Mean Aggregation 0.029952 Greedy Selection Stacking (Aggregated) 0.005364 Inter-Cluster Stacking Mean Aggregation 0.047336 Inter-Cluster Stacking Stacking (Aggregated) 0.003206 Intra-Cluster Stacking Stacking (Aggregated) 0.001124 Mean Aggregation Stacking (Aggregated) 0.000022 Stacking (Aggregated) Stacking (All) 0.002472 T ABLE VI G RO U PE D P E RF O R M AN C E C O M P A R I S ON O F M U LTI P L E N O N - E N S EM B L E A N D E N SE M B L E M E T H OD S AC RO S S D A TAS E T S . M E T H OD S S H AR I N G A G RO U P L E TT E R H A V E S T A T IS T I C AL LY S I MI L A R P E R FO R M A NC E , D E TE R M I NE D B Y F RI E D M AN / N E ME N Y I T E ST S A T α = 0 . 05 . A GG R E G A T E D S T AC K I NG A N D C E S D E M ON S T RAT E T H E B E ST P E RF O R M AN C E , W H I LE G R EE DY S E LE C T I ON I S S I M IL A R T O C E S ( B U T N OT S T A C KI N G ) . Group Method Rank Sum a Stacking (Aggregated) 32 ab CES 27 bc Greedy Selection 18 c Inter-Cluster Stacking 17 cd Stacking (All) 16.5 cd Intra-Cluster Stacking 15 cd Best Base Classiﬁer 11 d Mean Aggregation 7.5 scales or notions of probability . This explains its poor perfor- mance compared to the best base classiﬁer in a heterogeneous ensemble and emphasizes the need for ensemble selection or weighting via stacking to take full adv antage of the ensemble. W e discuss the issue of calibration in Section IV -B. Thus, we observe consistent performance trends across these methods. Ho wev er, to draw meaningful conclusions it is criti- cal to determine if the performance differences are statistically signiﬁcant. For this we employ the standard methodology giv en by Dem ˇ sar [45] to test for statistically signiﬁcant perfor- mance dif ferences between multiple methods across multiple datasets. The Friedman test [46] ﬁrst determines if there are statistically signiﬁcant differences between any pair of methods over all datasets, followed by a post-hoc Nemenyi test [47] to calculate a p-value for each pair of methods. This is the non-parametric equiv alent of ANO V A combined with a T ukey HSD post-hoc test where the assumption of normally distributed v alues is remov ed by using rank transformations. As many of the assumptions of parametric tests are violated by machine learning algorithms, the Friedman/Nemenyi test is preferred despite reduced statistical power [45]. Using the Freidman/Nemeyi approach with a cutoff of α = 0 . 05 , the pairwise comparison between our ensemble and non-ensemble methods is shown in T able V. For brevity , only methods with statistically signiﬁcant performance differ - ences are shown. The ranked performance of each method across all datasets is shown in T able VI. Methods sharing a label in the group column have statistically indistinguishable performance based on their summed rankings. This table shows that aggregated stacking and CES have the best perfor- mance, while CES and pure greedy selection have similar per - formance. Howe ver , aggregated stacking and greedy selection do not share a group as their summed ranks are too distant and thus have a signiﬁcant performance dif ference. The remaining approaches including non-aggregated stacking are statistically similar to mean aggregation and motiv ates our inclusion of cluster-based stacking, whose performance may improve gi ven a more suitable distance metric. These rankings statistically reinforce the general trends presented earlier in T able IV. W e note that nested cross-validation, relative to a single validation set, improv es the performance of both stacking and CES by increasing the amount of meta data av ailable as well as the bagging that occurs as a result. Both effects reduce ov erﬁtting but performance is still typically better with smaller ensembles. More nested folds increase the quality of the meta data and thus af fects the performance of these methods as well, though computation time increases substantially and motiv ates our selection of k = 5 nested folds. Finally , we emphasize that each method we ev aluate out- performs the previous state of the art A UC of 0.741 for GI prediction [9]. In particular , stacked aggregation results in the prediction of 988 additional genetic interactions at a 10% false discovery rate. In addition, these heterogeneous ensemble methods out-perform random forests and gradient boosted regression models which are themselv es homogeneous ensembles. This demonstrates the value of heterogeneous 0.60 0.65 0.70 0.75 0.00 0.25 0.50 0.75 1.00 P airwise Div ersity (1 − |Q|) P airwise P erformance (A UC) 0.55 0.60 0.65 0.70 0.75 0.00 0.25 0.50 0.75 P airwise Div ersity (1 − |Q|) P airwise P erformance (A UC) Fig. 1. Performance as a function of diversity for all pairwise combinations of 27 base classiﬁers on the GI (top ﬁgure) and PF3 (bottom ﬁgure) datasets. More div erse combinations typically result in less performance except for high-performance classiﬁers such as random forests and generalized boosted regression models, whose points are shown in red if they are part of a pair . Raw Q values are adjusted so that larger values imply more div ersity . ensembles for improving predictiv e performance. I V . E N S E M B L E C H A R A C T E R I S T I C S A. The Role of Diversity The relationship between ensemble div ersity and perfor- mance has immediate impact on practitioners of ensemble methods, yet has not formally been proven despite extensiv e study [11], [13]. For brevity we analyze this tradeoff using GI and PF3 as representativ e datasets, though the trends observed generalize to PF1 and PF2. Figure 1 presents a high-le vel view of the relationship between performance and div ersity , plotting the div ersity of pairwise classiﬁers against their performance as an ensemble by taking the mean of their predictions. This ﬁgure shows the 0.1 0.2 0.3 0 20 40 60 Iteration Ensem ble Div ersity (1 − |Q|) .76 .77 .78 .79 Ensem ble P erformance (A UC) 0.1 0.2 0.3 0.4 0.5 0 20 40 60 Iteration Ensem ble Div ersity (1 − |Q|) .76 .77 .78 Ensem ble P erformance (A UC) Fig. 2. Ensemble div ersity and performance as a function of iteration number for both greedy selection (bottom curve) and CES (top curve) on the GI and PF3 datasets. For GI (top ﬁgure), greedy selection fails to improve diversity and performance decreases with ensemble size, while CES successfully balances diversity and performance (shown by a shift in color from red to yellow as height increases) and reaches an equilibrium over time. For PF3 (bottom ﬁgure), greedy selection manages to improve di versity around iteration 30 but accuracy decreases, demonstrating that di verse predictions alone are not enough for accurate ensembles. complicated relationship between div ersity and performance that holds for each of our datasets: T wo highly div erse classiﬁers are more likely to perform poorly due to lower prediction consensus. There are exceptions, and these tend to include well-performing base classiﬁers such as random forests and gradient boosted regression models (shown in red in Figure 1) which achiev e high A UC on their o wn and stand to gain from a div erse partner . Div ersity works in tension with performance, and while improving performance depends on div ersity , the wrong kind of div ersity limits performance of the ensemble [48]. Figure 2 demonstrates this tradeoff by plotting ensemble div ersity and performance as a function of the iteration number of the simple greedy selection and CES methods detailed in T ABLE VII T H E M O ST - WE I G H TE D C L AS S I FIE R S P RO D U C ED B Y S TAC K IN G W I TH L O GI S T I C R E G R ES S I O N ( W E I GH T m ) A N D C E S ( W E I GH T c ) F O R T H E P F 3 DAT A S E T , A L O N G W I T H T H E IR A V E R AG E PA IR WI S E D I V ER S I T Y A N D P E RF O R M AN C E . Classiﬁer W eight m W eight c Div . A UC rf 0.25 0.21 0.39 0.71 gbm 0.20 0.27 0.42 0.72 RBFClassiﬁer - 0.05 0.45 0.71 MultilayerPerceptron 0.09 - 0.46 0.70 SGD 0.09 0.04 0.47 0.69 VFI 0.11 0.11 0.71 0.66 IBk 0.09 0.13 0.72 0.68 Section II for the GI (top ﬁgure) and PF3 (bottom ﬁgure) datasets. These ﬁgures rev eal how CES (top curve) success- fully exploits the tradeoff between di versity and performance while a purely greedy approach (bottom curve) actually de- creases in performance over iterations after the best individual base classiﬁers are added. This is shown via coloring, where CES shifts from red to yellow (better performance) as its div ersity increases while greedy selection grows darker red (worse performance) as its diversity only slightly increases. Note that while greedy selection increases ensemble diversity around iteration 30 for PF3, ov erall performance continues to decrease. This demonstrates that div ersity must be balanced with accuracy to create well-performing ensembles. T o illustrate using the PF3 panel of Figure 2, the ﬁrst classi- ﬁers chosen by CES (in order) are rf.1, rf.7, gbm.2, RBFClas- siﬁer .0, MultilayerPerceptron.9, and gbm.3 where numbers indicate bagged versions of a base classiﬁer . RBFClassiﬁer .0 is a lo w performance, high div ersity classiﬁer while the others are the opposite (see T able III for a summary of base classiﬁers). This ensemble shows ho w CES tends to repeatedly select base classiﬁers that improve performance, then selects a more div erse and typically worse performing classiﬁer . Here the former are different bagged versions of a random forest while the latter is RBFClassiﬁer .0. This manifests in the left part of the upper curve where diversity is low and then jumps to its ﬁrst peak. After this, a random forest is added again to balance performance and di versity drops until the next peak. This process is repeated while the algorithm approaches a weighted equilibrium of high performing, low diversity and low performing, high div ersity classiﬁers. This agrees with recent observations that di versity enforces a kind of regularization for ensembles [13], [49]: Performance stops increasing when there is no more div ersity to extract from the pool of possible classiﬁers. W e see this in the PF3 panel of Figure 2 as performance reaches its peak, where small oscillations in diversity represent re-balancing the weights to maintain performance past the optimal ensemble size. Since ensemble selection and stacking are top performers and can both be interpreted as learning to weight different base classiﬁers, we next compare the most heavily weighted classiﬁers selected by CES (W eight c ) with the coefﬁcients of a lev el 1 logistic regression meta-learner (W eight m ). W e com- T ABLE VIII C A ND I DATE C L A SS I FI ER S S O RTE D B Y M E AN PA IR WI S E D I V ER S I T Y A N D P E RF O R M AN C E . T H E M O S T H E A V I L Y W E I GH T E D C L AS S I FI ER S F OR B OT H C E S A N D S TAC K IN G A R E S H OW N I N B O L D . T H IS T R E ND , W HI C H H O L D S AC RO S S D A TA SE T S , S H OW S T H E PA IR I N G O F H I G H - P E R F OR M A N CE L OW - D IV E R S IT Y C LA S S I FIE R S W I T H T H E IR C O MP L E M EN T S , D E MO N S T RATI N G H O W S E EM I N G L Y D I SPA R A T E A P P ROA CH E S C R E A T E A BA L A N CE O F D I V ER S I T Y A N D P E R F OR M A N CE . Classiﬁer Div ersity A UC rf 0.386 0.712 gbm 0.419 0.720 glmnet 0.450 0.694 glmboost 0.452 0.680 RBFClassiﬁer 0.453 0.713 SimpleLogistic 0.459 0.689 MultilayerPer ceptron 0.459 0.706 SMO 0.462 0.695 SGD 0.470 0.693 LMT 0.472 0.692 pam 0.525 0.673 LogitBoost 0.528 0.691 ADT ree 0.539 0.682 V otedPerceptron 0.540 0.694 multinom 0.553 0.677 LADT ree 0.568 0.678 Logistic 0.574 0.669 AdaBoostM1 0.584 0.684 P AR T 0.615 0.652 MultiBoostAB 0.615 0.679 J48 0.632 0.645 JRip 0.687 0.627 VFI 0.713 0.662 IBk 0.720 0.682 RBFNetwork 0.778 0.653 RandomT ree 0.863 0.602 SPegasos 0.980 0.634 pute W eight c as the normalized counts of classiﬁers included in the ensemble, resulting in greater weight for classiﬁers selected multiple times. These weights for PF3 are shown in T able VII. Nearly the same classiﬁers receive the most weight under both approaches (though logistic regression coefﬁcients were not restricted to positiv e v alues so we cannot directly compare weights between methods). Howe ver , the general trend of the relativ e weights is clear and explains the oscillations seen in Figure 2: High performance, lo w di versity classiﬁers are repeatedly paired with higher di versity , lower performing classiﬁers. A more complete picture of selection emerges by examining the full list of candidate base classiﬁers (T able VIII) with the most weighted ensemble classiﬁers shown in bold. The highest performing, lowest div ersity GBM and RF classi- ﬁers appear at the top of the list while VFI and IBk are near the bottom. Though there are more di verse classiﬁers than VFI and IBk, they were not selected due to their lower performance. This example illustrates how diversity and performance are balanced during selection, and also gi ves new insight into the nature of stacking due to the con vergent weights of these seemingly dif ferent approaches. A metric incorporating both measures should increase the performance of hybrid methods such as cluster-based stacking, which we plan to in vestigate in future work. Base Classifiers Ensembles (CES) Ensembles (Greedy) 0.6 0.7 0.7875 0.7900 0.7925 0.7950 0.76 0.77 0.78 0.1 0.2 0.3 0.140 0.145 0.150 0.155 0.15 0.16 0.17 0.18 Brier Score P erformance (A UC) Fig. 3. Performance as a function of calibration as measured by the Brier score [50] for base classiﬁers (left), ensembles for each iteration of CES (middle), and ensembles for each strictly greedy iteration (right). Iterations for greedy selection move from the upper left to the lo wer right, while CES starts in the lower right and moves to the upper left. This shows better-calibrated (lower Brier score) classiﬁers and ensembles have higher average performance and illustrates the iterative performance differences between the methods. Stacking with logistic regression produces outputs with approximately half the best Brier score of CES, explaining the difference in ﬁnal classiﬁer weights leading to the superior performance of stacking. B. The Role of Calibration A ke y factor in the performance dif ference between stacking and CES is illustrated by stacking’ s selection of Multilayer- Perceptron instead of RBFClassiﬁer for PF3. This difference in the relative weighting of classiﬁers, or the exchange of one classiﬁer for another in the ﬁnal ensemble, persists across our datasets. W e suggest this is due to the ability of the layer 1 classiﬁer to learn a function on the probabilistic outputs of base classiﬁers and compensate for potential differences in calibration , resulting in the superior performance of stacking. A binary classiﬁer is said to be well-calibrated if it is cor- rect p percent of the time for predictions of conﬁdence p [24]. Howe ver , accuracy and calibration are related b ut not the same: A binary classiﬁer that ﬂips a fair coin for a balanced dataset will be calibrated but not accurate. Relatedly , many well- performing classiﬁers do not produce calibrated probabilities. Measures such as A UC are not sensiti ve to calibration for base classiﬁers, and the effects of calibration on heterogeneous ensemble learning hav e only recently been studied [24]. This section further in vestigates this relationship. T o illustrate a practical example of calibration, consider a support vector machine. An uncalibrated SVM outputs the distance of an instance from a hyperplane to generate a probability . This is not a true posterior probability of an instance belonging to a class, but is commonly conv erted to such using Platt’ s method [51]. In fact, this is analogous to ﬁtting a layer 1 logistic regression to the uncalibrated SVM outputs with a slight modiﬁcation to av oid overﬁtting. This approach is not restricted to SVMs and additional methods such as isotonic regression are commonly used for both binary and multi-class problems [52]. Regardless of the base classiﬁer , a lack of calibration may effect the performance of ensemble selection methods such as CES since the predictions of many heterogeneous classiﬁers are combined using simple aggregation methods such as the mean. Several methods exist for ev aluating the calibration of probabilistic classiﬁers. One such method, the Brier scor e , assesses how close (on av erage) a classiﬁer’ s probabilistic output is to the correct binary label [50]: B S = 1 N N X i =1 ( f i − o i ) 2 (2) ov er all instances i . This is simply the mean squared error ev aluated in the context of probabilistic binary classiﬁcation. Lower scores indicate better calibration. Figure 3 plots the Brier scores for each base classiﬁer against its performance for the GI dataset as well as the ensemble Brier scores for each iteration of CES and greedy selection. This shows that classiﬁers and ensembles with calibrated outputs generally perform better . Note in particular the calibration and performance of simple greedy selection, with initial iterations in the upper left of the panel showing high performing well-calibrated base classiﬁers chosen for the ensemble, but moving to the lower right as sub-optimal classiﬁers are forced into the ensemble. In contrast, CES starts with points in the lo wer right and moves to the upper left as both ensemble calibration and performance improv e each iteration. The upper left of the CES plot suggests the beneﬁt of additional classiﬁers outweighs a loss in calibration during its ﬁnal iterations. Stacking produces a layer 1 classiﬁer with approximately half the Brier score (0.083) of CES or the best base classiﬁers. Since this approach learns a function ov er probabilities it is able to adjust to the different scales used by potentially ill- calibrated classiﬁers in a heterogeneous ensemble. This ex- plains the difference in the ﬁnal weights assigned by stacking and CES to the base classiﬁers in T able VII: Though the relativ e weights are mostly the same, logistic regression is able to correct for the lack of calibration across classiﬁers and better incorporate the predictions of MultilayerPerceptron whereas CES cannot. In this case, a calibrated MultilayerPerceptron serves to improv e performance of the ensemble and thus stacking outperforms CES. In summary , this section demonstrates the tradeof f between performance and di versity made by CES and examines its connection with stacking. There is signiﬁcant ov erlap in the relativ e weights of the most important base classiﬁers selected by both methods. From this set of classiﬁers, stacking often assigns more weight to a particular classiﬁer as compared to CES and this result holds across our datasets. W e attribute the superior performance of stacking to this dif ference, originat- ing from its ability to accommodate differences in classiﬁer calibration that are likely to occur in large heterogeneous ensembles. This claim is substantiated by its signiﬁcantly lower Brier score compared to CES as well as the correlation between ensemble calibration and performance. This suggests the potential for improving ensemble methods by accommo- dating differences in calibration. V . C O N C L U S I O N S A N D F U T U R E W O R K The aim of ensemble techniques is to combine div erse clas- siﬁers in an intelligent way such that the predictiv e accuracy of the ensemble is greater than that of the best base classiﬁer . Since enumerating the space of all classiﬁer combinations quickly becomes infeasible for ev en relativ ely small ensemble sizes, other methods for ﬁnding well performing ensembles hav e been widely studied and applied in the last decade. In this paper we apply a variety of ensemble approaches to two difﬁcult problems in computational genomics: The prediction of genetic interactions and the prediction of pro- tein functions. These problems are notoriously difﬁcult for their extreme class imbalance, prev alence of missing values, integration of heterogeneous data sources of different scale, and ov erlap between feature distributions of the majority and minority classes. These issues are ampliﬁed by the inherent complexity of the underlying biological mechanisms and in- complete domain knowledge. W e ﬁnd that stacking and ensemble selection approaches offer statistically signiﬁcant improv ements ov er the previous state-of-the-art for GI prediction [9] and moderate improve- ments over tuned random forest classiﬁers which are par- ticularly effecti ve in this domain [5]. Here, even small im- prov ements in accuracy can contribute directly to biomedical knowledge after wet-lab veriﬁcation: These include 988 addi- tional genetic interactions predicted by aggre gated stacking at a 10% false disco very rate. W e also uncov er a novel connection between stacking and Caruana et al. ’ s ensemble selection method (CES) [18], [19], demonstrating how these two dis- parate methods con verge to nearly the same ﬁnal base classiﬁer weights by balancing div ersity and performance in different ways. W e e xplain how v ariations in these weights are related to the calibration of base classiﬁers in the ensemble, and ﬁnally describe how stacking improves accuracy by accounting for differences in calibration. This connection also sho ws how the utilization of diversity is an emergent, not explicit, property of how CES maximizes ensemble performance and suggests directions for future work including formalizing the ef fects of calibration on heterogeneous ensemble performance, modiﬁca- tions to CES which explicitly incorporate di versity [49], and an optimization-based formulation of the div ersity/performance tradeoff for improving cluster-based stacking methods. V I . A C K N OW L E D G E M E N T S W e thank the Genomics Institute at Mount Sinai for their generous ﬁnancial and technical support. R E F E R E N C E S [1] L. Rokach, “Ensemble-Based Classiﬁers, ” Artiﬁcial Intelligence Review , vol. 33, no. 1-2, pp. 1–39, 2009. [2] G. Seni and J. Elder , Ensemble Methods in Data Mining: Impr oving Accuracy Through Combining Predictions . Morgan & Claypool, 2010. [3] J. Shotton, A. Fitzgibbon, M. Cook, T . Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-Time Human Pose Recognition in Parts from Single Depth Images, ” in Proceedings of the 2011 IEEE Confer ence on Computer V ision and P attern Recognition , 2011, pp. 1297–1304. [4] D. Kim, “Forecasting Time Series with Genetic Fuzzy Predictor Ensem- ble, ” IEEE T ransactions on Fuzzy Systems , vol. 5, no. 4, pp. 523–535, 1997. [5] P . Y ang, Y . H. Y ang, B. B. Zhou, and A. Y . Zomaya, “A Review of Ensemble Methods in Bioinformatics, ” Curr ent Bioinformatics , vol. 5, no. 4, pp. 296–308, 2010. [6] A. Altmann, M. Rosen-Zvi, M. Prosperi, E. Aharoni, H. Neuvirth, E. Sch ¨ ulter , J. B ¨ uch, D. Struck, Y . Peres, F . Incardona, A. S ¨ onnerborg, R. Kaiser, M. Zazzi, and T . Lengauer, “Comparison of Classiﬁer Fusion Methods for Predicting Response to Anti HIV -1 Therapy , ” PLoS ONE , vol. 3, no. 10, p. e3470, 2008. [7] M. Liu, D. Zhang, and D. Shen, “Ensemble Sparse Classiﬁcation of Alzheimer’ s Disease, ” Neuroima ge , vol. 60, no. 2, pp. 1106–1116, 2012. [8] A. Khan, A. Majid, and T .-S. Choi, “Predicting Protein Subcellular Location: Exploiting Amino Acid Based Sequence of Feature Spaces and Fusion of Di verse Classiﬁers, ” Amino Acids , vol. 38, no. 1, pp. 347–350, 2010. [9] G. Pandey , B. Zhang, A. N. Chang, C. L. Myers, J. Zhu, V . Kumar , and E. E. Schadt, “An Integrative Multi-Network and Multi-Classiﬁer Approach to Predict Genetic Interactions, ” PLoS Computational Biology , vol. 6, no. 9, p. e1000928, 2010. [10] K. T umer and J. Ghosh, “Error Correlation and Error Reduction in Ensemble Classiﬁers, ” Connection Science , vol. 8, no. 3-4, pp. 385– 404, 1996. [11] L. I. Kunchev a and C. J. Whitaker, “Measures of Div ersity in Classi- ﬁer Ensembles and Their Relationship with the Ensemble Accuracy , ” Machine Learning , vol. 51, no. 2, pp. 181–207, 2003. [12] T . G. Dietterich, “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization, ” Machine Learning , vol. 40, no. 2, pp. 139–157, 2000. [13] E. K. T ang, P . N. Suganthan, and X. Y ao, “An Analysis of Diversity Measures, ” Machine Learning , vol. 65, no. 1, pp. 247–271, 2006. [14] L. Breiman, “Bagging Predictors, ” Machine Learning , vol. 24, no. 2, pp. 123–140, 1996. [15] R. E. Schapire and Y . Freund, Boosting: F oundations and Algorithms . MIT Press, 2012. [16] C. J. Merz, “Using Correspondence Analysis to Combine Classiﬁers, ” Machine Learning , vol. 36, no. 1-2, pp. 33–58, 1999. [17] D. H. W olpert, “Stacked Generalization, ” Neural Networks , vol. 5, no. 2, pp. 241–259, 1992. [18] R. Caruana, A. Niculescu-Mizil, G. Crew , and A. Ksikes, “Ensemble Selection from Libraries of Models, ” in Pr oceedings of the 21st Inter- national Confer ence on Machine Learning , 2004, pp. 18–26. [19] R. Caruana, A. Munson, and A. Niculescu-Mizil, “Getting the Most Out of Ensemble Selection, ” in Proceedings of the 6th International Confer ence on Data Mining , 2006, pp. 828–833. [20] A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V . Sindhwani, and Y . Liu, “W inning the KDD Cup Orange Challenge with Ensemble Selection, ” Journal of Machine Learning Resear ch Pr oceedings T rack , vol. 7, pp. 23–24, 2009. [21] G. Pandey , V . Kumar , and M. Steinbach, “Computational Approaches for Protein Function Prediction: A Survey, ” University of Minnesota, T ech. Rep., 2006. [22] D. C. Hess, C. L. Myers, C. Huttenhower , M. A. Hibbs, A. P . Hayes, J. Pa w , J. J. Clore, R. M. Mendoza, B. S. Luis, C. Nislow , G. Giaev er , M. Costanzo, O. G. Troyanskaya, and A. A. Caudy , “Computationally Driv en, Quantitativ e Experiments Discover Genes Required for Mito- chondrial Biogenesis, ” PLoS Genetics , vol. 5, no. 3, p. e1000407, 2009. [23] E. Szczurek, N. Misra, and M. V ingron, “Synthetic Sickness or Lethality Points at Candidate Combination Therapy T argets in Glioblastoma, ” International Journal of Cancer , 2013. [24] A. Bella, C. Ferri, J. Hern ´ andez-Orallo, and M. J. Ram ´ ırez-Quintana, “On the Effect of Calibration in Classiﬁer Combination, ” Applied Intelligence , pp. 1–20, 2012. [25] T . R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey , H. Dai, Y . D. He, M. J. Kidd, A. M. King, M. R. Meyer , D. Slade, P . Y . Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty , J. Simon, M. Bard, and S. H. Friend, “Functional Discov ery via a Compendium of Expression Proﬁles, ” Cell , vol. 102, no. 1, pp. 109–126, 2000. [26] C. L. Myers, D. R. Barrett, M. A. Hibbs, C. Huttenhower , and O. G. T royanskaya, “Finding Function: Evaluation Methods for Functional Genomic Data, ” BMC Genomics , vol. 7, no. 187, 2006. [27] G. Pandey , C. L. Myers, and V . Kumar , “Incorporating Functional Inter-Relationships Into Protein Function Prediction Algorithms, ” BMC Bioinformatics , vol. 10, no. 142, 2009. [28] J. L. Hartman, B. Garvik, and L. Hartwell, “Principles for the Buffering of Genetic V ariation, ” Science , vol. 291, no. 5506, pp. 1001–1004, 2001. [29] T . Horn, T . Sandmann, B. Fischer , E. Axelsson, W . Huber , and M. Boutros, “Mapping of Signaling Networks Through Synthetic Ge- netic Interaction Analysis by RNAi, ” Nature Methods , vol. 8, no. 4, pp. 341–346, 2011. [30] B. V anderSluis, J. Bellay , G. Musso, M. Costanzo, B. Papp, F . J. V izeacoumar, A. Baryshnikova, B. J. Andrews, C. Boone, and C. L. Myers, “Genetic Interactions Reveal the Evolutionary Trajectories of Duplicate Genes, ” Molecular Systems Biology , vol. 6, no. 1, 2010. [31] R Core T eam, “R: A Language and Environment for Statistical Com- puting, ” 2012. [32] K. Hornik, C. Buchta, and A. Zeileis, “Open-Source Machine Learning: R Meets W eka, ” Computational Statistics , vol. 24, no. 2, pp. 225–232, 2009. [33] M. Hall, E. Frank, G. Holmes, B. Pfahringer , P . Reutemann, and I. H. W itten, “The WEKA Data Mining Software: An Update, ” SIGKDD Explorations , vol. 11, no. 1, pp. 10–18, 2009. [34] M. Kuhn, “Building Predictiv e Models in R Using the caret Package, ” Journal of Statistical Software , vol. 28, no. 5, pp. 1–26, 2008. [35] T . Hothorn, P . B ¨ uhlmann, T . Kneib, M. Schmid, and B. Hofner, “Model- Based Boosting 2.0, ” Journal of Machine Learning Resear ch , vol. 11, no. Aug, pp. 2109–2113, 2010. [36] J. H. Friedman, T . Hastie, and R. J. Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent, ” Journal of Statistical Softwar e , vol. 33, no. 1, pp. 1–22, 2010. [37] W . N. V enables and B. D. Ripley , Modern Applied Statistics with S , 4th ed. Springer , 2002. [38] G. Ridgew ay , “gbm: Generalized Boosted Regression Models, ” 2013. [39] A. Liaw and M. Wiener , “Classiﬁcation and Regression by randomFor- est, ” R News , vol. 2, no. 3, pp. 18–22, 2002. [40] T . Hastie, R. J. T ibshirani, B. Narasimhan, and G. Chu, “Diagnosis of Multiple Cancer T ypes by Shrunken Centroids of Gene Expression, ” Pr oceedings of the National Academy of Sciences of the United States of America , vol. 99, no. 10, pp. 6567–6572, 2002. [41] K. M. Ting and I. H. Witten, “Issues in Stacked Generalization, ” Journal of Artiﬁcial Intelligence Research , vol. 10, no. 1, pp. 271–289, 1999. [42] G. U. Y ule, “On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society , ” Philosophical T ransactions of the Royal Society of London, Series A , vol. 194, pp. 257–319, 1900. [43] J. Cohen, “A Coefﬁcient of Agreement for Nominal Scales, ” Educational and Psychological Measurement , vol. 20, no. 1, pp. 37–46, 1960. [44] D. Eddelbuettel and R. Franc ¸ ois, “Rcpp: Seamless R and C++ Integra- tion, ” Journal of Statistical Software , vol. 40, no. 8, pp. 1–18, 2011. [45] J. Dem ˇ sar , “Statistical Comparisons of Classiﬁers ov er Multiple Data Sets, ” Journal of Machine Learning Researc h , vol. 7, no. Jan, pp. 1–30, 2006. [46] M. Friedman, “The Use of Ranks to A void the Assumption of Normality Implicit in the Analysis of V ariance, ” Journal of the American Statistical Association , vol. 32, no. 200, pp. 675–701, 1937. [47] P . B. Nemenyi, “Distribution-Free Multiple Comparisons, ” Ph.D. disser- tation, Princeton Univ ersity , 1963. [48] G. Brown and L. I. Kunchev a, ““Good” and “Bad” Diversity in Majority V ote Ensembles, ” in Pr oceedings of the 9th International Confer ence on Multiple Classiﬁer Systems , 2010, pp. 124–133. [49] N. Li, Y . Y u, and Z.-H. Zhou, “Div ersity Regularized Ensemble Prun- ing, ” in Pr oceedings of the 2012 Eur opean Conference on Machine Learning and Knowledge Discovery in Databases , 2012, pp. 330–345. [50] G. W . Brier , “V eriﬁcation of Forecasts Expressed in T erms of Probabil- ity , ” Monthly W eather Review , vol. 78, no. 1, pp. 1–3, 1950. [51] J. C. Platt, “Probabilistic Outputs for Support V ector Machines and Comparisons to Regularized Likelihood Methods, ” Advances in Large Mar gin Classiﬁers , vol. 10, no. 3, pp. 61–74, 1999. [52] B. Zadrozny and C. Elkan, “T ransforming Classiﬁer Scores Into Accu- rate Multiclass Probability Estimates, ” in Pr oceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining , 2002, pp. 694–699.

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment