Should we really use post-hoc tests based on mean-ranks?

The statistical comparison of multiple algorithms over multiple data sets is fundamental in machine learning. This is typically carried out by the Friedman test. When the Friedman test rejects the null hypothesis, multiple comparisons are carried out…

Authors: Alessio Benavoli, Giorgio Corani, Francesca Mangili

Should we use the post-hoc tests based on mean-ranks? Should w e really use p ost-ho c tests based on mean-ranks? Alessio Bena v oli alessio@idsia.ch Giorgio Corani giorg io@id sia.ch F rancesca Mangili francesca@idsia.ch Istituto Dal le Mo l le di Studi sul l’Intel ligenza Artificiale (IDSIA) Scuola Universi t aria Pr ofessionale del la Svizzer a ital iana (SU PSI) Universit` a del la Svizzer a italiana (USI) Manno, Switzerland Editor: Abstract The statistical compar ison of m ultiple alg orithms o ver m ultiple data sets is fundamen tal in machine lear ning. This is t ypically car ried o ut b y the F riedman test. When the F riedman test rejects the null hypo thesis, multiple comparisons are carried out t o e s tablish whic h are the significant differences a mong algor ithms. The multiple compar isons are usually per formed using the mean-ranks test. The aim of this tec hnical note is t o discuss the inconsistencies o f the mea n-ranks p ost-ho c test with the go al of discour aging its us e in machine learning as well as in medicine, psychology , e tc.. W e show that the outcome o f the mean-ranks test dep ends on the po ol of algorithms or ig inally included in the exp eriment . In other w o rds, the outcome of the comparison betw een algor ithms A and B dep ends also on the perfo r mance of the other algo rithms included in the orig ina l experiment. This ca n lead to paradoxical situa tio ns. F o r insta nce the difference b et ween A and B could b e declared significant if the p o ol comprises a lgorithms C, D , E and not significa n t if the p o o l comprises algor ithms F , G , H . T o o vercome these issues, we sugg e s t instead to perfor m the m ultiple comparison using a test whos e o utcome only dep ends on the tw o algor ithms b eing compared, such as the sign-test or the Wilcoxon sig ned-rank test. Keyw ords: statistical co mparison, F riedman test, post-ho c test 1. In tro duction The statistical comparison of multiple algorithms ov er multiple data sets is fundamental in mac hine learning; it is t ypically carried out b y means of a statistical test. Th e recommended approac h is the F riedman test (Dem ˇ sar , 20 06). Being n on-parametric, it do es not require commensurabilit y of the measures across different data sets, it d oes not assu m e normalit y of the samp le mea n s an d it is r obust to outliers. When the F riedman test rejects the null hypothesis of no difference among the algo- rithms, p ost-ho c analysis is carried out to assess whic h d ifferen ces are significant. A series of pairw ise comparison is p erformed adjusting t h e significance leve l via Bo n f erroni correc- tion or other more p o werful approac hes (Dem ˇ sar , 2006; Garcia and He r rera, 2008) to con trol the family-wise T yp e I err or. 1 Bena voli and Corani and Mangili The mean-ranks p ost-ho c test (McDo n ald and Thompson, 1967; Nemen yi, 196 3 ), is rec- ommended as pairwise test for multiple comparisons in most b o oks of n on p arametric statis- tics: see for instance (Gibb ons and Chakrab orti, 2011, Sec. 12.2.1), (Kv am and Vidako vic , 2007, Sec. 8.2) and (Sheskin , 2003, Sec. 25.2). It is also commonly u sed in machine learnin g (Dem ˇ sar , 2006; Garcia a n d Herrera , 2008). The m ean-ranks test is based on th e stat istic: z = | ¯ R A − ¯ R B | / r m ( m + 1) 6 n , where ¯ R A , ¯ R B are the mean ranks (as compu ted by the F riedman test) of algorithms A and B, m is the num b er of alg orithm s to b e compared and n the n umb er of datasets. The mean- ranks ¯ R A , ¯ R B are computed considering the p erformance of all the m algorithms. Thus the outcome of the comparison b et ween A and B dep ends also on the p erformance of the other (m-2) alg orithm s included in the original exp erimen t. This can lead to parado xical situations. F or in stance th e difference b et ween A and B could b e d eclared signific ant if the po ol co mp rises al gorithms C , D, E and not signific ant if the po ol co mp rises al gorithms F , G, H . The p erformance of t h e remaining a lgorithms should instead b e irrelev ant when comparing algorithms A and B . Th is problem has b een p ointe d out sev eral times in the past (Miller, 1966; Gabriel, 1969; Fligner, 1984) and also in (Hollander et al., 2013, Sec. 7.3). Y et it is ignored b y most literature on non p arametric s tatist ics. Ho wev er this issue should not b e ignored, as it can increase th e type I error when comparing t w o equiv alen t algorithms and con v ersely d ecrease the p ow er when comparing alg orithm s w hose p er f ormance is truly differen t. In this tec hn ical note, all these inconsistencies of the mean-ranks test will b e discussed in details and illustrated b y means of highlighting examples with the go al of discouraging its use in mac hine learning as w ell as in med icine, psycholo gy , etc.. T o a v oid theses issues, w e instead recommend to p erform the pairwise comparisons of the p ost-ho c analysis using the W ilc oxon signe d-r ank test or the sign test . The decisions of suc h tests do not dep end on the p o ol of alg orithm s included in the in itial experiment. It is understo o d that, regardless the sp ecific test adopted for the p airwise comparisons, it is necessary to con trol the family-wise t yp e I error. This can b e o b tained through Bonferroni correction or through more p o w erf u l approac hes (Dem ˇ sar , 200 6 ; Garcia and Herrera, 2008). Ev en b etter wo u ld b e the adoption of the Ba y esian method s for hypothesis testing. They o v ercome the many dra wb acks (Dem ˇ sar , 2008; Go o dman, 199 9 ; Kru s c hk e, 2010) of the null- h yp othesis significance tests. F or instance, Ba y esian counte r p arts of the Wilco xon and of the sign test ha ve b een p r esen ted in (Benav oli et a l. , 2014; Bena voli et al., 2014); a Ba yesian approac h for comparing cross-v alidated al gorithms on multiple data sets is d iscussed by (Corani and Benav oli , 2015). 2 Should we use the post-hoc tests based on mean-ranks? 2. F riedm an test The p erform ance of m ultiple al gorithms tested on m ultiple datasets can b e organized in a matrix: D atasets Algor ithms X 11 X 12 . . . X 1 n X 21 X 22 . . . X 2 n . . . . . . . . . . . . X m 1 X m 2 . . . X mn (1) where X ij denotes th e p erformance of the i -th algorithm on the j -th dataset (for i = 1 , . . . , m and j = 1 , . . . , n ). T h e observ ations (perf ormances) in different columns are assumed to b e indep endent. T he algorithms are rank ed col u mn-b y-column and eac h e ntry X ij is rep laced b y its rank relativ e to the other observ ations in the j -th column: R =      R 11 R 12 . . . R 1 n R 21 R 22 . . . R 2 n . . . . . . . . . . . . R m 1 R m 2 . . . R mn      , (2) where R ij is the r ank of th e algo r ith m i in the j -th dataset. Th e su m of the i -th ro w R i = P n j =1 R ij , ∀ i = 1 , . . . , m , dep end s on how the i -th algorithm p erforms w.r.t. the other ( m − 1) algorithms. Under the null h yp othesis of the F riedman test (no difference b et wee n th e algorithms) the a verage v alue of R i is n ( m + 1) / 2. T h e statistic of the F riedman test is S = 12 nm ( m + 1) n X j =1  R j − n ( m + 1) 2  2 , (3) whic h un der the null hyp othesis has a c h i-squared d istribution w ith m − 1 degrees of f r eedom. F or m = 2, th e F r iedman test corresp onds to the sign test. 3. Mean ranks p ost-ho c test If the F riedm an test rejects the n ull hypothesis one has to establish w hic h are the significan t differences among the a lgorithms. If a ll classifiers are compared to eac h other, one h as to p erform m ( m − 1) / 2 pairwise comparisons. When p erforming m ultiple comparisons, one has to control the famil y-w ise error rate, namely the probabilit y of at le ast one erroneous rejection of the null h yp othesis among the m ( m − 1) / 2 pairwise comparisons. In the follo wing example w e cont r ol the family-wise error (FWER) r ate through the Bonferroni correctio n , ev en though more p o werful tec hniqu es are also a v ailable (D em ˇ sar , 2006; Ga r cia and Herrera, 2008 ). Ho wev er our d iscussion of the shortcomings of the mean-ranks test is v alid regardless the sp ecific approac h adopted to con trol the FWER. The mean-rank test claims that the i -th and the j -th algorithm are significantly different if: | ¯ R i − ¯ R j | ≥ z ∗ r m ( m + 1) 6 n . (4) 3 Bena voli and Corani and Mangili where ¯ R i = 1 n R i is the mean rank of the i -th algorithm and z ∗ is the Bonferroni corrected α/m ( m − 1) upp er standard normal qu an tile (Gibb ons and Chakrab orti, 2011, S ec. 12.2.1). Equation (4) is based on the large samp le ( n > 10) appro ximation of the distribution of the sta tistic. The ac tual distribu tion of the statistic | ¯ R i − ¯ R j | is derived assuming al l the ( m !) n ranks in (2) to b e equally pr obable. Under this assumption th e v ariance of | ¯ R i − ¯ R j | is m ( m + 1) / 6 n , whic h originates the term under th e squ are ro ot in (4). The sampling distribution of the statistic | ¯ R i − ¯ R j | assumes all ranks configurations in (2) to b e equally probable. Y et this assum p tion is not tenable: the p ost-ho c analysis is p erformed b e c ause th e null h yp othesis of the F riedm an test has b een rejected. 4. Inconsistencies of the mean-ranks test W e illustrate the inconsistencies the mean-ranks test b y p resen ting three examples. All examples refer to the analysis of the acc u racy of differen t classifiers on m ultiple data set s. W e sh o w that th e outcome of th e test dep end s both on the actual difference of accuracy b et wee n algorithm A and B and on the accuracy of th e remaining algorithms. 4.1 Example 1: artificially increasing p o wer Assume w e ha ve tested fi v e algorithms A, B , C, D , E on 20 datasets ob taining the accuracies: D atasets A 50 50 50 50 50 50 50 50 50 50 80 80 80 80 80 80 80 80 80 80 B 80 80 80 80 80 80 80 80 80 80 50 50 50 50 50 50 50 50 50 50 C 55 55 55 55 55 55 55 55 55 55 45 45 45 45 45 45 45 45 45 45 D 60 60 60 60 60 60 60 60 60 60 85 85 85 85 85 85 85 85 85 85 E 65 65 65 65 65 65 65 65 65 65 90 90 90 90 90 90 90 90 90 90 The corresp onding ran k s are: D atasets A 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 B 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 C 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 D 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 E 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 where b etter algo r ithms are giv en higher ranks. W e aim at comparing A and B . Algorithm B is b etter than A in the first ten datasets, while A is b etter than B in the remaining ten. The t wo algorithms ha ve the same mean p erformance and their differences are symmetrically distributed. Each algo r ithms win s on half the data sets. Differen t t yp es o f t wo-sided tests (t-test, Wilco xon signed-rank test, sign-test) return the same p -v alue, p = 1. The mean- ranks test corr esp ond in this case to the sign-test and th u s also its p-v alue is 1. This is most extreme result in fa vo r of the null hyp othesis. No w assume that we compare A, B together w ith C, D , E . In the first ten datasets, algorithm A is w orse than C, D, E , whic h in turn are worse than B . In the remaining ten datasets, C is worse than A, B , whic h in turn are worse than D , E . The p -v alue of the 4 Should we use the post-hoc tests based on mean-ranks? F riedman test is p ≈ 10 − 10 and, thus, it rejects the n ull hypothesis. W e c an th us perf orm the p ost-hoc test (4) with z ∗ = 2 . 807 (the Bonferroni corrected α/m ( m − 1) upp er standard normal quant ile for α = 0 . 0 5 and m = 5). Th e significance lev el has b een adjusted to α/m ( m − 1), since w e are p erforming m ( m − 1) / 2 t wo-sided comparisons. The mean ranks of A, B are resp ectiv ely 2 and 3 . 5 a n d, th us, sin ce | ¯ R A − ¯ R B | = 1 . 5 and z ∗ q m ( m +1) 6 n ≈ 1 . 4 w e can reject the null h yp othesis. The result of the p ost-ho c test is that the algorithms A, B ha v e significantly differen t p erformance. The decisions of the mean-ranks test are n ot consistent: • if it compares A, B alone, it does not reject th e n ull h yp othesis; • if it compares A, B together with C, D , E , it r ejects the null h yp othesis concluding that A, B ha ve signifi cantly differen t p erformance. The presence of C , D , E artificially introdu ces a difference b et we en A, B by c hanging the mean ranks of A, B . F or instance, D and E r ank a lwa y s b etter than A , while they nev er outp erform B when it works w ell (i.e., datasets fr om one to ten); in a real case stud y , a similar result w ould probably indicate that while B is w ell su ited for the first ten d atase ts, D , E and A are b etter suited for the last ten. The difference (in rank) b et wee n A and B is artificially amplified by the pr esence of D and E only wh en B is b etter than A . The p oin t is that a large differences in the global ranks of t wo classifiers do es not n ecessarily corresp ond to large differences in their a ccur acies (and vice versa, as w e will see in the next example). This issu e can hap p en in pr actic e. 1 Assume that a researc h er p resen ts a new algorithm A 0 and some of its w eak er v ariations A 1 , A 2 ,..., A k and compares the new a lgorithms with an existing algorithm B . When B is b etter, the rank is B ≻ A 0 ≻ . . . ≻ A k . When A 0 is b etter, the rank is A 0 ≻ A 1 ≻ . . . ≻ A k ≻ B . Therefore, the p resence of A 1 , A 2 ,..., A k artificially increases the difference b et w een A 0 and B . 4.2 Example 2: lo w p o wer due to the remaining algorithms Assume the p erformance of algorithms A and B on different data sets to b e norm ally distributed as follo ws: A ∼ N (0 , 1) , B ∼ N (1 . 5 , 1) . The p o ol of algo r ithms comprises also C, D, E , whose p erf ormance is distribu ted as follo ws: C ∼ N (5 , 1) , D ∼ N (6 , 1) , E ∼ N (7 , 1) . A co llection of 20 data sets is considered . F or the sak e o f simplicit y , assume w e w ant to compare only A and B . Th ere is th us no need of co r r ectio n f or m ultiple comparisons. When comparin g A and B , the p o wer of the t wo-sided sign test with α = 0 . 0 5 is very high: 0 . 94 (w e ha ve ev aluated th e p o wer numerically by Monte Carlo sim ulation). T h e p o wer of the mean-ranks test is instead only 0 . 046. W e can exp lain the large difference 1. W e thank the anon ymous review er for suggesting this example. 5 Bena voli and Corani and Mangili of p ow er as follo ws. The sign te st (under normal appro ximation of the d istr ibution of the statistic) claims significance wh en: | ¯ R A − ¯ R B | ≥ z ∗ r 1 n while the mea n -ranks te s t (4) claims significance when: | ¯ R A − ¯ R B | ≥ z ∗ r m ( m + 1) 6 n = z ∗ r 5 n , with m = 5. S ince the algorithms C , D , E ha v e mean p erformances that are m uc h larger than those of A, B , the m ean-ranks difference | ¯ R A − ¯ R B | is equal f or the t wo test. Ho wev er the mean-ranks estimates the v ariance of the statistic | ¯ R A − ¯ R B | to b e fi v e times larger compared to th e sign test. The critical v alue of th e mean-ranks test is inflated b y √ 5, largely decreasing the p o w er of the test. In fact for the mean-ranks test the v ariance of | ¯ R A − ¯ R B | increases with the n u m b er of a lgorithms includ ed in the initial exp erimen t. 4.3 Example 3: real classifiers on UCI data sets Finally , we compare th e accuracies of sev en classifiers on 54 datasets. The classifiers are: J48 decision tree ( C 1 ); hidd en naive Ba yes ( C 2 ); a verage d one-dep end ence estimator (A ODE) ( C 3 ); n aiv e-Ba y es ( C 4 ); J 48 graft ( C 5 ), lo cally we ighted naive -Ba yes ( C 6 ), random forest ( C 7 ). The whole set of results is giv en in App en d ix. Each classifier has b een assessed via 10 run s of 10-fol d s c r oss-v alidation. W e p erformed all th e exp erimen ts u s ing WEKA. 2 All these classifiers are describ ed in (Witten and F r an k , 20 05 ). The accuracies are rep orted in T able 2. Assume that our aim is to compare C 1 , C 2 , C 3 , C 4 alone. Th er efore, w e consider just the first 4 columns in T able 2. Th e mean ranks are: C 2 = 2 . 676 , C 4 = 1 . 917 , C 1 = 2 . 518 , C 3 = 2 . 888 . The F r iedman test r ejects the null hyp othesis. The pairwise comparisons for the pair C 2 , C 4 giv es the statistic z = | ¯ R 2 − ¯ R 4 | / p m ( m + 1) / 6 n = 3 . 06 . Since 3 . 0 6 is greater th an z ∗ = 2 . 64 (the Bo n ferroni co r rected α/m ( m − 1) up p er standard normal quan tile fo r α = 0 . 05 and m = 4) , the mean-ranks p ro cedure finds the algorithms C 2 , C 4 to b e significan tly d ifferen t. If w e compare C 2 , C 4 together with C 1 , C 5 , the mean rank s are: C 2 = 2 . 713 , C 4 = 2 . 102 , C 1 = 2 . 528 , C 5 = 2 . 657 . Again, F riedman test rejects the null hyp othesis. Th e p airwise comparisons for the pair C 2 , C 4 giv es the statistic z = | ¯ R 2 − ¯ R 4 | / p m ( m + 1) / 6 n = 2 . 46 , 2. http: //www.cs.waikato.a c.nz/ml/weka/ 6 Should we use the post-hoc tests based on mean-ranks? Card=2 C ard=3 Card= 4 C 2 vs. C 4 7/10 9/10 3/5 C 2 vs. C 7 1/10 - - C 3 vs. C 7 2/10 - - C 4 vs. C 6 9/10 5/10 - T able 1: Pa irw ise comparisons that are affected (num b ers of decisions that are signif- ican tly d ifferen t/n umb er of subsets) b y the p erformance of the other algorithms. Here Card=2 means that, f or eac h pair C a , C b on the left column, w e are considering the su bsets { C a , C b , C x , C y } , Card =3 { C a , C b , C x , C y , C z } and C ard=4 { C a , C b , C x , C y , C z , C w } . T he sym b ol “-” means that the comparison do es not dep end on the subset o f algorithms. whic h is smaller than z ∗ . Th us the difference b et we en algorithms C 2 and C 4 is not signifi- can t. The acc u racies of C 2 and C 4 are the same in the tw o cases b ut a gain the decisions of the mean-ranks are conditional to the group o f classifiers w e are considering. Consider building a set of f our classifiers { C 2 , C 4 , C x , C y } . By differently c ho osing C x and C y w e ca n build ten differen t such sets. F or ea ch su bset w e run the mean-ranks test to c hec k whether the difference b et we en C 2 and C 4 is significan tly differen t. The d ifference is claimed to b e signific ant in 7 case s and not sig ni fic ant in 3 case s . No w consider a set of five classifiers { C 2 , C 4 , C x , C y , C z } . By differentl y choosing C x , C y and C z w e can build ten differen t such sets. This yields 10 fur ther c ases in w h ic h w e compare aga in C 2 and C 4 . Their difference is cla imed to b e significan t in 9/1 0 ca s es. T able 1 rep orts the pairw ise comparisons for w h ic h the statistical d ecisio n c hanges with the p o ol of classifiers that are considered. The outcome of th e m ean-ranks test when comparing the s ame pair of classifiers c learly dep ends on the po ol of alternativ e classifiers { C x , C y , . . . } whic h is assumed. 4.4 Maxim um t yp e I error A fu rther dra w b ac k of the mean-ranks test whic h has n ot b een discussed in the pr evious ex- amples is that it cannot con trol the maximum t yp e I error, that is, the probabilit y of falsely declaring an y p air of algo r ith m s to b e different rega rd less of the other m − 2 algorithms. If the accuracies of all algorithms bu t one are equ al, it do es not guaran tee the family-wise T yp e I error to b e smalle r than α when comparing the m − 1 equiv alent a lgorithms. W e p oin t the reader to (Fligner , 1984) for a d etaile d discussion on this asp ect. 5. A suggested pro cedure Giv en the abov e issues, we recommend to av oid the mean-ranks test for the p ost-hoc analy- sis. On e sh ould instead p erform the multiple comparison usin g tests whose decision dep end only on the tw o a lgorithms b eing compared, suc h as the sign test or t h e Wilco xon signed- rank test. The sign test is more robu st, as it only assumes the observ ations to b e identic ally distributed. Its d r a wbac k is lo w p o wer. T he Wilco xon signed-r an k test is m ore p o w erf ul 7 Bena voli and Corani and Mangili and thus it is generally reco mm ended (De m ˇ s ar, 2006). Compared to the sign test, the Wilco x on signed-rank test mak es the additional assu m ption of a symmetric distribution of the differences b et w een the t wo algo r ithms b eing compared . Th e decision b et ween s ign test and signed-rank te st thus d ep ends on whether the symmetry assu mption is tenable on to the analyzed data . Regardless the adopted test, the m ultiple comparisons sh ould b e p erformed adjusting the significance lev el to con trol the family-wise T yp e-I error. This can b e done usin g the correction for multiple comparison discussed b y (De m ˇ sar , 2006; Garcia and Herrera, 2008). If w e adopt the Wil coxo n signed-rank test in Example 3 for comparing C 2 , C 4 , w e obtain the p -v alue 0 . 0002, indep enden tly from the p er f ormance of the other algo r ithms. Thus, for an y p o ol of alg orithm s C 2 , C 4 , C x , C y , w e alwa ys rep ort the same decision: C 2 , C 4 are significan tly different b ecause the p -v alue is less than the Bonferroni corrected significance lev el α/m ( m − 1) (in the ca se m = 4, α/m ( m − 1) = 0 . 00 42). 6. Soft w are The MA T LAB scripts of the ab o v e examples can b e do wnloaded from ipg.idsi a.ch/softwa re/meanRanks/mat l a 7. Conclusions The m ean-ranks p ost-ho c test is wid ely used test f or m u ltiple pairwise comparison. W e discuss a n u m b er of drawbac ks of this test, whic h w e r ecommend to av oid. W e instead recommend to adopt th e sign-test or the Wilco xon signed-rank , whose decision do es n ot dep end on the p o ol of classifiers included in the o r iginal exp erimen t. W e moreo v er bring to the atten tion of the reader the Ba y esian counterparts of these tests, wh ic h o vercome the many dra wb ac ks (Kruschk e, 2010, Chap.11) of null-h yp othesis significance testing. References A. Bena voli , F. Mangili, G. C orani, M. Zaffalon, and F. Ruggeri. A Ba y esian Wilc oxo n signed-rank test based o n the Diric hlet pr ocess. In Pr o c e e dings of th e 30th International Confer enc e o n Machine L e arning (ICML 2014) , pages 1–9, 20 14. A. Benav oli, F. Mangili, F. Ruggeri, and M. Zaffalon. Impr ecise Diric hlet Process with application to the hypothesis test on the p robabilit y that X ≤ Y. A c c epte d for public ation in Journal of Statistic al The ory and Pr actic e , F ebr u ary 2014. doi: 10.1080 /15598608.2014. 98599 7. G. Corani and A. Bena voli . A Ba y esian approac h for comparing cross-v alidated algorithms on m ultiple data sets. A c c epte d for public ation in Machine L e arning , 2015. Janez Dem ˇ sar. Statistical comparisons of classifiers o ver multiple data sets. Journal of Machine L e arning R ese ar ch , 7:1–30, 2006. Janez Dem ˇ sar. On the appr opriateness of statistical tests in machine learning. In Worksho p on Evaluation Metho ds f or Machine L e arning in c onjunction with ICML , 20 08. 8 Should we use the post-hoc tests based on mean-ranks? Mic hael A. Fligner. A note on t w o-sided distribution-free treatment v ersus cont r ol m ultiple comparisons. Journal of the Am e ric an Sta tistic al Asso ciation , 79(38 5):pp. 20 8–211, 1984. K Rub en Gabr iel. S im ultaneous test pro cedures–some theory of multiple comparisons. The Anna ls of Mathematic al Statistics , pages 224–250, 1969. Salv ador Garcia and F rancisco Herrera. An extension on” Statistical Comparisons of Classi- fiers o ve r Multiple Dat a Sets” for all pairwise comparisons. Journal of Machine L e arning R ese ar ch , 9( 12), 2008. Jean Dic kinson Gibb ons and Sub habrata C hakrab orti. Nonp ar ametric statistic al infer enc e . Springer, 201 1. Stev en N Go o dman. T o w ard evidence-based medical s tatistics: Th e p–v alue fallacy . A nnals of internal me dicine , 130(12) :995–1004, 1999. Myles Holla n d er, Douglas A W olfe, and Eric Chic ke n . Nonp ar ametric statistic al metho ds , v olume 751. John Wiley & Sons, 2013. John K Kruschk e. Ba yesian data analysis. Wiley Inter disciplinary R eviews: Co gnitive Scienc e , 1 (5):658–676 , 2010. P aul H Kv am and B r ani Vidak ovic. N onp ar ametric statistics with applic ations to scienc e and engine ering , v olume 653. John Wiley & Sons, 2007. B. J. McDonald and Jr T hompson, W. A. Rank sum m u ltiple comparisons in one- and t wo -wa y classificatio n s. Biometrika , 54(3/4):pp. 487–49 7, 1967. Rup ert G Miller. Simulta ne ous statistic al infer enc e . Sp ringer, 1966. P . Nemenyi . Distribution-fr e e multiple c omp arisons . Ph.D. thesis, Princeton Univ ersity , 1963. Da vid J Sh eskin. Handb o ok of p ar ametric and nonp ar ametric statistic al pr o c e dur es . CRC Press, 200 3. Ian H Witten and Eib e F r ank. Data Mining: Pr actic al machine le arning to ols and te ch- niques . Morgan Kaufmann, 20 05. 9 Bena voli and Corani and Mangili T able of accuracies used in example 3 Dataset C1 C2 C3 C4 C 5 C6 C7 anneal 98.44 98 98 96.43 98.55 98.33 99 audiology 78.32 73.42 71.66 71.23 78.32 77.41 73.89 wisconsin-breast-cancer 93.7 96.71 96.99 97.14 93.7 97.28 95.57 cmc 50.71 52.81 51.39 51.05 50.78 50.98 48.67 conta ct- lenses 81.67 68.33 71.67 71.67 81.67 65 78.33 credit 86.38 84.64 86.67 86.23 86.52 87.25 85.07 german-credit 72.4 76.6 76 .6 76 72.4 75.3 73 pima-diab etes 73.7 74.09 75.01 74.36 73.56 74.75 72.67 ecoli 81.52 80.04 81.83 82.12 81.52 80.63 78.84 eucalyptus 64.28 63.2 58.7 1 51.1 64.0 1 59.52 59.4 glass 71.58 74.26 73.83 70.63 71.1 75.69 73.33 grub-damage 38.79 36.88 43.92 47.79 39.42 40.13 42.63 hab erman 72.87 71.53 72.52 72.52 72.87 73.52 72.16 hay es-roth 60 56.88 60 60 60 60 59.38 cleeland-14 78.82 81.47 81.8 83.44 78.48 82.78 81.81 hungaria n -14 78.64 84.39 84.39 84.74 78.64 84.38 81.97 hepatitis 79.46 85.13 83.79 82.5 79.46 82.5 81.25 hypothyroid 99.28 99.18 98.54 98.3 99.28 98.62 98.97 ionosphere 91.17 90.88 90.88 89.17 91.74 89.17 91.75 iris 93.33 92 92.67 92.67 93.33 92 93.33 kr-s-kp 99.44 92.46 91.24 87.89 99.37 91.21 98.87 labor 85 88 84.67 83 85 81.33 84.67 lier-disorders 56.25 56.25 56.25 56.25 56.25 56.25 56.25 lymphography 78.33 85 85.71 84.38 79 86.33 79.62 monks1 98.74 100 85.44 74.64 98.74 82.21 98.56 monks3 98.92 97.84 96.75 96.39 98.92 96.39 97.84 monks 64.72 64.57 63.73 62.24 64.72 64.9 70. 72 mushroom 1 00 99.96 99.95 95.83 100 99.84 100 nursery 97.05 94.28 92.71 90.32 97.08 91.61 98.09 optdigits 78.97 96.17 96.9 92.3 81.01 94.2 91.8 page-blocks 96.62 96.84 96.95 93.51 96.66 94.15 96.97 pasture-pro duction 75 85.83 80.83 80.83 75 81.67 75.83 p endigits 89.05 97.61 97.82 87.78 89.87 94.81 95.67 p ostoperatie 70 67.78 67.78 66.67 70 66.67 60 primary-tumor 40.11 48.08 47.49 46.89 40.11 49.55 38.31 segmen t 94.24 96.36 94.5 91.3 94.03 94.29 96.06 solar-flare-C 88.86 88.24 88.54 86.08 88.86 87.92 86.05 solar-flare-m 90.1 87.02 87.92 87 90.1 86.99 85.46 solar-flare-X 97.84 97.53 97.84 93.17 97.84 94.41 95.99 sonar 74.48 79.83 81.26 80.29 74.45 80.79 78.36 so yb ean 92.39 94.58 93.4 92.08 92.98 93.55 92.68 spam b ase 92.81 92.31 93.37 89.85 93.22 90.63 93.65 sp ect- reordered 78.29 82.07 80.93 79.03 78.29 83.15 80.56 splice 94.36 96.18 96.21 95.36 94.2 95.89 89.37 squash-stored 70 58 60 61.67 70 63.67 57.67 squash-unstored 76.67 69 70.67 61.67 76.67 68.67 77.33 tae 47 44.38 47 47 47 47 45.67 credit 84.93 83.91 85.07 84.2 84.93 85.22 83.33 o wel 76.67 84.65 77.78 60.3 76.87 77.88 84.95 w av eform 74.38 84.52 84.92 79.86 74.9 83.62 79.68 white-clo ver 56.9 79.29 68.57 66.9 56.9 64.76 70 wine 88.79 98.33 98.33 98.89 89.35 98.33 97.22 yeas t 57.01 57.48 56.74 56.8 57.01 57.48 56.26 zoo 92.18 100 95.09 93.18 92.18 96.18 95.09 T able 2: Accuracy of classifiers on d ifferent data sets. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment