Should we really use post-hoc tests based on mean-ranks?

Should we use the post-hoc tests based on mean-ranks? Should w e really use p ost-ho c tests based on mean-ranks? Alessio Bena v oli alessio@idsia.ch Giorgio Corani giorg io@id sia.ch F rancesca Mangili francesca@idsia.ch Istituto Dal le Mo l le di Studi sul l’Intel ligenza Artiﬁciale (IDSIA) Scuola Universi t aria Pr ofessionale del la Svizzer a ital iana (SU PSI) Universit` a del la Svizzer a italiana (USI) Manno, Switzerland Editor: Abstract The statistical compar ison of m ultiple alg orithms o ver m ultiple data sets is fundamen tal in machine lear ning. This is t ypically car ried o ut b y the F riedman test. When the F riedman test rejects the null hypo thesis, multiple comparisons are carried out t o e s tablish whic h are the signiﬁcant diﬀerences a mong algor ithms. The multiple compar isons are usually per formed using the mean-ranks test. The aim of this tec hnical note is t o discuss the inconsistencies o f the mea n-ranks p ost-ho c test with the go al of discour aging its us e in machine learning as well as in medicine, psychology , e tc.. W e show that the outcome o f the mean-ranks test dep ends on the po ol of algorithms or ig inally included in the exp eriment . In other w o rds, the outcome of the comparison betw een algor ithms A and B dep ends also on the perfo r mance of the other algo rithms included in the orig ina l experiment. This ca n lead to paradoxical situa tio ns. F o r insta nce the diﬀerence b et ween A and B could b e declared signiﬁcant if the p o ol comprises a lgorithms C, D , E and not signiﬁca n t if the p o o l comprises algor ithms F , G , H . T o o vercome these issues, we sugg e s t instead to perfor m the m ultiple comparison using a test whos e o utcome only dep ends on the tw o algor ithms b eing compared, such as the sign-test or the Wilcoxon sig ned-rank test. Keyw ords: statistical co mparison, F riedman test, post-ho c test 1. In tro duction The statistical comparison of multiple algorithms ov er multiple data sets is fundamental in mac hine learning; it is t ypically carried out b y means of a statistical test. Th e recommended approac h is the F riedman test (Dem ˇ sar , 20 06). Being n on-parametric, it do es not require commensurabilit y of the measures across diﬀerent data sets, it d oes not assu m e normalit y of the samp le mea n s an d it is r obust to outliers. When the F riedman test rejects the null hypothesis of no diﬀerence among the algo- rithms, p ost-ho c analysis is carried out to assess whic h d iﬀeren ces are signiﬁcant. A series of pairw ise comparison is p erformed adjusting t h e signiﬁcance leve l via Bo n f erroni correc- tion or other more p o werful approac hes (Dem ˇ sar , 2006; Garcia and He r rera, 2008) to con trol the family-wise T yp e I err or. 1 Bena voli and Corani and Mangili The mean-ranks p ost-ho c test (McDo n ald and Thompson, 1967; Nemen yi, 196 3 ), is rec- ommended as pairwise test for multiple comparisons in most b o oks of n on p arametric statis- tics: see for instance (Gibb ons and Chakrab orti, 2011, Sec. 12.2.1), (Kv am and Vidako vic , 2007, Sec. 8.2) and (Sheskin , 2003, Sec. 25.2). It is also commonly u sed in machine learnin g (Dem ˇ sar , 2006; Garcia a n d Herrera , 2008). The m ean-ranks test is based on th e stat istic: z = | ¯ R A − ¯ R B | / r m ( m + 1) 6 n , where ¯ R A , ¯ R B are the mean ranks (as compu ted by the F riedman test) of algorithms A and B, m is the num b er of alg orithm s to b e compared and n the n umb er of datasets. The mean- ranks ¯ R A , ¯ R B are computed considering the p erformance of all the m algorithms. Thus the outcome of the comparison b et ween A and B dep ends also on the p erformance of the other (m-2) alg orithm s included in the original exp erimen t. This can lead to parado xical situations. F or in stance th e diﬀerence b et ween A and B could b e d eclared signiﬁc ant if the po ol co mp rises al gorithms C , D, E and not signiﬁc ant if the po ol co mp rises al gorithms F , G, H . The p erformance of t h e remaining a lgorithms should instead b e irrelev ant when comparing algorithms A and B . Th is problem has b een p ointe d out sev eral times in the past (Miller, 1966; Gabriel, 1969; Fligner, 1984) and also in (Hollander et al., 2013, Sec. 7.3). Y et it is ignored b y most literature on non p arametric s tatist ics. Ho wev er this issue should not b e ignored, as it can increase th e type I error when comparing t w o equiv alen t algorithms and con v ersely d ecrease the p ow er when comparing alg orithm s w hose p er f ormance is truly diﬀeren t. In this tec hn ical note, all these inconsistencies of the mean-ranks test will b e discussed in details and illustrated b y means of highlighting examples with the go al of discouraging its use in mac hine learning as w ell as in med icine, psycholo gy , etc.. T o a v oid theses issues, w e instead recommend to p erform the pairwise comparisons of the p ost-ho c analysis using the W ilc oxon signe d-r ank test or the sign test . The decisions of suc h tests do not dep end on the p o ol of alg orithm s included in the in itial experiment. It is understo o d that, regardless the sp eciﬁc test adopted for the p airwise comparisons, it is necessary to con trol the family-wise t yp e I error. This can b e o b tained through Bonferroni correction or through more p o w erf u l approac hes (Dem ˇ sar , 200 6 ; Garcia and Herrera, 2008). Ev en b etter wo u ld b e the adoption of the Ba y esian method s for hypothesis testing. They o v ercome the many dra wb acks (Dem ˇ sar , 2008; Go o dman, 199 9 ; Kru s c hk e, 2010) of the null- h yp othesis signiﬁcance tests. F or instance, Ba y esian counte r p arts of the Wilco xon and of the sign test ha ve b een p r esen ted in (Benav oli et a l. , 2014; Bena voli et al., 2014); a Ba yesian approac h for comparing cross-v alidated al gorithms on multiple data sets is d iscussed by (Corani and Benav oli , 2015). 2 Should we use the post-hoc tests based on mean-ranks? 2. F riedm an test The p erform ance of m ultiple al gorithms tested on m ultiple datasets can b e organized in a matrix: D atasets Algor ithms X 11 X 12 . . . X 1 n X 21 X 22 . . . X 2 n . . . . . . . . . . . . X m 1 X m 2 . . . X mn (1) where X ij denotes th e p erformance of the i -th algorithm on the j -th dataset (for i = 1 , . . . , m and j = 1 , . . . , n ). T h e observ ations (perf ormances) in diﬀerent columns are assumed to b e indep endent. T he algorithms are rank ed col u mn-b y-column and eac h e ntry X ij is rep laced b y its rank relativ e to the other observ ations in the j -th column: R =      R 11 R 12 . . . R 1 n R 21 R 22 . . . R 2 n . . . . . . . . . . . . R m 1 R m 2 . . . R mn      , (2) where R ij is the r ank of th e algo r ith m i in the j -th dataset. Th e su m of the i -th ro w R i = P n j =1 R ij , ∀ i = 1 , . . . , m , dep end s on how the i -th algorithm p erforms w.r.t. the other ( m − 1) algorithms. Under the null h yp othesis of the F riedman test (no diﬀerence b et wee n th e algorithms) the a verage v alue of R i is n ( m + 1) / 2. T h e statistic of the F riedman test is S = 12 nm ( m + 1) n X j =1  R j − n ( m + 1) 2  2 , (3) whic h un der the null hyp othesis has a c h i-squared d istribution w ith m − 1 degrees of f r eedom. F or m = 2, th e F r iedman test corresp onds to the sign test. 3. Mean ranks p ost-ho c test If the F riedm an test rejects the n ull hypothesis one has to establish w hic h are the signiﬁcan t diﬀerences among the a lgorithms. If a ll classiﬁers are compared to eac h other, one h as to p erform m ( m − 1) / 2 pairwise comparisons. When p erforming m ultiple comparisons, one has to control the famil y-w ise error rate, namely the probabilit y of at le ast one erroneous rejection of the null h yp othesis among the m ( m − 1) / 2 pairwise comparisons. In the follo wing example w e cont r ol the family-wise error (FWER) r ate through the Bonferroni correctio n , ev en though more p o werful tec hniqu es are also a v ailable (D em ˇ sar , 2006; Ga r cia and Herrera, 2008 ). Ho wev er our d iscussion of the shortcomings of the mean-ranks test is v alid regardless the sp eciﬁc approac h adopted to con trol the FWER. The mean-rank test claims that the i -th and the j -th algorithm are signiﬁcantly diﬀerent if: | ¯ R i − ¯ R j | ≥ z ∗ r m ( m + 1) 6 n . (4) 3 Bena voli and Corani and Mangili where ¯ R i = 1 n R i is the mean rank of the i -th algorithm and z ∗ is the Bonferroni corrected α/m ( m − 1) upp er standard normal qu an tile (Gibb ons and Chakrab orti, 2011, S ec. 12.2.1). Equation (4) is based on the large samp le ( n > 10) appro ximation of the distribution of the sta tistic. The ac tual distribu tion of the statistic | ¯ R i − ¯ R j | is derived assuming al l the ( m !) n ranks in (2) to b e equally pr obable. Under this assumption th e v ariance of | ¯ R i − ¯ R j | is m ( m + 1) / 6 n , whic h originates the term under th e squ are ro ot in (4). The sampling distribution of the statistic | ¯ R i − ¯ R j | assumes all ranks conﬁgurations in (2) to b e equally probable. Y et this assum p tion is not tenable: the p ost-ho c analysis is p erformed b e c ause th e null h yp othesis of the F riedm an test has b een rejected. 4. Inconsistencies of the mean-ranks test W e illustrate the inconsistencies the mean-ranks test b y p resen ting three examples. All examples refer to the analysis of the acc u racy of diﬀeren t classiﬁers on m ultiple data set s. W e sh o w that th e outcome of th e test dep end s both on the actual diﬀerence of accuracy b et wee n algorithm A and B and on the accuracy of th e remaining algorithms. 4.1 Example 1: artiﬁcially increasing p o wer Assume w e ha ve tested ﬁ v e algorithms A, B , C, D , E on 20 datasets ob taining the accuracies: D atasets A 50 50 50 50 50 50 50 50 50 50 80 80 80 80 80 80 80 80 80 80 B 80 80 80 80 80 80 80 80 80 80 50 50 50 50 50 50 50 50 50 50 C 55 55 55 55 55 55 55 55 55 55 45 45 45 45 45 45 45 45 45 45 D 60 60 60 60 60 60 60 60 60 60 85 85 85 85 85 85 85 85 85 85 E 65 65 65 65 65 65 65 65 65 65 90 90 90 90 90 90 90 90 90 90 The corresp onding ran k s are: D atasets A 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 B 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 C 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 D 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 E 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 where b etter algo r ithms are giv en higher ranks. W e aim at comparing A and B . Algorithm B is b etter than A in the ﬁrst ten datasets, while A is b etter than B in the remaining ten. The t wo algorithms ha ve the same mean p erformance and their diﬀerences are symmetrically distributed. Each algo r ithms win s on half the data sets. Diﬀeren t t yp es o f t wo-sided tests (t-test, Wilco xon signed-rank test, sign-test) return the same p -v alue, p = 1. The mean- ranks test corr esp ond in this case to the sign-test and th u s also its p-v alue is 1. This is most extreme result in fa vo r of the null hyp othesis. No w assume that we compare A, B together w ith C, D , E . In the ﬁrst ten datasets, algorithm A is w orse than C, D, E , whic h in turn are worse than B . In the remaining ten datasets, C is worse than A, B , whic h in turn are worse than D , E . The p -v alue of the 4 Should we use the post-hoc tests based on mean-ranks? F riedman test is p ≈ 10 − 10 and, thus, it rejects the n ull hypothesis. W e c an th us perf orm the p ost-hoc test (4) with z ∗ = 2 . 807 (the Bonferroni corrected α/m ( m − 1) upp er standard normal quant ile for α = 0 . 0 5 and m = 5). Th e signiﬁcance lev el has b een adjusted to α/m ( m − 1), since w e are p erforming m ( m − 1) / 2 t wo-sided comparisons. The mean ranks of A, B are resp ectiv ely 2 and 3 . 5 a n d, th us, sin ce | ¯ R A − ¯ R B | = 1 . 5 and z ∗ q m ( m +1) 6 n ≈ 1 . 4 w e can reject the null h yp othesis. The result of the p ost-ho c test is that the algorithms A, B ha v e signiﬁcantly diﬀeren t p erformance. The decisions of the mean-ranks test are n ot consistent: • if it compares A, B alone, it does not reject th e n ull h yp othesis; • if it compares A, B together with C, D , E , it r ejects the null h yp othesis concluding that A, B ha ve signiﬁ cantly diﬀeren t p erformance. The presence of C , D , E artiﬁcially introdu ces a diﬀerence b et we en A, B by c hanging the mean ranks of A, B . F or instance, D and E r ank a lwa y s b etter than A , while they nev er outp erform B when it works w ell (i.e., datasets fr om one to ten); in a real case stud y , a similar result w ould probably indicate that while B is w ell su ited for the ﬁrst ten d atase ts, D , E and A are b etter suited for the last ten. The diﬀerence (in rank) b et wee n A and B is artiﬁcially ampliﬁed by the pr esence of D and E only wh en B is b etter than A . The p oin t is that a large diﬀerences in the global ranks of t wo classiﬁers do es not n ecessarily corresp ond to large diﬀerences in their a ccur acies (and vice versa, as w e will see in the next example). This issu e can hap p en in pr actic e. 1 Assume that a researc h er p resen ts a new algorithm A 0 and some of its w eak er v ariations A 1 , A 2 ,..., A k and compares the new a lgorithms with an existing algorithm B . When B is b etter, the rank is B ≻ A 0 ≻ . . . ≻ A k . When A 0 is b etter, the rank is A 0 ≻ A 1 ≻ . . . ≻ A k ≻ B . Therefore, the p resence of A 1 , A 2 ,..., A k artiﬁcially increases the diﬀerence b et w een A 0 and B . 4.2 Example 2: lo w p o wer due to the remaining algorithms Assume the p erformance of algorithms A and B on diﬀerent data sets to b e norm ally distributed as follo ws: A ∼ N (0 , 1) , B ∼ N (1 . 5 , 1) . The p o ol of algo r ithms comprises also C, D, E , whose p erf ormance is distribu ted as follo ws: C ∼ N (5 , 1) , D ∼ N (6 , 1) , E ∼ N (7 , 1) . A co llection of 20 data sets is considered . F or the sak e o f simplicit y , assume w e w ant to compare only A and B . Th ere is th us no need of co r r ectio n f or m ultiple comparisons. When comparin g A and B , the p o wer of the t wo-sided sign test with α = 0 . 0 5 is very high: 0 . 94 (w e ha ve ev aluated th e p o wer numerically by Monte Carlo sim ulation). T h e p o wer of the mean-ranks test is instead only 0 . 046. W e can exp lain the large diﬀerence 1. W e thank the anon ymous review er for suggesting this example. 5 Bena voli and Corani and Mangili of p ow er as follo ws. The sign te st (under normal appro ximation of the d istr ibution of the statistic) claims signiﬁcance wh en: | ¯ R A − ¯ R B | ≥ z ∗ r 1 n while the mea n -ranks te s t (4) claims signiﬁcance when: | ¯ R A − ¯ R B | ≥ z ∗ r m ( m + 1) 6 n = z ∗ r 5 n , with m = 5. S ince the algorithms C , D , E ha v e mean p erformances that are m uc h larger than those of A, B , the m ean-ranks diﬀerence | ¯ R A − ¯ R B | is equal f or the t wo test. Ho wev er the mean-ranks estimates the v ariance of the statistic | ¯ R A − ¯ R B | to b e ﬁ v e times larger compared to th e sign test. The critical v alue of th e mean-ranks test is inﬂated b y √ 5, largely decreasing the p o w er of the test. In fact for the mean-ranks test the v ariance of | ¯ R A − ¯ R B | increases with the n u m b er of a lgorithms includ ed in the initial exp erimen t. 4.3 Example 3: real classiﬁers on UCI data sets Finally , we compare th e accuracies of sev en classiﬁers on 54 datasets. The classiﬁers are: J48 decision tree ( C 1 ); hidd en naive Ba yes ( C 2 ); a verage d one-dep end ence estimator (A ODE) ( C 3 ); n aiv e-Ba y es ( C 4 ); J 48 graft ( C 5 ), lo cally we ighted naive -Ba yes ( C 6 ), random forest ( C 7 ). The whole set of results is giv en in App en d ix. Each classiﬁer has b een assessed via 10 run s of 10-fol d s c r oss-v alidation. W e p erformed all th e exp erimen ts u s ing WEKA. 2 All these classiﬁers are describ ed in (Witten and F r an k , 20 05 ). The accuracies are rep orted in T able 2. Assume that our aim is to compare C 1 , C 2 , C 3 , C 4 alone. Th er efore, w e consider just the ﬁrst 4 columns in T able 2. Th e mean ranks are: C 2 = 2 . 676 , C 4 = 1 . 917 , C 1 = 2 . 518 , C 3 = 2 . 888 . The F r iedman test r ejects the null hyp othesis. The pairwise comparisons for the pair C 2 , C 4 giv es the statistic z = | ¯ R 2 − ¯ R 4 | / p m ( m + 1) / 6 n = 3 . 06 . Since 3 . 0 6 is greater th an z ∗ = 2 . 64 (the Bo n ferroni co r rected α/m ( m − 1) up p er standard normal quan tile fo r α = 0 . 05 and m = 4) , the mean-ranks p ro cedure ﬁnds the algorithms C 2 , C 4 to b e signiﬁcan tly d iﬀeren t. If w e compare C 2 , C 4 together with C 1 , C 5 , the mean rank s are: C 2 = 2 . 713 , C 4 = 2 . 102 , C 1 = 2 . 528 , C 5 = 2 . 657 . Again, F riedman test rejects the null hyp othesis. Th e p airwise comparisons for the pair C 2 , C 4 giv es the statistic z = | ¯ R 2 − ¯ R 4 | / p m ( m + 1) / 6 n = 2 . 46 , 2. http: //www.cs.waikato.a c.nz/ml/weka/ 6 Should we use the post-hoc tests based on mean-ranks? Card=2 C ard=3 Card= 4 C 2 vs. C 4 7/10 9/10 3/5 C 2 vs. C 7 1/10 - - C 3 vs. C 7 2/10 - - C 4 vs. C 6 9/10 5/10 - T able 1: Pa irw ise comparisons that are aﬀected (num b ers of decisions that are signif- ican tly d iﬀeren t/n umb er of subsets) b y the p erformance of the other algorithms. Here Card=2 means that, f or eac h pair C a , C b on the left column, w e are considering the su bsets { C a , C b , C x , C y } , Card =3 { C a , C b , C x , C y , C z } and C ard=4 { C a , C b , C x , C y , C z , C w } . T he sym b ol “-” means that the comparison do es not dep end on the subset o f algorithms. whic h is smaller than z ∗ . Th us the diﬀerence b et we en algorithms C 2 and C 4 is not signiﬁ- can t. The acc u racies of C 2 and C 4 are the same in the tw o cases b ut a gain the decisions of the mean-ranks are conditional to the group o f classiﬁers w e are considering. Consider building a set of f our classiﬁers { C 2 , C 4 , C x , C y } . By diﬀerently c ho osing C x and C y w e ca n build ten diﬀeren t such sets. F or ea ch su bset w e run the mean-ranks test to c hec k whether the diﬀerence b et we en C 2 and C 4 is signiﬁcan tly diﬀeren t. The d iﬀerence is claimed to b e signiﬁc ant in 7 case s and not sig ni ﬁc ant in 3 case s . No w consider a set of ﬁve classiﬁers { C 2 , C 4 , C x , C y , C z } . By diﬀerentl y choosing C x , C y and C z w e can build ten diﬀeren t such sets. This yields 10 fur ther c ases in w h ic h w e compare aga in C 2 and C 4 . Their diﬀerence is cla imed to b e signiﬁcan t in 9/1 0 ca s es. T able 1 rep orts the pairw ise comparisons for w h ic h the statistical d ecisio n c hanges with the p o ol of classiﬁers that are considered. The outcome of th e m ean-ranks test when comparing the s ame pair of classiﬁers c learly dep ends on the po ol of alternativ e classiﬁers { C x , C y , . . . } whic h is assumed. 4.4 Maxim um t yp e I error A fu rther dra w b ac k of the mean-ranks test whic h has n ot b een discussed in the pr evious ex- amples is that it cannot con trol the maximum t yp e I error, that is, the probabilit y of falsely declaring an y p air of algo r ith m s to b e diﬀerent rega rd less of the other m − 2 algorithms. If the accuracies of all algorithms bu t one are equ al, it do es not guaran tee the family-wise T yp e I error to b e smalle r than α when comparing the m − 1 equiv alent a lgorithms. W e p oin t the reader to (Fligner , 1984) for a d etaile d discussion on this asp ect. 5. A suggested pro cedure Giv en the abov e issues, we recommend to av oid the mean-ranks test for the p ost-hoc analy- sis. On e sh ould instead p erform the multiple comparison usin g tests whose decision dep end only on the tw o a lgorithms b eing compared, suc h as the sign test or t h e Wilco xon signed- rank test. The sign test is more robu st, as it only assumes the observ ations to b e identic ally distributed. Its d r a wbac k is lo w p o wer. T he Wilco xon signed-r an k test is m ore p o w erf ul 7 Bena voli and Corani and Mangili and thus it is generally reco mm ended (De m ˇ s ar, 2006). Compared to the sign test, the Wilco x on signed-rank test mak es the additional assu m ption of a symmetric distribution of the diﬀerences b et w een the t wo algo r ithms b eing compared . Th e decision b et ween s ign test and signed-rank te st thus d ep ends on whether the symmetry assu mption is tenable on to the analyzed data . Regardless the adopted test, the m ultiple comparisons sh ould b e p erformed adjusting the signiﬁcance lev el to con trol the family-wise T yp e-I error. This can b e done usin g the correction for multiple comparison discussed b y (De m ˇ sar , 2006; Garcia and Herrera, 2008). If w e adopt the Wil coxo n signed-rank test in Example 3 for comparing C 2 , C 4 , w e obtain the p -v alue 0 . 0002, indep enden tly from the p er f ormance of the other algo r ithms. Thus, for an y p o ol of alg orithm s C 2 , C 4 , C x , C y , w e alwa ys rep ort the same decision: C 2 , C 4 are signiﬁcan tly diﬀerent b ecause the p -v alue is less than the Bonferroni corrected signiﬁcance lev el α/m ( m − 1) (in the ca se m = 4, α/m ( m − 1) = 0 . 00 42). 6. Soft w are The MA T LAB scripts of the ab o v e examples can b e do wnloaded from ipg.idsi a.ch/softwa re/meanRanks/mat l a 7. Conclusions The m ean-ranks p ost-ho c test is wid ely used test f or m u ltiple pairwise comparison. W e discuss a n u m b er of drawbac ks of this test, whic h w e r ecommend to av oid. W e instead recommend to adopt th e sign-test or the Wilco xon signed-rank , whose decision do es n ot dep end on the p o ol of classiﬁers included in the o r iginal exp erimen t. W e moreo v er bring to the atten tion of the reader the Ba y esian counterparts of these tests, wh ic h o vercome the many dra wb ac ks (Kruschk e, 2010, Chap.11) of null-h yp othesis signiﬁcance testing. References A. Bena voli , F. Mangili, G. C orani, M. Zaﬀalon, and F. Ruggeri. A Ba y esian Wilc oxo n signed-rank test based o n the Diric hlet pr ocess. In Pr o c e e dings of th e 30th International Confer enc e o n Machine L e arning (ICML 2014) , pages 1–9, 20 14. A. Benav oli, F. Mangili, F. Ruggeri, and M. Zaﬀalon. Impr ecise Diric hlet Process with application to the hypothesis test on the p robabilit y that X ≤ Y. A c c epte d for public ation in Journal of Statistic al The ory and Pr actic e , F ebr u ary 2014. doi: 10.1080 /15598608.2014. 98599 7. G. Corani and A. Bena voli . A Ba y esian approac h for comparing cross-v alidated algorithms on m ultiple data sets. A c c epte d for public ation in Machine L e arning , 2015. Janez Dem ˇ sar. Statistical comparisons of classiﬁers o ver multiple data sets. Journal of Machine L e arning R ese ar ch , 7:1–30, 2006. Janez Dem ˇ sar. On the appr opriateness of statistical tests in machine learning. In Worksho p on Evaluation Metho ds f or Machine L e arning in c onjunction with ICML , 20 08. 8 Should we use the post-hoc tests based on mean-ranks? Mic hael A. Fligner. A note on t w o-sided distribution-free treatment v ersus cont r ol m ultiple comparisons. Journal of the Am e ric an Sta tistic al Asso ciation , 79(38 5):pp. 20 8–211, 1984. K Rub en Gabr iel. S im ultaneous test pro cedures–some theory of multiple comparisons. The Anna ls of Mathematic al Statistics , pages 224–250, 1969. Salv ador Garcia and F rancisco Herrera. An extension on” Statistical Comparisons of Classi- ﬁers o ve r Multiple Dat a Sets” for all pairwise comparisons. Journal of Machine L e arning R ese ar ch , 9( 12), 2008. Jean Dic kinson Gibb ons and Sub habrata C hakrab orti. Nonp ar ametric statistic al infer enc e . Springer, 201 1. Stev en N Go o dman. T o w ard evidence-based medical s tatistics: Th e p–v alue fallacy . A nnals of internal me dicine , 130(12) :995–1004, 1999. Myles Holla n d er, Douglas A W olfe, and Eric Chic ke n . Nonp ar ametric statistic al metho ds , v olume 751. John Wiley & Sons, 2013. John K Kruschk e. Ba yesian data analysis. Wiley Inter disciplinary R eviews: Co gnitive Scienc e , 1 (5):658–676 , 2010. P aul H Kv am and B r ani Vidak ovic. N onp ar ametric statistics with applic ations to scienc e and engine ering , v olume 653. John Wiley & Sons, 2007. B. J. McDonald and Jr T hompson, W. A. Rank sum m u ltiple comparisons in one- and t wo -wa y classiﬁcatio n s. Biometrika , 54(3/4):pp. 487–49 7, 1967. Rup ert G Miller. Simulta ne ous statistic al infer enc e . Sp ringer, 1966. P . Nemenyi . Distribution-fr e e multiple c omp arisons . Ph.D. thesis, Princeton Univ ersity , 1963. Da vid J Sh eskin. Handb o ok of p ar ametric and nonp ar ametric statistic al pr o c e dur es . CRC Press, 200 3. Ian H Witten and Eib e F r ank. Data Mining: Pr actic al machine le arning to ols and te ch- niques . Morgan Kaufmann, 20 05. 9 Bena voli and Corani and Mangili T able of accuracies used in example 3 Dataset C1 C2 C3 C4 C 5 C6 C7 anneal 98.44 98 98 96.43 98.55 98.33 99 audiology 78.32 73.42 71.66 71.23 78.32 77.41 73.89 wisconsin-breast-cancer 93.7 96.71 96.99 97.14 93.7 97.28 95.57 cmc 50.71 52.81 51.39 51.05 50.78 50.98 48.67 conta ct- lenses 81.67 68.33 71.67 71.67 81.67 65 78.33 credit 86.38 84.64 86.67 86.23 86.52 87.25 85.07 german-credit 72.4 76.6 76 .6 76 72.4 75.3 73 pima-diab etes 73.7 74.09 75.01 74.36 73.56 74.75 72.67 ecoli 81.52 80.04 81.83 82.12 81.52 80.63 78.84 eucalyptus 64.28 63.2 58.7 1 51.1 64.0 1 59.52 59.4 glass 71.58 74.26 73.83 70.63 71.1 75.69 73.33 grub-damage 38.79 36.88 43.92 47.79 39.42 40.13 42.63 hab erman 72.87 71.53 72.52 72.52 72.87 73.52 72.16 hay es-roth 60 56.88 60 60 60 60 59.38 cleeland-14 78.82 81.47 81.8 83.44 78.48 82.78 81.81 hungaria n -14 78.64 84.39 84.39 84.74 78.64 84.38 81.97 hepatitis 79.46 85.13 83.79 82.5 79.46 82.5 81.25 hypothyroid 99.28 99.18 98.54 98.3 99.28 98.62 98.97 ionosphere 91.17 90.88 90.88 89.17 91.74 89.17 91.75 iris 93.33 92 92.67 92.67 93.33 92 93.33 kr-s-kp 99.44 92.46 91.24 87.89 99.37 91.21 98.87 labor 85 88 84.67 83 85 81.33 84.67 lier-disorders 56.25 56.25 56.25 56.25 56.25 56.25 56.25 lymphography 78.33 85 85.71 84.38 79 86.33 79.62 monks1 98.74 100 85.44 74.64 98.74 82.21 98.56 monks3 98.92 97.84 96.75 96.39 98.92 96.39 97.84 monks 64.72 64.57 63.73 62.24 64.72 64.9 70. 72 mushroom 1 00 99.96 99.95 95.83 100 99.84 100 nursery 97.05 94.28 92.71 90.32 97.08 91.61 98.09 optdigits 78.97 96.17 96.9 92.3 81.01 94.2 91.8 page-blocks 96.62 96.84 96.95 93.51 96.66 94.15 96.97 pasture-pro duction 75 85.83 80.83 80.83 75 81.67 75.83 p endigits 89.05 97.61 97.82 87.78 89.87 94.81 95.67 p ostoperatie 70 67.78 67.78 66.67 70 66.67 60 primary-tumor 40.11 48.08 47.49 46.89 40.11 49.55 38.31 segmen t 94.24 96.36 94.5 91.3 94.03 94.29 96.06 solar-ﬂare-C 88.86 88.24 88.54 86.08 88.86 87.92 86.05 solar-ﬂare-m 90.1 87.02 87.92 87 90.1 86.99 85.46 solar-ﬂare-X 97.84 97.53 97.84 93.17 97.84 94.41 95.99 sonar 74.48 79.83 81.26 80.29 74.45 80.79 78.36 so yb ean 92.39 94.58 93.4 92.08 92.98 93.55 92.68 spam b ase 92.81 92.31 93.37 89.85 93.22 90.63 93.65 sp ect- reordered 78.29 82.07 80.93 79.03 78.29 83.15 80.56 splice 94.36 96.18 96.21 95.36 94.2 95.89 89.37 squash-stored 70 58 60 61.67 70 63.67 57.67 squash-unstored 76.67 69 70.67 61.67 76.67 68.67 77.33 tae 47 44.38 47 47 47 47 45.67 credit 84.93 83.91 85.07 84.2 84.93 85.22 83.33 o wel 76.67 84.65 77.78 60.3 76.87 77.88 84.95 w av eform 74.38 84.52 84.92 79.86 74.9 83.62 79.68 white-clo ver 56.9 79.29 68.57 66.9 56.9 64.76 70 wine 88.79 98.33 98.33 98.89 89.35 98.33 97.22 yeas t 57.01 57.48 56.74 56.8 57.01 57.48 56.26 zoo 92.18 100 95.09 93.18 92.18 96.18 95.09 T able 2: Accuracy of classiﬁers on d iﬀerent data sets. 10

Should we really use post-hoc tests based on mean-ranks?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment