Information-Theoretic Measures for Objective Evaluation of Classifications

Information-Theoretic Measures for Objecti ve Ev aluation of Classiﬁcations Bao-Gang Hu ∗ ,a,b , Ran He a , XiaoT ong Y uan c a NLPR / LIAMA, Institut e of Automation , Chinese Academy of Scie nces, Beijing 100190, China b Beijing Gradua te School, Chinese Academy of Science s, Beijing 100190, China. c Department of Electr oni c and Computer Engineering , National Universi ty of Singapor e, Singapor e. Abstract This work presents a systematic study of ob jectiv e e valuations of abstaining classiﬁcations using Infor mation-Th eoretic Measures ( IT Ms ). First, we deﬁne objectiv e measures for which they do not depend on any free parameter . This deﬁ- nition provides technical simplicity for e xam ining “ objectivity ” or “ subjectivity ” directly to classiﬁcation e valuations. Second, we propose twen ty four norm alized IT Ms, derived from either mu tual inform ation, divergence, o r cross- entropy , for in vestigation. Contrary to conventional p erforman ce measure s th at apply empirical formu las b ased on users’ intuitions or preferenc es, the ITMs are theoretically more sound for realizing objective evaluations of classi- ﬁcations. W e ap ply them to distinguish “ err or typ es ” and “ r eject types ” in bina ry classiﬁcations withou t the nee d for input data of co st term s. Third, to better und erstand and select the ITMs, we suggest three desirable featur es for classiﬁcation assessment mea sures, which appear more crucial and app ealing from the viewpoint of classiﬁcation applications. Using th ese featur es as “ meta-m easur es ”, we can rev eal the advantages and limitations o f ITMs from a higher le vel of ev aluation kn owledge. Nu merical examples ar e gi ven to corro borate our c laims and compa re the dif- ferences am ong the prop osed measures. T he b est measure is selected in terms of the meta-me asures, and its speciﬁc proper ties r egarding error types and reject types are analytically derived. K ey wor ds: Abstaining classiﬁcations, error types, reject types, entropy, similarity, objecti vity 1. I ntroduction The selection o f evaluation measure s for classi ﬁcations has r eceiv ed in creasing atten tions f rom re searchers on var- ious application ﬁelds [1] [2][3][4][5] [6][7]. It is well known that e valuation measures, or criteria, ha ve a substantial impact on the quality of classiﬁcation performance . The prob lem of ho w to select e valuation measures for the o verall quality of classiﬁcations is di ﬃ cult, and there appears no univ ersal answer to this. Up to now , v ariou s types of e valuation me asures ha ve been used in classiﬁcation applicatio ns. T aking a binary classiﬁcation as an example, mo re than thirty metrics have been ap plied for assessing the quality of classiﬁcations and their algorithm s a s gi ven in T able 1 of Lavesson an d Da vidsson ’ s pa per [5]. Mo st of the metrics listed in this table can be con sidered a typ e of perform ance-based measures. In practice , other ty pes of e valuation measures, such as Informatio n-Theor etic Measures ( ITMs ), hav e also commo nly been used in machin e learn ing [ 8][9]. The typ ical informa tion-based measure used in classiﬁcations is the cross entropy [10]. In a r ecent work [ 11], Hu and W an g derived an analy tical for mula of the Shan non-b ased mutual info rmation measure with respect to a confu sion matrix. Signiﬁcant beneﬁts were de riv ed from the mea sure, such as its gen erality e ven fo r cases of cla ssiﬁcations with a reject option, and its objecti vity in naturally balancing per formance -based measures that may conﬂict with one another (such as precision an d recall). The objectivity was achieved from the perspective that an inf ormation- based measure does not require knowledge of cost terms in ev aluating classiﬁcations. This advantage is particu larly impor tant in studies of abstain ing classiﬁcations [12][4] and cost sensitive learning [13][14], where cost terms m ay be required as input ∗ Correspondi ng aut hor . A ddress: NLPR / LIAMA, Institute of Automati on, Chi nese Academy of Science s, Beiji ng 100190, Chi na. T el.: + 86- 10-62647318 , Fax: + 86-10-6264745 8. Email addr ess: hubg@nlpr.ia.ac .cn (Ba o-Gang Hu) Septembe r 9, 2018 data for e valuations. Gen erally , if no cost terms are assigned to e valuations, it implies that the zero-one cost functions are applied [1 5]. In such situation s, classiﬁ cation e valuations withou t a reject optio n may still be applicable and useful in class-b alanced datasets. Prob lematic, or un reasonable, results will be obtained for ev aluations in situatio ns wher e classes are highly ske wed in the datasets [3] if no speciﬁc cost terms are giv en. In this work, for simplify ing discu ssions, we distinguish, o r decouple, two study goals in ev aluation studies, namely , evaluation o f classiﬁers and ev aluation of classiﬁcations. The fo rmer g oal co ncerns mo re a bout evaluation of algorithms in which classiﬁers ap plied. Fr om this ev aluation, designers or users c an select the best classiﬁer . Th e latter goal is to ev aluate classiﬁcation re sults without co ncerning which classiﬁer is applied. This ev aluation aims more on resu lt comparison s o r measure comp arisons. One typical exam ple was demonstrated by Macka y [16] for highligh ting the di ﬃ culty in classiﬁcation ev aluation s. He showed two speciﬁc co nfusion matrices, C D and C E , in binary classiﬁcations with a reject option: C D = " 74 6 10 0 9 1 # , C E = " 78 6 6 0 5 5 # , with C = " T N F P RN F N T P RP # , (1) where the confusion m atrix is d eﬁned as C in eq. ( 1) , and “ TN ”, “ TP ”, “ FN ”, “ FP ”, “ RN ”, “ R P ” represent “ true ne gative ” , “ true positive ”, “ false negative ”, “ false positive ”, “ r eject ne gative ”, “ r eject positive ”, respectively . For the giv en data, user s may ask “ which measu r es will be pr op er for rankin g th em ”. If directly applying “ T rue P ositive Rate- F alse P ositive Rate ” curve (also called R OC) or “ Pr ecision- Recall ” curve, one may conclud e that th e perfor mance of C E is be tter than that of C D . This conclusion is prop er since the two sets of data share th e same reject rate ( = 11%). Gener ally , “ Err or -Reject ” curve is mo stly adopted in ab staining classiﬁcations. Based on this evaluation approa ch, o ne may consider the p erforman ces of two classiﬁcations have no di ﬀ erence b ecause they show the same error rate ( = 6%) an d reject rate. Mack ay [16] ﬁrst suggested applying mu tual-infor mation based measure in ranking classiﬁcations, and through which Hu and W ang (referring to M5-M6 in T able 3, [11]) observed that C D is better than C E . If reviewing the two matrices carefu lly with respect to imbalanced classes, on e may ag ree with the obser vation because the small class in C D receives mo re correct classiﬁcations than that in C E . W e consid er th e example designed by Mackay [16] is quite stimulating fo r study of ab staining classiﬁcation ev aluations. Th e implications o f th e example for m th e mo ti vations of the present work on add ressing three related open problem s, which are generally ov erloo ked in the study of classiﬁcation ev aluation s as follows: I. How to deﬁne “ pr oper ” mea sures in terms of high-level knowledge for abstaining classiﬁcation e valuations? II. How to conduct an objectiv e e valuation of classiﬁcations without using cost terms? III. How to distinct or rank “ err o r types ” and “ r eject types ” in classiﬁcation e valuations? Con ventional binary class iﬁcations u sually distinguish tw o types of misclassiﬁcation errors [15][16] if they result in di ﬀ ere nt losses in applications. For examp le, in med ical applicatio ns, “ T ype I Err or ” (or “ false po sitive ”) can b e an error of misclassifying a healthy person to be abnormal, such as cancer . On the contrary , “ T ype II Err or ”(or “ false ne gative ”) is an error where cancer is not detected in a patient. Therefo re, “ T yp e II Err or ” is more costly than “ T ype I Err or ”. Based on th e same reason for identifying “ err or types ” in binary classiﬁcations, there is a n eed for considering “ r eject types ” if a reject o ption is a pplied. Of the existing measu res, we consider inform ation-theor etic measures to be most promising in provid ing “ objectivity ” in classiﬁcation evaluations. A deta iled discussion o n the deﬁnitio n of “ objectivity ” is given in Section 3 . This work is an extension of our p revious study [ 11]. Howe ver , the work aims at a systematic in vestigation of informatio n measures with speciﬁc focus on “ err or types ” and “ reject types ”. T he main contribution of the w ork is derived from the follo wing three aspects: I. W e deﬁn e the “ pr oper ” features, also called “ m eta-measures ” , fo r selecting cand idate measures in the con- text of abstainin g classiﬁcation e valuations. These features will assist users in u nderstandin g ad vantages and limitations of ev aluation measures from a higher le vel of knowledge. II. W e examine most of the e xisting information m easures in a systematic in vestigation of “ err or types ” an d “ reject types ” f or objecti ve ev aluations. W e hope that the m ore than twenty measures invest igated ar e able to enrich the curr ent bank of classiﬁcation ev aluation measures. F or the best measure in terms of th e meta-measures, we present a theoretical conﬁrmation of its desirable proper ties regarding error types and reject types. 2 III. W e reveal the intr insic shortcom ings o f infor mation measures in evaluations. T he discussions are intended to be applicable to a wider range of class iﬁcation pro blems, such as similarity ranking. In addition , we are able to employ the measures reasonably in interpreting classiﬁcation results. T o address c lassiﬁcation evaluations with a r eject o ption, we a ssume that the only basic data av ailable for clas- siﬁcation evaluations is a confu sion matrix , without input data of cost terms. The rest of this letter is organ ized as follows. In Section 2 , we present related work for the selectio n of ev aluation measur es. For seeking “ p r oper ” mea- sures, we propose se veral desirable features in the context of classiﬁ cations in Section 3 . Th ree groups of normalized informa tion m easures ar e pr oposed along with their intr insic shortcom ings in Sections 4 to 6, respectiv ely . Sev eral numerical examples, together with discussions, are gi ven in Section 7. Finally , in Section 8 we conclud e the w ork. 2. Relate d W o rk In c lassiﬁcation ev aluation s, a measure based on classiﬁcation accuracy ha s traditiona lly been used with some success in nume rous ca ses [15]. This measu re, howe ver , may su ﬀ er serious pro blems in reaching intuitiv ely reasonable results from ce rtain special ca ses of real-world classiﬁcation problems [3]. T he main reason for this is that a single measure of accuracy does not tak e into accoun t error types. T o overcome the p roblems of accuracy measures, r esearchers h av e developed many sophisticated appr oaches for classiﬁcation assessment[17][18]. Among these, two common ly-used ap proaches are R OC ( Recei ver Ope rating Characteristic) curves and A UC (Area un der Curve) measures [1][19]. ROC curves p rovide users with a very fast ev aluation ap proach via visual inspec tions, but this is only applicab le in limited cases with speciﬁc curve fo rms (for example, w hen o ne curve is comp letely above the o ther). A UC mea sures are more generic fo r ran king classiﬁca- tions without constraints on curve fo rms. In a study of binary classiﬁcations, a formal proof was gi ven by Ling et al. [1] sho wing that A UC is a better measu re th an acc uracy from the deﬁnitions o f both statistical co nsistency and discriminancy . Sophisticated A UC measures were re ported re cently for impr oving robustness [6 ] and coheren cy [7] of classiﬁers. Drummo nd and Holte [20] propo sed a visualization techniq ue called “ Cost Curve ”, which is able to take into account of cost terms for sho wing conﬁdenc e i ntervals on classiﬁer’ s pe rforman ce. Japkowicz [3] presented convincing e xam ples sho wing the sh ortcoming s of the existing ev aluation methods, includ ing accuracy , precision vs. recall, and R OC tech niques. The ﬁnding s from the examples further conﬁrm ed the need f or metho ds using me asure- based functions [2 1]. The m ain id ea behind mea sure-based functions is to form a sing le function with resp ect to a weighted summation of multiple measures. The measure function is able to balance a trade-o ﬀ amon g the conﬂicting measures, su ch as precision and recall. Howe ver , the main di ﬃ cu lty arises in the selection of balancing weig hts for the measures [5]. In m ost cases, users rely on their pref erences and experiences in assigning the weights, wh ich imposes a strong degree of s ubjec ti vity on the e valuation results. Classiﬁcation ev aluation s beco me more complicated if a classiﬁer ab stains from making a p rediction when the outcome is considered unr eliable for a speciﬁc sample. In this case, an extra c lass, k nown as the “ r eject ” o r “ u nknown ” class, is ad ded to th e classiﬁcation. In recen t years, the study of abstaining classiﬁers has received muc h atten tion [22][23][12][4][24]. W ith complete data of a f ull cost matrix, they were able to assess the classiﬁcations. If o ne term of the cost matrix was missing, such as o n a reject cost term , the ap proaches f or classiﬁcation ev aluation s gene rally failed. Mor eover , because in most situa tions the cost terms ar e given by users, th is appr oach is basically a subjective ev aluation in ap plications. V anderlooy et al. [25] fu rther in vestigated the R OC isometrics approac h which does not rely on information from a cost matrix. This approach, ho wever , is only ap plicable to binary classiﬁcation problems. A promising stud y o f objecti ve ev aluation s of classiﬁcations is attributed to the introduction of information the ory . Kvalseth [26] and W ickens [27] deri ved normalized mutual i nfo rmation ( NMI ) measures in relation to a contingency table. Furth er pioneerin g stud ies on the classiﬁcation problems were conducted by Finn [28] and Forbes [29]. Forbes [29] discussed t he problem that NMI does not s hare a monoton ic p roperty with the other performance measures, such as a ccuracy or F-measure. Se veral d i ﬀ erent deﬁnitio ns for info rmation measures have been rep orted in studies of classiﬁcation assessment, such as inf ormation scores by Kononenko a nd Bratko [30] and KL divergence by Nishii and T an aka [31]. Y ao, et al. [8] and T an, et al. [32] summarized many useful informa tion measures for stud ies of associations and attribute importance. Signiﬁcan t e ﬀ or ts were mad e on discussing the desire d properties of e valuation measures [32]. Principe, et al. [ 9] proposed a f ramew ork o f infor mation th eoretic learning ( ITL ) that inc luded supervised learnin g as in c lassiﬁcations. Within this fr amew ork, the learnin g criteria were th e mutual in formation 3 deﬁned from the Shan non and Renyi en tropies. T wo quadratic divergences, na mely , the Euclid ean and Cauch y- Schwartz distances were also included. From the per spectiv e of inform ation theory , W ang and Hu [ 33] d eriv ed f or the ﬁrst time the nonlinear relation s between m utual inform ation and th e conventional perf ormance measures ( accuracy , recall and precision) fo r binary classiﬁcation problems. They [11] extended the investigation into abstaining classiﬁcation e valuations for mu ltiple classes. Their method was based solely on the confusio n matrix. For gainin g the theoretical proper ties, they derived the extremum theorems concerning mutual information measures. One of the important ﬁndings from the local mini- mum theorem is the theore tic re velation of the no n-mono tonic property of mutual info rmation measures with respect to the diago nal terms of a con fusion matrix. This prope rty may ca use irratio nal ev aluation results from some d ata in classiﬁcations. They conﬁrm ed this problem by examining speciﬁc numerical exam ples. The oretical invest igation s are still missed for other information measures, such as di vergence-based and cross-entropy based ones. 3. Object ive Evaluations and Meta-Measures This work focuses on ob jectiv e evaluations of classiﬁcations. While Berger [34] stressed fo ur po ints f rom a philosoph ical po sition fo r supp orting objective Bayesian analysis, it seems that fe w studies in the literatu re add ress the “ objec tivity ” issue in the stud y of classiﬁcation evaluations. Some r esearchers [3 2] may c all their measures to be objective ones without d eﬁning them formally . Considerin g th at “ objectivity ” is a more philosophical co ncept witho ut a well accepted d eﬁnition, we propo se a schem e for de ﬁning “ ob jective evaluations ” from th e viewpoint of practical implementatio n and e xamina tion. Deﬁnition 1. Obje ctive evaluations and measures . An objective e valuation is an a ssessment expressed by a function that does not contain any free parameter . This fun ction i s called an objective measure. Remark 1 . Whe n a free parameter is used to deﬁne a measure, it usually carr ies a certa in degree of subjectivity in e valuations. Therefore , acco rding to this deﬁnition, a measure based on cost terms [15] as free parameters does not lead to an o bjective e valuation. Deﬁn ition 1 m ay be co nservati ve, b ut nevertheless, provid es technical simplicity fo r examining “ objectivity ” or “ subjectivity ” directly with respect to the existence of free parameters. I n some situations, Deﬁnition 1 can be relaxed by includin g free p arameters, but they a ll hav e to be determin ed sole ly from the given dataset. Deﬁnition 2. Dat asets in classiﬁcation evaluations with a reject option . A reject option is sometimes considered for classiﬁcations in which one may assign samples to a reject or u nknown class. Evaluations of classiﬁcation with a reject option app ly two datasets, n amely , the output (o r prediction) dataset { y k } n k = 1 , which is a r ealization of d iscrete random v ariable Y v alued on set { 1 , 2 , . . . , m + 1 } ; and th e target dataset { t k } n k = 1 ∈ T valued on set { 1 , 2 , . . . , m } ; where n is the total number of samp les, and m is the to tal number of classes. A s amp le identiﬁed as a reject class is rep resented by y k = m + 1. Remark 2 . T he term “ abstaining classiﬁers ” has been widely used in classiﬁcation problems with a reject option [12][4]. Howe ver , mo st studies of ab staining classiﬁcations required cost matrices for their ev aluations. The deﬁnition giv en above exhib its mor e gen eric scen arios in classiﬁcation ev aluation s, beca use it do es no t require kn owledge of cost terms for error types and reject types. Deﬁnition 3 . A ugmented confusion matrix and its constra ints [11] . An augm ented confu sion matrix includes one column for the reject class, which is added to a conv ention al con fusion matrix: C =               c 11 c 12 . . . c 1 m c 1( m + 1) c 21 c 22 . . . c 2 m c 2( m + 1) . . . c m 1 c m 2 . . . c mm c m ( m + 1)               , (2) where c i j represents the sample nu mber of the i th class that is classiﬁed as the j th class. The row data correspon ds to the actual classes, wh ile th e column data corr esponds to the predicted classes. The last colum n rep resents th e reject class. Th e relations and constraints of an augmented confusion matrix are: C j = m + 1 X j = 1 c i j , C i > 0 , c i j ≥ 0 , i = 1 , 2 , . . . , m , (3) 4 where C i is the total number for the i th class, which is generally known in classiﬁcation problems. Deﬁnition 4. Error types and reject types . Following the conventions in binary classiﬁcations [ ? ] , we denote c 12 and c 21 by “ T ype I Err or ” and “ T ype II Err or ” re specti vely; c 13 and c 23 by “ T ype I Reject ” and “ T y pe II R eject ” respectively . Deﬁnition 5. Normalized information measure . A norm alized infor mation measure, d enoted as N I ( T , Y ) ∈ [0 , 1], is a f unction based on info rmation theory , which represents the degree of similarity between two rand om variables T and Y . In princip le, we hope that all NI m easures satisfy the th ree importan t pro perties, or axioms, of m etrics [15][35], supposing Z is anoth er random v ariable: P1: N I ( T , Y ) = 1 i ﬀ T = Y (the iden tity axiom) P2: N I ( T , Y ) + N I ( Y , Z ) ≥ N I ( T , Z ) (the triangle inequality) P3: N I ( T , Y ) = N I ( Y , T ) (th e s ymm etry axiom) Remark 3 . V iolations of properties of metrics are possible in reach ing r easonable ev aluations of classiﬁcation s. For example, the tr iangle in equality an d symmetry p roperties can be relaxed withou t chang ing the ran king or ders among classiﬁcations if their ev aluation measures are ap plied consistently . Howe ver , the identity pro perty is indicated only fo r the relation T = Y (assum ing T is padded with zero s to ma ke it the same size as Y ) , and doe s not g uarantee an e xact solution ( t k = y k ) in classiﬁcations (see Theo rems 1 an d 4 given later) . If a violation of metric properties occurs, the NIs are referred to as measures, rather than metrics. For classiﬁcation e valuations, we consider the generic properties of metrics not to be as crucial i n co mparisons as certain speciﬁc features. I n this w ork , we focus on speciﬁc features that, though not mathematically fundamental, are more necessary in classiﬁcation ap plications. T o select “ better ” measure s for obje cti ve e valuations of classiﬁcations, we propose the following three desirable features together with their heuristic reasons. Feature 1. Monotonicity with respect to the diagona l t erms o f the confusion matrix . T he diagonal terms of the conf usion matrix represent th e exact classiﬁcation n umbers for all the samples. Or, they reﬂect the coinciden t number s between t an d y fr om a similarity viewpoint. When on e of th ese ter ms ch anges, the evaluation measure should change in a mono tonous way . Other wise, any non-m onoton ic measur e may fail to provide a rational result for rank ing classiﬁcations co rrectly . Th is feature is orig inally p roposed for descr ibing the strength of agree ment ( or similarity) if the matrix is a contingency table [32]. Feature 2. V aria tion with r eject rate . T o improve classiﬁcation perf ormance, a reject option is of ten used in engineer ing applications [12]. Ther efore, we suggest that a measure should be a scalar function on both class iﬁcation accuracy and reject rates. Su ch a m easure cou ld be ev aluated based solely on a giv en conf usion matrix fro m a single operating point in the classiﬁcation. This is d i ﬀ erent to th e A UC measures that are based on an “ Err or-Reject ” cur ve [16][24] from multiple operating points. Feature 3 . Intuitively consistent costs among error t ypes and reject ty pes . This f eature is der i ved from the principle of o ur conv ention al intuitions when dealing with error types in classiﬁcations. It is also extended to reject types. T wo s peciﬁc intuitions are ad opted for binary classiﬁ cation s. First, a misclassiﬁcation or r ejection from a small class will cause a greater cost th an that from a large class. This intuition represents a property called “ within e rr o r types and r eject types ”. Second, a misclassiﬁcation will pr oduce a greater co st th an a rejection from the same class, which is called “ between err o r and reject types ” property . If a measure is ab le to satisfy the in tuitions, we refer to its associated costs as being “ intuitively consistent ”. Ex ceptions may exist to the intu itions above, b ut we con sider them as a very s pecial case. At th is stage, it is worth discussing on “ objectivity ” in ev aluations because one ma y doubt co rrectness o f the intentions above and the terms “ desir able ” or “ intuitions ” in a study of objecti ve e valuations. The three features seem to be “ pr oblema tic ” in terms of p roviding a general co ncept of “ ob jectivity ”, because n o h uman bias sh ould be a pplied in the ob jecti ve judgment of ev aluation results. The following discussions justify the pr oposal of requirin g de sirable, or proper, featu res for objective measure s. On one hand, we recog nize that any e valuation will imply a certain degree of “ subjectivity ”, since evaluations exist only as a result o f human judgment. For examples, every selectio n of e valuation measures, even of objective ones, will r ely o n po ssible sou rces o f “ subjec tivity ” fr om u sers. On the other hand, engineer ing applications do concer n abou t objective ev aluations [ 29][32]. Howe ver , to the auth ors’ b est k nowledge, 5 a technical deﬁnition , or criterion, seems missing for d etermining objective or subjective measures in ev aluations of classiﬁcations. For overcomin g possible confusio n and vagueness, we set Deﬁnition 1 as a p ractical criterion for examining wh ether a classiﬁcation ev aluation h olds “ objectivity ” or does not. If a measure satisﬁes this deﬁnition, it will always retain the proper ty of “ objective co nsistency ” in ev aluating the given classiﬁcation results. The three “ desirable ” features, though based on “ intuitions ” with “ su bjectivity ”, do not destroy the criterion o f “ ob jectivity ” in classiﬁcation e valuations. Therefore, it is logically co rrect to d iscuss “ desirable ” featu res o f objectiv e mea sures as long as the measures satisfy Deﬁnition 1 for keeping the deﬁned “ objectivity ”. Note that all desirable features above are deri ved from o ur intuitions on g eneral cases of classiﬁcation e valuations. Other items ma y be deriv ed for a wider examinatio n of features. For example, Forbes [29] pro posed six “ con straints on pr op er comparative measur es ”, namely , “ s tatistically principled, r ea dily in terpr etable, gener alizable to k- class sit- uations, not di ﬀ er ent to the special status, r eﬂective of agr eement, and insensitive to the se gmentation ”. However , we consider the three features proposed in this w ork to be more crucial, especially as Feature 3 has ne ver b een concerned in previous studies of classiﬁcation e valuations. A lthough Features 2 and 3 may s hare a similar meaning, the y are pre- sented indi vidua lly to highlight their s peciﬁc concerns. W e can also call th e desirab le fe atures “ meta-measur es ”, since these are de ﬁned to be qualitative and high-level measures abou t measures. In this work, we apply meta- measures in our inv estigation of info rmation me asures. The examination with r espect to th e meta-measu res enab les clar iﬁcation of the cau ses of p erforman ce di ﬀ erences among th e e xamin ed measur es in classiﬁcation evaluations. It will b e helpful for users to un derstand ad vantages and lim itations of di ﬀ eren t measur es, either o bjectiv e- or subjective-ones, from a higher le vel of e valuation knowledge. 4. Normalize d Inf ormation Measures based on Mu tual Information All NI measu res applied in this work are divided into o ne of three group s, name ly , mutu al-informa tion based, div ergence based, and cross-entropy based grou ps. In this section, we f ocus o n the ﬁrst group . Each measure in th is group is derived dir ectly from mutual inf ormation representing the degree of similarity be tween two rand om vari- ables. For the purpose of ob jecti ve e valuations, as suggested by Deﬁnition 1 in the previous section, we elimin ate all candidate measures deﬁned from the Renyi or Jensen en tropies [36][9] s ince they inv olve a free parameter . Th erefore, without adding free parameter s, we only apply the Shannon entropy to information measures [37]: H ( Y ) = − X y p ( y ) log 2 p ( y ) , (4) where Y is a discrete rand om v ariable with pro bability m ass functio n p ( y ). Then mutual info rmation is deﬁned as [37]: I ( T , Y ) = X t X y p ( t , y ) log 2 p ( t , y ) p ( t ) p ( y ) , (5) where p ( t , y ) is the joint distribution for the two discrete ran dom variables T and Y , and p ( t ) and p ( y ) ar e called marginal distrib utions that can be derived fr om: p ( t ) = X y p ( t , y ) , p ( y ) = X t p ( t , y ) . (6) Sometimes, the simpliﬁed no tations for p i j = p ( t , y ) = p ( t = t i , y = y j ) are u sed in this work. T ab le 1 lists the possible normalized inf ormation m easures with in the m utual-info rmation based group. Basically , they all make use of Eq. (5 ) in their calculations. The main di ﬀ erence s ar e due to the normalization schemes. In applying the formulas for calculating N I k , one gene rally does not have an e xact p ( t , y ). F or this reason , we adopt an empirical joint distribution deﬁned below for the calculations. Deﬁnition 6. Empirical jo int distribution and empirical marginal distrib utions [1 1] . An empirical joint distribution is deﬁned from the frequency means for the gi ven confusion matrix, C , as: P e ( t , y ) = ( P i j ) e = 1 n c i j , i = 1 , 2 , . . . , m , j = 1 , 2 , . . . , m + 1 , (7 a ) 6 where n = P C i , denotes the total num ber of samp les in th e classiﬁcations. T he su bscript ” e ” is giv en f or denotin g empirical terms. T he empirical marginal distrib utions are: P e ( t = t i ) = C i n , i = 1 , 2 , . . . , m . (7 b ) P e ( y = y j ) = 1 n m X i = 1 c i j , j = 1 , 2 , . . . , m + 1 . (7 c ) Deﬁnition 7. Empirical mutua l inf ormation [11] . The empirical mutual information is giv en by: I e ( T , Y ) = X t X y P e ( t , y ) log 2 P e ( t , y ) P e ( t ) P e ( y ) = m X i = 1 m + 1 X j = 1 c i j n log 2 ( c i j C i m P i = 1 c i j n ) . (8) Deﬁnitions 6 an d 7 provide users with a direct means fo r applying information measur es th rough the given data of the con fusion matrix . For the sake of simplicity of analysis and d iscussion, we a dopt the empirical distributions, or p i j ≈ P i j , fo r calculating all NIs and deriving the theorem s, but removing their associated su bscript ” e ”. Note that the notation of N I 2 in T able 1 di ﬀ ers from the others for calculating mutual information , where I M ( T , Y ) is deﬁned as “ modiﬁed mutual information ”, The calculation of I M ( T , Y ) is carried out b ased on the intersection o f T an d Y . Hence, when using Eq. (8), the intersection requires that I M ( T , Y ) incorporate a summation of j over 1 to m , instead of m + 1. This deﬁnition is b eyond mathematical rigor, b ut N I 2 has the same p roperties of metrics as N I 1 . It was originally propo sed to overcome the problem of unchanging v alues in NIs if rejections are made within only one c lass (referring to M9-M10 in T able 3, [11]). Th e following three theorems are deri ved for all NIs in this group. Theorem 1 . W ithin all NI measures in T ab le 1, when N I ( T , Y ) = 1, the classiﬁcation witho ut a reject class may correspo nd to the case of e ither an exact classiﬁcation ( y k = t k ), o r a spe ciﬁc misclassiﬁcation ( y k , t k ). The spec iﬁc misclassiﬁcation can be fully removed by simply e xcha nging labels in the confusion matrix. Proof . If N I ( T , Y ) = 1, we can ob tain the f ollowing cond itions from Eq. ( 8) for classiﬁcations without a reject class: p i j = p ( t = t i ) ≈ P e ( t = t i ) = C i n and p k j = 0 , i , j , k = 1 , 2 , . . . , m , k , i . (9) These con ditions describe the speciﬁc con fusion matrix where only on e non-zer o term appears in e ach co lumn (with the exception o f the last ( m + 1)th column). When j = i , C is a diagonal matrix for representing an exact classiﬁcation ( y k = t k ). Otherwise, a speciﬁc misclassiﬁcation exists fo r which a d iagonal matrix can be obtained by exchanging labels in the confusion matrix (referring to M11 in T ab le 4, [11]). ♦ Remark 4 . Theor em 1 descr ibes that NI(T ,Y) = 1 presents a necessary , but no t su ﬃ cient, c ondition of an exact classiﬁcation. Theorem 2 . For abstaining classiﬁcation problems, when N I ( T , Y ) = 0, the classiﬁer generally reﬂects a misclas- siﬁcation. One special case is that all samples are considered to be one of m class es, or be a re ject class. Proof . For NI s deﬁned in T able 1 , N I ( T , Y ) = 0 i ﬀ I ( T , Y ) = 0. Accor ding to informa tion theory [37], the following conditio ns can hold based on the given marginal distributions (o r the emp irical on es if a con fusion m atrix is used): I ( T , Y ) = 0 , i f f p ( t , y ) = p ( t ) p ( y ) . (10) The con ditional part in Eq. (10 ) can b e rewritten in the for m p i j = p ( t = t i ) p ( y = y j ). From the constraints in (3), p ( t = t i ) > 0 ( i = 1 , 2 , . . . , m ) can b e o btained. For classiﬁcation solutions, there sho uld exist at least one term for p ( y = y j ) > 0 ( j = 1 , 2 , . . . , m + 1). There fore, at least o ne n on-zero term for p i j > 0 ( i , j ) m ust be obtained . This non-ze ro term c orrespon ds to th e o ﬀ -diag onal term in th e confu sion m atrix, which in dicates that a misclassiﬁcation has oc curred. When all samples have been identiﬁed as one of the classes (referr ing to M2 in T able 4, [11]), N I = 0 should be obtained. ♦ Remark 5 . Eq. (10) gives the statistical reason for zero mutu al information, that is, the two random variables are “ statistically independ ent ”. Theo rem 2 demonstrates an intrinsic reason for local minima in NIs. 7 Theorem 3 . The NI measures deﬁned by th e Shannon entropy g enerally do not e xhibit a monotonic property with respect to the diagonal terms of a confusion matrix. Proof . Based on [11], we arrive at simpler co nditions fo r the local minima a bout I ( T , Y ) for the given con fusion matrix: C =               . . . 0 0 . . . 0 c i , i c i , i + 1 0 0 c i + 1 , i c i + 1 , i + 1 0 . . . 0 0 . . .               , i f c i , i c i + 1 , i = c i , i + 1 c i + 1 , i + 1 . (11) The local minim a are obtained beca use the four giv en non -zero terms in Eq. (11 ) produce zero (or the minimu m) contribution to I ( T , Y ) . Supp ose a ge neric form is given for N I ( T , Y ) = g ( I ( T , Y )), where g ( · ) is a n ormalization function . From th e chain rule of deriv atives , it can b e seen that the condition s d o n ot chang e for reaching the local minima. ♦ Remark 6 . The non -mono tonic p roperty of the info rmation m easures implies that these measures m ay su ﬀ er from an intrinsic p roblem of local minima for classiﬁcation r ankings (referring to M1 9-M20 in T ab le 4, [11]). Or, accordin g to Feature 1 of th e meta-measu res, a ratio nal result fo r the classiﬁcation evaluations m ay not be obtained due to the non -mono tonic property of the measures. This shortcom ing has not been theo retically derived in previous studies ([28][29][32]). 5. Normalize d Inf ormation Measures based on Inf ormation Div ergence In this section, w e propo se normalized inform ation measures based on the d eﬁnition of inf ormation divergence. In T able 2, we summar ize the com monly-u sed div ergence measures, which are denoted as D k ( T , Y ) and represents dissimilarity betwee n the two r andom variables T an d Y . In Section s 5 and 6, we app ly the fo llowing no tations for deﬁning marginal distrib utions: p t ( z ) = p t ( t = z ) = p ( t ) , a nd p y ( z ) = p y ( y = z ) = p ( y ) , (12) where z is a p ossible scalar value that t or y can take. For a con sistent co mparison with the previous normalized informa tion m easures, we adopt the following transformation on D k [31]: N I k = exp( − D k ) . (13) This transfo rmation provides both inv erse and n ormalization functio nalities. It does not introdu ce any extra param- eters, and presents a hig h degree o f simplicity , as in deriv ation for examining the local minima. T wo more theorems are derived by fo llowing a similar analysis to that in the previous s ection. Theorem 4 . For all NI measure s in T able 2, wh en N I ( T , Y ) = 1, th e classiﬁer c orrespond s to the case of either an exact classiﬁcation, or a speciﬁc m isclassiﬁcation. Gener ally , th e misclassiﬁcation in the latter case can not be removed by switching labels in the confusion matrix. Proof . When p y ( z ) = p t ( z ), it is always the case that N I ( T , Y ) = 1. Howe ver , general conditio ns can be given for p y ( z ) = p t ( z ) as follows: p y ( y = z i ) = p t ( t = z i ) , or X j p ji = X j p i j , i = 1 , 2 , . . . , m . (14) Eq. ( 14) implies two cases o f classiﬁcations fo r D k ( T , Y ) = 0 , k = 1 0 , . . . , 20, One o f these correspo nds to an exact classiﬁcation (or y k = t k ), wh ile the other is the result of a speciﬁc misclassiﬁcation that shows th e relationship of y k , t k , but p y ( z ) = p t ( z ). In th e latter case, switching labels in th e confusion ma trix to r emove m isclassiﬁcation generally destroys the relation for p y ( z ) = p t ( z ) at t he same time. Con sidering the relation is a necessary condition for a perfect classiﬁcation, the misclassiﬁcation cannot be removed through a label switching operation. ♦ Remark 7 . Th eorem 4 suggests the caution shou ld be ap plied in explaining the c lassiﬁcation evaluations wh en N I ( T , Y ) = 1. The maxim um of the NIs from the information di vergence mea sures only in dicates the equiv alence between the marginal pro babilities, p y ( z ) = p t ( z ), but this is not always true f or r epresenting exact classiﬁcations (or 8 y k = t k ). Theorem 4 reveals an intrin sic p roblem when u sing an NI as a measure fo r similarity ev aluations between two datasets, such as in image registration. Theorem 5 . Th e NI measures based on infor mation di vergence g enerally do not exhibit a monotonic prope rty with respect to the diagonal terms of confusion matrix. Proof . T he theorem can be proved by examining the existence o f mu ltiple maxima fo r NI m easures based o n informa tion d i vergence. Here we use a b inary classiﬁcation a s an example. T he local minima of D k are o btained when the following conditions exist for a confusion matrix: C = " C 1 − d 1 d 1 0 d 2 C 2 − d 2 0 # and d 1 = d 2 , (15) where d 1 and d 2 are integer numbers ( > 0) for misclassiﬁed s amp les. The con fusion matrix in Eq. (15) prod uces zero div ergence D k and therefo re, N I k = 1. However , ch anging from d 1 , d 2 always results in N I k < 1. ♦ Remark 8 . Theo rem 5 indicates an other shortco ming of NIs in the inf ormation diver gence group from th e view- point of monoton icity . The reason is o nce again attrib uted to the usage of marginal distributions in calculations of div ergence. The shortcomin g has not been reported in previous in vestigations ([ 31][35]). 6. Normalize d Inf ormation Measures based on Cr oss-Entropy In this section , we pr opose normalized information measures based o n cross-entropy , which is deﬁned for discrete random variables as [10]: H ( T ; Y ) = − X z p t ( z ) log 2 p y ( z ) , or H ( Y ; T ) = − X z p y ( z ) log 2 p t ( z ) . (16) Note that H ( T ; Y ) di ﬀ ers from joint-entropy H ( T , Y ) with respect to both notation and deﬁnition, and is gi ven as [37 ]: H ( T , Y ) = − X t X y p ( t , y ) log 2 p ( t , y ) . (17) In fact, from Eq. (1 6), one can deri ve the relation between KL div ergence (see T able 2) and cross-entro py: H ( T ; Y ) = H ( T ) + K L ( T , Y ) , or H ( Y ; T ) = H ( Y ) + K L ( Y , T ) . (18) If H ( T ) is consider ed as a con stant in classiﬁcation since the target dataset is gen erally known and ﬁxed, we can observe fro m E q. (18 ) that cross-en tropy shares a similar meaning as KL diver gence fo r representing dissimilarity between T and Y . From the conditio ns H ≥ 0 an d K L ≥ 0, we are able to realize the no rmalization for cross-entro py shown in T able 3 . Following similar discussions as in th e pr evious section, we can d eriv e that all inform ation measures listed in T able 3 will also satisfy Theorems 4 and 5. 7. Numerical Exa mples and Discussi ons This section presents sev eral nume rical examp les togeth er with associated discussions. All calculations for the numerical examples were done using the op en source software Scilab 1 and a speciﬁc too lbox 2 . The detailed imple- mentation of this toolbox is describe d in [38]. T a ble 4 lists six numerical exam ples in b inary classiﬁcation prob lems accordin g to th e speciﬁc s cenar ios of their con fusion matrices. W e adopt the notations from [39] fo r the terms “ correct r ecognition r ate (CR) ”, “ err or rate (E) ”, and “ r eject r ate (R ej) ” and their relation: C R + E + Re j = 1 . (19) In addition, we deﬁne “ accuracy r ate (A) ” as 1 http://w ww.scilab.or g 2 The toolbox is freely av aila ble as the ﬁle “ confmat rix2ni.zip ” at ( http://www. openpr.org.cn ). 9 A = C R C R + E . (20) The ﬁrst fo ur classiﬁcations (or m odels) M1-M4 are provided to show the speciﬁc d i ﬀ erences with respect to error types and reject ty pes. In this work, we do not c oncern classiﬁers ap plied (say , neu ral networks or support vector machines) for ev aluations, but only the resulting ev aluation s from these classiﬁers. In real ap plications, it is commo n to encounter rankin g classiﬁcation results as in M1 to M4. The ﬁrst two classiﬁcations of M1 and M2 share the same values fo r the correct recognition and accuracy rates ( C R = A = 99%). Th e other tw o classiﬁcations, for M3 and M4, have the same reject rates ( Re j = 1%) and correct recognition rates ( C R = 99 %). The accuracy rates for M3 and M4 are also the same ( A = 10 0%). This d eﬁnition is con sistent with the conv ention s in the study of “ Accuracy- Reject ” curves [16]. If neglecting th e speciﬁc app lication backgrou nds, users gen erally have a ran king or der for the four classiﬁcations so that the “ b est ” o ne is selected. Th e data fro m other con ventional measures, such as Pr eci sion , Recall an d F 1, are also given in T able 4. W ithout using extra knowledge abou t the co st o f di ﬀ ere nt error ty pes o r re ject types, the conv ention al per formanc e mea sures are not possible to rank the four classiﬁ cations, M1-M4, properly . According to th e intuitions of Featu re 3 proposed in Sectio n 3, one can gain two sets of ranking or ders for th e four classiﬁcations M1 to M4 in forms of: ℜ ( M 2 ) > ℜ ( M 1) , ℜ ( M 4) > ℜ ( M 3) , (21 − a ) ℜ ( M 4 ) > ℜ ( M 2) , ℜ ( M 3) > ℜ ( M 1) , (21 − b ) where we denote ℜ ( • ) to be a rankin g op erator, so that ℜ ( M i ) > ℜ ( M j ) expresses M i is better than M j in rank ing. From eq. (21), on e is unable to tell the rankin g order between M2 and M3. For a fast comp arison, a sp eciﬁc letter is assigned to the ranking order of each model in T able 4 based on eq. (2 1): ℜ ( M 4 ) = A , ℜ ( M 3) = B , ℜ ( M 2) = B , ℜ ( M 1) = C . (22) The top rank “ A” indicates the “ best ” classiﬁcation (M 4 in this case) of the four models. T ab le 4 does not distinguish ranking order b etween M2 and M3. Howe ver , numer ical invest igation s using information measures will provide the ranking order from the giv en data. Th e other two models, M5 and M6, are also speciﬁcally design ed for the purp ose of examining information measures on Theorems 3 and 5 (or Feature 1), respectively . T ables 5 and 6 present the results on information me asures for M1-M 6, where th e rankin g orders amo ng M1-M4 is based o n the c alculation results of NIs with th e given d igits. The fo llowing ob servations are achieved f rom the solutions to the examples. 1) When no rmalization functions includ e the te rm H ( Y ) f or the mutual infor mation group , the associated NI pro - duces the desirab le featu re of a variation in reject rate. N I 2 is e ﬀ ective for th is feature even if it only uses H ( T ) for its no rmalization. The e ﬀ ecti veness is attr ibuted to the deﬁnition of I M ( T , Y ) f or calculatin g mutual informa tion based on the intersection of T and Y . 2) The re sults o f M5 an d M6 conﬁrm , respecti vely , Theorem 3 for loc al minim a an d Th eorem 5 for maxima of NIs. T he e xistence of multi extrema in dicates the non-mon otonic pro perty with respect to the diagonal terms of the confusion matrix, thereby exhibiting an intrinsic shortcoming of the information measures. 3) For classiﬁcations M1 to M4 , the meta- measure of Feature 3 suggests rankin g orders as shown in eqs. ( 21) or (22). Ho wever , o f all the measures in the three groups only N I 2 shows any consistency with the intuitio ns from the given examples (T ables 5 a nd 6). This re sult indicates that Feature 3 seems to be a di ﬃ cult property for most informatio n measures. 4) None of the perf ormance or info rmation measures inv estigated in this work fully satisfy the me ta-measures. Examinin g data distinguishability in M1 throug h M4, we consider th e information me asures fro m the mutua l- informa tion grou p to be m ore ap propr iate th an tho se of the other grou ps (say , N I 12 and N I 22 do n ot show signiﬁcant distinguishab ility , or v alue di ﬀ erences, to the four models). 10 The f ourth o bservation supports the pro posal of meta- measures for a higher lev el of classiﬁcation evaluations. The meta-measures p rovide users with a simple guideline of selectin g “ pr o per ” measures from their speciﬁc concern s of app lications. For exam ple, the p erforma nce me asures (such as A , E , C R , F 1 , o r A UC) satisfy Feature 1, b ut fail directly to disting uish error types and r eject types in an objective e valuation. When Feature 2 o r 3 is a main co ncern, the informatio n measures e xhibited to be more e ﬀ ective, despite them not being perfect. Of all the inform ation measu res in vestigated, N I 2 is shown to be th e “ best ” for the given examples in terms of Feature 3. Therefo re, more detailed studies, fro m both theoretical a nd n umerical o nes, were made on this pro mising measure. The theor etical pr operties of this me asure was derived in App endix A. Wh ile Theorem A1 conﬁrms th at N I 2 satisfy Feature 3 around the exact classiﬁcations, Theo rem A2 in dicates that this measure is able to ad just the ranking o rder between a misclassiﬁcation of a large class and a rejection of a small class. T able 7 shows two sets of confusion s matrices which are similar to M1-M4 in T a ble 4. One can o bserve the changes o f ranking order s amon g them. Th ese changes numerically conﬁrm Theorem A2 and its critical point, or cross-over point ( Ω = C 1 / n ≈ 0 . 942), for the gi ven data. Further in vestigations were carried o ut o n th ree-class problems. Althou gh some NIs could be removed direc tly based on the ir po or performanc e with respect to the m eta-measures (such as N I 1 and N I 9 on Feature 2 ), they wer e retained to d emonstrate p ros an d c ons in the applica tions. At this stage, we extend the conce pts of error types and reject types to multiple class es. Nine examples are speciﬁcally designed in T able 8. The ranking order for each model is shown in T ab le 8 , which is der i ved from the intuition s of Feature 3 . From T ables 9 and 1 0, it is inter esting to see that N I 2 is still the most approp riate m easure for classiﬁcation evaluations. Using this m easure, w e can select the “ best ” and “ worst ” classiﬁcations co nsistent with our intuition. All other m easures perf orm below our satisfactions for distinguishing error types and reject types properly . The numerica l stu dy supports the viewpoint that n o universally su perior measure exists. For example, in com- paring with information me asure N I 2 , the c on ventional accur acy measure satisﬁes Fea ture 1, but do es not q ualify to Feature 3. Thus, a ny m easure, either p erforma nce-based or inf ormation- based, should be designed a nd ev aluated within th e c ontext of th e sp eciﬁc a pplications. It is evident that the desirable featur es in th e speciﬁc application s become more crucial (or “ pr oper ”) for e valuation m easures than some generic mathematical properties. For e xamp le, informa tion m easures (such as KL div ergence) , that may not satisfy a metric’ s properties (say , symmetry) , are able to process classiﬁcation e valuations including a reject op tion. They provide mo re app licable power than the conventional perfor mance measures in abstaining classiﬁcations. Ho wever , we still need a complete picture about info rmation mea- sures with respect to the ir advantages as well as limitations. The examp les in T ables 4, 7, and 8 on ly present limited scenarios for variations in con fusion matr ices. Using the open -source toolbo x from [38], o ne is able to test more scenarios for numerical in vestigations. 8. Summary In this work, we in vestigated objectiv e ev aluations of classiﬁcations by introducin g normalized information mea- sures. W e reviewed the related works an d discussed objectivity and its for mal d eﬁnition in ev aluations. Objective ev aluations may be required under di ﬀ erent application backgroun d. In classiﬁcations, for example, exact kno wledge of misclassiﬁcation costs is som etimes unknown in e valuations. Mor eover , cases o f ign orance r egarding reject costs appear more often in scenarios of abstaining classiﬁcations. In these cases, althoug h sub jecti ve e valuations can be applied, the user-given data of th e unkn own abstention c osts will lead to a much high er degree of u ncertainty or in- consistency . W e b eliev e that an objective evaluation can be a suitable solution, as well as a complem entary , app roach to subjecti ve evaluations. In some situations, an objecti ve e valuation is co nsidered useful despite th e s ubjec ti ve e valu- ations being re asonable for the applications. The results fro m both objective and subjective e valuations gi ve users an overall q uality of classiﬁcation results. Considering that abstaining classiﬁcations are beco ming more po pular, we focu sed on the distinction s of er ror types and reject types with in ob jecti ve evaluations of classiﬁcations. First, we p roposed three m eta-measures for assessing classiﬁcations, which seem mo re relevant and pr oper than the properties of m etrics in the context o f clas- siﬁcation applicatio ns. The meta-measu res provide users with useful gu idelines for a quick selection of cand idate measures. Second, we tried systematically to en rich a classiﬁcation ev aluation b ank by in cluding common ly used informa tion measures. Contrar y to the co n ventional perf ormance m easures that apply empirical for mulas, th e infor- mation measures are theoretically more sound for objecti ve e valuations of class iﬁcations. The key advantage o f these 11 measures over the conventional ones is their ability to handle mu lti-class classiﬁcation ev aluation s with a reject o p- tion. Third, we r e vealed theor etically the in trinsic sho rtcoming s of the information measures. The se have not been formally reported b efore in studies of im age registration, feature selection, or similarity ranking. The discovery of these shortcoming s is v ery importan t for users to interpret their results correctly when applying those measures. Based o n the prin ciple of the “ No F ree-Lunch Theor em ” [15], w e reco gnize that there are no “ universally superio r ” measures [5]. It is not our aim to replace the co n ventional perform ance measures, b ut to explore information measur es systematically in classiﬁcation ev aluations. The theoretical study demonstrates the strength and weakness of the informa tion measures. Nu merical in vestigations, co nducted on binary an d thr ee-class c lassiﬁcations, co nﬁrmed tha t objective ev aluations are not an easy topic in th e st udy of machine learning. One of the most ch allenging tasks will be an explor ation o f novel measures that satisfy all meta-measur es as well as the metric proper ties in objecti ve ev aluations of classiﬁcations. It is also nec essary to de ﬁne the “ rankin g or der ” intu itions among erro r types and reject typ es in generic classi ﬁcations, which will form th e basis of the quantitati ve meta-m easures. Howe ver , this task bec omes m ore di ﬃ cult if classiﬁcations are beyond tw o classes. Acknowledgment This work is supported in part by Natural Science of Foundation of China (#610750 51). Ap pendix A. Theorems and Sensitivity Functions of N I 2 for Binary Classiﬁcations Theorem A1 : For a binary classiﬁcation deﬁned by: C = " T N F P RN F N T P R P # , and (A1 − a ) C 1 = T N + F P + RN , C 2 = F N + T P + RP , C 1 + C 2 = n (A1 − b) N I 2 satisﬁes Feature 3 on the property regarding error types and reject types around the exact classiﬁcations. Sp eciﬁ- cally for the four confusio n matrices belo w: M 1 = " C 1 0 0 d C 2 − d 0 # , M 2 = " C 1 − d d 0 0 C 2 0 # , M 3 = " C 1 0 0 0 C 2 − d d # , M 4 = " C 1 − d 0 d 0 C 2 0 # , (A2) the following relations will be held: N I 2 ( M 1 ) < N I 2 ( M 2 ) and N I 2 ( M 3 ) < N I 2 ( M 4 ) , (A3 − a ) N I 2 ( M 1 ) < N I 2 ( M 3 ) and N I 2 ( M 2 ) < N I 2 ( M 4 ) , (A3 − b) where C 1 > C 2 > d > 0 . (A3 − c) Proof . For a binary classiﬁcation, N I 2 is deﬁned by the modiﬁed mutual inform ation: N I 2 = I M ( T , Y ) H ( T ) , and I M ( T , Y ) = T N n log 2 nT N C 1 ( T N + F N ) + F P n log 2 nF P C 1 ( T P + F P ) + F N n log 2 nF N C 2 ( F N + T N ) + T P n log 2 nT P C 2 ( F P + T P ) . (A4) Let M 0 be a confu sion matrix correspon ding to the exact classi ﬁcations: M 0 = " C 1 0 0 0 C 2 0 # . (A5) 12 Based on th e deﬁnition of I M in (A4), one can calculate the mutu al infor mation di ﬀ e rences between two models. Considering M 0 to be a baseline, we obtain the analytical results below for the four models: ∆ I 10 = I M ( M 1 ) − I M ( M 0 ) = 1 n ( C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d ) , (A6 − a ) ∆ I 20 = I M ( M 2 ) − I M ( M 0 ) = 1 n ( C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d ) , (A6 − b) ∆ I 30 = I M ( M 3 ) − I M ( M 0 ) = d n (log 2 C 2 n ) , (A6 − c ) ∆ I 40 = I M ( M 4 ) − I M ( M 0 ) = d n (log 2 C 1 n ) , (A6 − d) For the g i ven assumption C 1 > C 2 > d > 0, all ∆ I s above are negativ e values so that their abstracts represent the absolute co sts in classiﬁcations. One can d irectly p rove that | ∆ I 30 | > | ∆ I 40 | from ( A6-c) and (A6- d). The p rocedur es for the proo f of | ∆ I 10 | > | ∆ I 20 | are g i ven below . Fir st, one nee ds to con ﬁrm the following two functio ns to b e strictly decreasing ( x 1 < x 2 , g ( x 1 ) > g ( x 2 )): g 1 ( x ) = ( x x + d ) x and g 2 ( x ) = ( d x + d ) d f or x > 0 , d > 0 . (A7 − a ) Then, from the monoton ically decreasing property of (A7-a), one can deri ve the follo wing relations: C 1 > C 2 → ( C 1 C 1 + d ) C 1 < ( C 2 C 2 + d ) C 2 < 1 and ( d C 1 + d ) d < ( d C 2 + d ) d < 1 → 1 n | C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d | < 1 n | C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d | → | ∆ I 20 | < | ∆ I 10 | (A7 − b) The relation s in (A3-a) are achieved for N I 2 because its normalization term, H ( T ), is a co nstant for the given C 1 and C 2 . One th erefore con ﬁrms the satisfaction of Feature 3 on the p roperty of the within er ror types and r eject types around the exact classiﬁ cation s, respecti vely . Then it is a pro of of the re lation (A3- b), which suggests that a misclassiﬁcation su ﬀ er a higher cost than a rejection for the same class. Feature 3 considers this relation as a b asic pr operty in classiﬁcations for the between er ror and reject types. Th e procedures for the proof are: C 1 > C 2 → C 1 C 2 + C 1 d > ( C 1 + C 2 ) d = nd → 1 > C 1 n > d C 2 + d →    log 2 ( C 1 n )    <     log 2 ( d C 2 + d )     → 1 n    d log 2 ( C 1 n )    < 1 n     d log 2 d C 2 + d     < 1 n     C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d     → | ∆ I 40 | < | ∆ I 20 | (A8 − a ) C 1 + d < n → C 1 ( C 1 + d ) + n d < C 1 n + n d → C 1 ( C 1 + d ) + nd n ( C 1 + d ) < 1 → C 1 n + d C 1 + d < 1 → d C 1 + d < C 2 n < 1 →    log 2 C 2 n    <     log 2 d C 1 + d     → 1 n    d log 2 C 2 n    < 1 n     C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d     → | ∆ I 30 | < | ∆ I 10 | . (A8 − b) ♦ Theorem A2 : For the gi ven conditions (A1)-(A2) and C 1 > C 2 > d > 0, N I 2 will satisfy the following relations: N I 2 ( M 4 ) > N I 2 ( M 3 ) > N I 2 ( M 2 ) > N I 2 ( M 1 ) f or 0 . 5 < p 1 < Ω ≤ 1 (A9 − a ) N I 2 ( M 4 ) > N I 2 ( M 2 ) > N I 2 ( M 3 ) > N I 2 ( M 1 ) f or 0 . 5 < Ω < p 1 ≤ 1 (A9 − b) where we set p 1 = C 1 / n , and Ω is an up per boundar y fo r the v alidation of (A9-a ). Proof . The ﬁrst relation describes that the ranking order in (A9-a) is valid only for a certain range of p 1 . Th e lower bo undary is resulted fr om th e assumption of C 1 to be a large class. The upp er bound ary , Ω , is d etermined by 13 0 10 20 30 40 50 60 70 80 90 100 Ŧ 0.09 Ŧ 0.08 Ŧ 0.07 Ŧ 0.06 Ŧ 0.05 Ŧ 0.04 Ŧ 0.03 Ŧ 0.02 Ŧ 0.01 Ŧ 0.00 p1(%) Delta_I Figure A1.: Plots of “ ∆ I vs. p 1 (%)” when n = 100 and d = 1.(Black-Solid = ∆ I 10 , Black -Dash =∆ I 20 , Blue-Sol id =∆ I 30 , Blue-Da sh =∆ I 40 ) the cross-over point betwe en the functions of (A- 6b) and (A-6c) . For better u nderstandin g of the relation s ( A9), we present the plots of “ ∆ I vs. p 1 ” when n = 10 0 and d = 1 (Fig. A1). For examining the v alidation rang e of ( A9-a), one needs to calculate the cr oss-over p oint fro m solving th e eq uation below: f = ∆ I 20 − ∆ I 30 = 1 n ( C 2 log 2 C 2 C 2 + d + d log 2 d n C 2 + d ) = 0 . (A10) There exists no closed-form solu tion to Ω . Based on the monoton icity of the related function s and relatio ns in (A3) , one is a ble to conﬁrm the c onditions in (A9-a) and (A9-b), r espectiv ely . Fig. A1 depicts nu merically th at only a single cross-over point appears to the range of p 1 > 0 . 5(or C 1 > C 2 ). ♦ Remark A1 : W e can d enote Ω ( n , d ) to be the c ross-over point obtained from f , with two in depende nt v ariables n and d . The value of Ω incr eases with n , but decre ases with d . A nu merical so lution to Ω should b e eng aged. The physical interpretation of Ω is a critical point at which a rejection within a small class has the same cost with a misclassiﬁcation within a large class. This situation generally does n ot occur excep t f or classiﬁcations o f largely- ske wed classes (say , C 1 >> C 2 ). Therefor e, we call the ranking order in (A9-a) is a ge neral ranking order, and one in (A9-b) is a largely-ske wed-cla ss ra nking order . Sensitivity functions : The sensiti vity fun ctions are gi ven as the con ventional forms for deli vering approxim ation analysis of I M : ∂ I M ∂ T N = 1 n " log 2 n C 1 +  log 2 T N T N + F N  sng ( T N ) # , (A11 − a) ∂ I M ∂ T P = 1 n " log 2 n C 2 +  log 2 T P T P + F P  sng ( T P ) # , (A11 − b) ∂ I M ∂ F N = 1 n " log 2 n C 2 +  log 2 F N F N + T N  sng ( F N ) # , (A11 − c) ∂ I M ∂ F P = 1 n " log 2 n C 1 +  log 2 F P F P + T P  sng ( F P ) # , (A11 − d) ∂ I M ∂ R N = − ∂ I ∂ T N − ∂ I ∂ F P , (A11 − e) 14 ∂ I M ∂ R P = − ∂ I ∂ F N − ∂ I ∂ T P . (A11 − f ) where sgn ( . ) is a sign fu nction for satisfying the deﬁnition of H (0) = 0. Only four indepen dent v ariables describe the sensiti vity f unctions due to the two co nstraints in (A1 -b). Hence, a chain r ule is app lied for d eriving t he function s of (A11-e) and (A11-f) . ♦ Remark A2 : Using e q.(A11), we failed to reach the reasonab le conclusion s as those in Theore ms A1 for the reason that the ﬁrst-or der d i ﬀ erentials may be not su ﬃ cient for the analysis ar ound the exact classiﬁcations. For example, we got the results for: I ( M 1 ) − I ( M 0 ) ≈ ( T P 1 − T P 0 ) ∂ I M ( M 0 ) ∂ T P + ( F N 1 − F N 0 ) ∂ I M ( M 0 ) ∂ F N = − d n log 2 ( n C 2 ) + d n log 2 ( n C 2 ) = 0 . (A12 − a) I ( M 2 ) − I ( M 0 ) ≈ ( T N 1 − T N 0 ) ∂ I M ( M 0 ) ∂ T N + ( F P 1 − F P 0 ) ∂ I M ( M 0 ) ∂ F P = − d n log 2 ( n C 1 ) + d n log 2 ( n C 1 ) = 0 . (A12 − b) This observation suggests that one needs t o be cautious when using sensiti vity function for a pproxim ation ana lysis on I M (or N I 2 ). References [1] C. X. Ling, J. Huang, H. Zhang, Auc: a statistical ly consi stent and more discriminati ng m easure than accurac y , in: the 18th Internatio nal Conferen ce on Artiﬁcial Intelli gence (IJCAI-2003), 2003, pp. 519–526. [2] M. M. S. Beg, A subjecti v e measure of web s earch quali ty , Information Sciences 169 (2005) 365–381. [3] N. Japko wicz, Why question machine learning ev aluati on methods?, in: AAAI-06 Evalu ation Methods for Machine Learning W orkshop, 2006, pp. 6–11. [4] T . Pietrasze k, Classiﬁcat ion of intrusion detectio n alerts using abstaining classiﬁers, Intelligent Data Analysis 11 (2007) 293–316. [5] N. Lavesson, P . Davidsson, Analysis of multi crit eria methods for algorit hm and classiﬁer ev alua tion, in: the 24th Annual W orkshop of the Swedish, 2007. [6] S. V anderlooy , E. H ¨ ullermei er , A critical analysis of varia nts of the auc, Machine Learning 72 (3) (2008) 247–262. [7] D. J. Hand, Mea suring cla ssiﬁer pe rformance: a coherent al ternati ve to the are a unde r t he roc curve , Machi ne Learning 77 (1) (2009) 103–123. [8] Y . Y . Y ao, S. K. M. W ong, C. J. Butz, O n information theoretic measures of attrib ute importance, in: P AKDD, Beijing, China, 1999, pp. 133–137. [9] J. Principe, D. X u, Q. Zhao, J. Fisher , Learning from example s with informatio n-theoreti c criteria , J. VLSI Signal Proc 26 (2000) 61–77. [10] C. M. Bishop, Neural Networks for Pattern Recog nition, Clarend on Press, London, 1995. [11] B. G. Hu, Y . W ang, Ev aluation criteri a based on mutual informati on for classiﬁcatio ns includin g rejected class, Acta A utomati ca Sinica 34 (2008) 1396–14 03. [12] M.-R. T emanni, S. A. Nadeem, D. Berrar , J. D. Zucker , Optimizing abstaining classiﬁers using roc analysis, in: CAMDA, 2007. [13] P . Domingos, Meta cost: A general method for making classiﬁers cost-sensiti ve , in: the 5th A CM SIGKDD Internationa l Conference on Kno wledge Discove ry and Data Mining, 1999, pp. 155–164 . [14] C. Elkan, Metacost: A general method for m aking classiﬁers cost-sensiti ve , in: the 17th Internati onal Joint Conferenc e on Artiﬁcial Intelli- gence (IJCAI-01), 2001, pp. 973–978. [15] R. O. Duda, P . E. Hart, D. Stork, Pattern Classi ﬁcation (2nd eds.), John W ile y , NY , 1995. [16] D. J. C. Mackay , Information Theory , Inferenc e, and Learning Algorithms, Cambridge: Cambridge Univ ersity Press, 2003. [17] C. Ferri, J . Hern ´ andez-Ora llo, R. Modroiu, A n experime ntal compari son of performance measures for classiﬁcat ion, Pattern Recognition Letters 30 (2009) 27–38. [18] N. Japko wicz, M. Shah, Ev aluati ng Learning Algorithms: A Classiﬁcati on Perspecti ve, Cambridge Uni versit y Press, 2011. [19] T . Fa wcett, An introducti on to roc analysis, Pattern Recognit ion Letters 27 (2006) 861–874. [20] C. Drummond, R. C. Holt e, Cost curve s: An improved method for visualiz ing classiﬁer performanc e, Machine Learning 65 (1) (2006) 95–130. [21] A. Andersson, P . Davidsson, J. Linden , Measure-ba sed classiﬁer performance e val uation, Patter n Recognit ion Letters 20 (1999) 1165–1173. [22] C. de Stefano, C. Sansone, M. V ento, T o reje ct or not to reject: That is the question - an answer in case of neural classiﬁers, IEEE Trans. on Systems, Man and Cyberne tics 30 (2000) 84–94. [23] T . Landgre be, D. M. J. T ax, P . Pac lik, R. P . W . Duin, T he interacti on between cla ssiﬁcation and reject performanc e for distanc e-based reject -option classiﬁers, Patte rn Recogni tion Letters 27 (2006) 908–917. [24] G. Iannello, G. Percannel la, C. Sansone, P . Soda, On the use of classiﬁcati on reliabili ty for improving performance of the one-per-cl ass decomposit ion method, Data Kno wl. Eng. 68 (2009) 1398– 1410. [25] S. V anderloo y , I. Sprinkhuiz en-Kuyper , E. Smirnov , J. va n den Herik, The roc isometr ics approach to construct reli able classiﬁers, Intelligen t Data Analysis 13 (2009) 3–37. [26] T . Kvålseth , Entropy and corre lation: Some comments, IEE E Tr ansactions on Systems, Man, and Cyberneti cs 17 (1987) 517–519. [27] T . D. W icke ns, Multiw ay Conti ngency T ables Analysis for the Social Scienc es, Lawrence Erlbaum, Hill sdale, Ne w Jersey , 1989. 15 [28] J. T . Fin n, Use of th e av erage mut ual i nformation inde x in ev alu ating classiﬁcation error and con sistency , Inter . J. of Geogra phical Informatio n Systems 7 (1993) 349–366. [29] A. Forbe s, Classiﬁcat ion-algorit hm ev alu ation: Fi ve perf ormance measures based o n confusi on matrices, J. Cl inical Monitori ng and Comput- ing 11 (1995) 189–206. [30] I. Kononen ko, I. Bratko, Information-base d ev aluat ion criterion for classiﬁer’ s performance, Machine Learning 6 (1991) 67–80. [31] R. Nishii, S. T anaka, Accuracy and inaccur acy assessments in land-cov er classiﬁcation , IEE E Trans. Geoscienc e and Remote Sensing 37 (1999) 491–498 . [32] P .-N. T an, V . Kumar , J. Sri va stav a, Selecti ng the right objecti v e m easure for associa tion analysis, Information Systems 29 (2004) 293–313. [33] Y . W ang, B.-G. Hu, Deri va tions of normalized mutual information in binary classiﬁcat ions, in: the 6th International Conference on Fuzzy Systems and Kno wledge Discov ery , 2009, pp. 155–163. [34] J. O. Berger , The case for objecti ve bayesian analysis, Bayesian Analysis 1 (2005) 385–402. [35] M. Li, X. Chen, X. Li, B. Ma, M. V itanyi , The similarit y metric, IEEE Trans. Inf ormation Theory 50 (2004) 3250–3264. [36] C. Tsallis, Possible generaliza tion of boltz mann-gibbs statistic s, J. Stats. Physics 52 (1988) 479–487. [37] T . M. Cov er , J. A. Thomas, Elements of Information Theory , John Wi ley , NY , 1995. [38] B. G. Hu, Informati on measure toolbox for classiﬁer ev alua tion on open source s oftw are scilab, in: IEEE Internatio nal W orkshop on Open- source Softwa re for Scientiﬁc Computat ion, 2009, pp. 179–184. [39] C. K. Chow , On optimum recogni tion error and rejec t tradeo ﬀ , IEEE Trans. on Inf ormation Theory 16 (1970) 41–46. [40] A. Strehl, J. Ghosh, Clustering ensembles, a kno wledge reuse framew ork for combinin g multiple partiti ons, J. Machine Learning Research 3 (2002) 583–617 . [41] F . Malv estuto, Statistical treatment of the information content of a database, Information Systems 11 (1986) 211–223. [42] S. Kullback, R. A. Leibler , On information and su ﬃ ciency , The Annals of Mathemati cal Statistic s 22 (1951) 79–86. [43] D. Johnson, S. Sinanovic , Symmetrizing the kullback-le ibler distance, Rice Univ ersity , W orking Pa per . [44] I. Csiszar , Eine informationthe oretische ungleichung und ihre anwendung auf den bewei s der ergodizit at von marko ﬀ sche n ket ten, Publ. Math. Inst. Hungar . Acad. Sci. 8 (1963) 85–108. [45] J. Lin, Div ergenc e measures based on the shannon entropy , IEEE Tra ns. Information Theory 37 (1991) 145–151. [46] D. Malerba, F . Es posito, M. Monopoli, Comparing dissimilari ty measures for probabilistic symbolic object s, Data Mining III, Series Man- agement Informati on Systems 6 (2002) 31–40. 16 T able 1: NI measures within the mutual-i nformation based group. No. Name [Reference] Formula on N I k 1 NI based on mutual information [28] N I 1 ( T , Y ) = I ( T , Y ) H ( T ) 2 NI based on mutual information [11] N I 2 ( T , Y ) = I M ( T , Y ) H ( T ) 3 NI based on mutual information [28] N I 3 ( T , Y ) = I ( T , Y ) H ( Y ) 4 NI based on mutual information N I 4 ( T , Y ) = 1 2 h I ( T , Y ) H ( T ) + I ( T , Y ) H ( Y ) i 5 NI based on mutual information [26] N I 5 ( T , Y ) = 2 I ( T , Y ) H ( T ) + H ( Y ) 6 NI based on mutual information [40] N I 6 ( T , Y ) = I ( T , Y ) √ H ( T ) H ( Y ) 7 NI based on mutual information [41] N I 7 ( T , Y ) = I ( T , Y ) H ( T , Y ) 8 NI based on mutual information [26] N I 8 ( T , Y ) = I ( T , Y ) max( H ( T ) , H ( Y )) 9 NI based on mutual information [26] N I 9 ( T , Y ) = I ( T , Y ) min( H ( T ) , H ( Y )) 17 T able 2: Information measures within the div ergen ce based group. No. Name of D k [Reference] Formula on D k ( N I k = e x p ( − D k )) 10 ED-Quadratic Div ergence [9] D 10 = Q D E D ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 11 CS-Quadratic Div ergence [9] D 11 = Q D CS ( T , Y ) = log 2 P z p t ( z ) 2 P z p y ( z ) 2 [ P z ( p t ( z ) p y ( z ))] 2 12 KL Div ergence [42] D 12 = K L ( T , Y ) = P z p t ( z ) log 2 p t ( z ) p y ( z ) 13 Bhattacharyya Distance [43] D 13 = D B ( T , Y ) = − log 2 P z p p t ( z ) p y ( z ) 14 χ 2 (Pearson) Div ergence [4 4] D 14 = χ 2 ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 p y ( z ) 15 Hellinger Distance [44] D 15 = H 2 ( T , Y ) = P z ( p p t ( z ) − p p y ( z )) 2 16 V ariation Distance [44] D 16 = V ( T , Y ) = P z | p t ( z ) − p y ( z ) | 17 J div ergenc e j [ 45] D 17 = J ( T , Y ) = P z p t ( z ) log 2 p t ( z ) p y ( z ) + P z p y ( z ) log 2 p y ( z ) p t ( z ) 18 L (or JS) div ergence [45] D 18 = L ( T , Y ) = K L ( T , M ) + K L ( Y , M ) , M = ( p t ( z ) + p y ( z )) 2 19 Symmetric χ 2 Div ergence [46] D 19 = χ 2 S ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 p y ( z ) + P z ( p y ( z ) − p t ( z )) 2 p t ( z ) 20 Resistor A verage Distance [43] D 20 = D RA ( T , Y ) = K L ( T , Y ) K L ( Y , T ) K L ( T , Y ) + K L ( Y , T ) 18 T able 3: NI measures within the cross-entrop y based group. No. Name Formula o n N I k 21 NI based on cross-entropy N I 21 = H ( T ) H ( T ; Y ) , H ( T ; Y ) = − P z p t ( z ) log 2 p y ( z ) 22 NI based on cross-entropy N I 22 = H ( Y ) H ( Y ; T ) , H ( Y ; T ) = − P z p y ( z ) log 2 p t ( z ) 23 NI based on cross-entropy N I 23 = 1 2  H ( T ) H ( T ; Y ) + H ( Y ) H ( Y ; T )  24 NI based on cross-entropy N I 24 = H ( T ) + H ( Y ) H ( T ; Y ) + H ( Y ; T ) T able 4: Numerical example s in Binary Classiﬁcat ions(M1-M4 and M6: C 1 = 90 , C 2 = 10; M5: C 1 = 95 , C 2 = 5). (R) = ranking order for the model, where R = A,B, ... , in de scending order from the top. Model M1 M2 M3 M4 M5 M6 (Ranking) (C) (B) (B) (A) C " 90 0 0 1 9 0 # " 89 1 0 0 10 0 # " 90 0 0 0 9 1 # " 89 0 1 0 10 0 # " 57 38 0 3 2 0 # " 89 1 0 1 9 0 # CR 0.990 0.990 0.990 0.990 0.590 0.980 Rej 0.000 0.000 0.010 0.010 0.000 0.000 Pr ecision 0.989 1.000 1.000 1.000 0.950 0.989 Recall 1.000 0.989 1.000 1.000 0.600 0.989 F1 0.994 0.994 1.000 1.000 0.735 0.989 19 T able 5: Results for the models in T able 4 on information measures from mutual-informati on and cross-entrop y groups. (R) = ranking order for the model, where R = A,B, ..., in descen ding order from the top. Model N I 1 N I 2 N I 3 N I 4 N I 5 N I 6 N I 7 N I 8 N I 9 N I 22 N I 23 N I 24 N I 25 M1 0.831 0.831 0.893 0.862 0.860 0.861 0.755 0.831 0.893 0.998 0.998 0.998 0.998 (C) (D) (D) (B) (D) (D) (D) (D) (D) (D) (A) (A) (A) (A) M2 0.897 0.897 0.841 0.869 0.868 0.869 0.767 0.841 0.897 0.998 0.998 0.998 0.998 (B) (C) (C) (D) (C) ( C) (C) (C) (C) (C) (A) (A) (A) (A) M3 1.000 0.929 0.909 0.955 0.952 0.953 0.909 0.909 1.000 0.969 0.000 0.484 0.000 (B) (A) (B) (A) (A) (A) (A) (A) (A) (A) (D) (B) (C) (B) M4 1.000 0.997 0.855 0.928 0.922 0.925 0.855 0.855 1.000 0.970 0.000 0.485 0.000 (A) (A) (A) (C) (B) (B) ( B) (B) (B) (A) (C) (B) (B) (B) M5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.374 0.548 0.461 0.495 M6 0.731 0.731 0.731 0.731 0.731 0.731 0.576 0.731 0.731 1.000 1.000 1.000 1.000 20 T able 6: Results for the models in T able 4 on information measures from div ergenc e group. S = singularity w hich cannot be remov ed. (R) = ranking order for the model, where R = A, B, ..., in descendi ng order from the top. Model N I 10 N I 11 N I 12 N I 13 N I 14 N I 15 N I 16 N I 17 N I 18 N I 19 N I 20 M1 0.9998 0.9998 0.9991 0.9998 0.9988 0.9997 0.9802 0.9983 0.9996 0.9977 0.9996 (C) (A) (A) (B) (A) (B) (A) (A) (B) (A) (B) (A) M2 0.9998 0.9998 0.9992 0.9998 0.9990 0.9997 0.9802 0.9985 0.9996 0.9979 0.9996 (B) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) M3 0.9998 0.9996 0.9849 0.9926 0.9890 0.9898 0.9802 S 0.9897 S S (B) (A) (D) (D) (D) (D) (D) (A) (D) M4 0.9998 0.9998 0.9856 0.9928 0.9899 0.9900 0.9802 S 0.9900 S S (A) (A) (A) (C) (C) (C) (C) (A) (C) M5 0.7827 0.6473 0.6189 0.8540 0.6002 0.8129 0.4966 0.2775 0.7550 0.0455 0.7406 M6 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 S 21 T able 7: Numerical example s in Binary Classiﬁca tions(n = 100). (R) = ranking order for the model, where R = A,B, ..., in descendin g order from the top. Model M1 a M2a M3a M4a M1b M2b M3b M4b C " 94 0 0 1 5 0 # " 93 1 0 0 6 0 # " 94 0 0 0 5 1 # " 93 0 1 0 6 0 # " 95 0 0 1 4 0 # " 94 1 0 0 5 0 # " 95 0 0 0 4 1 # " 94 0 1 0 5 0 # CR 0.99 0.99 0.99 0 .99 0.99 0.99 0.99 0.99 (Rejection) (0.00) (0.00) (0.01) (0.01) (0.00) (0.00) (0.01) (0.01) N I 2 0.756 0.874 0.876 0.997 0.720 0.864 0.849 0.997 (Ranking) (D) (C) (B) (A) (D) (B) (C) (A) 22 T able 8: Classiﬁcation exa mples in three classes( C 1 = 80, C 2 = 15, C 3 = 5).(R) = ranking order for the model, where R = A,B, .. ., in desc ending order from the top. Model M7 M8 M9 M10 M11 (Ranking) (C) (C) (B) (B) (B) C           80 0 0 0 0 15 0 0 1 0 4 0                     80 0 0 0 0 15 0 0 0 1 4 0                     80 0 0 0 0 15 0 0 0 0 4 1                     80 0 0 0 1 14 0 0 0 0 5 0                     80 0 0 0 0 14 1 0 0 0 5 0           CR 0.99 0.99 0.99 0.99 0.99 Rej 0.00 0.00 0.01 0.00 0.00 Model M12 M 13 M14 M15 (Ranking) (B) (B) (B) (A) C           80 0 0 0 0 14 0 1 0 0 5 0                     79 1 0 0 0 15 0 0 0 0 5 0                     79 0 1 0 0 15 0 0 0 0 5 0                     79 0 0 1 0 15 0 0 0 0 5 0           CR 0.99 0.99 0.99 0.99 Rej 0.01 0.00 0.00 0.01 23 T able 9: Results for th e models in T able 8 on information measures from m utual- information and cross-entrop y groups. S = singula rity which ca nnot be remo ved. (R) = ranking order for t he model, where R = A,B, .. ., in desc ending order from the top. Model N I 1 N I 2 N I 3 N I 4 N I 5 N I 6 N I 7 N I 8 N I 9 N I 21 N I 22 N I 23 N I 24 M7 0.912 0.912 0.957 0.935 0.934 0.934 0.876 0.912 0.957 0.998 0.998 0.998 0.998 (F) (F) (F) (C) (G) (G) (G) (F) (H) (E) (D) (C) (C) (C) M8 0.939 0.939 0.958 0.949 0.949 0.949 0.902 0.939 0.958 0.998 0.998 0.998 0.998 (F) (E) (E) (B) (D) (D) (D) (D) (D) (D) (D) (C) (C) (C) M9 1.000 0.951 0.961 0.980 0.980 0.980 0.961 0.961 1.000 0.982 0.000 0.491 0.000 (C) (A) (D) (A) (A) (A) (A) (A) (A) (A) (G) (G) (I) (G) M10 0.912 0.912 0.938 0.925 0.925 0.925 0.860 0.912 0.938 0.999 0.999 0.999 0.999 (E) (F) (F) (F) (I) (I) (I) (H) (H) (G) (A) (A) (A) (A) M11 0.956 0.956 0.941 0.948 0.948 0.948 0.902 0.941 0.956 0.998 0.998 0.998 0.998 (E) (D) (C ) (E) ( E) (E) (E) (D) (C) (E) (B) (C) (C) ( C) M12 1.000 0.969 0.943 0.972 0.971 0.971 0.943 0.943 1.000 0.983 0.000 0.492 0.000 (B) (A) (B) (D) (B) (B) ( B) (B) (B) (A) (F) (G) (G) (G) M13 0.939 0.939 0.915 0.927 0.927 0.927 0.863 0.915 0.939 0.999 0.999 0.999 0.999 (D) (E) (E) (I ) (H) (H) (H) (G) (G) (F) (A) (A) (A) (A) M14 0.956 0.956 0.916 0.936 0.935 0.936 0.879 0.916 0.956 0.998 0.998 0.998 0.998 (D) (D) (C) (H) (F) (F ) (F) (E ) (F ) (E) (D) (C) (C) (C) M15 1.000 0.996 0.919 0.960 0.958 0.959 0.919 0.919 1.000 0.984 0.000 0.492 0.000 (A) (A) (A) (G) (C) (C) (C) (C) ( E) (A) (E) (G) (G) (G) 24 T able 10: Results for the models in T able 8 on informati on measures from di verg ence group. S = singulari ty which cannot be remove d. (R) = ranking order for the model, where R = A,B, ..., in descendi ng order from the top. Model N I 10 N I 11 N I 12 N I 13 N I 14 N I 15 N I 16 N I 17 N I 18 N I 19 N I 20 M7 0.9998 0.9998 0.9982 0.9996 0.9974 0.9994 0.9802 0.9966 0.9992 0.9953 0.9992 (F) (A) (A) (D) (C) (E) (D) (A) (D) (D) (E) (D) M8 0.9998 0.9996 0.9979 0.9995 0.9969 0.9993 0.9802 0.9959 0.9990 0.9942 0.9990 (F) (A) (E) (E) (D) (F) (E ) (A) (F) (F) (F) (F) M9 0.9998 0.9996 0.9840 0.9924 0.9876 0.9895 0.9802 S 0.9893 S S (C) (A) (E) (H) (G) (I) (H) (A) (H) M10 0.9998 0.9997 0.9994 0.9999 0.9992 0.9998 0.9802 0.9988 0.9997 0.9984 0.9997 (E) (A) (C) (A) (A) (A) (A) (A) (B) (A) (C) (A) M11 0.9998 0.9996 0.9982 0.9995 0.9976 0.9994 0.9802 0.9964 0.9991 0.9950 0.9991 (E) (A) (E) (D) (D) (D) (D) (A) (E) (E) (F ) (E) M12 0.9998 0.9996 0.9852 0.9927 0.9893 0.9899 0.9802 S 0.9898 S S (B) (A) (E) (G) (F ) (H) (G) (A) (H) M13 0.9998 0.9997 0.9994 0.9999 0.9992 0.9998 0.9802 0.9989 0.9997 0.9985 0.9997 (D) (A) (C) (A) (A) (A) (A) (A) (A) (A) (A) (A) M14 0.9998 0.9997 0.9986 0.9996 0.9982 0.9995 0.9802 0.9972 0.9993 0.9961 0.9993 (D) (A) (C) (C) (C) (C) (C) (A) (C) (C) (D) (C) M15 0.9998 0.9998 0.9856 0.9928 0.9899 0.9900 0.9802 S 0.9900 S S (A) (A) (A) (F) (E) (G ) (F) (A) (G) 25

Information-Theoretic Measures for Objective Evaluation of Classifications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment