Information-Theoretic Measures for Objective Evaluation of Classifications

This work presents a systematic study of objective evaluations of abstaining classifications using Information-Theoretic Measures (ITMs). First, we define objective measures for which they do not depend on any free parameter. This definition provides…

Authors: Bao-Gang Hu, Ran He, XiaoTong Yuan

Information-Theoretic Measures for Objective Evaluation of   Classifications
Information-Theoretic Measures for Objecti ve Ev aluation of Classifications Bao-Gang Hu ∗ ,a,b , Ran He a , XiaoT ong Y uan c a NLPR / LIAMA, Institut e of Automation , Chinese Academy of Scie nces, Beijing 100190, China b Beijing Gradua te School, Chinese Academy of Science s, Beijing 100190, China. c Department of Electr oni c and Computer Engineering , National Universi ty of Singapor e, Singapor e. Abstract This work presents a systematic study of ob jectiv e e valuations of abstaining classifications using Infor mation-Th eoretic Measures ( IT Ms ). First, we define objectiv e measures for which they do not depend on any free parameter . This defi- nition provides technical simplicity for e xam ining “ objectivity ” or “ subjectivity ” directly to classification e valuations. Second, we propose twen ty four norm alized IT Ms, derived from either mu tual inform ation, divergence, o r cross- entropy , for in vestigation. Contrary to conventional p erforman ce measure s th at apply empirical formu las b ased on users’ intuitions or preferenc es, the ITMs are theoretically more sound for realizing objective evaluations of classi- fications. W e ap ply them to distinguish “ err or typ es ” and “ r eject types ” in bina ry classifications withou t the nee d for input data of co st term s. Third, to better und erstand and select the ITMs, we suggest three desirable featur es for classification assessment mea sures, which appear more crucial and app ealing from the viewpoint of classification applications. Using th ese featur es as “ meta-m easur es ”, we can rev eal the advantages and limitations o f ITMs from a higher le vel of ev aluation kn owledge. Nu merical examples ar e gi ven to corro borate our c laims and compa re the dif- ferences am ong the prop osed measures. T he b est measure is selected in terms of the meta-me asures, and its specific proper ties r egarding error types and reject types are analytically derived. K ey wor ds: Abstaining classifications, error types, reject types, entropy, similarity, objecti vity 1. I ntroduction The selection o f evaluation measure s for classi fications has r eceiv ed in creasing atten tions f rom re searchers on var- ious application fields [1] [2][3][4][5] [6][7]. It is well known that e valuation measures, or criteria, ha ve a substantial impact on the quality of classification performance . The prob lem of ho w to select e valuation measures for the o verall quality of classifications is di ffi cult, and there appears no univ ersal answer to this. Up to now , v ariou s types of e valuation me asures ha ve been used in classification applicatio ns. T aking a binary classification as an example, mo re than thirty metrics have been ap plied for assessing the quality of classifications and their algorithm s a s gi ven in T able 1 of Lavesson an d Da vidsson ’ s pa per [5]. Mo st of the metrics listed in this table can be con sidered a typ e of perform ance-based measures. In practice , other ty pes of e valuation measures, such as Informatio n-Theor etic Measures ( ITMs ), hav e also commo nly been used in machin e learn ing [ 8][9]. The typ ical informa tion-based measure used in classifications is the cross entropy [10]. In a r ecent work [ 11], Hu and W an g derived an analy tical for mula of the Shan non-b ased mutual info rmation measure with respect to a confu sion matrix. Significant benefits were de riv ed from the mea sure, such as its gen erality e ven fo r cases of cla ssifications with a reject option, and its objecti vity in naturally balancing per formance -based measures that may conflict with one another (such as precision an d recall). The objectivity was achieved from the perspective that an inf ormation- based measure does not require knowledge of cost terms in ev aluating classifications. This advantage is particu larly impor tant in studies of abstain ing classifications [12][4] and cost sensitive learning [13][14], where cost terms m ay be required as input ∗ Correspondi ng aut hor . A ddress: NLPR / LIAMA, Institute of Automati on, Chi nese Academy of Science s, Beiji ng 100190, Chi na. T el.: + 86- 10-62647318 , Fax: + 86-10-6264745 8. Email addr ess: hubg@nlpr.ia.ac .cn (Ba o-Gang Hu) Septembe r 9, 2018 data for e valuations. Gen erally , if no cost terms are assigned to e valuations, it implies that the zero-one cost functions are applied [1 5]. In such situation s, classifi cation e valuations withou t a reject optio n may still be applicable and useful in class-b alanced datasets. Prob lematic, or un reasonable, results will be obtained for ev aluations in situatio ns wher e classes are highly ske wed in the datasets [3] if no specific cost terms are giv en. In this work, for simplify ing discu ssions, we distinguish, o r decouple, two study goals in ev aluation studies, namely , evaluation o f classifiers and ev aluation of classifications. The fo rmer g oal co ncerns mo re a bout evaluation of algorithms in which classifiers ap plied. Fr om this ev aluation, designers or users c an select the best classifier . Th e latter goal is to ev aluate classification re sults without co ncerning which classifier is applied. This ev aluation aims more on resu lt comparison s o r measure comp arisons. One typical exam ple was demonstrated by Macka y [16] for highligh ting the di ffi culty in classification ev aluation s. He showed two specific co nfusion matrices, C D and C E , in binary classifications with a reject option: C D = " 74 6 10 0 9 1 # , C E = " 78 6 6 0 5 5 # , with C = " T N F P RN F N T P RP # , (1) where the confusion m atrix is d efined as C in eq. ( 1) , and “ TN ”, “ TP ”, “ FN ”, “ FP ”, “ RN ”, “ R P ” represent “ true ne gative ” , “ true positive ”, “ false negative ”, “ false positive ”, “ r eject ne gative ”, “ r eject positive ”, respectively . For the giv en data, user s may ask “ which measu r es will be pr op er for rankin g th em ”. If directly applying “ T rue P ositive Rate- F alse P ositive Rate ” curve (also called R OC) or “ Pr ecision- Recall ” curve, one may conclud e that th e perfor mance of C E is be tter than that of C D . This conclusion is prop er since the two sets of data share th e same reject rate ( = 11%). Gener ally , “ Err or -Reject ” curve is mo stly adopted in ab staining classifications. Based on this evaluation approa ch, o ne may consider the p erforman ces of two classifications have no di ff erence b ecause they show the same error rate ( = 6%) an d reject rate. Mack ay [16] first suggested applying mu tual-infor mation based measure in ranking classifications, and through which Hu and W ang (referring to M5-M6 in T able 3, [11]) observed that C D is better than C E . If reviewing the two matrices carefu lly with respect to imbalanced classes, on e may ag ree with the obser vation because the small class in C D receives mo re correct classifications than that in C E . W e consid er th e example designed by Mackay [16] is quite stimulating fo r study of ab staining classification ev aluations. Th e implications o f th e example for m th e mo ti vations of the present work on add ressing three related open problem s, which are generally ov erloo ked in the study of classification ev aluation s as follows: I. How to define “ pr oper ” mea sures in terms of high-level knowledge for abstaining classification e valuations? II. How to conduct an objectiv e e valuation of classifications without using cost terms? III. How to distinct or rank “ err o r types ” and “ r eject types ” in classification e valuations? Con ventional binary class ifications u sually distinguish tw o types of misclassification errors [15][16] if they result in di ff ere nt losses in applications. For examp le, in med ical applicatio ns, “ T ype I Err or ” (or “ false po sitive ”) can b e an error of misclassifying a healthy person to be abnormal, such as cancer . On the contrary , “ T ype II Err or ”(or “ false ne gative ”) is an error where cancer is not detected in a patient. Therefo re, “ T yp e II Err or ” is more costly than “ T ype I Err or ”. Based on th e same reason for identifying “ err or types ” in binary classifications, there is a n eed for considering “ r eject types ” if a reject o ption is a pplied. Of the existing measu res, we consider inform ation-theor etic measures to be most promising in provid ing “ objectivity ” in classification evaluations. A deta iled discussion o n the definitio n of “ objectivity ” is given in Section 3 . This work is an extension of our p revious study [ 11]. Howe ver , the work aims at a systematic in vestigation of informatio n measures with specific focus on “ err or types ” and “ reject types ”. T he main contribution of the w ork is derived from the follo wing three aspects: I. W e defin e the “ pr oper ” features, also called “ m eta-measures ” , fo r selecting cand idate measures in the con- text of abstainin g classification e valuations. These features will assist users in u nderstandin g ad vantages and limitations of ev aluation measures from a higher le vel of knowledge. II. W e examine most of the e xisting information m easures in a systematic in vestigation of “ err or types ” an d “ reject types ” f or objecti ve ev aluations. W e hope that the m ore than twenty measures invest igated ar e able to enrich the curr ent bank of classification ev aluation measures. F or the best measure in terms of th e meta-measures, we present a theoretical confirmation of its desirable proper ties regarding error types and reject types. 2 III. W e reveal the intr insic shortcom ings o f infor mation measures in evaluations. T he discussions are intended to be applicable to a wider range of class ification pro blems, such as similarity ranking. In addition , we are able to employ the measures reasonably in interpreting classification results. T o address c lassification evaluations with a r eject o ption, we a ssume that the only basic data av ailable for clas- sification evaluations is a confu sion matrix , without input data of cost terms. The rest of this letter is organ ized as follows. In Section 2 , we present related work for the selectio n of ev aluation measur es. For seeking “ p r oper ” mea- sures, we propose se veral desirable features in the context of classifi cations in Section 3 . Th ree groups of normalized informa tion m easures ar e pr oposed along with their intr insic shortcom ings in Sections 4 to 6, respectiv ely . Sev eral numerical examples, together with discussions, are gi ven in Section 7. Finally , in Section 8 we conclud e the w ork. 2. Relate d W o rk In c lassification ev aluation s, a measure based on classification accuracy ha s traditiona lly been used with some success in nume rous ca ses [15]. This measu re, howe ver , may su ff er serious pro blems in reaching intuitiv ely reasonable results from ce rtain special ca ses of real-world classification problems [3]. T he main reason for this is that a single measure of accuracy does not tak e into accoun t error types. T o overcome the p roblems of accuracy measures, r esearchers h av e developed many sophisticated appr oaches for classification assessment[17][18]. Among these, two common ly-used ap proaches are R OC ( Recei ver Ope rating Characteristic) curves and A UC (Area un der Curve) measures [1][19]. ROC curves p rovide users with a very fast ev aluation ap proach via visual inspec tions, but this is only applicab le in limited cases with specific curve fo rms (for example, w hen o ne curve is comp letely above the o ther). A UC mea sures are more generic fo r ran king classifica- tions without constraints on curve fo rms. In a study of binary classifications, a formal proof was gi ven by Ling et al. [1] sho wing that A UC is a better measu re th an acc uracy from the definitions o f both statistical co nsistency and discriminancy . Sophisticated A UC measures were re ported re cently for impr oving robustness [6 ] and coheren cy [7] of classifiers. Drummo nd and Holte [20] propo sed a visualization techniq ue called “ Cost Curve ”, which is able to take into account of cost terms for sho wing confidenc e i ntervals on classifier’ s pe rforman ce. Japkowicz [3] presented convincing e xam ples sho wing the sh ortcoming s of the existing ev aluation methods, includ ing accuracy , precision vs. recall, and R OC tech niques. The finding s from the examples further confirm ed the need f or metho ds using me asure- based functions [2 1]. The m ain id ea behind mea sure-based functions is to form a sing le function with resp ect to a weighted summation of multiple measures. The measure function is able to balance a trade-o ff amon g the conflicting measures, su ch as precision and recall. Howe ver , the main di ffi cu lty arises in the selection of balancing weig hts for the measures [5]. In m ost cases, users rely on their pref erences and experiences in assigning the weights, wh ich imposes a strong degree of s ubjec ti vity on the e valuation results. Classification ev aluation s beco me more complicated if a classifier ab stains from making a p rediction when the outcome is considered unr eliable for a specific sample. In this case, an extra c lass, k nown as the “ r eject ” o r “ u nknown ” class, is ad ded to th e classification. In recen t years, the study of abstaining classifiers has received muc h atten tion [22][23][12][4][24]. W ith complete data of a f ull cost matrix, they were able to assess the classifications. If o ne term of the cost matrix was missing, such as o n a reject cost term , the ap proaches f or classification ev aluation s gene rally failed. Mor eover , because in most situa tions the cost terms ar e given by users, th is appr oach is basically a subjective ev aluation in ap plications. V anderlooy et al. [25] fu rther in vestigated the R OC isometrics approac h which does not rely on information from a cost matrix. This approach, ho wever , is only ap plicable to binary classification problems. A promising stud y o f objecti ve ev aluation s of classifications is attributed to the introduction of information the ory . Kvalseth [26] and W ickens [27] deri ved normalized mutual i nfo rmation ( NMI ) measures in relation to a contingency table. Furth er pioneerin g stud ies on the classification problems were conducted by Finn [28] and Forbes [29]. Forbes [29] discussed t he problem that NMI does not s hare a monoton ic p roperty with the other performance measures, such as a ccuracy or F-measure. Se veral d i ff erent definitio ns for info rmation measures have been rep orted in studies of classification assessment, such as inf ormation scores by Kononenko a nd Bratko [30] and KL divergence by Nishii and T an aka [31]. Y ao, et al. [8] and T an, et al. [32] summarized many useful informa tion measures for stud ies of associations and attribute importance. Significan t e ff or ts were mad e on discussing the desire d properties of e valuation measures [32]. Principe, et al. [ 9] proposed a f ramew ork o f infor mation th eoretic learning ( ITL ) that inc luded supervised learnin g as in c lassifications. Within this fr amew ork, the learnin g criteria were th e mutual in formation 3 defined from the Shan non and Renyi en tropies. T wo quadratic divergences, na mely , the Euclid ean and Cauch y- Schwartz distances were also included. From the per spectiv e of inform ation theory , W ang and Hu [ 33] d eriv ed f or the first time the nonlinear relation s between m utual inform ation and th e conventional perf ormance measures ( accuracy , recall and precision) fo r binary classification problems. They [11] extended the investigation into abstaining classification e valuations for mu ltiple classes. Their method was based solely on the confusio n matrix. For gainin g the theoretical proper ties, they derived the extremum theorems concerning mutual information measures. One of the important findings from the local mini- mum theorem is the theore tic re velation of the no n-mono tonic property of mutual info rmation measures with respect to the diago nal terms of a con fusion matrix. This prope rty may ca use irratio nal ev aluation results from some d ata in classifications. They confirm ed this problem by examining specific numerical exam ples. The oretical invest igation s are still missed for other information measures, such as di vergence-based and cross-entropy based ones. 3. Object ive Evaluations and Meta-Measures This work focuses on ob jectiv e evaluations of classifications. While Berger [34] stressed fo ur po ints f rom a philosoph ical po sition fo r supp orting objective Bayesian analysis, it seems that fe w studies in the literatu re add ress the “ objec tivity ” issue in the stud y of classification evaluations. Some r esearchers [3 2] may c all their measures to be objective ones without d efining them formally . Considerin g th at “ objectivity ” is a more philosophical co ncept witho ut a well accepted d efinition, we propo se a schem e for de fining “ ob jective evaluations ” from th e viewpoint of practical implementatio n and e xamina tion. Definition 1. Obje ctive evaluations and measures . An objective e valuation is an a ssessment expressed by a function that does not contain any free parameter . This fun ction i s called an objective measure. Remark 1 . Whe n a free parameter is used to define a measure, it usually carr ies a certa in degree of subjectivity in e valuations. Therefore , acco rding to this definition, a measure based on cost terms [15] as free parameters does not lead to an o bjective e valuation. Defin ition 1 m ay be co nservati ve, b ut nevertheless, provid es technical simplicity fo r examining “ objectivity ” or “ subjectivity ” directly with respect to the existence of free parameters. I n some situations, Definition 1 can be relaxed by includin g free p arameters, but they a ll hav e to be determin ed sole ly from the given dataset. Definition 2. Dat asets in classification evaluations with a reject option . A reject option is sometimes considered for classifications in which one may assign samples to a reject or u nknown class. Evaluations of classification with a reject option app ly two datasets, n amely , the output (o r prediction) dataset { y k } n k = 1 , which is a r ealization of d iscrete random v ariable Y v alued on set { 1 , 2 , . . . , m + 1 } ; and th e target dataset { t k } n k = 1 ∈ T valued on set { 1 , 2 , . . . , m } ; where n is the total number of samp les, and m is the to tal number of classes. A s amp le identified as a reject class is rep resented by y k = m + 1. Remark 2 . T he term “ abstaining classifiers ” has been widely used in classification problems with a reject option [12][4]. Howe ver , mo st studies of ab staining classifications required cost matrices for their ev aluations. The definition giv en above exhib its mor e gen eric scen arios in classification ev aluation s, beca use it do es no t require kn owledge of cost terms for error types and reject types. Definition 3 . A ugmented confusion matrix and its constra ints [11] . An augm ented confu sion matrix includes one column for the reject class, which is added to a conv ention al con fusion matrix: C =               c 11 c 12 . . . c 1 m c 1( m + 1) c 21 c 22 . . . c 2 m c 2( m + 1) . . . c m 1 c m 2 . . . c mm c m ( m + 1)               , (2) where c i j represents the sample nu mber of the i th class that is classified as the j th class. The row data correspon ds to the actual classes, wh ile th e column data corr esponds to the predicted classes. The last colum n rep resents th e reject class. Th e relations and constraints of an augmented confusion matrix are: C j = m + 1 X j = 1 c i j , C i > 0 , c i j ≥ 0 , i = 1 , 2 , . . . , m , (3) 4 where C i is the total number for the i th class, which is generally known in classification problems. Definition 4. Error types and reject types . Following the conventions in binary classifications [ ? ] , we denote c 12 and c 21 by “ T ype I Err or ” and “ T ype II Err or ” re specti vely; c 13 and c 23 by “ T ype I Reject ” and “ T y pe II R eject ” respectively . Definition 5. Normalized information measure . A norm alized infor mation measure, d enoted as N I ( T , Y ) ∈ [0 , 1], is a f unction based on info rmation theory , which represents the degree of similarity between two rand om variables T and Y . In princip le, we hope that all NI m easures satisfy the th ree importan t pro perties, or axioms, of m etrics [15][35], supposing Z is anoth er random v ariable: P1: N I ( T , Y ) = 1 i ff T = Y (the iden tity axiom) P2: N I ( T , Y ) + N I ( Y , Z ) ≥ N I ( T , Z ) (the triangle inequality) P3: N I ( T , Y ) = N I ( Y , T ) (th e s ymm etry axiom) Remark 3 . V iolations of properties of metrics are possible in reach ing r easonable ev aluations of classification s. For example, the tr iangle in equality an d symmetry p roperties can be relaxed withou t chang ing the ran king or ders among classifications if their ev aluation measures are ap plied consistently . Howe ver , the identity pro perty is indicated only fo r the relation T = Y (assum ing T is padded with zero s to ma ke it the same size as Y ) , and doe s not g uarantee an e xact solution ( t k = y k ) in classifications (see Theo rems 1 an d 4 given later) . If a violation of metric properties occurs, the NIs are referred to as measures, rather than metrics. For classification e valuations, we consider the generic properties of metrics not to be as crucial i n co mparisons as certain specific features. I n this w ork , we focus on specific features that, though not mathematically fundamental, are more necessary in classification ap plications. T o select “ better ” measure s for obje cti ve e valuations of classifications, we propose the following three desirable features together with their heuristic reasons. Feature 1. Monotonicity with respect to the diagona l t erms o f the confusion matrix . T he diagonal terms of the conf usion matrix represent th e exact classification n umbers for all the samples. Or, they reflect the coinciden t number s between t an d y fr om a similarity viewpoint. When on e of th ese ter ms ch anges, the evaluation measure should change in a mono tonous way . Other wise, any non-m onoton ic measur e may fail to provide a rational result for rank ing classifications co rrectly . Th is feature is orig inally p roposed for descr ibing the strength of agree ment ( or similarity) if the matrix is a contingency table [32]. Feature 2. V aria tion with r eject rate . T o improve classification perf ormance, a reject option is of ten used in engineer ing applications [12]. Ther efore, we suggest that a measure should be a scalar function on both class ification accuracy and reject rates. Su ch a m easure cou ld be ev aluated based solely on a giv en conf usion matrix fro m a single operating point in the classification. This is d i ff erent to th e A UC measures that are based on an “ Err or-Reject ” cur ve [16][24] from multiple operating points. Feature 3 . Intuitively consistent costs among error t ypes and reject ty pes . This f eature is der i ved from the principle of o ur conv ention al intuitions when dealing with error types in classifications. It is also extended to reject types. T wo s pecific intuitions are ad opted for binary classifi cation s. First, a misclassification or r ejection from a small class will cause a greater cost th an that from a large class. This intuition represents a property called “ within e rr o r types and r eject types ”. Second, a misclassification will pr oduce a greater co st th an a rejection from the same class, which is called “ between err o r and reject types ” property . If a measure is ab le to satisfy the in tuitions, we refer to its associated costs as being “ intuitively consistent ”. Ex ceptions may exist to the intu itions above, b ut we con sider them as a very s pecial case. At th is stage, it is worth discussing on “ objectivity ” in ev aluations because one ma y doubt co rrectness o f the intentions above and the terms “ desir able ” or “ intuitions ” in a study of objecti ve e valuations. The three features seem to be “ pr oblema tic ” in terms of p roviding a general co ncept of “ ob jectivity ”, because n o h uman bias sh ould be a pplied in the ob jecti ve judgment of ev aluation results. The following discussions justify the pr oposal of requirin g de sirable, or proper, featu res for objective measure s. On one hand, we recog nize that any e valuation will imply a certain degree of “ subjectivity ”, since evaluations exist only as a result o f human judgment. For examples, every selectio n of e valuation measures, even of objective ones, will r ely o n po ssible sou rces o f “ subjec tivity ” fr om u sers. On the other hand, engineer ing applications do concer n abou t objective ev aluations [ 29][32]. Howe ver , to the auth ors’ b est k nowledge, 5 a technical definition , or criterion, seems missing for d etermining objective or subjective measures in ev aluations of classifications. For overcomin g possible confusio n and vagueness, we set Definition 1 as a p ractical criterion for examining wh ether a classification ev aluation h olds “ objectivity ” or does not. If a measure satisfies this definition, it will always retain the proper ty of “ objective co nsistency ” in ev aluating the given classification results. The three “ desirable ” features, though based on “ intuitions ” with “ su bjectivity ”, do not destroy the criterion o f “ ob jectivity ” in classification e valuations. Therefore, it is logically co rrect to d iscuss “ desirable ” featu res o f objectiv e mea sures as long as the measures satisfy Definition 1 for keeping the defined “ objectivity ”. Note that all desirable features above are deri ved from o ur intuitions on g eneral cases of classification e valuations. Other items ma y be deriv ed for a wider examinatio n of features. For example, Forbes [29] pro posed six “ con straints on pr op er comparative measur es ”, namely , “ s tatistically principled, r ea dily in terpr etable, gener alizable to k- class sit- uations, not di ff er ent to the special status, r eflective of agr eement, and insensitive to the se gmentation ”. However , we consider the three features proposed in this w ork to be more crucial, especially as Feature 3 has ne ver b een concerned in previous studies of classification e valuations. A lthough Features 2 and 3 may s hare a similar meaning, the y are pre- sented indi vidua lly to highlight their s pecific concerns. W e can also call th e desirab le fe atures “ meta-measur es ”, since these are de fined to be qualitative and high-level measures abou t measures. In this work, we apply meta- measures in our inv estigation of info rmation me asures. The examination with r espect to th e meta-measu res enab les clar ification of the cau ses of p erforman ce di ff erences among th e e xamin ed measur es in classification evaluations. It will b e helpful for users to un derstand ad vantages and lim itations of di ff eren t measur es, either o bjectiv e- or subjective-ones, from a higher le vel of e valuation knowledge. 4. Normalize d Inf ormation Measures based on Mu tual Information All NI measu res applied in this work are divided into o ne of three group s, name ly , mutu al-informa tion based, div ergence based, and cross-entropy based grou ps. In this section, we f ocus o n the first group . Each measure in th is group is derived dir ectly from mutual inf ormation representing the degree of similarity be tween two rand om vari- ables. For the purpose of ob jecti ve e valuations, as suggested by Definition 1 in the previous section, we elimin ate all candidate measures defined from the Renyi or Jensen en tropies [36][9] s ince they inv olve a free parameter . Th erefore, without adding free parameter s, we only apply the Shannon entropy to information measures [37]: H ( Y ) = − X y p ( y ) log 2 p ( y ) , (4) where Y is a discrete rand om v ariable with pro bability m ass functio n p ( y ). Then mutual info rmation is defined as [37]: I ( T , Y ) = X t X y p ( t , y ) log 2 p ( t , y ) p ( t ) p ( y ) , (5) where p ( t , y ) is the joint distribution for the two discrete ran dom variables T and Y , and p ( t ) and p ( y ) ar e called marginal distrib utions that can be derived fr om: p ( t ) = X y p ( t , y ) , p ( y ) = X t p ( t , y ) . (6) Sometimes, the simplified no tations for p i j = p ( t , y ) = p ( t = t i , y = y j ) are u sed in this work. T ab le 1 lists the possible normalized inf ormation m easures with in the m utual-info rmation based group. Basically , they all make use of Eq. (5 ) in their calculations. The main di ff erence s ar e due to the normalization schemes. In applying the formulas for calculating N I k , one gene rally does not have an e xact p ( t , y ). F or this reason , we adopt an empirical joint distribution defined below for the calculations. Definition 6. Empirical jo int distribution and empirical marginal distrib utions [1 1] . An empirical joint distribution is defined from the frequency means for the gi ven confusion matrix, C , as: P e ( t , y ) = ( P i j ) e = 1 n c i j , i = 1 , 2 , . . . , m , j = 1 , 2 , . . . , m + 1 , (7 a ) 6 where n = P C i , denotes the total num ber of samp les in th e classifications. T he su bscript ” e ” is giv en f or denotin g empirical terms. T he empirical marginal distrib utions are: P e ( t = t i ) = C i n , i = 1 , 2 , . . . , m . (7 b ) P e ( y = y j ) = 1 n m X i = 1 c i j , j = 1 , 2 , . . . , m + 1 . (7 c ) Definition 7. Empirical mutua l inf ormation [11] . The empirical mutual information is giv en by: I e ( T , Y ) = X t X y P e ( t , y ) log 2 P e ( t , y ) P e ( t ) P e ( y ) = m X i = 1 m + 1 X j = 1 c i j n log 2 ( c i j C i m P i = 1 c i j n ) . (8) Definitions 6 an d 7 provide users with a direct means fo r applying information measur es th rough the given data of the con fusion matrix . For the sake of simplicity of analysis and d iscussion, we a dopt the empirical distributions, or p i j ≈ P i j , fo r calculating all NIs and deriving the theorem s, but removing their associated su bscript ” e ”. Note that the notation of N I 2 in T able 1 di ff ers from the others for calculating mutual information , where I M ( T , Y ) is defined as “ modified mutual information ”, The calculation of I M ( T , Y ) is carried out b ased on the intersection o f T an d Y . Hence, when using Eq. (8), the intersection requires that I M ( T , Y ) incorporate a summation of j over 1 to m , instead of m + 1. This definition is b eyond mathematical rigor, b ut N I 2 has the same p roperties of metrics as N I 1 . It was originally propo sed to overcome the problem of unchanging v alues in NIs if rejections are made within only one c lass (referring to M9-M10 in T able 3, [11]). Th e following three theorems are deri ved for all NIs in this group. Theorem 1 . W ithin all NI measures in T ab le 1, when N I ( T , Y ) = 1, the classification witho ut a reject class may correspo nd to the case of e ither an exact classification ( y k = t k ), o r a spe cific misclassification ( y k , t k ). The spec ific misclassification can be fully removed by simply e xcha nging labels in the confusion matrix. Proof . If N I ( T , Y ) = 1, we can ob tain the f ollowing cond itions from Eq. ( 8) for classifications without a reject class: p i j = p ( t = t i ) ≈ P e ( t = t i ) = C i n and p k j = 0 , i , j , k = 1 , 2 , . . . , m , k , i . (9) These con ditions describe the specific con fusion matrix where only on e non-zer o term appears in e ach co lumn (with the exception o f the last ( m + 1)th column). When j = i , C is a diagonal matrix for representing an exact classification ( y k = t k ). Otherwise, a specific misclassification exists fo r which a d iagonal matrix can be obtained by exchanging labels in the confusion matrix (referring to M11 in T ab le 4, [11]). ♦ Remark 4 . Theor em 1 descr ibes that NI(T ,Y) = 1 presents a necessary , but no t su ffi cient, c ondition of an exact classification. Theorem 2 . For abstaining classification problems, when N I ( T , Y ) = 0, the classifier generally reflects a misclas- sification. One special case is that all samples are considered to be one of m class es, or be a re ject class. Proof . For NI s defined in T able 1 , N I ( T , Y ) = 0 i ff I ( T , Y ) = 0. Accor ding to informa tion theory [37], the following conditio ns can hold based on the given marginal distributions (o r the emp irical on es if a con fusion m atrix is used): I ( T , Y ) = 0 , i f f p ( t , y ) = p ( t ) p ( y ) . (10) The con ditional part in Eq. (10 ) can b e rewritten in the for m p i j = p ( t = t i ) p ( y = y j ). From the constraints in (3), p ( t = t i ) > 0 ( i = 1 , 2 , . . . , m ) can b e o btained. For classification solutions, there sho uld exist at least one term for p ( y = y j ) > 0 ( j = 1 , 2 , . . . , m + 1). There fore, at least o ne n on-zero term for p i j > 0 ( i , j ) m ust be obtained . This non-ze ro term c orrespon ds to th e o ff -diag onal term in th e confu sion m atrix, which in dicates that a misclassification has oc curred. When all samples have been identified as one of the classes (referr ing to M2 in T able 4, [11]), N I = 0 should be obtained. ♦ Remark 5 . Eq. (10) gives the statistical reason for zero mutu al information, that is, the two random variables are “ statistically independ ent ”. Theo rem 2 demonstrates an intrinsic reason for local minima in NIs. 7 Theorem 3 . The NI measures defined by th e Shannon entropy g enerally do not e xhibit a monotonic property with respect to the diagonal terms of a confusion matrix. Proof . Based on [11], we arrive at simpler co nditions fo r the local minima a bout I ( T , Y ) for the given con fusion matrix: C =               . . . 0 0 . . . 0 c i , i c i , i + 1 0 0 c i + 1 , i c i + 1 , i + 1 0 . . . 0 0 . . .               , i f c i , i c i + 1 , i = c i , i + 1 c i + 1 , i + 1 . (11) The local minim a are obtained beca use the four giv en non -zero terms in Eq. (11 ) produce zero (or the minimu m) contribution to I ( T , Y ) . Supp ose a ge neric form is given for N I ( T , Y ) = g ( I ( T , Y )), where g ( · ) is a n ormalization function . From th e chain rule of deriv atives , it can b e seen that the condition s d o n ot chang e for reaching the local minima. ♦ Remark 6 . The non -mono tonic p roperty of the info rmation m easures implies that these measures m ay su ff er from an intrinsic p roblem of local minima for classification r ankings (referring to M1 9-M20 in T ab le 4, [11]). Or, accordin g to Feature 1 of th e meta-measu res, a ratio nal result fo r the classification evaluations m ay not be obtained due to the non -mono tonic property of the measures. This shortcom ing has not been theo retically derived in previous studies ([28][29][32]). 5. Normalize d Inf ormation Measures based on Inf ormation Div ergence In this section, w e propo se normalized inform ation measures based on the d efinition of inf ormation divergence. In T able 2, we summar ize the com monly-u sed div ergence measures, which are denoted as D k ( T , Y ) and represents dissimilarity betwee n the two r andom variables T an d Y . In Section s 5 and 6, we app ly the fo llowing no tations for defining marginal distrib utions: p t ( z ) = p t ( t = z ) = p ( t ) , a nd p y ( z ) = p y ( y = z ) = p ( y ) , (12) where z is a p ossible scalar value that t or y can take. For a con sistent co mparison with the previous normalized informa tion m easures, we adopt the following transformation on D k [31]: N I k = exp( − D k ) . (13) This transfo rmation provides both inv erse and n ormalization functio nalities. It does not introdu ce any extra param- eters, and presents a hig h degree o f simplicity , as in deriv ation for examining the local minima. T wo more theorems are derived by fo llowing a similar analysis to that in the previous s ection. Theorem 4 . For all NI measure s in T able 2, wh en N I ( T , Y ) = 1, th e classifier c orrespond s to the case of either an exact classification, or a specific m isclassification. Gener ally , th e misclassification in the latter case can not be removed by switching labels in the confusion matrix. Proof . When p y ( z ) = p t ( z ), it is always the case that N I ( T , Y ) = 1. Howe ver , general conditio ns can be given for p y ( z ) = p t ( z ) as follows: p y ( y = z i ) = p t ( t = z i ) , or X j p ji = X j p i j , i = 1 , 2 , . . . , m . (14) Eq. ( 14) implies two cases o f classifications fo r D k ( T , Y ) = 0 , k = 1 0 , . . . , 20, One o f these correspo nds to an exact classification (or y k = t k ), wh ile the other is the result of a specific misclassification that shows th e relationship of y k , t k , but p y ( z ) = p t ( z ). In th e latter case, switching labels in th e confusion ma trix to r emove m isclassification generally destroys the relation for p y ( z ) = p t ( z ) at t he same time. Con sidering the relation is a necessary condition for a perfect classification, the misclassification cannot be removed through a label switching operation. ♦ Remark 7 . Th eorem 4 suggests the caution shou ld be ap plied in explaining the c lassification evaluations wh en N I ( T , Y ) = 1. The maxim um of the NIs from the information di vergence mea sures only in dicates the equiv alence between the marginal pro babilities, p y ( z ) = p t ( z ), but this is not always true f or r epresenting exact classifications (or 8 y k = t k ). Theorem 4 reveals an intrin sic p roblem when u sing an NI as a measure fo r similarity ev aluations between two datasets, such as in image registration. Theorem 5 . Th e NI measures based on infor mation di vergence g enerally do not exhibit a monotonic prope rty with respect to the diagonal terms of confusion matrix. Proof . T he theorem can be proved by examining the existence o f mu ltiple maxima fo r NI m easures based o n informa tion d i vergence. Here we use a b inary classification a s an example. T he local minima of D k are o btained when the following conditions exist for a confusion matrix: C = " C 1 − d 1 d 1 0 d 2 C 2 − d 2 0 # and d 1 = d 2 , (15) where d 1 and d 2 are integer numbers ( > 0) for misclassified s amp les. The con fusion matrix in Eq. (15) prod uces zero div ergence D k and therefo re, N I k = 1. However , ch anging from d 1 , d 2 always results in N I k < 1. ♦ Remark 8 . Theo rem 5 indicates an other shortco ming of NIs in the inf ormation diver gence group from th e view- point of monoton icity . The reason is o nce again attrib uted to the usage of marginal distributions in calculations of div ergence. The shortcomin g has not been reported in previous in vestigations ([ 31][35]). 6. Normalize d Inf ormation Measures based on Cr oss-Entropy In this section , we pr opose normalized information measures based o n cross-entropy , which is defined for discrete random variables as [10]: H ( T ; Y ) = − X z p t ( z ) log 2 p y ( z ) , or H ( Y ; T ) = − X z p y ( z ) log 2 p t ( z ) . (16) Note that H ( T ; Y ) di ff ers from joint-entropy H ( T , Y ) with respect to both notation and definition, and is gi ven as [37 ]: H ( T , Y ) = − X t X y p ( t , y ) log 2 p ( t , y ) . (17) In fact, from Eq. (1 6), one can deri ve the relation between KL div ergence (see T able 2) and cross-entro py: H ( T ; Y ) = H ( T ) + K L ( T , Y ) , or H ( Y ; T ) = H ( Y ) + K L ( Y , T ) . (18) If H ( T ) is consider ed as a con stant in classification since the target dataset is gen erally known and fixed, we can observe fro m E q. (18 ) that cross-en tropy shares a similar meaning as KL diver gence fo r representing dissimilarity between T and Y . From the conditio ns H ≥ 0 an d K L ≥ 0, we are able to realize the no rmalization for cross-entro py shown in T able 3 . Following similar discussions as in th e pr evious section, we can d eriv e that all inform ation measures listed in T able 3 will also satisfy Theorems 4 and 5. 7. Numerical Exa mples and Discussi ons This section presents sev eral nume rical examp les togeth er with associated discussions. All calculations for the numerical examples were done using the op en source software Scilab 1 and a specific too lbox 2 . The detailed imple- mentation of this toolbox is describe d in [38]. T a ble 4 lists six numerical exam ples in b inary classification prob lems accordin g to th e specific s cenar ios of their con fusion matrices. W e adopt the notations from [39] fo r the terms “ correct r ecognition r ate (CR) ”, “ err or rate (E) ”, and “ r eject r ate (R ej) ” and their relation: C R + E + Re j = 1 . (19) In addition, we define “ accuracy r ate (A) ” as 1 http://w ww.scilab.or g 2 The toolbox is freely av aila ble as the file “ confmat rix2ni.zip ” at ( http://www. openpr.org.cn ). 9 A = C R C R + E . (20) The first fo ur classifications (or m odels) M1-M4 are provided to show the specific d i ff erences with respect to error types and reject ty pes. In this work, we do not c oncern classifiers ap plied (say , neu ral networks or support vector machines) for ev aluations, but only the resulting ev aluation s from these classifiers. In real ap plications, it is commo n to encounter rankin g classification results as in M1 to M4. The first two classifications of M1 and M2 share the same values fo r the correct recognition and accuracy rates ( C R = A = 99%). Th e other tw o classifications, for M3 and M4, have the same reject rates ( Re j = 1%) and correct recognition rates ( C R = 99 %). The accuracy rates for M3 and M4 are also the same ( A = 10 0%). This d efinition is con sistent with the conv ention s in the study of “ Accuracy- Reject ” curves [16]. If neglecting th e specific app lication backgrou nds, users gen erally have a ran king or der for the four classifications so that the “ b est ” o ne is selected. Th e data fro m other con ventional measures, such as Pr eci sion , Recall an d F 1, are also given in T able 4. W ithout using extra knowledge abou t the co st o f di ff ere nt error ty pes o r re ject types, the conv ention al per formanc e mea sures are not possible to rank the four classifi cations, M1-M4, properly . According to th e intuitions of Featu re 3 proposed in Sectio n 3, one can gain two sets of ranking or ders for th e four classifications M1 to M4 in forms of: ℜ ( M 2 ) > ℜ ( M 1) , ℜ ( M 4) > ℜ ( M 3) , (21 − a ) ℜ ( M 4 ) > ℜ ( M 2) , ℜ ( M 3) > ℜ ( M 1) , (21 − b ) where we denote ℜ ( • ) to be a rankin g op erator, so that ℜ ( M i ) > ℜ ( M j ) expresses M i is better than M j in rank ing. From eq. (21), on e is unable to tell the rankin g order between M2 and M3. For a fast comp arison, a sp ecific letter is assigned to the ranking order of each model in T able 4 based on eq. (2 1): ℜ ( M 4 ) = A , ℜ ( M 3) = B , ℜ ( M 2) = B , ℜ ( M 1) = C . (22) The top rank “ A” indicates the “ best ” classification (M 4 in this case) of the four models. T ab le 4 does not distinguish ranking order b etween M2 and M3. Howe ver , numer ical invest igation s using information measures will provide the ranking order from the giv en data. Th e other two models, M5 and M6, are also specifically design ed for the purp ose of examining information measures on Theorems 3 and 5 (or Feature 1), respectively . T ables 5 and 6 present the results on information me asures for M1-M 6, where th e rankin g orders amo ng M1-M4 is based o n the c alculation results of NIs with th e given d igits. The fo llowing ob servations are achieved f rom the solutions to the examples. 1) When no rmalization functions includ e the te rm H ( Y ) f or the mutual infor mation group , the associated NI pro - duces the desirab le featu re of a variation in reject rate. N I 2 is e ff ective for th is feature even if it only uses H ( T ) for its no rmalization. The e ff ecti veness is attr ibuted to the definition of I M ( T , Y ) f or calculatin g mutual informa tion based on the intersection of T and Y . 2) The re sults o f M5 an d M6 confirm , respecti vely , Theorem 3 for loc al minim a an d Th eorem 5 for maxima of NIs. T he e xistence of multi extrema in dicates the non-mon otonic pro perty with respect to the diagonal terms of the confusion matrix, thereby exhibiting an intrinsic shortcoming of the information measures. 3) For classifications M1 to M4 , the meta- measure of Feature 3 suggests rankin g orders as shown in eqs. ( 21) or (22). Ho wever , o f all the measures in the three groups only N I 2 shows any consistency with the intuitio ns from the given examples (T ables 5 a nd 6). This re sult indicates that Feature 3 seems to be a di ffi cult property for most informatio n measures. 4) None of the perf ormance or info rmation measures inv estigated in this work fully satisfy the me ta-measures. Examinin g data distinguishability in M1 throug h M4, we consider th e information me asures fro m the mutua l- informa tion grou p to be m ore ap propr iate th an tho se of the other grou ps (say , N I 12 and N I 22 do n ot show significant distinguishab ility , or v alue di ff erences, to the four models). 10 The f ourth o bservation supports the pro posal of meta- measures for a higher lev el of classification evaluations. The meta-measures p rovide users with a simple guideline of selectin g “ pr o per ” measures from their specific concern s of app lications. For exam ple, the p erforma nce me asures (such as A , E , C R , F 1 , o r A UC) satisfy Feature 1, b ut fail directly to disting uish error types and r eject types in an objective e valuation. When Feature 2 o r 3 is a main co ncern, the informatio n measures e xhibited to be more e ff ective, despite them not being perfect. Of all the inform ation measu res in vestigated, N I 2 is shown to be th e “ best ” for the given examples in terms of Feature 3. Therefo re, more detailed studies, fro m both theoretical a nd n umerical o nes, were made on this pro mising measure. The theor etical pr operties of this me asure was derived in App endix A. Wh ile Theorem A1 confirms th at N I 2 satisfy Feature 3 around the exact classifications, Theo rem A2 in dicates that this measure is able to ad just the ranking o rder between a misclassification of a large class and a rejection of a small class. T able 7 shows two sets of confusion s matrices which are similar to M1-M4 in T a ble 4. One can o bserve the changes o f ranking order s amon g them. Th ese changes numerically confirm Theorem A2 and its critical point, or cross-over point ( Ω = C 1 / n ≈ 0 . 942), for the gi ven data. Further in vestigations were carried o ut o n th ree-class problems. Althou gh some NIs could be removed direc tly based on the ir po or performanc e with respect to the m eta-measures (such as N I 1 and N I 9 on Feature 2 ), they wer e retained to d emonstrate p ros an d c ons in the applica tions. At this stage, we extend the conce pts of error types and reject types to multiple class es. Nine examples are specifically designed in T able 8. The ranking order for each model is shown in T ab le 8 , which is der i ved from the intuition s of Feature 3 . From T ables 9 and 1 0, it is inter esting to see that N I 2 is still the most approp riate m easure for classification evaluations. Using this m easure, w e can select the “ best ” and “ worst ” classifications co nsistent with our intuition. All other m easures perf orm below our satisfactions for distinguishing error types and reject types properly . The numerica l stu dy supports the viewpoint that n o universally su perior measure exists. For example, in com- paring with information me asure N I 2 , the c on ventional accur acy measure satisfies Fea ture 1, but do es not q ualify to Feature 3. Thus, a ny m easure, either p erforma nce-based or inf ormation- based, should be designed a nd ev aluated within th e c ontext of th e sp ecific a pplications. It is evident that the desirable featur es in th e specific application s become more crucial (or “ pr oper ”) for e valuation m easures than some generic mathematical properties. For e xamp le, informa tion m easures (such as KL div ergence) , that may not satisfy a metric’ s properties (say , symmetry) , are able to process classification e valuations including a reject op tion. They provide mo re app licable power than the conventional perfor mance measures in abstaining classifications. Ho wever , we still need a complete picture about info rmation mea- sures with respect to the ir advantages as well as limitations. The examp les in T ables 4, 7, and 8 on ly present limited scenarios for variations in con fusion matr ices. Using the open -source toolbo x from [38], o ne is able to test more scenarios for numerical in vestigations. 8. Summary In this work, we in vestigated objectiv e ev aluations of classifications by introducin g normalized information mea- sures. W e reviewed the related works an d discussed objectivity and its for mal d efinition in ev aluations. Objective ev aluations may be required under di ff erent application backgroun d. In classifications, for example, exact kno wledge of misclassification costs is som etimes unknown in e valuations. Mor eover , cases o f ign orance r egarding reject costs appear more often in scenarios of abstaining classifications. In these cases, althoug h sub jecti ve e valuations can be applied, the user-given data of th e unkn own abstention c osts will lead to a much high er degree of u ncertainty or in- consistency . W e b eliev e that an objective evaluation can be a suitable solution, as well as a complem entary , app roach to subjecti ve evaluations. In some situations, an objecti ve e valuation is co nsidered useful despite th e s ubjec ti ve e valu- ations being re asonable for the applications. The results fro m both objective and subjective e valuations gi ve users an overall q uality of classification results. Considering that abstaining classifications are beco ming more po pular, we focu sed on the distinction s of er ror types and reject types with in ob jecti ve evaluations of classifications. First, we p roposed three m eta-measures for assessing classifications, which seem mo re relevant and pr oper than the properties of m etrics in the context o f clas- sification applicatio ns. The meta-measu res provide users with useful gu idelines for a quick selection of cand idate measures. Second, we tried systematically to en rich a classification ev aluation b ank by in cluding common ly used informa tion measures. Contrar y to the co n ventional perf ormance m easures that apply empirical for mulas, th e infor- mation measures are theoretically more sound for objecti ve e valuations of class ifications. The key advantage o f these 11 measures over the conventional ones is their ability to handle mu lti-class classification ev aluation s with a reject o p- tion. Third, we r e vealed theor etically the in trinsic sho rtcoming s of the information measures. The se have not been formally reported b efore in studies of im age registration, feature selection, or similarity ranking. The discovery of these shortcoming s is v ery importan t for users to interpret their results correctly when applying those measures. Based o n the prin ciple of the “ No F ree-Lunch Theor em ” [15], w e reco gnize that there are no “ universally superio r ” measures [5]. It is not our aim to replace the co n ventional perform ance measures, b ut to explore information measur es systematically in classification ev aluations. The theoretical study demonstrates the strength and weakness of the informa tion measures. Nu merical in vestigations, co nducted on binary an d thr ee-class c lassifications, co nfirmed tha t objective ev aluations are not an easy topic in th e st udy of machine learning. One of the most ch allenging tasks will be an explor ation o f novel measures that satisfy all meta-measur es as well as the metric proper ties in objecti ve ev aluations of classifications. It is also nec essary to de fine the “ rankin g or der ” intu itions among erro r types and reject typ es in generic classi fications, which will form th e basis of the quantitati ve meta-m easures. Howe ver , this task bec omes m ore di ffi cult if classifications are beyond tw o classes. Acknowledgment This work is supported in part by Natural Science of Foundation of China (#610750 51). Ap pendix A. Theorems and Sensitivity Functions of N I 2 for Binary Classifications Theorem A1 : For a binary classification defined by: C = " T N F P RN F N T P R P # , and (A1 − a ) C 1 = T N + F P + RN , C 2 = F N + T P + RP , C 1 + C 2 = n (A1 − b) N I 2 satisfies Feature 3 on the property regarding error types and reject types around the exact classifications. Sp ecifi- cally for the four confusio n matrices belo w: M 1 = " C 1 0 0 d C 2 − d 0 # , M 2 = " C 1 − d d 0 0 C 2 0 # , M 3 = " C 1 0 0 0 C 2 − d d # , M 4 = " C 1 − d 0 d 0 C 2 0 # , (A2) the following relations will be held: N I 2 ( M 1 ) < N I 2 ( M 2 ) and N I 2 ( M 3 ) < N I 2 ( M 4 ) , (A3 − a ) N I 2 ( M 1 ) < N I 2 ( M 3 ) and N I 2 ( M 2 ) < N I 2 ( M 4 ) , (A3 − b) where C 1 > C 2 > d > 0 . (A3 − c) Proof . For a binary classification, N I 2 is defined by the modified mutual inform ation: N I 2 = I M ( T , Y ) H ( T ) , and I M ( T , Y ) = T N n log 2 nT N C 1 ( T N + F N ) + F P n log 2 nF P C 1 ( T P + F P ) + F N n log 2 nF N C 2 ( F N + T N ) + T P n log 2 nT P C 2 ( F P + T P ) . (A4) Let M 0 be a confu sion matrix correspon ding to the exact classi fications: M 0 = " C 1 0 0 0 C 2 0 # . (A5) 12 Based on th e definition of I M in (A4), one can calculate the mutu al infor mation di ff e rences between two models. Considering M 0 to be a baseline, we obtain the analytical results below for the four models: ∆ I 10 = I M ( M 1 ) − I M ( M 0 ) = 1 n ( C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d ) , (A6 − a ) ∆ I 20 = I M ( M 2 ) − I M ( M 0 ) = 1 n ( C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d ) , (A6 − b) ∆ I 30 = I M ( M 3 ) − I M ( M 0 ) = d n (log 2 C 2 n ) , (A6 − c ) ∆ I 40 = I M ( M 4 ) − I M ( M 0 ) = d n (log 2 C 1 n ) , (A6 − d) For the g i ven assumption C 1 > C 2 > d > 0, all ∆ I s above are negativ e values so that their abstracts represent the absolute co sts in classifications. One can d irectly p rove that | ∆ I 30 | > | ∆ I 40 | from ( A6-c) and (A6- d). The p rocedur es for the proo f of | ∆ I 10 | > | ∆ I 20 | are g i ven below . Fir st, one nee ds to con firm the following two functio ns to b e strictly decreasing ( x 1 < x 2 , g ( x 1 ) > g ( x 2 )): g 1 ( x ) = ( x x + d ) x and g 2 ( x ) = ( d x + d ) d f or x > 0 , d > 0 . (A7 − a ) Then, from the monoton ically decreasing property of (A7-a), one can deri ve the follo wing relations: C 1 > C 2 → ( C 1 C 1 + d ) C 1 < ( C 2 C 2 + d ) C 2 < 1 and ( d C 1 + d ) d < ( d C 2 + d ) d < 1 → 1 n | C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d | < 1 n | C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d | → | ∆ I 20 | < | ∆ I 10 | (A7 − b) The relation s in (A3-a) are achieved for N I 2 because its normalization term, H ( T ), is a co nstant for the given C 1 and C 2 . One th erefore con firms the satisfaction of Feature 3 on the p roperty of the within er ror types and r eject types around the exact classifi cation s, respecti vely . Then it is a pro of of the re lation (A3- b), which suggests that a misclassification su ff er a higher cost than a rejection for the same class. Feature 3 considers this relation as a b asic pr operty in classifications for the between er ror and reject types. Th e procedures for the proof are: C 1 > C 2 → C 1 C 2 + C 1 d > ( C 1 + C 2 ) d = nd → 1 > C 1 n > d C 2 + d →    log 2 ( C 1 n )    <     log 2 ( d C 2 + d )     → 1 n    d log 2 ( C 1 n )    < 1 n     d log 2 d C 2 + d     < 1 n     C 2 log 2 C 2 C 2 + d + d log 2 d C 2 + d     → | ∆ I 40 | < | ∆ I 20 | (A8 − a ) C 1 + d < n → C 1 ( C 1 + d ) + n d < C 1 n + n d → C 1 ( C 1 + d ) + nd n ( C 1 + d ) < 1 → C 1 n + d C 1 + d < 1 → d C 1 + d < C 2 n < 1 →    log 2 C 2 n    <     log 2 d C 1 + d     → 1 n    d log 2 C 2 n    < 1 n     C 1 log 2 C 1 C 1 + d + d log 2 d C 1 + d     → | ∆ I 30 | < | ∆ I 10 | . (A8 − b) ♦ Theorem A2 : For the gi ven conditions (A1)-(A2) and C 1 > C 2 > d > 0, N I 2 will satisfy the following relations: N I 2 ( M 4 ) > N I 2 ( M 3 ) > N I 2 ( M 2 ) > N I 2 ( M 1 ) f or 0 . 5 < p 1 < Ω ≤ 1 (A9 − a ) N I 2 ( M 4 ) > N I 2 ( M 2 ) > N I 2 ( M 3 ) > N I 2 ( M 1 ) f or 0 . 5 < Ω < p 1 ≤ 1 (A9 − b) where we set p 1 = C 1 / n , and Ω is an up per boundar y fo r the v alidation of (A9-a ). Proof . The first relation describes that the ranking order in (A9-a) is valid only for a certain range of p 1 . Th e lower bo undary is resulted fr om th e assumption of C 1 to be a large class. The upp er bound ary , Ω , is d etermined by 13 0 10 20 30 40 50 60 70 80 90 100 Ŧ 0.09 Ŧ 0.08 Ŧ 0.07 Ŧ 0.06 Ŧ 0.05 Ŧ 0.04 Ŧ 0.03 Ŧ 0.02 Ŧ 0.01 Ŧ 0.00 p1(%) Delta_I Figure A1.: Plots of “ ∆ I vs. p 1 (%)” when n = 100 and d = 1.(Black-Solid = ∆ I 10 , Black -Dash =∆ I 20 , Blue-Sol id =∆ I 30 , Blue-Da sh =∆ I 40 ) the cross-over point betwe en the functions of (A- 6b) and (A-6c) . For better u nderstandin g of the relation s ( A9), we present the plots of “ ∆ I vs. p 1 ” when n = 10 0 and d = 1 (Fig. A1). For examining the v alidation rang e of ( A9-a), one needs to calculate the cr oss-over p oint fro m solving th e eq uation below: f = ∆ I 20 − ∆ I 30 = 1 n ( C 2 log 2 C 2 C 2 + d + d log 2 d n C 2 + d ) = 0 . (A10) There exists no closed-form solu tion to Ω . Based on the monoton icity of the related function s and relatio ns in (A3) , one is a ble to confirm the c onditions in (A9-a) and (A9-b), r espectiv ely . Fig. A1 depicts nu merically th at only a single cross-over point appears to the range of p 1 > 0 . 5(or C 1 > C 2 ). ♦ Remark A1 : W e can d enote Ω ( n , d ) to be the c ross-over point obtained from f , with two in depende nt v ariables n and d . The value of Ω incr eases with n , but decre ases with d . A nu merical so lution to Ω should b e eng aged. The physical interpretation of Ω is a critical point at which a rejection within a small class has the same cost with a misclassification within a large class. This situation generally does n ot occur excep t f or classifications o f largely- ske wed classes (say , C 1 >> C 2 ). Therefor e, we call the ranking order in (A9-a) is a ge neral ranking order, and one in (A9-b) is a largely-ske wed-cla ss ra nking order . Sensitivity functions : The sensiti vity fun ctions are gi ven as the con ventional forms for deli vering approxim ation analysis of I M : ∂ I M ∂ T N = 1 n " log 2 n C 1 +  log 2 T N T N + F N  sng ( T N ) # , (A11 − a) ∂ I M ∂ T P = 1 n " log 2 n C 2 +  log 2 T P T P + F P  sng ( T P ) # , (A11 − b) ∂ I M ∂ F N = 1 n " log 2 n C 2 +  log 2 F N F N + T N  sng ( F N ) # , (A11 − c) ∂ I M ∂ F P = 1 n " log 2 n C 1 +  log 2 F P F P + T P  sng ( F P ) # , (A11 − d) ∂ I M ∂ R N = − ∂ I ∂ T N − ∂ I ∂ F P , (A11 − e) 14 ∂ I M ∂ R P = − ∂ I ∂ F N − ∂ I ∂ T P . (A11 − f ) where sgn ( . ) is a sign fu nction for satisfying the definition of H (0) = 0. Only four indepen dent v ariables describe the sensiti vity f unctions due to the two co nstraints in (A1 -b). Hence, a chain r ule is app lied for d eriving t he function s of (A11-e) and (A11-f) . ♦ Remark A2 : Using e q.(A11), we failed to reach the reasonab le conclusion s as those in Theore ms A1 for the reason that the first-or der d i ff erentials may be not su ffi cient for the analysis ar ound the exact classifications. For example, we got the results for: I ( M 1 ) − I ( M 0 ) ≈ ( T P 1 − T P 0 ) ∂ I M ( M 0 ) ∂ T P + ( F N 1 − F N 0 ) ∂ I M ( M 0 ) ∂ F N = − d n log 2 ( n C 2 ) + d n log 2 ( n C 2 ) = 0 . (A12 − a) I ( M 2 ) − I ( M 0 ) ≈ ( T N 1 − T N 0 ) ∂ I M ( M 0 ) ∂ T N + ( F P 1 − F P 0 ) ∂ I M ( M 0 ) ∂ F P = − d n log 2 ( n C 1 ) + d n log 2 ( n C 1 ) = 0 . (A12 − b) This observation suggests that one needs t o be cautious when using sensiti vity function for a pproxim ation ana lysis on I M (or N I 2 ). References [1] C. X. Ling, J. Huang, H. Zhang, Auc: a statistical ly consi stent and more discriminati ng m easure than accurac y , in: the 18th Internatio nal Conferen ce on Artificial Intelli gence (IJCAI-2003), 2003, pp. 519–526. [2] M. M. S. Beg, A subjecti v e measure of web s earch quali ty , Information Sciences 169 (2005) 365–381. [3] N. Japko wicz, Why question machine learning ev aluati on methods?, in: AAAI-06 Evalu ation Methods for Machine Learning W orkshop, 2006, pp. 6–11. [4] T . Pietrasze k, Classificat ion of intrusion detectio n alerts using abstaining classifiers, Intelligent Data Analysis 11 (2007) 293–316. [5] N. Lavesson, P . Davidsson, Analysis of multi crit eria methods for algorit hm and classifier ev alua tion, in: the 24th Annual W orkshop of the Swedish, 2007. [6] S. V anderlooy , E. H ¨ ullermei er , A critical analysis of varia nts of the auc, Machine Learning 72 (3) (2008) 247–262. [7] D. J. Hand, Mea suring cla ssifier pe rformance: a coherent al ternati ve to the are a unde r t he roc curve , Machi ne Learning 77 (1) (2009) 103–123. [8] Y . Y . Y ao, S. K. M. W ong, C. J. Butz, O n information theoretic measures of attrib ute importance, in: P AKDD, Beijing, China, 1999, pp. 133–137. [9] J. Principe, D. X u, Q. Zhao, J. Fisher , Learning from example s with informatio n-theoreti c criteria , J. VLSI Signal Proc 26 (2000) 61–77. [10] C. M. Bishop, Neural Networks for Pattern Recog nition, Clarend on Press, London, 1995. [11] B. G. Hu, Y . W ang, Ev aluation criteri a based on mutual informati on for classificatio ns includin g rejected class, Acta A utomati ca Sinica 34 (2008) 1396–14 03. [12] M.-R. T emanni, S. A. Nadeem, D. Berrar , J. D. Zucker , Optimizing abstaining classifiers using roc analysis, in: CAMDA, 2007. [13] P . Domingos, Meta cost: A general method for making classifiers cost-sensiti ve , in: the 5th A CM SIGKDD Internationa l Conference on Kno wledge Discove ry and Data Mining, 1999, pp. 155–164 . [14] C. Elkan, Metacost: A general method for m aking classifiers cost-sensiti ve , in: the 17th Internati onal Joint Conferenc e on Artificial Intelli- gence (IJCAI-01), 2001, pp. 973–978. [15] R. O. Duda, P . E. Hart, D. Stork, Pattern Classi fication (2nd eds.), John W ile y , NY , 1995. [16] D. J. C. Mackay , Information Theory , Inferenc e, and Learning Algorithms, Cambridge: Cambridge Univ ersity Press, 2003. [17] C. Ferri, J . Hern ´ andez-Ora llo, R. Modroiu, A n experime ntal compari son of performance measures for classificat ion, Pattern Recognition Letters 30 (2009) 27–38. [18] N. Japko wicz, M. Shah, Ev aluati ng Learning Algorithms: A Classificati on Perspecti ve, Cambridge Uni versit y Press, 2011. [19] T . Fa wcett, An introducti on to roc analysis, Pattern Recognit ion Letters 27 (2006) 861–874. [20] C. Drummond, R. C. Holt e, Cost curve s: An improved method for visualiz ing classifier performanc e, Machine Learning 65 (1) (2006) 95–130. [21] A. Andersson, P . Davidsson, J. Linden , Measure-ba sed classifier performance e val uation, Patter n Recognit ion Letters 20 (1999) 1165–1173. [22] C. de Stefano, C. Sansone, M. V ento, T o reje ct or not to reject: That is the question - an answer in case of neural classifiers, IEEE Trans. on Systems, Man and Cyberne tics 30 (2000) 84–94. [23] T . Landgre be, D. M. J. T ax, P . Pac lik, R. P . W . Duin, T he interacti on between cla ssification and reject performanc e for distanc e-based reject -option classifiers, Patte rn Recogni tion Letters 27 (2006) 908–917. [24] G. Iannello, G. Percannel la, C. Sansone, P . Soda, On the use of classificati on reliabili ty for improving performance of the one-per-cl ass decomposit ion method, Data Kno wl. Eng. 68 (2009) 1398– 1410. [25] S. V anderloo y , I. Sprinkhuiz en-Kuyper , E. Smirnov , J. va n den Herik, The roc isometr ics approach to construct reli able classifiers, Intelligen t Data Analysis 13 (2009) 3–37. [26] T . Kvålseth , Entropy and corre lation: Some comments, IEE E Tr ansactions on Systems, Man, and Cyberneti cs 17 (1987) 517–519. [27] T . D. W icke ns, Multiw ay Conti ngency T ables Analysis for the Social Scienc es, Lawrence Erlbaum, Hill sdale, Ne w Jersey , 1989. 15 [28] J. T . Fin n, Use of th e av erage mut ual i nformation inde x in ev alu ating classification error and con sistency , Inter . J. of Geogra phical Informatio n Systems 7 (1993) 349–366. [29] A. Forbe s, Classificat ion-algorit hm ev alu ation: Fi ve perf ormance measures based o n confusi on matrices, J. Cl inical Monitori ng and Comput- ing 11 (1995) 189–206. [30] I. Kononen ko, I. Bratko, Information-base d ev aluat ion criterion for classifier’ s performance, Machine Learning 6 (1991) 67–80. [31] R. Nishii, S. T anaka, Accuracy and inaccur acy assessments in land-cov er classification , IEE E Trans. Geoscienc e and Remote Sensing 37 (1999) 491–498 . [32] P .-N. T an, V . Kumar , J. Sri va stav a, Selecti ng the right objecti v e m easure for associa tion analysis, Information Systems 29 (2004) 293–313. [33] Y . W ang, B.-G. Hu, Deri va tions of normalized mutual information in binary classificat ions, in: the 6th International Conference on Fuzzy Systems and Kno wledge Discov ery , 2009, pp. 155–163. [34] J. O. Berger , The case for objecti ve bayesian analysis, Bayesian Analysis 1 (2005) 385–402. [35] M. Li, X. Chen, X. Li, B. Ma, M. V itanyi , The similarit y metric, IEEE Trans. Inf ormation Theory 50 (2004) 3250–3264. [36] C. Tsallis, Possible generaliza tion of boltz mann-gibbs statistic s, J. Stats. Physics 52 (1988) 479–487. [37] T . M. Cov er , J. A. Thomas, Elements of Information Theory , John Wi ley , NY , 1995. [38] B. G. Hu, Informati on measure toolbox for classifier ev alua tion on open source s oftw are scilab, in: IEEE Internatio nal W orkshop on Open- source Softwa re for Scientific Computat ion, 2009, pp. 179–184. [39] C. K. Chow , On optimum recogni tion error and rejec t tradeo ff , IEEE Trans. on Inf ormation Theory 16 (1970) 41–46. [40] A. Strehl, J. Ghosh, Clustering ensembles, a kno wledge reuse framew ork for combinin g multiple partiti ons, J. Machine Learning Research 3 (2002) 583–617 . [41] F . Malv estuto, Statistical treatment of the information content of a database, Information Systems 11 (1986) 211–223. [42] S. Kullback, R. A. Leibler , On information and su ffi ciency , The Annals of Mathemati cal Statistic s 22 (1951) 79–86. [43] D. Johnson, S. Sinanovic , Symmetrizing the kullback-le ibler distance, Rice Univ ersity , W orking Pa per . [44] I. Csiszar , Eine informationthe oretische ungleichung und ihre anwendung auf den bewei s der ergodizit at von marko ff sche n ket ten, Publ. Math. Inst. Hungar . Acad. Sci. 8 (1963) 85–108. [45] J. Lin, Div ergenc e measures based on the shannon entropy , IEEE Tra ns. Information Theory 37 (1991) 145–151. [46] D. Malerba, F . Es posito, M. Monopoli, Comparing dissimilari ty measures for probabilistic symbolic object s, Data Mining III, Series Man- agement Informati on Systems 6 (2002) 31–40. 16 T able 1: NI measures within the mutual-i nformation based group. No. Name [Reference] Formula on N I k 1 NI based on mutual information [28] N I 1 ( T , Y ) = I ( T , Y ) H ( T ) 2 NI based on mutual information [11] N I 2 ( T , Y ) = I M ( T , Y ) H ( T ) 3 NI based on mutual information [28] N I 3 ( T , Y ) = I ( T , Y ) H ( Y ) 4 NI based on mutual information N I 4 ( T , Y ) = 1 2 h I ( T , Y ) H ( T ) + I ( T , Y ) H ( Y ) i 5 NI based on mutual information [26] N I 5 ( T , Y ) = 2 I ( T , Y ) H ( T ) + H ( Y ) 6 NI based on mutual information [40] N I 6 ( T , Y ) = I ( T , Y ) √ H ( T ) H ( Y ) 7 NI based on mutual information [41] N I 7 ( T , Y ) = I ( T , Y ) H ( T , Y ) 8 NI based on mutual information [26] N I 8 ( T , Y ) = I ( T , Y ) max( H ( T ) , H ( Y )) 9 NI based on mutual information [26] N I 9 ( T , Y ) = I ( T , Y ) min( H ( T ) , H ( Y )) 17 T able 2: Information measures within the div ergen ce based group. No. Name of D k [Reference] Formula on D k ( N I k = e x p ( − D k )) 10 ED-Quadratic Div ergence [9] D 10 = Q D E D ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 11 CS-Quadratic Div ergence [9] D 11 = Q D CS ( T , Y ) = log 2 P z p t ( z ) 2 P z p y ( z ) 2 [ P z ( p t ( z ) p y ( z ))] 2 12 KL Div ergence [42] D 12 = K L ( T , Y ) = P z p t ( z ) log 2 p t ( z ) p y ( z ) 13 Bhattacharyya Distance [43] D 13 = D B ( T , Y ) = − log 2 P z p p t ( z ) p y ( z ) 14 χ 2 (Pearson) Div ergence [4 4] D 14 = χ 2 ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 p y ( z ) 15 Hellinger Distance [44] D 15 = H 2 ( T , Y ) = P z ( p p t ( z ) − p p y ( z )) 2 16 V ariation Distance [44] D 16 = V ( T , Y ) = P z | p t ( z ) − p y ( z ) | 17 J div ergenc e j [ 45] D 17 = J ( T , Y ) = P z p t ( z ) log 2 p t ( z ) p y ( z ) + P z p y ( z ) log 2 p y ( z ) p t ( z ) 18 L (or JS) div ergence [45] D 18 = L ( T , Y ) = K L ( T , M ) + K L ( Y , M ) , M = ( p t ( z ) + p y ( z )) 2 19 Symmetric χ 2 Div ergence [46] D 19 = χ 2 S ( T , Y ) = P z ( p t ( z ) − p y ( z )) 2 p y ( z ) + P z ( p y ( z ) − p t ( z )) 2 p t ( z ) 20 Resistor A verage Distance [43] D 20 = D RA ( T , Y ) = K L ( T , Y ) K L ( Y , T ) K L ( T , Y ) + K L ( Y , T ) 18 T able 3: NI measures within the cross-entrop y based group. No. Name Formula o n N I k 21 NI based on cross-entropy N I 21 = H ( T ) H ( T ; Y ) , H ( T ; Y ) = − P z p t ( z ) log 2 p y ( z ) 22 NI based on cross-entropy N I 22 = H ( Y ) H ( Y ; T ) , H ( Y ; T ) = − P z p y ( z ) log 2 p t ( z ) 23 NI based on cross-entropy N I 23 = 1 2  H ( T ) H ( T ; Y ) + H ( Y ) H ( Y ; T )  24 NI based on cross-entropy N I 24 = H ( T ) + H ( Y ) H ( T ; Y ) + H ( Y ; T ) T able 4: Numerical example s in Binary Classificat ions(M1-M4 and M6: C 1 = 90 , C 2 = 10; M5: C 1 = 95 , C 2 = 5). (R) = ranking order for the model, where R = A,B, ... , in de scending order from the top. Model M1 M2 M3 M4 M5 M6 (Ranking) (C) (B) (B) (A) C " 90 0 0 1 9 0 # " 89 1 0 0 10 0 # " 90 0 0 0 9 1 # " 89 0 1 0 10 0 # " 57 38 0 3 2 0 # " 89 1 0 1 9 0 # CR 0.990 0.990 0.990 0.990 0.590 0.980 Rej 0.000 0.000 0.010 0.010 0.000 0.000 Pr ecision 0.989 1.000 1.000 1.000 0.950 0.989 Recall 1.000 0.989 1.000 1.000 0.600 0.989 F1 0.994 0.994 1.000 1.000 0.735 0.989 19 T able 5: Results for the models in T able 4 on information measures from mutual-informati on and cross-entrop y groups. (R) = ranking order for the model, where R = A,B, ..., in descen ding order from the top. Model N I 1 N I 2 N I 3 N I 4 N I 5 N I 6 N I 7 N I 8 N I 9 N I 22 N I 23 N I 24 N I 25 M1 0.831 0.831 0.893 0.862 0.860 0.861 0.755 0.831 0.893 0.998 0.998 0.998 0.998 (C) (D) (D) (B) (D) (D) (D) (D) (D) (D) (A) (A) (A) (A) M2 0.897 0.897 0.841 0.869 0.868 0.869 0.767 0.841 0.897 0.998 0.998 0.998 0.998 (B) (C) (C) (D) (C) ( C) (C) (C) (C) (C) (A) (A) (A) (A) M3 1.000 0.929 0.909 0.955 0.952 0.953 0.909 0.909 1.000 0.969 0.000 0.484 0.000 (B) (A) (B) (A) (A) (A) (A) (A) (A) (A) (D) (B) (C) (B) M4 1.000 0.997 0.855 0.928 0.922 0.925 0.855 0.855 1.000 0.970 0.000 0.485 0.000 (A) (A) (A) (C) (B) (B) ( B) (B) (B) (A) (C) (B) (B) (B) M5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.374 0.548 0.461 0.495 M6 0.731 0.731 0.731 0.731 0.731 0.731 0.576 0.731 0.731 1.000 1.000 1.000 1.000 20 T able 6: Results for the models in T able 4 on information measures from div ergenc e group. S = singularity w hich cannot be remov ed. (R) = ranking order for the model, where R = A, B, ..., in descendi ng order from the top. Model N I 10 N I 11 N I 12 N I 13 N I 14 N I 15 N I 16 N I 17 N I 18 N I 19 N I 20 M1 0.9998 0.9998 0.9991 0.9998 0.9988 0.9997 0.9802 0.9983 0.9996 0.9977 0.9996 (C) (A) (A) (B) (A) (B) (A) (A) (B) (A) (B) (A) M2 0.9998 0.9998 0.9992 0.9998 0.9990 0.9997 0.9802 0.9985 0.9996 0.9979 0.9996 (B) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) M3 0.9998 0.9996 0.9849 0.9926 0.9890 0.9898 0.9802 S 0.9897 S S (B) (A) (D) (D) (D) (D) (D) (A) (D) M4 0.9998 0.9998 0.9856 0.9928 0.9899 0.9900 0.9802 S 0.9900 S S (A) (A) (A) (C) (C) (C) (C) (A) (C) M5 0.7827 0.6473 0.6189 0.8540 0.6002 0.8129 0.4966 0.2775 0.7550 0.0455 0.7406 M6 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 S 21 T able 7: Numerical example s in Binary Classifica tions(n = 100). (R) = ranking order for the model, where R = A,B, ..., in descendin g order from the top. Model M1 a M2a M3a M4a M1b M2b M3b M4b C " 94 0 0 1 5 0 # " 93 1 0 0 6 0 # " 94 0 0 0 5 1 # " 93 0 1 0 6 0 # " 95 0 0 1 4 0 # " 94 1 0 0 5 0 # " 95 0 0 0 4 1 # " 94 0 1 0 5 0 # CR 0.99 0.99 0.99 0 .99 0.99 0.99 0.99 0.99 (Rejection) (0.00) (0.00) (0.01) (0.01) (0.00) (0.00) (0.01) (0.01) N I 2 0.756 0.874 0.876 0.997 0.720 0.864 0.849 0.997 (Ranking) (D) (C) (B) (A) (D) (B) (C) (A) 22 T able 8: Classification exa mples in three classes( C 1 = 80, C 2 = 15, C 3 = 5).(R) = ranking order for the model, where R = A,B, .. ., in desc ending order from the top. Model M7 M8 M9 M10 M11 (Ranking) (C) (C) (B) (B) (B) C           80 0 0 0 0 15 0 0 1 0 4 0                     80 0 0 0 0 15 0 0 0 1 4 0                     80 0 0 0 0 15 0 0 0 0 4 1                     80 0 0 0 1 14 0 0 0 0 5 0                     80 0 0 0 0 14 1 0 0 0 5 0           CR 0.99 0.99 0.99 0.99 0.99 Rej 0.00 0.00 0.01 0.00 0.00 Model M12 M 13 M14 M15 (Ranking) (B) (B) (B) (A) C           80 0 0 0 0 14 0 1 0 0 5 0                     79 1 0 0 0 15 0 0 0 0 5 0                     79 0 1 0 0 15 0 0 0 0 5 0                     79 0 0 1 0 15 0 0 0 0 5 0           CR 0.99 0.99 0.99 0.99 Rej 0.01 0.00 0.00 0.01 23 T able 9: Results for th e models in T able 8 on information measures from m utual- information and cross-entrop y groups. S = singula rity which ca nnot be remo ved. (R) = ranking order for t he model, where R = A,B, .. ., in desc ending order from the top. Model N I 1 N I 2 N I 3 N I 4 N I 5 N I 6 N I 7 N I 8 N I 9 N I 21 N I 22 N I 23 N I 24 M7 0.912 0.912 0.957 0.935 0.934 0.934 0.876 0.912 0.957 0.998 0.998 0.998 0.998 (F) (F) (F) (C) (G) (G) (G) (F) (H) (E) (D) (C) (C) (C) M8 0.939 0.939 0.958 0.949 0.949 0.949 0.902 0.939 0.958 0.998 0.998 0.998 0.998 (F) (E) (E) (B) (D) (D) (D) (D) (D) (D) (D) (C) (C) (C) M9 1.000 0.951 0.961 0.980 0.980 0.980 0.961 0.961 1.000 0.982 0.000 0.491 0.000 (C) (A) (D) (A) (A) (A) (A) (A) (A) (A) (G) (G) (I) (G) M10 0.912 0.912 0.938 0.925 0.925 0.925 0.860 0.912 0.938 0.999 0.999 0.999 0.999 (E) (F) (F) (F) (I) (I) (I) (H) (H) (G) (A) (A) (A) (A) M11 0.956 0.956 0.941 0.948 0.948 0.948 0.902 0.941 0.956 0.998 0.998 0.998 0.998 (E) (D) (C ) (E) ( E) (E) (E) (D) (C) (E) (B) (C) (C) ( C) M12 1.000 0.969 0.943 0.972 0.971 0.971 0.943 0.943 1.000 0.983 0.000 0.492 0.000 (B) (A) (B) (D) (B) (B) ( B) (B) (B) (A) (F) (G) (G) (G) M13 0.939 0.939 0.915 0.927 0.927 0.927 0.863 0.915 0.939 0.999 0.999 0.999 0.999 (D) (E) (E) (I ) (H) (H) (H) (G) (G) (F) (A) (A) (A) (A) M14 0.956 0.956 0.916 0.936 0.935 0.936 0.879 0.916 0.956 0.998 0.998 0.998 0.998 (D) (D) (C) (H) (F) (F ) (F) (E ) (F ) (E) (D) (C) (C) (C) M15 1.000 0.996 0.919 0.960 0.958 0.959 0.919 0.919 1.000 0.984 0.000 0.492 0.000 (A) (A) (A) (G) (C) (C) (C) (C) ( E) (A) (E) (G) (G) (G) 24 T able 10: Results for the models in T able 8 on informati on measures from di verg ence group. S = singulari ty which cannot be remove d. (R) = ranking order for the model, where R = A,B, ..., in descendi ng order from the top. Model N I 10 N I 11 N I 12 N I 13 N I 14 N I 15 N I 16 N I 17 N I 18 N I 19 N I 20 M7 0.9998 0.9998 0.9982 0.9996 0.9974 0.9994 0.9802 0.9966 0.9992 0.9953 0.9992 (F) (A) (A) (D) (C) (E) (D) (A) (D) (D) (E) (D) M8 0.9998 0.9996 0.9979 0.9995 0.9969 0.9993 0.9802 0.9959 0.9990 0.9942 0.9990 (F) (A) (E) (E) (D) (F) (E ) (A) (F) (F) (F) (F) M9 0.9998 0.9996 0.9840 0.9924 0.9876 0.9895 0.9802 S 0.9893 S S (C) (A) (E) (H) (G) (I) (H) (A) (H) M10 0.9998 0.9997 0.9994 0.9999 0.9992 0.9998 0.9802 0.9988 0.9997 0.9984 0.9997 (E) (A) (C) (A) (A) (A) (A) (A) (B) (A) (C) (A) M11 0.9998 0.9996 0.9982 0.9995 0.9976 0.9994 0.9802 0.9964 0.9991 0.9950 0.9991 (E) (A) (E) (D) (D) (D) (D) (A) (E) (E) (F ) (E) M12 0.9998 0.9996 0.9852 0.9927 0.9893 0.9899 0.9802 S 0.9898 S S (B) (A) (E) (G) (F ) (H) (G) (A) (H) M13 0.9998 0.9997 0.9994 0.9999 0.9992 0.9998 0.9802 0.9989 0.9997 0.9985 0.9997 (D) (A) (C) (A) (A) (A) (A) (A) (A) (A) (A) (A) M14 0.9998 0.9997 0.9986 0.9996 0.9982 0.9995 0.9802 0.9972 0.9993 0.9961 0.9993 (D) (A) (C) (C) (C) (C) (C) (A) (C) (C) (D) (C) M15 0.9998 0.9998 0.9856 0.9928 0.9899 0.9900 0.9802 S 0.9900 S S (A) (A) (A) (F) (E) (G ) (F) (A) (G) 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment