Evaluation of Performance Measures for Classifiers Comparison
The selection of the best classification algorithm for a given dataset is a very widespread problem, occuring each time one has to choose a classifier to solve a real-world problem. It is also a complex task with many important methodological decisions to make. Among those, one of the most crucial is the choice of an appropriate measure in order to properly assess the classification performance and rank the algorithms. In this article, we focus on this specific task. We present the most popular measures and compare their behavior through discrimination plots. We then discuss their properties from a more theoretical perspective. It turns out several of them are equivalent for classifiers comparison purposes. Futhermore. they can also lead to interpretation problems. Among the numerous measures proposed over the years, it appears that the classical overall success rate and marginal rates are the more suitable for classifier comparison task.
💡 Research Summary
The paper addresses a fundamental yet often overlooked problem in machine learning practice: selecting an appropriate accuracy measure for comparing multiple classifiers on the same dataset. While a plethora of performance metrics exists—many originally devised for purposes other than classifier evaluation—their sheer number and heterogeneous terminology make it difficult for practitioners to choose the most suitable one. The authors therefore restrict their analysis to a common scenario: flat, mutually‑exclusive, multi‑class classification where each classifier outputs a discrete label. Under these assumptions, the goal is simply to rank classifiers by their empirical performance on a test set.
The literature review shows that previous comparative studies either focus on binary problems, on classifiers that output continuous scores, or on very specific domains, and often rely on correlation analyses that do not directly address ranking consistency. The authors introduce three concepts that guide their evaluation: (i) equivalence – two measures are considered equivalent if they produce the same ranking of classifiers, even if the absolute values differ; (ii) discrimination – the ability of a measure to detect genuine performance differences; and (iii) consistency – the stability of rankings across varying data conditions.
A taxonomy of metrics is presented. First, nominal association measures (e.g., chi‑square, Cramer’s V, Matthews correlation) quantify the statistical dependence between true and predicted class assignments but can assign maximal values to perfectly mis‑classified data, rendering them unsuitable for accuracy assessment. Second, the overall success rate (OSR), defined as the trace of the confusion matrix, is a simple, symmetric, multiclass metric ranging from 0 (total mis‑classification) to 1 (perfect classification). Third, marginal‑rate measures—true‑positive rate (TPR, sensitivity), true‑negative rate (TNR, specificity), positive‑predictive value (PPV, precision), and negative‑predictive value (NPV)—focus on class‑specific error types and are directly interpretable. From these, the F‑measure (harmonic mean of PPV and TPR) and Jaccard coefficient (intersection‑over‑union) are derived; both are class‑specific and symmetric but emphasize overlap rather than overall correctness. The Classification Success Index (CSI) and its per‑class variant (ICSI) are linear combinations of TPR and PPV, while chance‑corrected agreement coefficients (Cohen’s κ, etc.) attempt to remove agreement expected by random guessing.
To compare the behavior of these metrics, the authors introduce “discrimination plots,” a novel visualization that maps the ranking changes induced by systematic perturbations of the confusion matrix (e.g., varying class imbalance, injecting false positives). By applying ten representative metrics to synthetic and real datasets, they observe that OSR and the marginal‑rate family (especially the average of TPR and PPV) produce virtually identical rankings across all tested scenarios. In contrast, association measures and composite metrics such as F‑measure or Jaccard are highly sensitive to specific error patterns; small shifts in false‑positive rates can invert the ranking of two classifiers.
The functional analysis in Section 7 further clarifies why OSR and marginal rates are preferable in the defined context. Both are bounded, symmetric, and have a clear probabilistic interpretation (proportion of correctly classified instances or proportion of correctly identified members of a class). They are also invariant to the fixed class‑proportion assumption, which holds when the same test set is used for all classifiers. Chance‑corrected coefficients, while valuable in inter‑rater reliability studies, add little discriminative power here because the expected agreement under chance is already accounted for by the fixed class distribution.
In conclusion, the paper recommends using the overall success rate for a quick, global assessment and marginal‑rate averages (e.g., the mean of TPR and PPV) when a more nuanced view of class‑specific performance is desired. These metrics are simple to compute, easy to interpret, and, most importantly, produce consistent rankings of classifiers, avoiding the interpretational pitfalls associated with more complex or domain‑specific measures. The authors suggest future work to extend the analysis to hierarchical, multi‑label, and probabilistic‑output classifiers, where the current conclusions may need revision.
Comments & Academic Discussion
Loading comments...
Leave a Comment