Evaluation of Performance Measures for Classifiers Comparison

Reading time: 5 minute
...

📝 Abstract

The selection of the best classification algorithm for a given dataset is a very widespread problem, occuring each time one has to choose a classifier to solve a real-world problem. It is also a complex task with many important methodological decisions to make. Among those, one of the most crucial is the choice of an appropriate measure in order to properly assess the classification performance and rank the algorithms. In this article, we focus on this specific task. We present the most popular measures and compare their behavior through discrimination plots. We then discuss their properties from a more theoretical perspective. It turns out several of them are equivalent for classifiers comparison purposes. Futhermore. they can also lead to interpretation problems. Among the numerous measures proposed over the years, it appears that the classical overall success rate and marginal rates are the more suitable for classifier comparison task.

💡 Analysis

The selection of the best classification algorithm for a given dataset is a very widespread problem, occuring each time one has to choose a classifier to solve a real-world problem. It is also a complex task with many important methodological decisions to make. Among those, one of the most crucial is the choice of an appropriate measure in order to properly assess the classification performance and rank the algorithms. In this article, we focus on this specific task. We present the most popular measures and compare their behavior through discrimination plots. We then discuss their properties from a more theoretical perspective. It turns out several of them are equivalent for classifiers comparison purposes. Futhermore. they can also lead to interpretation problems. Among the numerous measures proposed over the years, it appears that the classical overall success rate and marginal rates are the more suitable for classifier comparison task.

📄 Content

EVALUATION OF PERFORMANCE MEASURES FOR CLASSIFIERS COMPARISON

Vincent Labatut
Galatasaray University, Computer Science Department, Istanbul, Turkey vlabatut@gsu.edu.tr

Hocine Cherifi University of Burgundy, LE2I UMR CNRS 5158, Dijon, France hocine.cherifi@u-bourgogne.fr

ABSTRACT The selection of the best classification algorithm for a given dataset is a very widespread problem, occuring each time one has to choose a classifier to solve a real-world problem. It is also a complex task with many important methodological decisions to make. Among those, one of the most crucial is the choice of an appropriate measure in order to properly assess the classification performance and rank the algorithms. In this article, we focus on this specific task. We present the most popular measures and compare their behavior through discrimination plots. We then discuss their properties from a more theoretical perspective. It turns out several of them are equivalent for classifiers comparison purposes. Futhermore. they can also lead to interpretation problems. Among the numerous measures proposed over the years, it appears that the classical overall success rate and marginal rates are the more suitable for classifier comparison task.

Keywords: Classification, Accuracy Measure, Classifier Comparison, Discrimination Plot.

1 INTRODUCTION

The comparison of classification algorithms is a complex and open problem. First, the notion of performance can be defined in many ways: accuracy, speed, cost, readability, etc. Second, an appropriate tool is necessary to quantify this performance. Third, a consistent method must be selected to compare the measured values.

As performance is most of the time expressed in terms of accuracy, we focus on this point in this work. The number of accuracy measures appearing in the classification literature is extremely large. Some were specifically designed to compare classifiers , but most were initially defined for other purposes, such as measuring the association between two random variables [2], the agreement between two raters [3] or the similarity between two sets [4]. Furthermore, the same measure may have been independently developed by different authors, at different times, in different domains, for different purposes, leading to very confusing typology and terminology. Besides its purpose or name, what characterizes a measure is the definition of the concept of accuracy it relies on. Most measures are designed to focus on a specific aspect of the overall classification results [5]. This leads to measures with different interpretations, and some do not even have any clear interpretation. Finally, the measures may also differ in the nature of the situations they can handle [6]. They can be designed for binary (only two classes) or multiclass (more than two classes) problems. They can be dedicated to mutually exclusive (one instance belongs to exactly one class) or overlapping classes (one instance can belong to several classes) situations. Some expect the classifier to output a discrete score (Boolean classifiers), whereas other can take advantage of the additional information conveyed by a real-valued score (probabilistic or fuzzy classifiers). One can also oppose flat (all classes on the same level) and hierarchical classification (a set of classes at a lower level constitutes a class at a higher level). Finally, some measures are sensitive to the sampling design used to retrieve the test data [7].

Many different measures exist, but yet, there is no such thing as a perfect measure, which would be the best in every situation [8]: an appropriate measure must be chosen according to the classification context and objectives. Because of the overwhelming number of measures and of their heterogeneity, choosing the most adapted one is a difficult problem. Moreover, it is not always clear what the measures properties are, either because they were never rigorously studied, or because

specialists do not agree on the question (e.g. the question of chance-correction [9]). Maybe for these reasons, authors very often select an accuracy measure by relying on the tradition or consensus observed in their field. The point is then more to use the same measure than their peers rather than the most appropriate one.

In this work, we reduce the complexity of choosing an accuracy measure by restraining our analysis to a very specific but widespread, situation. We discuss the case where one wants to select the best classification algorithm to process a given data set [10]. An appropriate way to perform this task would be to study the data properties first, then to select a suitable classification algorithm and determine the most appropriate parameter values, and finally to use it to build the classifier. But not everyone has the statistical expertise required to perform this analytic work.

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut