A simple data discretizer

Reading time: 5 minute
...

📝 Abstract

Data discretization is an important step in the process of machine learning, since it is easier for classifiers to deal with discrete attributes rather than continuous attributes. Over the years, several methods of performing discretization such as Boolean Reasoning, Equal Frequency Binning, Entropy have been proposed, explored, and implemented. In this article, a simple supervised discretization approach is introduced. The prime goal of MIL is to maximize classification accuracy of classifier, minimizing loss of information while discretization of continuous attributes. The performance of the suggested approach is compared with the supervised discretization algorithm Minimum Information Loss (MIL), using the state-of-the-art rule inductive algorithms- J48 (Java implementation of C4.5 classifier). The presented approach is, indeed, the modified version of MIL. The empirical results show that the modified approach performs better in several cases in comparison to the original MIL algorithm and Minimum Description Length Principle (MDLP) .

💡 Analysis

Data discretization is an important step in the process of machine learning, since it is easier for classifiers to deal with discrete attributes rather than continuous attributes. Over the years, several methods of performing discretization such as Boolean Reasoning, Equal Frequency Binning, Entropy have been proposed, explored, and implemented. In this article, a simple supervised discretization approach is introduced. The prime goal of MIL is to maximize classification accuracy of classifier, minimizing loss of information while discretization of continuous attributes. The performance of the suggested approach is compared with the supervised discretization algorithm Minimum Information Loss (MIL), using the state-of-the-art rule inductive algorithms- J48 (Java implementation of C4.5 classifier). The presented approach is, indeed, the modified version of MIL. The empirical results show that the modified approach performs better in several cases in comparison to the original MIL algorithm and Minimum Description Length Principle (MDLP) .

📄 Content

A Simple Data Discretizer

Gourab Mitra Shashidhar Sundereisan Bikash Kanti Sarkar gourab5139014@gmail.com shashidhar.sun@gmail.com bk_sarkarbit@hotmail.com Dept. of Computer Science and Engineering Bloomberg L.P. , USA Dept. of Computer Science and Engineering SUNY at Buffalo, USA

Birla Institute of Technology, Mesra, Ranchi, India

Abstract Data discretization is an important step in the process of machine learning, since it
is easier for classifiers to deal with discrete attributes rather than continuous attributes. Over the years, several methods of performing discretization such as Boolean Reasoning, Equal Frequency Binning, Entropy have been proposed, explored, and implemented. In this article, a simple supervised discretization approach is introduced. The prime goal of MIL is to maximize classification accuracy of classifier, minimizing loss of information while
discretization of continuous attributes. The performance of the suggested approach is compared with the supervised discretization algorithm Minimum Information Loss (MIL),
using the state-of-the-art rule inductive algorithms- J48 (Java implementation of C4.5 classifier). The presented approach is, indeed, the modified version of MIL. The empirical results show that the modified approach performs better in several cases in comparison to the original MIL algorithm and Minimum Description Length Principle (MDLP) .
Keywords Data mining, Discretization, Classifier, Accuracy, Information Loss
Nomenclature
Ai any continuous attribute of classification problem s number of class values in classification problem dmin, dmax minimum and maximum values of continuous attribute m total number of examples in a classification problem n number of sub-intervals of Ai c user set constant value Sj j-th sub-interval of Ai h length of each sub-interval fk frequency of examples occurring in the sub-interval Sj
TS expected threshold value for each sub-interval CTSj computed threshold value of the j-th sub-interval U the union operator acc mean accuracy s.d. standard deviation

  1. Introduction Most of the real-world problems involve continuous attributes, each of which possesses
    more values. Obviously, due to exponential growth of data in the database systems,
    operating continuous values is comparatively complex in comparison to discrete values. Therefore, conversion of input data sets with continuous attributes into data sets with
    discrete attributes is necessary to reduce the range of values, and this is indeed known as
    data discretization. Although, transforming the continuous values of attribute into a few discrete values may be expressed nominally (e.g. “low”, “medium” and “high” ) or the attribute may even be nominal before transformation. The values of nominal attributes are further
    needed to normalize such as: 1(low), 2(medium), 3(high). In Knowledge Discovery of
    Databases(KDD), discretization process is known to be one of the most important data
    preprocessing tasks. Discretization of continuous attributes has been extensively studied (Khiops and Boulle, 2004; Catlett,1991; Ching et al.,1995; Chmielewski and Grzymala-Busse, 1996; Dougherty et al., 1995; Fayyad and Irani, 1993; Kohavi and Sahami, 1996; Pal and Biswas, 2005), prior to the learning process. Several machine learning algorithms have been developed to mine knowledge from real- world problems. However, many of them such as (Apte and Hong, 1996; Cendrowaka,1987; Clark and Niblett T., 1989) cannot handle continuous attributes, whereas each of them can operate on discretized attributes. Furthermore, even if an algorithm can handle continuous attributes its performance can be significantly improved by replacing continuous attributes with its discretized values(Catlett, 1991; Pfahringer,1995).The other advantages in operating
    discretized attributes are the need of less memory space and processing time in comparison
    to their non-discretized form. In addition, much larger rules are produced, while processing
    continuous attributes. On the other hand, a disadvantage of discretizing a continuous value in some cases, is the loss of the information in the available original data.

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut