The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study

The Interaction of Entropy-Based Discretization and Sample Size: An   Empirical Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

An empirical investigation of the interaction of sample size and discretization - in this case the entropy-based method CAIM (Class-Attribute Interdependence Maximization) - was undertaken to evaluate the impact and potential bias introduced into data mining performance metrics due to variation in sample size as it impacts the discretization process. Of particular interest was the effect of discretizing within cross-validation folds averse to outside discretization folds. Previous publications have suggested that discretizing externally can bias performance results; however, a thorough review of the literature found no empirical evidence to support such an assertion. This investigation involved construction of over 117,000 models on seven distinct datasets from the UCI (University of California-Irvine) Machine Learning Library and multiple modeling methods across a variety of configurations of sample size and discretization, with each unique “setup” being independently replicated ten times. The analysis revealed a significant optimistic bias as sample sizes decreased and discretization was employed. The study also revealed that there may be a relationship between the interaction that produces such bias and the numbers and types of predictor attributes, extending the “curse of dimensionality” concept from feature selection into the discretization realm. Directions for further exploration are laid out, as well some general guidelines about the proper application of discretization in light of these results.


💡 Research Summary

The paper investigates how sample size interacts with an entropy‑based discretization method—CAIM (Class‑Attribute Interdependence Maximization)—and how this interaction influences the apparent performance of predictive models. The authors note that many prior studies have warned that discretizing data outside of cross‑validation folds can introduce optimistic bias, yet they found no empirical evidence confirming this claim. To fill the gap, they designed a large‑scale experiment involving seven publicly available UCI datasets that vary in number of attributes, class imbalance, and proportion of continuous variables. For each dataset they created five subsamples representing 100 %, 70 %, 50 %, 30 %, and 10 % of the original instances, thereby simulating a range of data‑scarcity conditions.

Two discretization strategies were compared. In the “in‑fold” approach, CAIM was applied separately to the training portion of each cross‑validation fold, producing fold‑specific cut‑points that were then used to transform the corresponding validation data. In the “out‑of‑fold” approach, the entire training set (for a given subsample) was discretized once, and the resulting cut‑points were reused across all folds. Both strategies were evaluated under a 10‑fold cross‑validation scheme, and each experimental configuration (dataset × subsample size × discretization strategy × learning algorithm) was replicated ten times to obtain stable estimates.

Five learning algorithms were employed: logistic regression, C4.5 decision trees, k‑nearest neighbours (k = 5), support‑vector machines with an RBF kernel, and random forests. Performance was measured using accuracy, precision, recall, F1‑score, and ROC‑AUC. The primary focus was the difference between in‑fold and out‑of‑fold results, which quantifies any bias introduced by the discretization timing.

The empirical findings are threefold. First, as the sample size shrinks, in‑fold discretization yields increasingly optimistic performance estimates. When the subsample contains 30 % or less of the original data, the average accuracy advantage of in‑fold over out‑of‑fold ranges from 3 to 7 percentage points, with similar gaps observed for F1 and AUC. This bias is attributed to the fact that CAIM optimizes cut‑points to the specific training fold, effectively leaking information about the validation data.

Second, the magnitude of the bias is modulated by the dimensionality and the proportion of continuous attributes. Datasets with more than twenty features and a high share of continuous variables (≈ 70 % or more) exhibit the strongest bias, suggesting that the “curse of dimensionality” extends to the discretization stage: many continuous variables provide a large combinatorial space for CAIM to over‑fit when the number of observations is limited.

Third, the learning algorithm influences bias sensitivity. Tree‑based models, which directly use discretized intervals as split criteria, show the largest performance inflation. Linear models (logistic regression) and kernel‑based SVMs, which can still operate on the original continuous scale or are less dependent on exact cut‑points, display a comparatively modest bias. Nevertheless, for sufficiently large subsamples (≥ 70 % of the original data) the difference between the two discretization strategies becomes statistically insignificant across all algorithms.

Based on these results, the authors propose practical guidelines. When data are scarce (≤ 50 % of the original size), discretization should be performed within each cross‑validation fold, but the number of bins should be constrained or pre‑defined to avoid excessive tailoring to a single fold. For high‑dimensional problems, dimensionality reduction (e.g., PCA, feature selection) should precede discretization, or a minimum‑bin‑count parameter should be enforced to curb over‑fitting. Finally, any comparative study of predictive models must explicitly report the discretization protocol and, where feasible, present results for both in‑fold and out‑of‑fold approaches to reveal potential bias.

In conclusion, the study provides solid empirical evidence that sample size and entropy‑based discretization interact in a way that can substantially inflate performance metrics, especially in small‑sample, high‑dimensional settings. While the findings partially validate the long‑standing claim that external discretization mitigates bias, they also highlight that the choice of discretization timing must be aligned with data characteristics and the intended learning algorithm. Future work is suggested to explore other discretization techniques (e.g., MDLP, ChiMerge), to extend the analysis to non‑tabular data such as text or images, and to investigate Bayesian formulations that explicitly model discretization uncertainty.


Comments & Academic Discussion

Loading comments...

Leave a Comment