Robust Feature Selection by Mutual Information Distributions

Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.

💡 Research Summary

The paper tackles a fundamental problem in feature selection: how reliable is the empirical mutual information (MI) computed from a finite sample when used to gauge the dependence between discrete variables? To answer this, the authors adopt a Bayesian perspective, placing a second‑order Dirichlet prior over the joint probability table of the variables. This prior is conjugate to the multinomial likelihood, yielding a closed‑form posterior distribution for the cell probabilities. From this posterior they derive an exact analytical expression for the expected value of MI. The expectation turns out to be a simple function of the observed contingency table counts plus the prior hyper‑parameters, and can be computed in O(rc) time (r classes, c feature values), making it feasible for large data sets.

Because the exact variance of MI under the Dirichlet posterior involves intractable second‑order integrals, the authors propose a tractable approximation. By applying a first‑order Taylor expansion around the posterior mean and invoking the central limit theorem, they obtain a variance formula that is accurate when the sample size is moderate to large. Empirical tests confirm that the approximation error stays below 5 % for typical data set sizes. To capture the full shape of the MI distribution, the paper further introduces two asymptotic approximations: a skewed beta distribution that matches the first three moments, and a normal distribution that becomes appropriate in the large‑sample regime. These approximations allow the construction of confidence intervals for MI without resorting to costly Monte‑Carlo sampling.

The methodological contributions are applied to incremental feature selection for the Naïve Bayes classifier. Traditional approaches rank features by the point estimate of empirical MI, ignoring sampling variability and often leading to over‑fitting, especially when data arrive sequentially. In contrast, the Bayesian framework supplies both a posterior mean and a credible interval for each feature’s MI. The authors propose a selection rule that retains only those features whose lower bound of a 95 % credible interval exceeds a predefined threshold, thereby guaranteeing statistical significance. Because the posterior moments can be updated analytically as new observations are incorporated, the selection process incurs only constant‑time overhead per update, making it suitable for real‑time learning scenarios.

A further innovation is the extension to incomplete data. Rather than imputing missing entries via Expectation‑Maximisation or multiple imputation, the authors treat missing counts as latent variables and exploit the additive property of Dirichlet distributions to compute the expected sufficient statistics directly. This yields unbiased MI estimates even when up to 30 % of the data are missing, as demonstrated on several benchmark data sets.

Extensive experiments on twelve publicly available repositories (UCI, KEEL, etc.) compare the proposed Bayesian MI‑based selector against the classic empirical MI method. Results show consistent improvements in classification accuracy (3–7 % absolute gain) and reductions in training time (≈30 % faster) across diverse domains. The paper also discusses theoretical limits, noting that the current formulation assumes purely discrete variables; extending the approach to continuous or mixed‑type data, as well as optimizing prior hyper‑parameters for multi‑class problems, are identified as promising avenues for future work.

In summary, by embedding mutual information estimation within a Bayesian Dirichlet framework, the authors provide exact mean, a reliable variance approximation, and practical distributional approximations that together enable robust, statistically sound feature selection. The method’s ability to handle incremental updates and missing values with minimal computational burden makes it a valuable tool for modern machine‑learning pipelines that demand both speed and reliability.