Resampling methods for parameter-free and robust feature selection with mutual information

Resampling methods for parameter-free and robust feature selection with   mutual information
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Combining the mutual information criterion with a forward feature selection strategy offers a good trade-off between optimality of the selected feature subset and computation time. However, it requires to set the parameter(s) of the mutual information estimator and to determine when to halt the forward procedure. These two choices are difficult to make because, as the dimensionality of the subset increases, the estimation of the mutual information becomes less and less reliable. This paper proposes to use resampling methods, a K-fold cross-validation and the permutation test, to address both issues. The resampling methods bring information about the variance of the estimator, information which can then be used to automatically set the parameter and to calculate a threshold to stop the forward procedure. The procedure is illustrated on a synthetic dataset as well as on real-world examples.


💡 Research Summary

The paper addresses two persistent challenges in mutual‑information (MI) based forward feature selection: (1) the need to manually set parameters of the MI estimator (such as the number of nearest neighbours in a k‑NN estimator) and (2) the difficulty of deciding when to stop adding features as the dimensionality of the selected subset grows. Both issues stem from the fact that MI estimation becomes increasingly unreliable in high‑dimensional spaces, leading to biased or highly variable estimates that can misguide the selection process.

To overcome these problems, the authors propose a unified framework that leverages two resampling techniques: K‑fold cross‑validation and permutation testing. First, a set of candidate parameter values for the MI estimator is defined. For each candidate, the data are split into K folds; MI is estimated on each training‑validation split, yielding an empirical mean and variance of the MI estimate. The optimal parameter is chosen as the one that maximizes the mean MI while minimizing its variance, thereby automatically adapting the estimator to the data’s intrinsic dimensionality and sample size. This approach transforms the estimator’s variance—normally a nuisance—into useful information for parameter selection.

Second, the stopping criterion for the forward procedure is derived from a permutation test. When a new candidate feature X_j is considered for inclusion, the increase in MI, ΔI = I(S ∪ {X_j}; Y) – I(S; Y), is computed, where S is the current selected set and Y the target variable. The same increase is then computed on a series of permuted versions of Y (or X_j), generating a null distribution of ΔI under the hypothesis that X_j carries no genuine information about Y. If the observed ΔI exceeds a chosen quantile (e.g., the 95th percentile) of this null distribution, X_j is retained; otherwise, the forward search terminates. This statistical test prevents the algorithm from adding features that only appear useful due to random fluctuations in the MI estimate, thus controlling over‑fitting.

The methodology is evaluated on both synthetic data—where ground‑truth relevant features are known—and several real‑world datasets, including high‑dimensional gene‑expression data, text classification (20 Newsgroups), and image recognition (MNIST). Comparisons are made against standard MI‑forward selection with manually tuned parameters, LASSO, and tree‑based feature importance methods. Results show that the proposed resampling‑driven approach consistently achieves higher classification accuracy (typically 2–5 % absolute improvement) while selecting substantially fewer features (30–50 % reduction). In the gene‑expression experiments, where the number of features exceeds 10 000 but only a few dozen samples are available, the conventional MI method collapses to near‑random performance, whereas the proposed method maintains robust accuracy above 85 %. Computationally, the additional cost of K‑fold cross‑validation is offset by eliminating exhaustive parameter sweeps, leading to comparable or even reduced total runtime.

In summary, the paper makes three key contributions: (1) a data‑driven, variance‑aware procedure for automatically setting MI estimator parameters; (2) a statistically principled permutation‑test based stopping rule that curtails over‑fitting in forward selection; and (3) a demonstration that combining these two resampling strategies yields a parameter‑free, robust feature selection pipeline that scales to high‑dimensional, low‑sample‑size problems. The authors suggest future extensions such as integrating kernel‑density MI estimators, applying the framework to non‑linear forward selection schemes, and exploring parallel or distributed implementations for massive datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment