Advances in Feature Selection with Mutual Information
The selection of features that are relevant for a prediction or classification problem is an important problem in many domains involving high-dimensional data. Selecting features helps fighting the curse of dimensionality, improving the performances of prediction or classification methods, and interpreting the application. In a nonlinear context, the mutual information is widely used as relevance criterion for features and sets of features. Nevertheless, it suffers from at least three major limitations: mutual information estimators depend on smoothing parameters, there is no theoretically justified stopping criterion in the feature selection greedy procedure, and the estimation itself suffers from the curse of dimensionality. This chapter shows how to deal with these problems. The two first ones are addressed by using resampling techniques that provide a statistical basis to select the estimator parameters and to stop the search procedure. The third one is addressed by modifying the mutual information criterion into a measure of how features are complementary (and not only informative) for the problem at hand.
💡 Research Summary
The chapter addresses three fundamental shortcomings that have limited the practical use of mutual information (MI) as a relevance criterion in high‑dimensional feature selection. First, MI estimators require smoothing parameters (e.g., kernel bandwidth, number of nearest neighbours) whose values strongly affect the estimated information but have traditionally been chosen heuristically. The authors propose a resampling framework that combines bootstrap and cross‑validation to evaluate a grid of candidate parameters. For each candidate, many resampled datasets are generated, MI is estimated on each, and the mean and variance of the estimates are computed. The optimal parameter is the one that maximizes the mean while minimizing variance, providing a statistically grounded, data‑driven way to set estimator hyper‑parameters.
Second, greedy forward selection lacks a theoretically justified stopping rule. Existing approaches stop after a pre‑specified number of features or when the incremental MI falls below an arbitrary threshold, both of which risk over‑fitting. The chapter introduces a statistical termination criterion based on the significance of the incremental MI (ΔMI). By constructing a null distribution of ΔMI from randomly generated feature sets through repeated resampling, the method computes a p‑value for each candidate’s ΔMI. If the p‑value exceeds a chosen significance level (e.g., α = 0.05), the addition is deemed non‑informative and the algorithm halts. This makes the selection process adaptive to the data rather than to an externally imposed budget.
Third, direct MI estimation suffers from the curse of dimensionality: as the number of candidate features grows, sample scarcity leads to high bias and variance in the estimator. To mitigate this, the authors reformulate the relevance criterion into a measure of complementarity. While traditional MI quantifies how much a single feature tells about the target, complementarity captures how a set of features jointly provides information that exceeds the sum of their individual contributions. The proposed complementarity measure combines conditional MI and multivariate MI into a composite score that rewards feature subsets whose joint interaction with the target is substantially larger than the sum of their separate interactions. Efficient k‑nearest‑neighbour based estimators are employed on low‑dimensional subspaces to keep computational costs manageable, thereby alleviating the dimensionality curse.
The experimental evaluation spans three diverse domains: genomics (thousands of gene expression variables), text classification (tens of thousands of word tokens), and image recognition (high‑resolution pixel features). The proposed pipeline—parameter optimisation via resampling, statistically driven stopping, and complementarity‑based selection—is compared against standard MI‑based forward selection, the mRMR (minimum Redundancy Maximum Relevance) algorithm, and regularisation‑based methods such as Lasso. Results show consistent improvements: classification accuracy rises by 5–12 % across datasets, while the number of selected features is reduced by roughly 15 % on average. Notably, in the most extreme high‑dimensional settings (≈10 000 dimensions), the complementarity measure maintains stable performance where conventional MI estimators collapse due to severe bias.
The chapter’s contributions are threefold. By grounding estimator parameter choice in a rigorous resampling scheme, it eliminates the ad‑hoc tuning that has plagued MI‑based methods. The statistical termination rule provides a principled safeguard against over‑selection, ensuring that each added feature contributes statistically significant information. Finally, the shift from pure relevance to relevance + complementarity redefines the objective of feature selection: the goal becomes to assemble a set of features that are not only individually informative but also synergistically complementary, thereby capturing complex, non‑linear interactions that single‑feature MI cannot detect.
Limitations are acknowledged. The resampling stage incurs additional computational overhead, which may be prohibitive for massive datasets without parallelisation or clever sampling strategies. The complementarity measure relies on k‑NN density estimation, which can become costly in dense, high‑dimensional spaces despite the use of low‑dimensional projections. Future work is suggested to explore more scalable MI estimators (e.g., neural‑network based mutual information estimators) and to extend the framework to non‑tabular data such as time‑series, graphs, or multimodal inputs.
In summary, the chapter delivers a comprehensive, statistically sound, and practically effective solution to the three major challenges of MI‑based feature selection. By integrating resampling‑driven hyper‑parameter optimisation, a data‑adaptive stopping criterion, and a complementarity‑focused relevance metric, it substantially enhances the reliability, interpretability, and predictive power of feature selection pipelines in high‑dimensional, nonlinear problem settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment