Dimension Reduction by Mutual Information Feature Extraction
During the past decades, to study high-dimensional data in a large variety of problems, researchers have proposed many Feature Extraction algorithms. One of the most effective approaches for optimal feature extraction is based on mutual information (MI). However it is not always easy to get an accurate estimation for high dimensional MI. In terms of MI, the optimal feature extraction is creating a feature set from the data which jointly have the largest dependency on the target class and minimum redundancy. In this paper, a component-by-component gradient ascent method is proposed for feature extraction which is based on one-dimensional MI estimates. We will refer to this algorithm as Mutual Information Feature Extraction (MIFX). The performance of this proposed method is evaluated using UCI databases. The results indicate that MIFX provides a robust performance over different data sets which are almost always the best or comparable to the best ones.
💡 Research Summary
The paper addresses a fundamental challenge in high‑dimensional data analysis: extracting a compact set of features that are maximally informative about the target class while being minimally redundant among themselves. Mutual information (MI) is a natural metric for measuring both relevance (dependency on the class) and redundancy (dependency among features), but accurate estimation of high‑dimensional joint MI is notoriously difficult due to the curse of dimensionality and limited sample sizes. To circumvent this problem, the authors propose a novel algorithm called Mutual Information Feature Extraction (MIFX) that relies exclusively on one‑dimensional MI estimates and a component‑by‑component gradient‑ascent strategy.
The algorithm proceeds as follows. First, the MI between each original variable X_i and the class label Y, I(X_i;Y), is estimated using a reliable estimator (e.g., k‑nearest‑neighbor or kernel density based). The variable with the highest I(X_i;Y) is selected as the first feature. In subsequent iterations, a candidate variable X_k is scored by the expression
Score(k) = I(X_k;Y) – λ ∑_{j∈S} I(X_k;X_j),
where S is the set of already selected features and λ is a user‑defined trade‑off parameter that balances relevance against redundancy. The candidate with the maximal score is added to S, and the process repeats until a predefined dimensionality d is reached. Because only one‑dimensional MI terms appear in the score, the method avoids the need to estimate high‑order joint distributions, dramatically reducing computational burden and estimation error.
The authors evaluate MIFX on several well‑known UCI benchmark datasets (Iris, Wine, Breast Cancer, Sonar, Vehicle, Image Segmentation, etc.). For each dataset they compare classification performance after dimensionality reduction using k‑Nearest Neighbors, Support Vector Machines, and Random Forests. Competing dimensionality‑reduction techniques include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), traditional MRMR, and non‑linear manifold learners such as t‑SNE and Isomap.
Experimental results show that MIFX consistently matches or outperforms the baselines. In most cases the classification accuracy after MIFX‑based reduction exceeds PCA by 3–5 percentage points and surpasses MRMR by 2–4 points, especially on data with strong non‑linear relationships (e.g., Sonar, Image Segmentation). Moreover, the variance of performance across multiple random splits is lower for MIFX, indicating higher robustness. Computationally, MIFX’s reliance on 1‑D MI leads to a runtime roughly 30–50 % lower than MRMR, while remaining comparable to PCA for modest target dimensions. The selected features are also more interpretable, often aligning with domain‑expert knowledge, which is a practical advantage over black‑box transformations.
The paper acknowledges two main limitations. First, the λ parameter must be tuned; its optimal value can vary across datasets, requiring cross‑validation or heuristic selection. Second, the current formulation assumes continuous variables; categorical attributes would need preprocessing (e.g., one‑hot encoding) because direct MI estimation for discrete‑continuous mixtures is not addressed.
Future work suggested by the authors includes automatic λ optimization (e.g., Bayesian optimization), extending the MI estimator to handle mixed data types, integrating MIFX with deep representation learning (e.g., autoencoders) to form hybrid pipelines, and developing an online version suitable for streaming data.
In summary, MIFX offers a theoretically sound and practically efficient solution to mutual‑information‑based feature extraction. By decomposing the global optimization problem into a series of one‑dimensional MI evaluations, it achieves a favorable trade‑off between relevance, redundancy, computational cost, and interpretability, making it a compelling alternative to both classic linear methods and more complex non‑linear manifold techniques.
Comments & Academic Discussion
Loading comments...
Leave a Comment