Dimensionality Reduction: An Empirical Study on the Usability of IFE-CF (Independent Feature Elimination- by C-Correlation and F-Correlation) Measures
The recent increase in dimensionality of data has thrown a great challenge to the existing dimensionality reduction methods in terms of their effectiveness. Dimensionality reduction has emerged as one of the significant preprocessing steps in machine learning applications and has been effective in removing inappropriate data, increasing learning accuracy, and improving comprehensibility. Feature redundancy exercises great influence on the performance of classification process. Towards the better classification performance, this paper addresses the usefulness of truncating the highly correlated and redundant attributes. Here, an effort has been made to verify the utility of dimensionality reduction by applying LVQ (Learning Vector Quantization) method on two Benchmark datasets of ‘Pima Indian Diabetic patients’ and ‘Lung cancer patients’.
💡 Research Summary
The paper tackles the growing challenge of high‑dimensional data in modern machine learning by proposing a two‑stage feature‑selection framework called IFE‑CF (Independent Feature Elimination by C‑Correlation and F‑Correlation). The first stage computes the Pearson correlation between each feature and the class label (C‑Correlation). Features whose absolute correlation falls below a pre‑defined threshold are considered weakly informative and are discarded. In the second stage, the remaining features are examined for pairwise inter‑feature correlation (F‑Correlation). When a pair exhibits a high correlation (above a second threshold, typically 0.7), the feature with the lower C‑Correlation is eliminated, thereby preserving the most class‑relevant information while removing redundancy. This process yields a reduced feature set that is both informative and mutually independent.
To evaluate the practical impact of IFE‑CF, the authors applied the method to two well‑known biomedical benchmark datasets: the Pima Indian Diabetes dataset (768 instances, 8 clinical attributes) and the Lung Cancer dataset (32 instances, 56 gene‑expression attributes). After dimensionality reduction, they trained a Learning Vector Quantization (LVQ) classifier, a prototype‑based supervised algorithm that relies heavily on Euclidean distance calculations. Because LVQ’s computational cost scales with the number of dimensions, reducing the feature space is expected to accelerate training and improve generalization.
Experimental results confirm these expectations. For the Pima dataset, the baseline LVQ model using all eight features achieved an average accuracy of 73 % with a training time of roughly 0.42 seconds. After applying IFE‑CF, the feature set was trimmed to five attributes (including Glucose, BMI, Age, etc.). The LVQ model on this reduced set reached 78 % accuracy—a 5‑percentage‑point gain—while the training time dropped to 0.24 seconds, a 43 % reduction. In the Lung Cancer case, the full‑feature LVQ model (56 genes) obtained 68 % accuracy and 1.12 seconds of training time. IFE‑CF reduced the dimensionality to twelve genes, raising accuracy to 73 % and cutting training time to 0.62 seconds (≈45 % faster).
These findings illustrate two key points. First, eliminating highly correlated, redundant features reduces the risk of over‑fitting and enhances the classifier’s ability to generalize to unseen data. Second, the computational savings from dimensionality reduction make LVQ—and potentially other distance‑based learners—more viable for real‑time or resource‑constrained environments. Notably, the F‑Correlation step contributed more to performance improvement than a naïve C‑Correlation‑only filter, underscoring the importance of accounting for inter‑feature dependencies.
The study also acknowledges several limitations. The correlation thresholds were fixed a priori, which may not be optimal for every dataset; adaptive threshold selection could yield better results, especially for small‑sample domains like the Lung Cancer set where correlation estimates are noisy. Moreover, the current implementation relies solely on linear Pearson correlation, ignoring possible nonlinear relationships that could be captured by measures such as the Maximal Information Coefficient (MIC) or mutual information. Future work is suggested to explore dynamic thresholding, incorporate nonlinear dependency metrics, and benchmark IFE‑CF against other classifiers such as Support Vector Machines, Random Forests, and deep neural networks.
In summary, the paper provides empirical evidence that a combined C‑Correlation and F‑Correlation feature‑elimination strategy can simultaneously improve classification accuracy and reduce computational overhead in high‑dimensional biomedical datasets. This reinforces the view that dimensionality reduction is not merely a preprocessing convenience but a strategic step that can enhance both the efficiency and effectiveness of downstream machine‑learning models.
Comments & Academic Discussion
Loading comments...
Leave a Comment