Gene selection with guided regularized random forest

Gene selection with guided regularized random forest
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The regularized random forest (RRF) was recently proposed for feature selection by building only one ensemble. In RRF the features are evaluated on a part of the training data at each tree node. We derive an upper bound for the number of distinct Gini information gain values in a node, and show that many features can share the same information gain at a node with a small number of instances and a large number of features. Therefore, in a node with a small number of instances, RRF is likely to select a feature not strongly relevant. Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In GRRF, the importance scores from an ordinary random forest (RF) are used to guide the feature selection process in RRF. Experiments on 10 gene data sets show that the accuracy performance of GRRF is, in general, more robust than RRF when their parameters change. GRRF is computationally efficient, can select compact feature subsets, and has competitive accuracy performance, compared to RRF, varSelRF and LASSO logistic regression (with evaluations from an RF classifier). Also, RF applied to the features selected by RRF with the minimal regularization outperforms RF applied to all the features for most of the data sets considered here. Therefore, if accuracy is considered more important than the size of the feature subset, RRF with the minimal regularization may be considered. We use the accuracy performance of RF, a strong classifier, to evaluate feature selection methods, and illustrate that weak classifiers are less capable of capturing the information contained in a feature subset. Both RRF and GRRF were implemented in the “RRF” R package available at CRAN, the official R package archive.


💡 Research Summary

The paper addresses a fundamental limitation of the Regularized Random Forest (RRF) when applied to high‑dimensional, low‑sample‑size data such as gene expression profiles. In RRF each tree node evaluates a subset of the training instances and selects the feature with the largest Gini information gain, while a regularization term λ penalizes the inclusion of new features. The authors first derive an upper bound on the number of distinct Gini gain values that can appear in a node. Because the bound is small when the node contains few samples but many candidate features, many features inevitably share the same gain value. Consequently, RRF may arbitrarily pick a feature that is not truly relevant, leading to unstable feature subsets and potential over‑fitting.

To overcome this problem, the authors propose Guided Regularized Random Forest (GRRF). GRRF operates in two stages. In the first stage an ordinary Random Forest (RF) is trained on the full data set, and the resulting global importance scores for all features are recorded. In the second stage RRF is run again, but the regularization penalty for each candidate feature is modulated by its RF importance: the effective penalty becomes (1 − γ·importance)·λ, where γ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment