Robust gene prioritization for Dietary Restriction via Fast-mRMR Feature Selection techniques

Robust gene prioritization for Dietary Restriction via Fast-mRMR Feature Selection techniques
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR Feature Selection to retain only relevant, non-redundant features for classifiers, building simpler, more interpretable and more efficient models. Experiments in our domain of interest, prioritizing genes related to Dietary Restriction (DR), show significant improvements over existing methods and enables us to integrate heterogeneous biological feature sets for better performance, a strategy that previously degraded performance due to noise accumulation. This work focuses on DR given the availability of curated data and expert knowledge for validation, yet this pipeline would be applicable to other biological processes, proving that feature selection is critical for reliable gene prioritization in high-dimensional omics.


💡 Research Summary

The paper addresses the challenging problem of gene prioritization for Dietary Restriction (DR), a biological process linked to longevity, under the realistic conditions of high‑dimensional omics data and Positive‑Unlabeled (PU) labeling. In PU settings only a small set of experimentally validated positive genes is known, while the rest of the genome is unlabeled and true negatives are unavailable. This creates two major obstacles: (i) a feature space where the number of variables far exceeds the number of samples (d ≫ n), and (ii) severe label uncertainty that hampers conventional supervised learning.

To tackle both issues, the authors propose a three‑stage pipeline that integrates Fast‑mRMR (Fast Minimum Redundancy Maximum Relevance) feature selection, robust model training, and consensus ranking. Fast‑mRMR computes mutual information between each feature and the target while penalizing redundancy among features, thereby selecting a compact, informative subset. Because true negatives are missing, the authors treat all unlabeled genes as provisional negatives during the feature‑selection step, relying on the robustness of mRMR to still capture discriminative patterns between the small positive set and the genomic background.

The pipeline proceeds as follows. First, Fast‑mRMR is applied to each data source (Gene Ontology, PathDIP, and a large co‑expression matrix) to retain a user‑defined percentage k of the most relevant features. The optimal k is determined inside a nested cross‑validation (CV) scheme: an inner 5‑fold CV optimizes k, while an outer 10‑fold CV evaluates the full workflow. Two state‑of‑the‑art ensemble classifiers—Balanced Random Forest (BRF) and CatBoost—are trained independently on the selected features. After each outer fold, the trained model predicts probabilities for the held‑out test genes. Ten independent runs of the 10×5 CV are performed with different random seeds, and the final score for each gene is the average of its predicted probabilities across runs, which reduces variance and yields a stable ranking.

A key technical contribution is the “Feature Bagging” strategy for handling the massive co‑expression dataset (≈45 000 features). The full feature space is partitioned into disjoint blocks; Fast‑mRMR is run on each block to extract the top K features, and the resulting subsets are merged. This approach dramatically lowers memory and computational demands while preserving biologically relevant signals.

Experimental results are reported on three configurations: (1) single‑source GO, (2) single‑source PathDIP, and (3) a combined GO + PathDIP set. For each configuration both classifiers are evaluated with three training regimes: (a) no feature selection (Original), (b) a PU‑learning baseline, and (c) the proposed Fast‑mRMR pipeline. Across all metrics—AU‑ROC, G‑Mean, and F1‑Score—the Fast‑mRMR approach outperforms the baselines, with statistical significance (p < 0.05). Notably, the combined GO + PathDIP configuration achieves the highest performance (CatBoost AU‑ROC = 0.873, F1‑Score = 0.560), overturning previous findings that naïve integration of heterogeneous data degrades accuracy due to the curse of dimensionality.

Analysis of the feature‑selection thresholds shows that optimal performance is reached with a very small fraction of the original features: roughly 5 % for GO and 25 % for PathDIP. Adding more features beyond these points leads to diminishing returns or even performance loss, confirming that most raw omics variables are redundant noise.

On the high‑dimensional co‑expression data, the Feature‑Bagging Fast‑mRMR pipeline reduces the feature space to less than 2 500 variables (≈5 % of the original) and improves AU‑ROC from ~0.50 (random) to 0.542 for BRF and 0.547 for CatBoost. Although statistical significance could not be established due to the limited signal, the consistent direction of improvement demonstrates the method’s scalability.

Beyond predictive performance, the authors evaluate computational sustainability using CodeCarbon. While Fast‑mRMR adds an upfront cost for feature selection, it substantially reduces the subsequent training time, leading to lower per‑inference energy consumption and CO₂ emissions. This “green AI” aspect is highlighted as a practical advantage for laboratories with limited computational resources.

Biological interpretation of the selected features reveals that the NRF2 pathway (WikiPathways.37) is consistently ranked among the top predictors, aligning with literature that links NRF2 activation to the protective effects of dietary restriction. Additionally, novel candidate genes such as RRAGD, GCLM, and others are identified with high predicted probabilities despite being annotated as negatives in the original dataset. RRAGD, for instance, participates in mTORC1 signaling, a pathway central to nutrient sensing and aging, suggesting plausible mechanistic relevance.

In summary, the study demonstrates that integrating Fast‑mRMR feature selection into PU‑learning pipelines for gene prioritization yields (i) significant gains in predictive accuracy, (ii) robust handling of heterogeneous and ultra‑high‑dimensional biological data via Feature Bagging, and (iii) measurable reductions in computational cost and environmental impact. The authors argue that the approach is domain‑agnostic and can be transferred to other biological processes, provided appropriate curated datasets and expert validation are available. Future work is suggested in extending the framework to other diseases, comparing alternative feature‑selection methods, and experimentally validating the newly proposed candidate genes.


Comments & Academic Discussion

Loading comments...

Leave a Comment