Biogeography-Based Informative Gene Selection and Cancer Classification Using SVM and Random Forests

Microarray cancer gene expression data comprise of very high dimensions. Reducing the dimensions helps in improving the overall analysis and classification performance. We propose two hybrid techniques, Biogeography - based Optimization - Random Forests (BBO - RF) and BBO - SVM (Support Vector Machines) with gene ranking as a heuristic, for microarray gene expression analysis. This heuristic is obtained from information gain filter ranking procedure. The BBO algorithm generates a population of candidate subset of genes, as part of an ecosystem of habitats, and employs the migration and mutation processes across multiple generations of the population to improve the classification accuracy. The fitness of each gene subset is assessed by the classifiers - SVM and Random Forests. The performances of these hybrid techniques are evaluated on three cancer gene expression datasets retrieved from the Kent Ridge Biomedical datasets collection and the libSVM data repository. Our results demonstrate that genes selected by the proposed techniques yield classification accuracies comparable to previously reported algorithms.

💡 Research Summary

The paper addresses the challenge of classifying cancer samples from high‑dimensional microarray gene‑expression data, where the number of measured genes (often tens of thousands) far exceeds the number of available patient samples. To mitigate over‑fitting and reduce computational burden, the authors propose two hybrid feature‑selection pipelines that combine Biogeography‑Based Optimization (BBO) with two powerful classifiers: Random Forests (RF) and Support Vector Machines (SVM).

BBO is an evolutionary meta‑heuristic inspired by the migration of species among habitats. In this work each “habitat” represents a candidate subset of genes. The algorithm iteratively applies migration (sharing of high‑fitness genes among habitats) and mutation (random replacement of low‑importance genes) across a population of habitats. The migration and mutation rates are adaptively linked to the fitness of each habitat, so that high‑accuracy habitats contribute more genetic material while low‑accuracy habitats explore new regions of the search space.

Before BBO begins, the authors compute an information‑gain based ranking of all genes using a simple filter method. This ranking serves two purposes: (1) it biases the initial population toward high‑ranked genes, thereby shrinking the search space, and (2) it guides mutation by preferentially swapping out low‑ranked genes. Consequently, BBO can focus its exploration on promising regions while still retaining the ability to discover novel combinations.

Fitness evaluation is performed by training either an SVM with an RBF kernel or a Random Forest on the gene subset supplied by a habitat. For SVM, the penalty parameter C and kernel width γ are tuned via internal cross‑validation; for RF, the number of trees and maximum depth are set to standard values, and the classifier’s out‑of‑bag accuracy is used as the fitness measure. The authors employ 10‑fold cross‑validation on each candidate subset and use the average classification accuracy as the objective to be maximized.

The experimental protocol uses three publicly available cancer microarray datasets drawn from the Kent Ridge Biomedical repository and the libSVM archive: (1) a leukemia dataset (72 samples, ~7,000 genes), (2) a colon‑cancer dataset (62 samples, ~2,000 genes after preprocessing), and (3) a prostate‑cancer dataset (102 samples, ~12,600 genes). For each dataset the BBO population size is set to 50 and the algorithm runs for 30 generations.

Results show that both BBO‑RF and BBO‑SVM consistently select very compact gene subsets—typically 10 to 30 genes—while achieving classification accuracies that match or slightly exceed previously reported methods, which often use larger gene sets. On the leukemia data, for example, BBO‑SVM reaches 98 % accuracy with only 12 genes, comparable to the best published results that use 30–50 genes. BBO‑RF achieves similar performance and additionally provides an intrinsic importance ranking of the selected genes, which can be valuable for biological interpretation. The authors also report that the hybrid approach converges quickly: fitness improvements plateau after roughly 20 generations, indicating that the migration‑mutation balance effectively explores the solution space without excessive computational cost.

Key insights from the study include: (1) the combination of a filter‑based pre‑ranking with an evolutionary optimizer yields a powerful synergy—filters prune the obvious noise, while BBO refines the subset through global search; (2) using two distinct classifiers as fitness evaluators demonstrates that the selected gene subsets are robust across different learning paradigms; (3) the BBO framework naturally balances exploration (mutation) and exploitation (migration), which is crucial for navigating the astronomically large combinatorial space of possible gene subsets.

The authors conclude that BBO‑based feature selection is a viable alternative to more conventional methods such as genetic algorithms, particle‑swarm optimization, or wrapper‑based recursive elimination. They suggest future work in multi‑objective optimization (simultaneously minimizing gene count and maximizing accuracy), comparison with other meta‑heuristics, deeper biological validation of the discovered gene signatures (e.g., pathway enrichment analysis), and integration with clinical pipelines for real‑world diagnostic support. Overall, the paper contributes a methodologically sound and empirically validated approach for extracting compact, high‑performing gene panels from noisy, high‑dimensional cancer expression data.

💡 Research Summary

📜 Original Paper Content