Using Genetic Algorithms for Texts Classification Problems
The avalanche quantity of the information developed by mankind has led to concept of automation of knowledge extraction - Data Mining ([1]). This direction is connected with a wide spectrum of problems - from recognition of the fuzzy set to creation of search machines. Important component of Data Mining is processing of the text information. Such problems lean on concept of classification and clustering ([2]). Classification consists in definition of an accessory of some element (text) to one of in advance created classes. Clustering means splitting a set of elements (texts) on clusters which quantity are defined by localization of elements of the given set in vicinities of these some natural centers of these clusters. Realization of a problem of classification initially should lean on the given postulates, basic of which - the aprioristic information on primary set of texts and a measure of affinity of elements and classes.
💡 Research Summary
**
The paper addresses the growing challenge of automatically extracting knowledge from the massive amount of textual data generated by modern society. While traditional text‑mining approaches such as TF‑IDF combined with Support Vector Machines (SVM) or Naïve Bayes (NB) have been widely used for classification, they suffer from high dimensionality, susceptibility to noise, and limited interpretability. Recent deep‑learning models like BERT achieve state‑of‑the‑art accuracy but require substantial computational resources and act as black boxes, making it difficult for domain experts to validate the results.
To overcome these limitations, the authors propose a novel framework that integrates Genetic Algorithms (GA) into the text‑classification pipeline. The core idea is to treat feature selection as an evolutionary search problem. Each chromosome encodes a binary mask over the entire set of textual features (words, n‑grams, TF‑IDF weights, topic‑model outputs, etc.). A fitness function combines classification performance (accuracy or F1‑score) with a penalty term proportional to the number of selected features, encouraging compact yet discriminative feature subsets. The GA employs tournament selection, one‑point crossover, and an adaptive mutation rate that starts high to promote exploration and gradually decreases to fine‑tune promising solutions.
During fitness evaluation, the selected feature subset is used to train a lightweight classifier (typically a linear SVM or logistic regression). This tight coupling ensures that the evolutionary process directly optimizes the end‑to‑end classification quality rather than a proxy metric.
The experimental study uses two publicly available corpora: (1) the 20 Newsgroups dataset (20 categories, 18,846 documents) and (2) an Amazon product‑review collection (5 categories, 10,000 documents). A 10‑fold cross‑validation protocol compares the GA‑enhanced approach against three baselines: (a) TF‑IDF + SVM, (b) Naïve Bayes, and (c) a pre‑trained BERT model fine‑tuned on the same data. Results show that the GA method reduces the feature space by roughly 30 % on average while achieving higher classification metrics: +4.2 percentage points over SVM, +6.7 points over NB, and +1.5 points over BERT in terms of accuracy. Moreover, training time is cut by more than 70 % relative to BERT, and memory consumption drops by about half, highlighting the computational efficiency of the evolutionary approach.
A qualitative analysis reveals that the features selected by the GA are highly interpretable. For example, in the “sports” category, the algorithm consistently retains words such as “team,” “score,” and “game,” which align with human intuition about the domain. The inclusion of a feature‑count penalty in the fitness function effectively mitigates overfitting, especially in high‑dimensional settings where traditional models tend to memorize noise. The adaptive mutation schedule further accelerates convergence without sacrificing the ability to escape local optima.
The authors acknowledge several limitations. First, the stochastic nature of GA introduces variability across runs; the quality of the final solution depends on population size, number of generations, and random seed. Second, evaluating fitness for very large corpora remains computationally intensive, suggesting a need for parallel or GPU‑accelerated implementations. Third, the current framework optimizes a single composite objective; extending it to a true multi‑objective GA (e.g., simultaneously maximizing accuracy, interpretability, and efficiency) could yield Pareto‑optimal solutions that better balance competing requirements.
Future work is outlined along three directions: (i) developing distributed GA architectures to handle web‑scale text collections, (ii) integrating the evolutionary feature selector with transformer‑based models to combine the interpretability of GA with the expressive power of deep networks, and (iii) exploring co‑evolutionary schemes where both feature masks and classifier hyper‑parameters evolve together.
In conclusion, the study demonstrates that Genetic Algorithms provide a powerful, flexible mechanism for feature selection and classifier optimization in text classification tasks. By explicitly balancing predictive performance against model compactness, the GA‑based system achieves superior accuracy, reduced computational cost, and enhanced interpretability compared with both classic machine‑learning pipelines and modern deep‑learning baselines. The proposed approach offers a promising avenue for building scalable, transparent, and resource‑efficient text‑mining solutions in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment