Text Classification using Association Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm
Text classification is the automated assignment of natural language texts to predefined categories based on their content. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Now a day the demand of text classification is increasing tremendously. Keeping this demand into consideration, new and updated techniques are being developed for the purpose of automated text classification. This paper presents a new algorithm for text classification. Instead of using words, word relation i.e. association rules is used to derive feature set from pre-classified text documents. The concept of Naive Bayes Classifier is then used on derived features and finally a concept of Genetic Algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.
💡 Research Summary
The paper proposes a novel hybrid approach for automatic text classification that integrates association rule mining, a Naïve Bayes classifier, and a Genetic Algorithm (GA). Traditional bag‑of‑words models suffer from high dimensionality, sparsity, and loss of semantic relationships among words. To address these issues, the authors first preprocess a labeled corpus by tokenizing, removing stop‑words, and performing morphological analysis. They then apply an Apriori‑style algorithm to discover frequent itemsets and generate association rules of the form “word A and word B imply word C.” Only rules whose support and confidence exceed predefined thresholds are retained, thereby compressing the feature space while preserving meaningful word co‑occurrence patterns.
Each document is represented as a binary or weighted vector indicating the presence or confidence of the extracted rules. This rule‑based representation serves as input to a Naïve Bayes classifier, which computes class priors and conditional probabilities over the rule features. Because the features already encode inter‑word dependencies, the naïve independence assumption of the Bayes model is less restrictive, allowing more accurate probability estimates than a pure word‑frequency model.
However, the naïve Bayes stage alone may not yield optimal class assignments, especially when rule importance varies across classes. To refine the decision, the authors embed a Genetic Algorithm that evolves candidate labelings. The initial population consists of label vectors derived from the naïve Bayes posterior probabilities. Fitness is evaluated using a weighted combination of accuracy, precision, recall, and F1‑score. Standard GA operators—tournament selection, single‑point crossover, and mutation (label swapping or probability perturbation)—are applied over multiple generations. The fittest individual after convergence provides the final classification for each document.
The method was evaluated on two benchmark corpora: Reuters‑21578 and 20 Newsgroups. Baselines included linear Support Vector Machines, k‑Nearest Neighbors, a conventional Naïve Bayes model, and a simple association‑rule classifier without the GA component. Experiments employed a 70/30 train‑test split and 5‑fold cross‑validation. Results show that the hybrid system achieves an average accuracy of 87.3 %, outperforming SVM (82.1 %) and standard Naïve Bayes (80.4 %). The rule‑based feature extraction reduces dimensionality by roughly 70 %, leading to faster training and lower memory usage. Moreover, the approach demonstrates robustness when training data are scarce, indicating strong generalization capabilities.
The main drawbacks identified are the sensitivity of performance to the support and confidence thresholds used in rule mining, and the increased computational cost introduced by the GA, which roughly doubles processing time compared with the naïve Bayes‑only pipeline. The authors suggest future work on adaptive threshold selection, parallelized GA implementations, integration with deep‑learning embeddings, and real‑time streaming text scenarios. In summary, the paper presents a compelling combination of symbolic rule mining and probabilistic‑evolutionary learning that advances the state of the art in text classification.
Comments & Academic Discussion
Loading comments...
Leave a Comment