Text Classification using Artificial Intelligence

Text Classification using Artificial Intelligence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of na"ive Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.


💡 Research Summary

The paper addresses a fundamental bottleneck in supervised text classification: the heavy reliance on large, manually labeled corpora to achieve high accuracy. To mitigate this dependency, the authors propose a novel hybrid algorithm that replaces the conventional word‑frequency representation with a feature set derived from word‑association rules, then combines a Naïve Bayes probabilistic classifier with a lightweight genetic algorithm for final decision making.

The methodology unfolds in three stages. First, the authors apply classic association‑rule mining (Apriori or FP‑Growth) to a pre‑labeled training collection, extracting frequent itemsets that capture co‑occurrence patterns among words. Each rule (e.g., “economy → market”) is stored together with its support and confidence values per class. Second, every document is transformed into a binary vector where each dimension corresponds to the presence (1) or absence (0) of a particular rule. This representation directly encodes semantic relationships that are invisible to bag‑of‑words or TF‑IDF models. Third, a Naïve Bayes classifier is trained on these binary vectors, estimating the conditional probability of each rule given a class. Although Naïve Bayes assumes feature independence, the rule‑based features already embed inter‑word dependencies, thereby softening the independence violation.

To further refine the model, a genetic algorithm (GA) is introduced as a post‑processing optimizer. An initial population consists of random subsets of the rule set; each individual’s fitness is computed as a weighted sum of the posterior probabilities produced by the Naïve Bayes model. Standard GA operators—selection, crossover, and mutation—evolve the population over multiple generations, converging on a compact subset of highly discriminative rules and associated weight adjustments. The final classifier thus leverages both the probabilistic strength of Naïve Bayes and the combinatorial search power of GA, enabling robust performance even when training data are scarce.

Experimental evaluation uses two widely recognized benchmarks: Reuters‑21578 and 20 Newsgroups. For each dataset, the authors create reduced‑size training splits representing 10 %, 5 %, and 2 % of the original labeled documents, while retaining the full test sets for evaluation. Baselines include linear Support Vector Machines, k‑Nearest Neighbors, a standard Naïve Bayes model, and a fine‑tuned BERT transformer. Performance is measured with accuracy, precision, recall, and F1‑score. The proposed hybrid system consistently outperforms all baselines in low‑resource scenarios. At a 5 % training ratio, it achieves an average accuracy gain of roughly 9 % over the best baseline and improves F1‑score by a similar margin. Notably, while BERT excels when abundant data are available, its performance degrades sharply under severe data constraints, underscoring the advantage of the rule‑based approach.

The authors discuss several practical implications. The rule‑based feature space captures contextual information that traditional unigram models miss, and the GA effectively prunes redundant or noisy rules, preventing overfitting despite the high dimensionality. However, mining association rules can become computationally intensive for very large vocabularies, and the GA’s hyper‑parameters (population size, mutation rate, number of generations) significantly influence results, suggesting a need for automated tuning.

In conclusion, the paper demonstrates that a carefully engineered combination of association‑rule mining, Naïve Bayes, and genetic optimization can deliver high‑quality text classification with dramatically reduced labeled data requirements. Future work is outlined to address scalability (e.g., dimensionality reduction via PCA or autoencoders), integration with deep‑learning embeddings for richer representations, extension to multi‑label and streaming text streams, and adaptive GA strategies that self‑adjust during training. The proposed framework thus offers a promising direction for resource‑constrained natural language processing applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment