Supervised learning Methods for Bangla Web Document Categorization

Supervised learning Methods for Bangla Web Document Categorization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper explores the use of machine learning approaches, or more specifically, four supervised learning Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Na"ive Bays (NB), and Support Vector Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy document feature vectors.


💡 Research Summary

The paper investigates the applicability of four classic supervised learning algorithms—Decision Tree (C4.5), K‑Nearest Neighbour (KNN), Naïve Bayes (NB), and Support Vector Machine (SVM)—to the problem of categorising Bangla (Bengali) web documents. Recognising that most text‑classification research has focused on English and other high‑resource languages, the authors first construct a Bangla corpus by crawling a variety of websites (news portals, blogs, forums) and manually assigning each document to one of ten predefined topical categories such as politics, economics, sports, culture, and science‑technology. The resulting dataset contains roughly 5,000 documents, each pre‑processed through HTML tag removal, duplicate filtering, and encoding normalisation.

Because Bangla exhibits rich morphology, the authors employ a Bangla‑specific stemming and tokenisation tool to perform morphological analysis, followed by removal of a custom stop‑word list. They then represent each document as a TF‑IDF weighted vector, retaining only terms that appear in at least three documents and limiting the vocabulary to the top 2,000 most informative words. This yields high‑dimensional, sparse feature vectors that are also noisy due to spelling variations and web‑specific artefacts.

Four classifiers are trained on this feature space. The Decision Tree uses information‑gain ratio for splitting and applies post‑pruning to mitigate over‑fitting. KNN is configured with k = 5 and cosine similarity as the distance metric. Naïve Bayes adopts the multinomial model with Laplace smoothing to avoid zero‑probability issues. SVM employs a radial‑basis‑function (RBF) kernel; the regularisation parameter C and kernel width γ are optimised via grid search. All experiments are evaluated using 10‑fold cross‑validation, reporting accuracy, precision, recall, and F1‑score.

Results show that SVM consistently outperforms the other methods. It achieves an accuracy of 0.89, precision of 0.88, recall of 0.87, and an F1‑score of 0.87, which is 5–9 percentage points higher than the best alternative. Decision Tree attains 0.73 accuracy but suffers from over‑fitting despite pruning. KNN reaches 0.71 accuracy, with performance heavily dependent on the distance calculation in the high‑dimensional space. Naïve Bayes, while extremely fast to train (under a second) and memory‑efficient, records the lowest accuracy at 0.68, reflecting the limitations of the conditional independence assumption for natural language. In terms of computational cost, Naïve Bayes is the quickest, SVM requires the longest training time (approximately 12 seconds), but all models provide inference speeds suitable for real‑time applications.

The authors discuss several key insights. First, effective morphological preprocessing (stemming, stop‑word removal) is crucial for Bangla because of its complex inflectional patterns; errors at this stage propagate to all downstream classifiers. Second, SVM’s ability to maximise the margin and to handle non‑linear separability through kernels makes it robust against the high dimensionality and noise typical of web‑derived text. Third, tree‑based and instance‑based methods (Decision Tree, KNN) become computationally expensive as the dataset grows and are more prone to over‑fitting, limiting their suitability for large‑scale, production‑level systems. Fourth, Naïve Bayes remains valuable for rapid prototyping or environments with constrained resources, though its predictive performance lags behind the others.

In conclusion, the study provides one of the first systematic evaluations of Bangla text categorisation using mainstream machine‑learning techniques and demonstrates that SVM is the most effective approach under the tested conditions. The paper suggests future work in three directions: (1) comparing these traditional methods with deep‑learning embeddings such as Word2Vec, FastText, or multilingual BERT fine‑tuned on Bangla; (2) extending the framework to multi‑label classification where documents may belong to several categories simultaneously; and (3) assessing the models in streaming or real‑time environments to gauge their scalability and latency characteristics.


Comments & Academic Discussion

Loading comments...

Leave a Comment