Automatic classification of bengali sentences based on sense definitions present in bengali wordnet

Automatic classification of bengali sentences based on sense definitions   present in bengali wordnet
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the Bengali sentences automatically into different groups in accordance with their underlying senses. The input sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL project of the Govt. of India, while information about the different senses of particular ambiguous lexical item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses that render sentences in different meanings. In our experiment we have achieved around 84% accurate result on the sense classification over the total input sentences. We have analyzed those residual sentences that did not comply with our experiment and did affect the results to note that in many cases, wrong syntactic structures and less semantic information are the main hurdles in semantic classification of sentences. The applicational relevance of this study is attested in automatic text classification, machine learning, information extraction, and word sense disambiguation.


💡 Research Summary

The paper presents a supervised machine‑learning approach for Word Sense Disambiguation (WSD) of Bengali sentences, using the sense definitions stored in the Bengali WordNet and a Naïve Bayes classifier. The authors focus on the ambiguous noun “māthā” (head) which can convey at least three distinct meanings: a literal body part, a figurative/abstract sense (e.g., “beginning” or “top”), and a proper‑name or geographical reference.

Data were drawn from the TDIL (Technology Development for Indian Languages) corpus, a large multilingual collection covering 85 domains such as physics, agriculture, and literature. From this corpus the authors extracted 1,747 sentences containing the lemma “māthā”. Because the raw text exhibited heterogeneous fonts, irregular punctuation, and inconsistent spacing, a manual normalization pipeline was applied: conversion to a uniform Unicode encoding, separation of punctuation marks, removal of stray brackets and line‑break artifacts, and explicit identification of sentence‑ending symbols.

Following normalization, a stop‑word removal stage was performed. Bengali’s rich set of post‑positions, conjunctions, interjections, pronouns, and frequently occurring adjectives/adverbs makes automatic frequency‑based stop‑word detection unreliable. Consequently, the authors built a stop‑word list manually with the aid of a standard Bengali dictionary, ensuring that only high‑frequency functional items were filtered while preserving content words that carry sense information.

For supervised learning, three training sets were constructed, each representing one of the three senses of “māthā”. The training data were derived from the Bengali WordNet entries for the target word, which provide a gloss (definition), example sentences, synonym sets, part‑of‑speech tags, and hierarchical ontology links. Each sentence was tokenized, and term frequencies were used as features for a multinomial Naïve Bayes model. Prior probabilities were set according to the proportion of sentences belonging to each sense in the training corpus. The classifier then computes the posterior probability for each sense using Bayes’ rule:

 P(sense | sentence) ∝ P(sense) × ∏ P(word | sense).

During testing, the model assigned each of the 1,747 sentences to the sense with the highest posterior probability. The overall accuracy reached 84 %, meaning that 1,470 sentences were correctly disambiguated. The remaining 16 % of errors were examined in detail. The dominant error sources were:

  1. Syntactic irregularities – sentences with fragmented or non‑canonical structures provided insufficient contextual clues.
  2. Insufficient lexical cues – some WordNet glosses are brief and abstract, limiting the discriminative power of the feature set.
  3. Morphological complexity – Bengali verbs and nouns undergo extensive inflection; “māthā” appears with various case endings (e.g., māthāy, māthāte) that were not fully normalized, leading to token mismatches.

The authors discuss several avenues for improvement. Incorporating a robust Bengali morphological analyzer would allow systematic stemming and case‑ending stripping, thereby reducing sparsity in the feature space. Expanding the WordNet with richer example sentences and sense‑specific corpora would enhance the training data’s representativeness. Moreover, experimenting with alternative classifiers such as Support Vector Machines, Random Forests, or deep neural networks (e.g., Bi‑LSTM with attention) could potentially surpass the Naïve Bayes baseline, especially when combined with word embeddings trained on the same TDIL corpus.

Beyond the immediate WSD task, the methodology has broader implications for Bengali natural language processing. Accurate sense classification can improve downstream applications such as text categorization, information extraction, machine translation, and semantic search. The paper demonstrates that even with limited resources—a modestly sized sense‑annotated corpus and a publicly available WordNet—reasonable performance can be achieved for a morphologically rich language. It also highlights the critical role of careful preprocessing, especially normalization and stop‑word handling, in achieving reliable results.

In conclusion, the study provides a practical, reproducible framework for Bengali WSD, validates the utility of the Bengali WordNet as a knowledge source, and outlines concrete steps for future research to address morphological challenges and explore more sophisticated learning algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment