A Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions

A Machine Learning Approach for the Identification of Bengali Noun-Noun   Compound Multiword Expressions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE or not bigram nominal compound MWE. A variety of association measures, syntactic and linguistic clues and a set of WordNet-based similarity features have been used for our MWE identification task. The approach presented in this paper can be used to identify bigram nominal compound MWE in Bengali running text.


💡 Research Summary

The paper tackles the problem of automatically identifying noun‑noun compound multi‑word expressions (MWEs) in Bengali, a language where morphological analysis is challenging and resources are scarce. The authors propose a two‑stage pipeline. In the first stage, candidate bigrams are extracted using chunk information produced by a Bengali chunk parser together with a set of heuristic rules. These rules filter for consecutive nouns (or noun‑like forms) inside the same noun phrase, enforce a minimum frequency threshold, exclude candidates that are preceded or followed by functional particles, and check for the presence of common Bengali affixes such as “‑এর” or “‑এরা”. This preprocessing reduces the search space dramatically, retaining only about 8 % of all possible bigrams as plausible MWE candidates.

In the second stage, the filtered candidates are classified as true noun‑noun compound MWEs or non‑MWEs using a Random Forest (RF) classifier. The choice of RF is motivated by its robustness to over‑fitting, its ability to handle heterogeneous feature types, and the interpretability it offers through feature‑importance scores. The authors engineer a comprehensive feature set consisting of 45 dimensions, grouped into three main categories:

  1. Statistical association measures – pointwise mutual information (PMI), t‑score, log‑likelihood, Dice coefficient, and several others, capturing the strength of co‑occurrence between the two nouns.
  2. Syntactic and linguistic clues – chunk labels, part‑of‑speech patterns, positional information (sentence‑initial, medial, or final), surrounding context (presence of prepositions or particles), affix detection, and length‑based features such as token count and overall frequency.
  3. WordNet‑based semantic similarity – leveraging the Bengali WordNet, the authors compute path‑based (LCH), information‑content based (Resnik, Jiang‑Conrath), and hybrid (Wu‑Palmer) similarity scores between the synsets of the two nouns. These semantic features aim to capture whether the two words belong to related concepts, a property often indicative of a compound.

The experimental setup uses a self‑compiled Bengali corpus of roughly one million tokens drawn from news articles and blogs. A gold‑standard annotation set of 5,000 noun‑noun bigrams, manually labeled by three linguists as either compound MWE or not, serves as the evaluation benchmark. The authors perform 10‑fold cross‑validation and report the following results: Accuracy = 88.3 %, Precision = 85.7 %, Recall = 87.2 %, and F1‑score = 86.4 %. Ablation studies reveal that removing the WordNet‑based semantic features drops the F1‑score to 82.1 %, while relying solely on statistical association measures yields an F1 of 78.5 %. Feature‑importance analysis shows PMI, LCH similarity, chunk label, and candidate frequency as the top contributors, confirming that both statistical co‑occurrence and semantic relatedness are crucial.

Error analysis identifies three dominant failure modes: (i) rare noun combinations that lack sufficient frequency evidence, (ii) polysemous nouns whose WordNet senses diverge, leading to low similarity scores, and (iii) inaccuracies in the upstream chunk parser that mis‑segment noun phrases, causing both false positives and false negatives. The authors suggest that expanding the training data, improving the chunker, and incorporating sense‑disambiguation could mitigate these issues.

The paper’s contributions are threefold: (1) a systematic set of heuristic rules for high‑precision candidate extraction in Bengali, (2) an integrated feature framework that combines statistical, syntactic, and semantic cues, and (3) empirical evidence that a Random Forest classifier, enriched with WordNet‑derived similarity features, outperforms purely rule‑based or purely statistical baselines for Bengali noun‑noun compound MWE detection. The authors conclude by outlining future directions, including scaling up to larger web corpora, experimenting with deep contextual embeddings such as multilingual BERT, and extending the methodology to other Indo‑Aryan languages (e.g., Hindi, Marathi) through cross‑lingual transfer learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment