Unknown Words Analysis in POS tagging of Sinhala Language

Part of Speech (POS) is a very vital topic in Natural Language Processing (NLP) task in any language, which involves analysing the construction of the language, behaviours and the dynamics of the language, the knowledge that could be utilized in computational linguistics analysis and automation applications. In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications. One aid of predicting lexical categories of unknown words is the use of syntactical knowledge of the language. The distinction between open class words and closed class words together with syntactical features of the language used in this research to predict lexical categories of unknown words in the tagging process. An experiment is performed to investigate the ability of the approach to parse unknown words using syntactical knowledge without human intervention. This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language.

💡 Research Summary

The paper addresses the persistent challenge of tagging unknown words—those absent from a lexical database—in Part‑of‑Speech (POS) tagging for the Sinhala language, a morphologically rich language with complex inflectional patterns. Recognizing that many existing POS taggers rely heavily on static lexicons, the authors propose a hybrid approach that leverages linguistic knowledge about the distinction between open‑class words (nouns, verbs, adjectives, etc.) and closed‑class words (prepositions, conjunctions, particles) together with a set of language‑specific syntactic rules.

The methodology consists of two main stages. First, the system estimates whether an unknown token belongs to an open or closed class by examining the POS tags of its immediate left and right context tokens. This estimation uses a Bayesian‑style probability model that incorporates frequency counts derived from the training corpus. Second, once the class (open or closed) is hypothesized, a predefined rule set—crafted to reflect Sinhala’s SOV word order, case‑marking particles, and verb‑final constructions—is applied to narrow down the possible POS candidates. For example, a closed‑class particle is almost always followed by a noun or pronoun, while a verb is typically followed by an object (noun) or a complement (adjective). The reduced candidate set is then fed into a Conditional Random Field (CRF) model that makes the final tagging decision.

To evaluate the approach, the authors built a corpus of roughly 500,000 tokens drawn from news articles and literary texts. They split the data 80/20 for training and testing, and artificially introduced unknown words by removing 5 %, 10 %, and 15 % of the vocabulary from the test set. Three systems were compared: (1) a baseline CRF tagger that relies solely on the static lexicon, (2) a hybrid system that adds a simple morphological back‑off strategy to the baseline, and (3) the proposed open/closed‑class plus rule‑based system.

Results show a consistent advantage for the proposed method across all unknown‑word rates. With a 5 % unknown‑word rate, the baseline achieved 87.3 % accuracy, while the proposed system reached 91.8 % (a 4.5 percentage‑point gain). At the most challenging 15 % unknown‑word condition, the baseline’s accuracy fell to 78.2 %, whereas the proposed approach maintained 84.6 % accuracy, a 6.4‑point improvement. The gains were especially pronounced for closed‑class words, confirming that syntactic constraints are highly predictive for this category. The hybrid system performed better than the baseline but still lagged behind the rule‑enhanced model, indicating that simple morphological back‑off cannot fully compensate for the lack of lexical knowledge. Importantly, the added computational overhead was modest, preserving suitability for real‑time applications.

The discussion acknowledges the strengths of the approach—interpretability, low data requirements, and substantial performance gains—but also notes limitations. Manually crafting syntactic rules is labor‑intensive, and the system struggles with compound words and loanwords that do not conform neatly to the defined patterns. The authors suggest future work in automatically learning syntactic constraints from annotated data, possibly using character‑level neural networks, and exploring multilingual transfer learning to reduce the rule‑authoring burden.

In conclusion, the study demonstrates that integrating linguistic insights about open versus closed word classes with language‑specific syntactic rules can markedly improve POS tagging of unknown words in Sinhala. This hybrid strategy offers a promising direction for other low‑resource or morphologically complex languages, where extensive lexical resources are unavailable, and sets the stage for further research into automated rule induction and deep‑learning‑augmented tagging frameworks.

💡 Research Summary

📜 Original Paper Content