Title: Sentiment Analysis and Emotion Classification using Machine Learning Techniques for Nagamese Language – A Low-resource Language
ArXiv ID: 2512.01256
Date: 2025-12-01
Authors: ** - 논문에 저자 정보가 명시되어 있지 않음 (제공된 텍스트에 저자명 및 소속이 포함되지 않음). **
📝 Abstract
The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines.
Keywords: Nagamese, NLP, sentiment analysis, machine learning
💡 Deep Analysis
📄 Full Content
Nagamese (Naga Pidgin) is an important Creole language, apart from the tribal languages spoken which is used as a common language across the entire state of Nagaland which is located in the North-Eastern part of India. It is an Assamese-lexified (Assam is an Indian state bordering Nagaland) creole language developed primarily as a means of communication in trade between the people of Nagaland and people from Assam. It is widely used in mass media in the news, radio stations, state-government media, etc. The Nagamese language is a resource-poor language, and therefore, it is a challenge to create resources for applying various language processing tasks.
Sentiment Analysis is a natural language processing (NLP) task that deals with capturing the opinion or sentiment of the person towards something such as an object, event, etc. The target of Sentiment Analysis is to find these opinions, identify the sentiments expressed, and then classify them in terms of their polarity such as positive sentiment, negative sentiment or, neutral sentiment.
Identifying the emotion of a person is the task in which emotions such as anger, joy, etc., are captured.
These tasks are useful in building applications such as predicting market trend based on sentiment analysis, prediction of election results based on emotion distribution, etc.
The aim of this work is to identify the sentiment and emotion contained in Nagamese textual content. The objectives of this research are:
-Classification of sentiments in terms of polarity (positive, negative, and neutral) in the textual content of Nagamese language.
-Classification of basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) contained in the textual content of Nagamese language.
-Building sentiment polarity lexicons and using this to build features along with additional features for supervised machine learning techniques using Naïve Bayes and Support Vector Machines.
The rest of the paper is organized as follows: Section 2 gives an introduction to the Nagamese Language, Section 3 gives an overview of the related works, Section 4 describes the dataset creation, Section 5 discusses the methodology in detail, Section 6 provides details of the experimental setup, Section 7 discusses the experimental results and Section 8 draws the conclusion and discusses future works
This section provides an overview of the character set, syllabic pattern, grammar, lexical, and other resources for the Nagamese Language.
The Nagamese language has 28 phonemes, including 6 vowels and 22 consonants (Sreedhar 1985). Vowels: i, u, e, ǝ, o, a Consonants: p, t, c, k, b, d, j, g, p h , t h , c h , k h , m, n, ń, s, š, h, r, l, w, y An example sentence in Nagamese is given below: ‘Moy dos baje pora yeti ase.’ (I am waiting here from 10 o’clock.)
A Nagamese word may consist of one or more syllables ranging upto a maximum of four syllables (Sreedhar 1985). The entire monosyllabic words can be subgrouped into six classes as given in the formula:
The only limitation in the above formula is that V cannot occur alone.
Nagamese exhibits disyllabic, trisyllabic, and tetra syllalabic. There are no pentasyllabic words unless one takes clear compound words.
Nagamese is a Subject-Object-Verb (SOV) (Singh 2021) language similar to other Indian languages such as Hindi, Assamese, etc. The detailed grammar of Nagamese can be found in the works of Sreedhar (1985), Baishya (2003), andBhattacharjya (2003).
Listed below are some of the lexical and other resources that are available for Nagamese Language.
Since Nagamese is an Assamese-lexified creole language, we present here some of the works that have been done for the Assamese language. Das et al. (2021) built a sentiment polarity classification model by applying machine learning classifiers for Assamese language using lexical features like adjectives, adverbs, and verbs on the news domain. Dev et al. (2021) attempt to perform Sentiment Analysis on Assamese Texts using the concepts of popular sentiment analyzer named -Vader‖, while taking -Bengali-Vader‖ as its backbone.
To apply any learning model, we need to create a training dataset to train the learning model. This section describes the details of the annotated sentiment polarity and the emotion dataset. The unlabelled dataset containing approx. 25k tokens and 1,195 unique words was created by collecting articles from online Nagamese News articles (https://nagamesekhobor.com/
) and bible passages (https://www.bible.com/)
. The Nagamese Khobor contains a variety of content related to current state affairs, sports etc. Random news articles from Nagamese khobor were collected from which various articles were extracted in order to obtain a mixed corpus of 594 Nagamese sentences. This was equally distributed among annotators of 198 sentences each. Fig. 1 shows the word cloud for the Nagamese raw dataset based on the word frequency in the unlabelled dataset. Fig. 2 shows the sample o