Code-mixed data is an important challenge of natural language processing because its characteristics completely vary from the traditional structures of standard languages. In this paper, we propose a novel approach called Sentiment Analysis of Code-Mixed Text (SACMT) to classify sentences into their corresponding sentiment - positive, negative or neutral, using contrastive learning. We utilize the shared parameters of siamese networks to map the sentences of code-mixed and standard languages to a common sentiment space. Also, we introduce a basic clustering based preprocessing method to capture variations of code-mixed transliterated words. Our experiments reveal that SACMT outperforms the state-of-the-art approaches in sentiment analysis for code-mixed text by 7.6% in accuracy and 10.1% in F-score.
Multilingual societies with decent amount of internet penetration widely adopted social media platforms. This led to the proliferation in usage of code-mixed text. Sentiment analysis of code-mixed data on social media platforms enables scrutiny of political campaigns, product reviews, advertisements and other social trends.
Code-mixed text adopts the vocabulary and grammar of multiple languages and often forms new structures based on its users. This is challenging for sentiment analysis as traditional semantic analysis approaches do not capture meaning of the sentences. Scarcity of annotated data available for sentiment analysis also limit the advances in the field.
In this paper, we aim to solve the limitations and challenges by utilizing a novel unified framework called “Sentiment Analysis of Code-Mixed Text (SACMT)”. SACMT model consists of twin Bi-directional Long Short Term Memory Recurrent Neural Networks (BiLSTM RNN) with shared parameters ⋆ These authors have contributed equally to this work. and a contrastive energy function, based on a similarity metric on top. The energy function suits discriminative training for energy-Based models [8].
SACMT learns the shared model parameters and the similarity metric by minimizing the energy function connecting the twin networks. Parameter sharing and the Similarity Metric guarantee that, if the sentiment of sentences on both the individual Bi-LSTM networks are same, then they are nearer to each other in the sentiment space, else they are farther from each other. Hence, the representation of India match jit gayi (India won the match) and Diwali ki shubh kamnaye sabko (Happy Diwali to everybody) are closer to each other and India match jit gayi (India won the match) and Bhai ki movie flop gayi (Bhai’s movie was a flop) are distant from each other. The learned similarity metric models the sentiment similarity of sentences into a common sentiment space.
Transliteration of phonetic languages, like Hindi, into roman script creates several variations of the same word. For example, “ब त”(more) can be transliterated as bahut,bohot or bohut. To solve this challenge, we perform a preprocessing step that aims at clustering multiple word variations together using a empirical similarity metric.
The rest of the paper is organized as follows. Section 2 describes the previous approaches in the field. Section 3 demonstrates the datasets. Section 4 explain the architecture of SACMT. Section 5 defines the baselines. Section 6 and 7 present the experimental set-up and results respectively. Finally, Section 8 concludes the paper.
Distributional semantics [10] approach captures the words’ semantics, but loses out on the information of their sequence in the sentence. Another limitation of the technique is that it considers a word immutable. Hence, it is unable to handle spelling errors, out of vocabulary words properly. [12] assigns polarity scores to individual words. The overall sentiment score of the constituent words assigns the sentence’s polarity. Thus, the semantic relation and words’ sequence is lost and this leads to incorrect classification. N-grams limit this problem but do not eliminate it completely.
Another line of research, [7], utilizes character level LSTMs to learn sub word level information of social media text. This information then classifies the sentences using an annotated corpus. The model presents an effective approach for embedding sentences. However, the limitation in the approach here is the requirement of abundant data.
Siamese networks (shown in figure 1) help in the contrastive learning of a similarity metric without an extensive dependence on the features of the input. [3] introduced siamese networks to solve the problem of signature verification. Later, [4] used the architecture with discriminative loss function for face verification. These networks also effectively enhance the quality of visual search [9,6]. Recently, [5] applied these networks to solve the problem of community question answering.
Let, F (X) be the family of functions with parameters W . F (X) is differentiable with respect to W . Siamese network seeks a value of the parameter W such that the symmetric similarity metric is small if X1 and X2 belong to the same category, and large if they belong to different categories. The scalar energy function S(C, R) that measures the sentiments’ relatedness between tweets of code-mixed (C) text and resource-rich (R) language can be defined as:
In SACMT, we input the tweets from both the languages to the network. We minimize the loss function such that S(C, R) is small if the C and R carry the same sentiment and large otherwise.
We utilize the datasets for testing the architecture on both code-mixed data (Hindi-English) and social media text of a standard language (English). Following are the datasets we considered in our experiments.
-Hindi-English Code-Mixed (HECM): The dataset, proposed in [7], consists of 3879 annotated Hindi-Engl
This content is AI-processed based on open access ArXiv data.