Automatic Normalization of Word Variations in Code-Mixed Social Media Text

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Automatic Normalization of Word Variations in Code-Mixed Social Media Text
ArXiv ID: 1804.00804
Date: 2024-03-08
Authors: 원문에 저자 정보가 제공되지 않았습니다. —

📝 Abstract

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.

💡 Deep Analysis

📄 Full Content

Code-mixing is the embedding of linguistic units such as phrases, words or morphemes of one language into the utterance of another language, whereas code-switching refers to the co-occurrence of speech extracts belonging to two different grammatical systems. Prevalent use of social media platforms by multilingual speakers leads to an increase in the phenomenon of code-mixing and code-switching [5,10,3,6]. Here we refer both the scenarios as code-mixing. Hindi-English bilingual speakers generate an immense amount of code-mixed social media text (CSMT). [19] noted the complexity in analyzing CSMT stems from non-adherence to a formal grammar, spelling variations, lack of annotated data, inherent conversational nature of the text and code-mixing. Traditional tools presume texts to have a strict adherence to formal structure. Hence, unique natural language processing (NLP) tools for CSMT are required and should be improved upon.

Internet usage is steadily increasing in multilingual societies such as India, where there are 22 official languages at center and state level, out of which Hindi and English are most prevalent 1 . These multilingual populations actively use code-mixed language on social media to share their opinion. With over 400 million Indian population on Internet, which is predicted to double in next 5 years 2 , we notice a huge potential for research in CSMT data.

On analyzing CSMT data, we find following ways in which CSMT data deviates from a formal standard form of the respective language:

-Informal transliteration: These variations are due to lack of a transliteration standard. Multilingual speakers tend to transliterate the lexicons directly from native script to roman, which lacks a formal transliteration method. Hence, it leads to many phonetic variations. For example, ब त can be transliterated to bahoot, bahout or bahut, which may be based on socio-cultural factors like accent, dialect and region. -Informal speech: These variations are not language specific. Speakers tend to write non standard spellings on social media. Usage of a non-formal speech leads to variations in spellings. For example, cooooooool, gud, mistke or lappy.

Section 3 discusses the variations in more detail. Unless a system captures these variations in code-mixed data, its performance will not be at par with corresponding systems on formal standard texts. In this paper, we present a novel approach of automating the normalization process of code-mixed informal text. We also compare the performances of current stateof-the-art CSMT sentiment analysis [12] and POS-tagging [4] tasks on CSMT data.

Section 2 discusses some relevant recent work and section 3 describes and explains these variations. Section 4 provides information about the datasets we use in this paper. Section 5 gives detailed methodology of our approach. The experiments and evaluations are presented in Section 6 and the conclusion is in Section 7.

Analysis of code-mixed languages has recently gained interest owing to the rise of non-native English speakers. [15] normalized the code-mixed text by segregating Hindi and English words. Hindi-English (Hi-En) code-mixing allows ease-ofcommunication among speakers by providing a much wider variety of phrases and expressions. A common form of code-mixing is romanization [15], which refers to the transliteration from a different writing system to the roman script. But this freedom makes formal rules irrelevant, leading to more complexities in the NLP tasks pertaining to CSMT data, highlighted by [2,19,1].

Initiatives have been taken by shared tasks [14,17] for analysis of CSMT data. [18] used distributional representation to normalize English social media text Figure 1: Most frequent 50 words with some clusters from CSMT data by substituting spelling mistakes with their corresponding normal form. Deep learning based solutions [21,16] are also demonstrated for various NLP tasks. [15] provides a methodology for normalization of these variably spelled romanized words in a rule-based manner. However, giving an automated unsupervised machine learning model enables the system to be utilized across languages. This is important in case of other Indian Languages, which exhibit similar code-mixing and romanized behavior as Hindi.

Informal variations in lexical forms have not been explored properly. Hence we provide our own nomenclature to better address the normalization process and to provide a discourse for further discussion.

As explained in Section 1, we observe informal variations of words in CSMT data. The context we explain the variations here is in Hindi-English per se. However, the approach is not language-specific.

-Informal transliterations: Lack of a transliteration standard implies that decisions of marking vowels and other sounds rely entirely on the user. This happens in case of romanizing Hindi which is written phonetically.

• Long Vowel transliteration: Users are found to be indicating vowel length in

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced CRF

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

Start searching

No results found