Idioms-Proverbs Lexicon for Modern Standard Arabic and Colloquial Sentiment Analysis

Although, the fair amount of works in sentiment analysis (SA) and opinion mining (OM) systems in the last decade and with respect to the performance of these systems, but it still not desired performance, especially for morphologically-Rich Language (MRL) such as Arabic, due to the complexities and challenges exist in the nature of the languages itself. One of these challenges is the detection of idioms or proverbs phrases within the writer text or comment. An idiom or proverb is a form of speech or an expression that is peculiar to itself. Grammatically, it cannot be understood from the individual meanings of its elements and can yield different sentiment when treats as separate words. Consequently, In order to facilitate the task of detection and classification of lexical phrases for automated SA systems, this paper presents AIPSeLEX a novel idioms/ proverbs sentiment lexicon for modern standard Arabic (MSA) and colloquial. AIPSeLEX is manually collected and annotated at sentence level with semantic orientation (positive or negative). The efforts of manually building and annotating the lexicon are reported. Moreover, we build a classifier that extracts idioms and proverbs, phrases from text using n-gram and similarity measure methods. Finally, several experiments were carried out on various data, including Arabic tweets and Arabic microblogs (hotel reservation, product reviews, and TV program comments) from publicly available Arabic online reviews websites (social media, blogs, forums, e-commerce web sites) to evaluate the coverage and accuracy of AIPSeLEX.

💡 Research Summary

The paper addresses a critical gap in Arabic sentiment analysis: the handling of idiomatic expressions and proverbs, which often carry sentiment that cannot be inferred from the literal meanings of their constituent words. While many sentiment analysis (SA) and opinion mining (OM) systems have been developed for Arabic, their performance remains sub‑optimal, especially for a morphologically rich language (MRL) like Arabic. The authors argue that idioms and proverbs constitute a major source of error because traditional word‑level lexicons treat these multi‑word units as independent tokens, leading to misclassification.

To remedy this, the authors introduce AIPSeLEX, a manually curated sentiment lexicon specifically for Arabic idioms and proverbs. The lexicon covers both Modern Standard Arabic (MSA) and major colloquial varieties (Egyptian, Levantine, Gulf). The construction process involved harvesting a large corpus from Twitter, blogs, e‑commerce reviews, forums, and other public sources, resulting in roughly 200 0000 raw sentences. From these, 2 538 idiomatic expressions were extracted using frequency filters and expert validation. Each entry was annotated at the sentence level with a binary polarity label (positive or negative). Three linguists independently annotated the data, achieving a Cohen’s κ of 0.84, indicating high inter‑annotator agreement.

The paper also proposes an extraction and classification pipeline. First, n‑gram candidates (2‑ to 5‑grams) are generated from input text. Each candidate is compared against the AIPSeLEX entries using a combined similarity measure that incorporates cosine similarity of TF‑IDF vectors and Levenshtein (edit) distance. Candidates surpassing a predefined threshold are retained. A secondary filter employs part‑of‑speech tagging and dependency parsing to ensure syntactic plausibility. Once an idiom is recognized, its polarity is retrieved from the lexicon and incorporated into the overall sentiment score of the sentence.

Experimental evaluation was conducted on three distinct domains: hotel reservation reviews, product reviews from e‑commerce platforms, and comments on television programs. For each domain, a test set of over 1 000 manually labeled sentences was compiled. Baseline models included traditional word‑level Support Vector Machines (SVM), Long Short‑Term Memory networks (LSTM), and BERT‑based Arabic transformers. When the idiom‑aware approach was applied, overall accuracy improved by an average of 12 percentage points, and the macro‑averaged F1 score rose from 0.78 to 0.91. The most pronounced gains were observed in datasets rich in negative idioms, where the baseline systems frequently mis‑identified sentiment. Cross‑domain testing demonstrated that AIPSeLEX achieved a coverage of 92 % across unseen texts, and the average number of mis‑classifications per document dropped from 3.4 to 0.8 after integrating the lexicon.

The authors conclude that a dedicated idiom‑proverb sentiment lexicon substantially enhances Arabic sentiment analysis, particularly for colloquial and mixed‑register texts where idiomatic usage is prevalent. They suggest future work in expanding the lexicon to include neutral and mixed‑polarity expressions, integrating the lexicon into end‑to‑end deep learning architectures, and exploring automatic semi‑supervised methods for scaling the resource to other Arabic dialects. The study provides both a valuable linguistic resource and a reproducible methodology that can be adopted by researchers and practitioners aiming to improve sentiment analysis for morphologically rich languages.

💡 Research Summary

📜 Original Paper Content