POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization

POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multi-event dataset with over 110K instances in 22 languages drawn from diverse online platforms and real-world events. Polarization is annotated along three axes, namely detection, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) fine-tuning six pretrained small language models; and (2) evaluating a range of open and closed large language models in few-shot and zero-shot settings. The results show that, while most models perform well in binary polarization detection, they achieve substantially lower performance when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and demonstrate the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.


💡 Research Summary

The paper introduces POLAR, a large‑scale benchmark designed to address the severe limitations of existing polarization research, which has largely been confined to English or other high‑resource languages, single cultural contexts, and event‑specific datasets. POLAR comprises over 110 K online text instances collected from a wide variety of platforms (X, Facebook, Reddit, Bluesky, Threads, YouTube comments, Weibo, Zhihu, etc.) across 22 languages that span seven language families, deliberately balancing high‑, medium‑, and low‑resource languages. The data are anchored in real‑world events such as the Russia‑Ukraine war, the Tigray conflict, elections in the United States and Germany, public‑health crises, migration waves, climate debates, and identity‑based movements, ensuring that the corpus reflects diverse sociopolitical realities.

Polarization is operationalized along three complementary axes: (1) Binary Detection (POLARDETECT) – whether a text expresses any form of polarization; (2) Type Classification (POLARTYPE) – the social dimension underlying the polarization (political, racial/ethnic, religious, gender/sexual, or other); and (3) Manifestation Identification (POLARMANIFEST) – the rhetorical tactics employed, including stereotyping, vilification, dehumanization, extreme language/absolutism, lack of empathy, and invalidation. Annotators were instructed to apply all applicable labels, reflecting the often overlapping nature of polarized discourse.

To accommodate the cultural and linguistic breadth, the authors devised a cross‑cultural annotation protocol. For high‑resource languages, crowd‑sourcing platforms (Amazon MTurk, Prolific) were used; for low‑resource languages, community‑based annotators with at least a bachelor’s degree were recruited. Annotators underwent rigorous training, pilot testing, and continuous performance monitoring. Only workers achieving a Fleiss’ κ of ≥ 0.8 (or equivalent Krippendorff’s α/Cohen’s κ where noted) were retained. Inter‑annotator agreement varied considerably across languages, ranging from 0.10 (German) to 0.83 (Khmer), with most languages clustering around 0.4–0.6. Notably, Italian, Russian, Burmese, and Polish lack manifestation annotations, highlighting an area for future enrichment.

The dataset construction pipeline involved language‑specific preprocessing (tokenization, length filtering, duplicate removal, anonymization) and a keyword‑driven retrieval strategy tailored by native‑speaker experts for each event. For several languages, existing toxic or hate‑speech corpora (e.g., ToxicN, COLD, Turkish Hate Speech Dataset) were re‑annotated to augment coverage.

Experimental evaluation proceeded in two parts. First, six small pretrained multilingual models (e.g., mBERT, XLM‑R, RemBERT) were fine‑tuned on each of the three tasks, employing multi‑label loss functions where appropriate. Second, a suite of open‑source and commercial large language models (GPT‑3.5, GPT‑4, LLaMA‑2, Claude, Mistral, etc.) were assessed in zero‑shot and few‑shot (5‑example) settings using carefully crafted prompts. Results reveal a clear performance gradient: binary detection achieved relatively high F1 scores (≈ 78–85 % across most languages), whereas type classification dropped to an average of 45–55 % F1, and manifestation identification fell further to 30–45 % F1. The gap was especially pronounced for low‑resource languages, where even the strongest LLMs struggled to surpass baseline performance. Few‑shot prompting offered modest gains over zero‑shot, but did not close the gap with fine‑tuned small models.

These findings underscore that online polarization is not merely a matter of sentiment polarity; it is deeply embedded in cultural narratives, identity politics, and rhetorical strategies that current NLP models capture only superficially. The authors argue that future work must explore (i) culturally aware prompt engineering, (ii) advanced multi‑task and multi‑label training objectives, (iii) data augmentation and cross‑lingual transfer techniques tailored for low‑resource settings, and (iv) integration of multimodal signals (e.g., images, emojis) that often co‑occur with polarized text.

The paper’s contributions are threefold: (1) the release of POLAR, the first multilingual, multicultural, multi‑event dataset for fine‑grained polarization analysis; (2) a rigorously defined taxonomy of polarization types and manifestations, operationalized through a robust, culturally adapted annotation protocol; and (3) comprehensive benchmarking of both small and large language models, exposing current limitations and charting a roadmap for more adaptable, context‑sensitive NLP solutions. All resources—including raw data, annotation guidelines, and code—will be publicly released, inviting the research community to build upon this foundation and to develop mitigation strategies that can be deployed globally to curb the spread of harmful polarized discourse.


Comments & Academic Discussion

Loading comments...

Leave a Comment