ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.
💡 Research Summary
The paper introduces ALPS (Arabic Linguistic & Pragmatic Suite), a diagnostic benchmark specifically designed to probe deep semantic and pragmatic reasoning in Arabic language models. Unlike most recent Arabic NLP benchmarks that prioritize scale and often rely on synthetic or translated data, ALPS consists of 531 carefully crafted multiple‑choice items covering 15 diagnostic tasks and 47 subtasks. The items are native Arabic, drawn from Classical Arabic, Modern Standard Arabic, Quranic text, and poetry, and were created and validated entirely by experts in Arabic linguistics. The construction pipeline involved item creation, reference consultation, distractor design, and multi‑round expert validation, ensuring cultural authenticity and eliminating translation artifacts.
Human performance was measured under a strict “single‑pass” closed‑book protocol with four annotators (two PhDs, one MA, one BA). The average accuracy was 84.6%, with substantial variation across tasks (e.g., Speech Act Theory 73.8%, Implicature 86.2%). Inter‑annotator agreement was modest (Fleiss’ κ = 0.23), reflecting genuine task difficulty rather than annotation noise. An expert‑adjudicated “oracle” achieved 99.2% accuracy, confirming that the questions are well‑posed.
The authors evaluated 23 models—including commercial giants (Gemini‑3‑flash, GPT‑5.2, Claude‑Opus‑4), open‑source multilingual models, and Arabic‑native models (Jais‑2‑70B, c4ai‑aya‑expanse‑32B)—using a zero‑shot prompting strategy that provides only format instructions, avoiding any in‑context examples. Results reveal a clear hierarchy: top commercial models outperform both open‑source and Arabic‑specific models, with Gemini‑3‑flash reaching 94.2% overall accuracy, surpassing the average human baseline but still 5 pp below the oracle. Notably, tasks that depend on diacritics and morpho‑syntactic cues exhibit a 36.5% error rate, indicating that current LLMs struggle with fine‑grained Arabic grammar (iʿrāb) and diacritic restoration.
A deeper split between semantics and pragmatics shows that even the best models achieve only a modest gap (e.g., Gemini‑3‑flash 93.8% on semantics vs. 90.8% on pragmatics). Arabic‑native models sometimes reverse this pattern, performing slightly better on pragmatic tasks, suggesting that language‑specific pre‑training can aid inference about speaker intent, implicature, and presupposition. However, all models lag behind on the most challenging pragmatic tasks such as Speech Act Theory (73.8% best) and Implicature, underscoring a fundamental limitation: they excel at rule‑based compositional reasoning but lack robust inference about context, intent, and subtle pragmatic cues.
The paper positions ALPS as a complementary diagnostic tool to large‑scale benchmarks like ORCA, BALSAM, and ArabicMMLU. By focusing on depth rather than breadth, ALPS can expose specific linguistic failure modes, guide targeted model improvements (e.g., better handling of iʿrāb, diacritic‑aware tokenization, pragmatic reasoning modules), and provide a more nuanced “linguistic reasoning” standard for Arabic LLMs. The authors argue that such fine‑grained evaluation is essential for moving Arabic NLP from surface‑level fluency toward genuine linguistic competence.
Comments & Academic Discussion
Loading comments...
Leave a Comment