From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
💡 Research Summary
The paper investigates how well Arabic language models pretrained primarily on Modern Standard Arabic (MSA) transfer to the many spoken dialects of Arabic (DA). Recognizing the diglossic nature of Arabic—where MSA dominates formal writing while dialects are used in everyday communication—the authors ask whether MSA‑centric or multi‑dialect models can generalize equitably across the dialectal landscape, and whether negative interference occurs when a single model is trained on all dialects.
To answer these questions, the study combines two complementary analytical methods. First, linear probing is applied to frozen layer‑wise embeddings from each model. Simple multinomial logistic regression classifiers are trained to predict three downstream tasks: Part‑of‑Speech tagging (POS), Named Entity Recognition (NER), and Sentiment Analysis (SA). By measuring probe accuracy per layer, the authors identify where linguistic information (morphology, syntax, sentiment cues) is most readily extractable. Second, they compute representational similarity between MSA and dialect models using Centered Kernel Alignment (CKA). CKA provides a scale‑invariant similarity score (0–1) for the hidden representations of parallel sentences from the MADAR corpus, offering a task‑agnostic view of how similarly the models encode Arabic input.
A novel aspect of the work is the incorporation of a geographic proximity proxy. Since MSA has no fixed location, the authors adopt Yemeni Arabic as an operational anchor for MSA, based on prior lexical‑semantic studies that suggest a close relationship between Classical Arabic and Yemeni dialects. They then correlate both probing performance and CKA scores with the physical distance between each dialect’s primary country and Yemen, testing the dialect‑continuum hypothesis that geographically closer varieties should be more similar.
The experimental setup includes balanced datasets for each dialect and task, with careful preprocessing (clitic handling for POS, silver‑standard NER for Gulf dialects, and tweet cleaning for SA). Models are all BERT‑style encoders to avoid architectural confounds: an MSA‑only model, a multi‑dialect “MIX” model, and several dialect‑specific models. Probes are trained separately on each layer, and CKA is computed layer‑wise on 2,000 parallel sentences per city‑level dialect.
Results show that MSA‑centric models do transfer to many dialects, achieving especially high performance on Levantine and Yemeni varieties, which are geographically close to the MSA anchor. However, transfer is far from uniform: performance drops markedly for distant dialects such as Tunisian or Moroccan Arabic. The multi‑dialect model exhibits lower CKA scores and probe accuracies overall, and its higher layers suffer a pronounced decline, indicating negative interference when the model tries to accommodate divergent dialectal patterns simultaneously. Dialect‑specific models outperform the MSA model only when they are trained on substantial amounts of dialect data (on the order of hundreds of thousands of sentences).
Statistical analysis reveals a strong negative correlation (r ≈ –0.71) between geographic distance from Yemen and both probing accuracy and CKA similarity, supporting the dialect‑continuum hypothesis. Moreover, the amount of pretraining data per dialect is a significant predictor of transfer success, suggesting that data scarcity—not just linguistic distance—is a key bottleneck.
The authors discuss the implications of these findings. Negative interference in the multi‑dialect model suggests that naïvely scaling a single Arabic model to cover all dialects may be counterproductive. Potential remedies include language‑specific adapters, layer‑wise weighting schemes, or staged fine‑tuning where a shared backbone is first trained on MSA and then adapted to each dialect. They also note that the geographic proxy is a simplification; future work should develop richer distance metrics that combine lexical, phonological, syntactic, and semantic similarity. Finally, improving the quality and quantity of dialectal corpora—especially for low‑resource varieties—remains essential for building robust, inclusive Arabic NLP systems.
In conclusion, while MSA‑centric models can provide a useful baseline for many Arabic dialects, equitable performance across the entire dialectal spectrum requires careful consideration of linguistic distance, data availability, and model architecture to mitigate negative interference. This study offers a comprehensive diagnostic framework—probing plus CKA—that can be applied to other diglossic language families as well.
Comments & Academic Discussion
Loading comments...
Leave a Comment