Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding

Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.


💡 Research Summary

The paper addresses the under‑representation of Arabic dialects in large language models (LLMs) by proposing a compact yet effective adaptation pipeline that combines low‑rank adaptation (LoRA) fine‑tuning, adapter merging, and dialect‑aware Minimum Bayes Risk (MBR) decoding. The authors start from two pretrained LLM backbones—Jais‑2 and LLaMA 3.2—and evaluate them on three dialects (Syrian, Moroccan, Saudi) in both monolingual generation and English↔dialect translation tasks defined by the AMIYA shared task.

First, separate LoRA adapters are trained for each dialect and each supervision type: (1) monolingual dialectal text, using a causal language modeling objective to capture dialect‑specific vocabulary, morphology, and syntax; (2) English‑dialect parallel data, framed as an instruction‑following translation task to enforce semantic alignment while preserving the instruction tokens. Training hyper‑parameters are modest (max sequence 512, 5 epochs, learning rate 3e‑5, batch size 32, BF16) to stay within memory limits.

Second, the two adapters (monolingual and translation) are combined using the TIES‑Merging technique, which merges parameter matrices while preserving the distinct contributions of each source of supervision. This merging yields a single dialect‑aware model that balances fluency in the target dialect with cross‑lingual fidelity.

Third, at inference time the model generates a set of N = 20 stochastic samples per prompt. Each candidate is scored with the ADI2 metric—a product of an Arabic Level of Dialectness (ALDI) score and a dialect identification (NADI) classifier. The candidate with the highest ADI2 score is selected, constituting MBR decoding that explicitly optimizes for dialect authenticity. The authors also experiment with chrF++‑based MBR and a combined ADI2 + chrF++ objective; these variants improve translation quality but degrade dialect fidelity, confirming that the choice of decoding objective strongly influences the trade‑off.

Empirical results show that LLaMA 3.2 excels in monolingual ADI2 (0.78) but fails in translation (chrF++ 0.14), whereas Jais‑2 offers a more balanced profile (ADI2 0.33, chrF++ 0.43). After fine‑tuning Jais‑2, monolingual adapters raise ADI2 to 0.44, translation adapters raise chrF++ to 0.42, and TIES‑Merging yields the best combined scores (ADI2 0.38, chrF++ 0.44). Applying ADI2‑based MBR further boosts monolingual ADI2 to 0.51 and translation ADI2 to 0.36, while maintaining chrF++ around 0.40. The final submission—independent fine‑tuning per dialect, TIES‑Merging, and ADI2‑MBR decoding—achieves the highest automatic ADI2 scores for Syrian and Saudi, the highest chrF++ for English→dialect and MSA→dialect translations, and the best human‑rated fluency for Moroccan.

The authors acknowledge limitations: reliance on the ADI2 automatic classifier, limited size and diversity of training corpora (especially informal code‑switching), and the computational overhead of MBR decoding, which multiplies inference time. Future work may explore more efficient reranking, joint multi‑dialect training, and richer dialectal datasets.

Overall, the study demonstrates that parameter‑efficient LoRA fine‑tuning, thoughtful adapter merging, and metric‑driven MBR decoding can jointly improve both dialectal authenticity and semantic accuracy, offering a practical roadmap for extending LLMs to low‑resource language varieties.


Comments & Academic Discussion

Loading comments...

Leave a Comment