Grammatical Error Correction for Low-Resource Languages: The Case of Zarma
Grammatical error correction (GEC) aims to improve text quality and readability. Previous work on the task focused primarily on high-resource languages, while low-resource languages lack robust tools. To address this shortcoming, we present a study on GEC for Zarma, a language spoken by over five million people in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated GEC models using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs – Gemma 2b and MT5-small – showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.
💡 Research Summary
The paper addresses the under‑explored problem of grammatical error correction (GEC) for Zarma, a West African language spoken by over five million people, and evaluates three distinct approaches: a rule‑based system, a machine‑translation (MT) model, and large language models (LLMs). Because Zarma lacks a standardized orthography, large annotated corpora, and mature NLP tools, the authors first construct a sizable dataset of 250,000 sentence pairs. This dataset combines 248,000 synthetically corrupted sentences generated by a noise script that introduces insertions, deletions, substitutions, and transpositions (mirroring real‑world typing errors and grammatical mistakes) with 2,000 human‑annotated examples that contain logical and grammatical errors together with explanations. The data are split 80/10/10 for training, validation, and testing.
Rule‑Based System – The baseline uses a Bloom filter for fast existence checks, a trie‑based dictionary derived from the Feriji corpus, Levenshtein distance for spelling suggestions, and a hand‑crafted set of Zarma grammar rules (e.g., handling the agglutinative suffix –ey for definite plural). While it achieves respectable spelling detection and correction, its GLEU (0.3124) and M² (0.401) scores reveal limited ability to handle complex syntactic or logical errors.
LLM‑Based Methods – Two multilingual LLMs, Gemma‑2B and MT5‑small, are fine‑tuned using two strategies: (1) an “Instruction + Error Explanation” prompt that presents the incorrect sentence, the corrected version, and a brief error cause, encouraging the model to reason about the correction; (2) a straightforward sentence‑pair alignment without explanations. To fit GPU memory constraints, the authors apply QLoRA quantization. In automatic evaluation the LLMs obtain moderate GLEU (≈0.55) and M² (≈0.62) scores, detection rates around 80 % and suggestion accuracies near 65 %, and in manual evaluation by native speakers they receive an average of 2.4 / 5.0. This shows that while LLMs can learn some correction patterns, their performance is hampered by the scarcity of Zarma‑specific pre‑training data.
MT‑Based Approach (M2M100) – The core contribution is repurposing Facebook’s many‑to‑many multilingual translation model (M2M100) for GEC. The model is fine‑tuned to treat an erroneous Zarma sentence as the “source” and its corrected counterpart as the “target.” Training uses the full mixed dataset, with hyper‑parameters such as batch size = 64, learning rate = 3e‑4, and five epochs. In automatic metrics the MT model achieves a detection rate of 95.82 % and suggestion accuracy of 78.90 %, with GLEU = 0.7896 and M² = 0.914, far surpassing the other two methods. Human evaluators rate its output at an average of 3.0 / 5.0, indicating that it corrects both surface spelling errors and deeper grammatical or logical mistakes more reliably.
Cross‑Language Validation – To test generalizability, the same pipeline is applied to Bambara, another low‑resource West African language. The MT‑based model again outperforms rule‑based and LLM baselines, suggesting that the approach scales across similar typologically low‑resource languages.
Analysis and Implications – The study demonstrates that, in the absence of large native corpora, a multilingual MT model fine‑tuned on a modest amount of synthetic and human‑annotated data can become a strong GEC tool for low‑resource languages. Rule‑based methods remain useful for quick spelling checks but cannot replace models that learn contextual patterns. LLMs, despite their flexibility, still depend heavily on pre‑training data coverage; their moderate results here underline the need for more Zarma‑specific pre‑training or adapter techniques. The MT approach, however, incurs higher computational costs during training and inference, which may limit deployment on low‑power devices. Future work could explore model compression, knowledge distillation, or efficient inference strategies to make the system more practical for real‑world applications.
Overall, the paper provides a clear methodology for building GEC resources, a thorough comparative evaluation, and evidence that multilingual MT models are currently the most effective solution for grammatical error correction in Zarma and potentially other under‑resourced languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment