InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
💡 Research Summary
InstructLR addresses the persistent performance gap of large language models (LLMs) on low‑resource languages (LRLs) by providing a scalable pipeline for creating high‑quality instruction‑tuning datasets. The authors first curate a diverse set of 20 topics drawn from multi‑task benchmarks such as MMLU, covering STEM, humanities, and social sciences. Seed instructions are generated in a high‑resource language (French) using a modified self‑instruct approach with the Mistral‑7B model.
The core novelty lies in translating these seed instructions and generating responses directly in the target LRL (Zarma, Bambara, Fulfulde) using an LLM that possesses basic competence in the language (Gemini 2.5 Pro). Prompt templates embed language‑specific rules for handling proper nouns, loanwords, technical terms, and phonetic adaptation, ensuring that the model does not invent new vocabulary.
To guarantee data quality while keeping human effort low, the pipeline incorporates a dual‑layer filtering system. Layer 1 is an automated Retrieval‑Augmented Generation (RAG) checker that references a knowledge base of 3,000 clean sentences, 20 grammar rules, and bilingual glossaries encoded with a FAISS index. Using n‑shot prompting, the RAG system either corrects drafts (low‑priority) or flags them for human review (high‑priority). Layer 2 involves native speakers who validate or edit flagged drafts. Inter‑annotator agreement measured by Krippendorff’s Alpha reached 0.793, indicating reliable human judgments.
Applying this workflow, the authors produced three 50‑k instruction‑response datasets: ZarmaInstruct‑50k, BambaraInstruct‑50k, and FulfuldeInstruct‑50k, all released under CC‑BY‑SA 4.0 on HuggingFace. Automated filtering reduced the number of drafts needing human review to 9.1 % of the total, cutting dataset creation costs by roughly 88 %.
Experimental evaluation involved six open‑source models (Gemma‑3‑270M, Gemma‑3‑1B, Gemma‑3‑4B, Llama‑3.1‑8B, Mistral‑7B‑Instruct‑v0.3, Phi‑4) across three training regimes: (1) zero‑shot (no fine‑tuning), (2) MT‑Seed (fine‑tuning on machine‑translated French seed instructions using MADLAD‑400), and (3) InstructLR (fine‑tuning on the newly generated datasets). Fine‑tuning employed QLoRA with the unsloth optimizer for efficiency.
Results show that models fine‑tuned on InstructLR data achieve BLEU scores of 22.8 (Zarma), 30.1 (Bambara), and 28.9 (Fulfulde), dramatically outperforming both zero‑shot baselines and MT‑Seed baselines (the latter often near zero). Human evaluations further reveal a preference for InstructLR outputs over machine‑translated baselines in 78‑84 % of pairwise comparisons.
Key insights include: (i) generating instructions and responses simultaneously in the target language preserves cultural and domain relevance better than translating pre‑existing pairs; (ii) RAG‑based automated correction can reliably filter the majority of drafts, leaving only a small, high‑impact subset for human review; (iii) the framework is language‑agnostic, requiring only modest prompt adjustments to scale to new LRLs.
In summary, InstructLR provides a practical, cost‑effective solution for building instruction‑tuning corpora for under‑represented languages, thereby narrowing the multilingual capability gap of LLMs and enabling broader, more inclusive AI applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment