AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
💡 Research Summary
The paper presents two newly released, high‑quality datasets for Amharic, a low‑resource Semitic language, aimed at advancing neural information retrieval (IR) and instruction‑following generative modeling. The first dataset, AmharicIR, consists of 1,091 manually verified query‑positive‑negative document triplets. Queries were created through expert authoring, web‑derived queries, and LLM‑assisted generation. Positive documents are real web pages that satisfy the information need, while negative documents include both hard negatives that share lexical overlap with the query and synthetic documents generated by large language models. All triplets were validated by native speakers for relevance and non‑relevance, and underwent language‑specific normalization to reduce orthographic variation. This resource directly supports contrastive training and benchmarking of neural retrievers such as DPR (dense dual‑encoder), ColBERT‑style late‑interaction models, and SPLADE‑style sparse neural retrievers, filling a gap left by existing multilingual datasets that rely on translation or weak supervision.
The second dataset, AmharicInstr, contains 6,285 prompt‑response pairs covering a wide range of domains (news, education, health, culture, etc.) and instruction types (summarization, translation, QA, conversational replies, task completion). Initial responses were generated by several LLMs (e.g., GPT‑4, LLaMA) and then rigorously reviewed by native speakers for grammaticality, fluency, relevance to the prompt, and factual plausibility. This quality‑controlled instruction data is intended for instruction tuning of GPT‑style models and for evaluating retrieval‑augmented generation in Amharic.
Both datasets follow a reproducible, six‑step pipeline: (1) define coverage (domains, tasks, linguistic phenomena); (2) generate candidates from expert authoring, web harvesting, and LLM assistance; (3) specify explicit supervision criteria (what counts as a positive, a hard negative, or an acceptable response); (4) conduct native‑speaker validation; (5) apply language‑specific normalization and deduplication; (6) release artifacts in standardized CSV, JSON, and JSONL formats with predefined train/validation/test splits and comprehensive documentation (collection process, quality control, intended uses, limitations). The authors emphasize that this methodology is language‑agnostic and can be adapted to other low‑resource languages by swapping domain templates, web sources, and validation guidelines.
The paper situates its contributions within prior work: large English retrieval datasets (MS MARCO, BEIR, Natural Questions) provide millions of labeled pairs but are unavailable for Amharic; existing Amharic resources are either tiny (152 query‑document pairs) or weakly supervised (headline‑article pairs with implicit relevance). By providing a manually verified triplet set of over a thousand examples and a sizable, quality‑controlled instruction set, the authors address both the scarcity of explicit contrastive signals for neural retrievers and the lack of prompt‑response data for instruction tuning.
Limitations acknowledged include the modest absolute size compared to English corpora, potential bias introduced by LLM‑generated negatives or responses, and the fact that hard negatives, while lexically similar, may not capture the full complexity of real‑world retrieval noise. Nonetheless, the resources constitute the first comprehensive, native‑speaker validated Amharic benchmark for both retrieval and generation, and are expected to catalyze research on low‑resource language IR, multilingual dense retrieval, hybrid sparse‑dense models, and instruction‑tuned LLMs. Future work suggested includes scaling up the datasets, exploring cross‑lingual transfer, and deeper analysis of bias and factuality in LLM‑generated content.
Comments & Academic Discussion
Loading comments...
Leave a Comment