AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic
With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf{\textit{AmharicStoryQA}}, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.
💡 Research Summary
The paper addresses a critical gap in multilingual large language model (LLM) evaluation: the tendency to treat language and culture as synonymous and to evaluate a language as a monolithic entity. Focusing on Amharic, the official language of Ethiopia, the authors demonstrate that significant cultural variation exists across regions within a single language, and that this variation materially affects model performance.
To capture this intra‑language diversity, the authors construct AmharicStoryQA, a long‑sequence story‑based question answering benchmark grounded in culturally diverse narratives from nine Ethiopian regions (including Afar, Amhara, Benishangul‑Gumuz, Harar, Oromia, and several zones of the Southern Nations, Nationalities, and Peoples’ Region). They collect 224 traditional folktales, translate them into English and Amharic, and generate five questions per story using GPT‑4.1. Each question is paired with a correct answer and three distractors (world‑knowledge‑biased, unrelated, and factually false).
Human annotators evaluate the generated English questions on three dimensions: correctness & faithfulness to the story, linguistic quality & clarity, and comprehension depth & challenge level. Inter‑rater reliability is measured with Gwet’s AC1, showing strong agreement. The questions are then manually translated into Amharic by experienced translators, and translation quality is assessed using SSA‑COMET scores (reference‑based and reference‑free metrics). Items scoring below 0.65 are manually inspected; a small number of genuine translation errors are corrected, ensuring high fidelity.
The final dataset comprises 571 training instances and 649 test instances, balanced across multiple‑choice (MCQA) and open‑ended generation formats. Token lengths average around 4,000–6,000 per story, with PCA‑based visualizations showing that while stories from different regions occupy a common semantic space, subtle outer contours reveal region‑specific nuances.
For evaluation, seven open‑source LLMs capable of handling up to 128 k token contexts are selected (e.g., Gemma‑3‑27B‑IT, Llama 3.1 8B Instruct, Command‑R). Using the EleutherAI lm‑eval harness, the authors conduct zero‑shot testing. MCQA performance is measured via log‑likelihood ranking of answer options; generation quality is assessed with chrF. Results expose pronounced regional disparities: models consistently underperform on stories from the Southern Nations zones and some peripheral regions, with accuracy drops of 10–15 % compared to central regions like Amhara.
Supervised fine‑tuning is performed with LoRA (rank 8, batch 8, three epochs) on three data configurations: English‑only, Amharic‑only, and multilingual (combined). Fine‑tuned models improve overall accuracy, yet gains are uneven across regions. For instance, Amhara stories see an 8 % boost, while Oromia or SNNPR zones improve by less than 2 %. This unevenness suggests that cultural grounding in the fine‑tuning data is insufficiently diverse or that the model architecture struggles to internalize region‑specific cues.
The study also explores the impact of prompt language. Using Amharic prompts yields lower performance than English prompts across both MCQA and generation tasks, highlighting the models’ limited proficiency with Amharic input despite being trained on multilingual corpora.
Key contributions:
- Introduction of AmharicStoryQA, a culturally grounded, long‑context QA benchmark for a low‑resource language, covering nine Ethiopian regions with both MCQA and generation tasks.
- Comprehensive zero‑shot analysis of seven LLMs, revealing that intra‑language cultural variation leads to significant performance gaps.
- Demonstration that supervised LoRA fine‑tuning improves overall scores but does not uniformly close regional gaps, underscoring the need for more diverse, culturally aware training data.
The authors argue that evaluating LLMs solely at the language level masks important cultural nuances, especially in multilingual societies where dialects, histories, and social norms differ markedly within a single language community. AmharicStoryQA provides a concrete resource to probe these nuances and can serve as a template for building similar benchmarks for other low‑resource languages. The work calls for a shift toward culturally grounded evaluation practices to more accurately assess and enhance narrative understanding in LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment