Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling
Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.
💡 Research Summary
Macaron introduces a novel, template‑first benchmark designed to evaluate multilingual, multicultural reasoning in large language models (LLMs) under controlled conditions. Existing multilingual benchmarks fall into two categories: translation‑parallel datasets that preserve English‑centric scenarios (e.g., XNLI, Global‑MMLU) and culture‑first datasets that are authored independently for each language or region (e.g., CulturalBench, ArabCulture). The former fails to test cultural grounding, while the latter often lacks systematic control over the reasoning skills required and can drift in difficulty across languages. Macaron addresses both issues by factorizing reasoning type and cultural aspect through a set of 100 language‑agnostic templates.
Each template specifies a question skeleton with typed slots (e.g.,
Comments & Academic Discussion
Loading comments...
Leave a Comment