Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

💡 Research Summary

The paper presents a comprehensive effort to create an Estonian version of the widely used commonsense reasoning benchmark WinoGrande, and to evaluate how large language models (LLMs) perform on this localized dataset compared with machine‑translated alternatives. The authors first describe the human translation pipeline: a master’s student in translation studies and a professional translator jointly translated the 1,767 test items, preserving the original 70 % lexical overlap between twin sentences, keeping answer options identical in grammatical case, and adapting culturally specific references (geographic locations, foods, brands, animal species, personal names) to Estonian equivalents. Because Estonian is an agglutinative Finno‑Ugric language, special care was taken to maintain number and case agreement; where a literal translation would create ambiguity or make the task solvable by simple morphological cues, the translators re‑phrased the schema (e.g., rendering English “cheap” as Estonian “maitsetu” to preserve the intended contrast). During this process 53 items (3 % of the set) required cultural localization and 89 items (5 %) were found to be ambiguous or incorrectly labeled in the original dataset; these were corrected, and the corrected labels will be released alongside the dataset.

To assess the reliability of the human‑produced data, two additional annotators independently labeled all items. Inter‑annotator agreement was high: Cohen’s κ = 0.816 and Fleiss’s κ = 0.855, indicating very strong consensus. Accuracy of the annotators was 95 % and 92 %, respectively, with a modest number of “undecidable” cases, confirming that the translation retained the intended disambiguation difficulty.

The authors then generated two machine‑translated versions using OpenAI’s GPT‑4 family. The first (GPT‑4o) employed a short, zero‑shot prompt (“translate English to Estonian”). The second (GPT‑4.1) used a detailed prompt that explicitly instructed the model to preserve lexical overlap, keep answer‑option morphology, and perform cultural adaptation. Manual inspection revealed systematic problems in both versions: (i) number or case mismatches (e.g., singular verb agreeing with a plural noun), (ii) loss of meaning (semantic drift), and (iii) creation of grammatical cues that allow a model to pick the correct answer for the wrong reason. Even the detailed prompt reduced the proportion of such errors only modestly; about 15.2 % of the machine‑translated items still suffered semantic loss.

Six LLMs—GPT‑4, GPT‑3.5‑turbo, LLaMA‑2‑13B, Mistral‑7B, Falcon‑40B, and an open‑source instruction‑tuned model—were evaluated on three datasets: the human‑translated Estonian set, the simple‑prompt machine translation, and the detailed‑prompt machine translation. Results show a clear hierarchy: the human‑translated set achieved an average accuracy of 71.3 %, only slightly below the original English benchmark (≈73 %). The simple‑prompt machine translation yielded 58.1 % accuracy, while the detailed‑prompt version improved to 60.4 % but remained substantially lower than the human version. Error analysis indicated that many incorrect predictions on the machine‑translated data were due to models exploiting translation artifacts (e.g., grammatical agreement) rather than genuine commonsense reasoning.

The paper’s contributions are threefold. First, it delivers a high‑quality, culturally adapted Estonian WinoGrande test set, complete with inter‑annotator agreement statistics and corrected labels for previously ambiguous items. Second, it provides two openly released machine‑translated baselines, together with a detailed annotation of translation errors that can serve as a benchmark for future MT research on low‑resource, morphologically rich languages. Third, it empirically demonstrates that, despite advances in LLM‑based translation, human expertise remains essential for preserving the logical structure and cultural relevance of reasoning benchmarks; prompt engineering alone cannot bridge the quality gap.

In the discussion, the authors argue that reliable multilingual evaluation of commonsense reasoning requires more than raw translation. It demands (a) expert linguistic knowledge to handle morphological agreement, (b) cultural adaptation to avoid introducing external knowledge that changes the problem, and (c) systematic error correction to eliminate label‑sentence inconsistencies. They suggest future work in (i) extending the methodology to other Finno‑Ugric and low‑resource languages, (ii) developing multi‑step post‑editing pipelines that combine LLM translation with human verification, and (iii) quantifying how specific translation errors propagate through downstream model performance. By releasing the Estonian dataset on Hugging Face, the authors invite the community to explore multilingual reasoning, improve translation techniques for complex benchmarks, and ultimately build more robust, language‑agnostic AI systems.

Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment