Estonian Native Large Language Model Benchmark

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.

💡 Research Summary

The paper introduces the first comprehensive benchmark suite for evaluating large language models (LLMs) in Estonian, a low‑resource Uralic language spoken by roughly one million people. Recognizing the scarcity of native‑language evaluation resources, the authors construct seven diverse tasks: (1) Estonian National Exam – 1,614 multiple‑choice questions covering seven school subjects; (2) TrivIA – 800 culturally specific trivia items derived from a popular Estonian board game; (3) Declension – 1,400 adjective‑noun case inflection queries covering all 14 Estonian noun cases; (4) Word Meaning – 1,000 definition‑to‑term mappings; (5) Grammar Correction – 3,731 sentences with native and non‑native errors; (6) News Summarization – 523 short news transcripts from ERR radio evaluated with ROUGE‑L; and (7) Speaker Name Extraction – speaker‑attributed transcripts from the main evening news program. All datasets are sourced from native Estonian material; no machine translation is used, and a pipeline of OCR, LLM‑assisted structuring, and manual post‑editing ensures high quality.

The evaluation covers six base models (e.g., LLaMA‑2, Mistral) and twenty‑six instruction‑tuned chat models, including both open‑source and commercial offerings. Base models are tested in a 5‑shot setting, while instruction‑tuned models are evaluated zero‑shot, reflecting typical usage patterns. Performance metrics are task‑specific: accuracy for multiple‑choice, exact‑match and Levenshtein similarity for grammar correction, ROUGE‑L for summarization, etc.

To validate results, the authors combine human judgments with an “LLM‑as‑a‑judge” approach using Anthropic’s Claude 3.7 Sonnet. Human evaluators compare paired model responses via a custom web interface, while Claude Sonnet automatically scores the same outputs using identical criteria. Correlation between human and LLM judgments varies by task, ranging from 0.45 (trivia and word‑meaning) to 0.78 (grammar correction and summarization), indicating that Claude Sonnet aligns well with human preferences on linguistically intensive tasks but less so on culturally specific knowledge.

Results show commercial models generally outperform open‑source counterparts across most benchmarks, especially in summarization and grammar correction. Among open‑source instruction‑tuned models, LLaMA‑2‑Chat and Mistral‑Instruct achieve competitive scores on declension and word‑meaning tasks, suggesting that model size is less decisive than the relevance of fine‑tuning data for low‑resource languages. The study also highlights that high‑quality, native‑language benchmarks yield stronger correlations with human judgments than translated or synthetic datasets.

Finally, the paper argues that reliable LLM‑as‑judge systems can reduce the cost of human annotation for low‑resource language evaluation, and it outlines future directions: expanding the benchmark to additional domains (legal, medical), incorporating multimodal inputs, and systematically assessing cultural bias and ethical considerations in Estonian LLMs.

Estonian Native Large Language Model Benchmark

💡 Research Summary

Comments & Academic Discussion

Leave a Comment