TurkBench: A Benchmark for Evaluating Turkish Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench

💡 Research Summary

The paper introduces TurkBench, a comprehensive benchmark specifically designed to evaluate generative large language models (LLMs) in Turkish. Recognizing that most existing evaluation suites such as GLUE, SuperGLUE, MMLU, and HELM are English‑centric and that direct translations fail to capture Turkish’s agglutinative morphology, free word order, and cultural nuances, the authors set out to create a resource that reflects the language’s unique characteristics.

TurkBench comprises 8,151 instances distributed across 21 distinct subtasks, organized into six major categories: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar & Vocabulary, and Instruction Following. All data are sourced from authentic Turkish materials—national exams administered by OSYM, university coursework, sociology texts, scientific Olympiads, and real‑world platforms—and are not adapted from existing English datasets nor synthetically generated. Each sample was authored by domain experts and subsequently validated by human reviewers according to three criteria: factual correctness, grammatical well‑formedness, and cultural sensitivity (details in Appendix 9.1).

The Knowledge segment includes 200 general‑knowledge multiple‑choice questions and 2,373 MMLU‑style items covering 24 academic subjects, drawn from OSYM exams and METU assessments. Language Understanding tasks consist of reading comprehension (482 open‑ended questions), natural language inference (256 three‑way classification items), summarization (262 open‑ended prompts), and semantic textual similarity (225 sentence pairs rated on a 1‑5 scale). Reasoning tasks span mathematical reasoning (500 problems from TUBITAK Science Olympiad and METU exams), complex multi‑step reasoning (100 multiple‑choice items from ALES exams), and commonsense reasoning (241 culturally grounded scenarios).

Content Moderation evaluates safety aspects through toxicity detection, bias detection, and hallucination assessment (truthfulness and faithfulness, each 250 items). The Grammar & Vocabulary category probes linguistic phenomena unique to Turkish: rare words (139), loanwords (165), named‑entity recognition (438), part‑of‑speech tagging (260), and idiom/metaphor identification (150). Finally, Instruction Following comprises 997 prompts that require models to generate appropriate responses in realistic usage contexts.

Evaluation metrics are task‑specific: accuracy for most classification and multiple‑choice items, Pearson/Spearman correlation for semantic similarity, and the “LLM‑as‑a‑Judge” framework for generation‑heavy tasks such as summarization and reading comprehension. The latter leverages a separate, well‑calibrated LLM to provide consistent, scalable judgments, reducing reliance on costly human annotation.

A key contribution is the public release of TurkBench on Hugging Face, accompanied by an automated submission pipeline and an online leaderboard (https://huggingface.co/turkbench). Researchers can upload model outputs, receive immediate scoring, and compare results against other submissions, fostering transparent competition within the Turkish AI community.

Compared with prior Turkish benchmarks—Mukayese, TurkishMMLU, TR‑MTEB, Turkish‑PLU, Cetvel—TurkBench distinguishes itself by (i) using entirely native Turkish data rather than translated or repurposed resources, (ii) enforcing rigorous human validation for linguistic and cultural fidelity, (iii) covering a broader spectrum of capabilities including safety and instruction following, and (iv) providing a near‑automatic evaluation infrastructure.

In summary, TurkBench offers a linguistically sound, culturally aware, and technically robust platform for assessing Turkish LLMs across knowledge, reasoning, safety, and language‑specific dimensions. It sets a new standard for multilingual benchmark development and is poised to accelerate progress toward more capable, responsible, and locally relevant AI systems for Turkish‑speaking users.

TurkBench: A Benchmark for Evaluating Turkish Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment