FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Reading time: 5 minute
...

📝 Original Info

  • Title: FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models
  • ArXiv ID: 2512.13330
  • Date: 2025-12-15
  • Authors: Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo

📝 Abstract

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machinetranslated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/ LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/ TurkuNLP/FIN-bench-v2.

💡 Deep Analysis

Figure 1

📄 Full Content

Large language models (LLMs) have rapidly evolved into a central focus of modern artificial intelligence research, driving substantial progress in natural language understanding and generation. Originating from the Transformer model architecture introduced by Vaswani et al. (2017), these models with billions of trainable parameters are typically trained on unprecedentedly large textual datasets. This extensive training enables them to achieve state-of-the-art performance across a broad spectrum of applications. Crucially, it empowers these models to generalize beyond their original training objectives via in-context learning, allowing them to adapt to novel problems without the need for task-specific parameter updates. This distinct capability highlights their utility as versatile, general-purpose computational systems.

Model evaluation is a crucial part of research and deployment. Most evaluation resources are in English, hindering model development for lowresource languages such as Finnish. We have tried to mitigate this challenge by introducing the first medium-scale effort for generative model evaluation with the original FIN-bench (Luukkonen et al., 2023). Finnish has also been included in EuroEval (Nielsen et al., 2024), MMTEB (Enevoldsen et al., 2025), and GlotEval (Luo et al., 2025). However, these resources have their drawbacks:

• Data quality. Datasets’ quality for benchmarking different-sized models is not assessed, which may delimit a large proportion of tasks (Kydlíček et al., 2024), or samples are produced with machine translation without human review.

• Task formulation. Task formulations are simple and do not account for prompt sensitivity (Voronov et al., 2024), and are poorly compatible with non-instruction-tuned model evaluation (Gu et al., 2025).

We present FIN-bench-v2, a broad collection of Finnish benchmark datasets compiled into a unified evaluation suite. We systematically evaluate the quality of benchmark datasets using various metrics, create a diverse collection of prompts by hand across all datasets with multiple human annotators, and manually refine the machine-translated GoldenSwag and XED datasets for accurate representation. We release compatible with the widely used Language Model Evaluation Harness (Gao et al., 2024).

Our main objectives for FIN-bench-v2 were modernizing the previous version of FIN-bench into a long-term-maintainable, easy-to-use format and expanding the benchmark to be more extensive and reliable for the evaluation of models of different sizes.

The original FIN-bench (Luukkonen et al., 2023) covered a broad, though not comprehensive, range of tasks for evaluating the Finnish language capabilities of LLMs. However, the evaluation libraries on which it relied had become deprecated, making it difficult to use in 2025. We therefore first modernized and ported FIN-bench to work on the LM Evaluation Harness (Gao et al., 2024), converting its datasets into the native format supported by the HuggingFace Datasets library to ensure longterm maintainability and ease of use. This modernization effort later evolved into FIN-bench-v2, a broader initiative to expand and diversify the benchmark’s task coverage. In particular, we sought to introduce new tasks from a variety of domains, including mathematics, geography, and medicine, to make the suite as comprehensive and representative as possible.

Beyond standard post-training assessment, we designed the suite to facilitate intermediate feedback during the pre-training phase via model checkpoint evaluation. To accommodate the distinct behaviors of base and fine-tuned models, we sought to implement two separate prompting strategies: Cloze Formulation (CF) and Multiple-choice Formulation (MCF) (Gu et al., 2025). This dual approach addresses established findings that while instruction-tuned models benefit from answer choices embedded in the prompt (MCF), base models typically demonstrate superior performance with standard cloze-style completions (Brown et al., 2020).

The first step of the benchmark creation was to include all tasks and datasets from the original FINbench (Luukkonen et al., 2023). As will be discussed in the following section, each of these tasks and datasets was systematically re-evaluated to determine whether it should be retained, modified, or excluded in the construction of . This reassessment ensured that the updated benchmark remained reliable, relevant, and compatible with our renewed evaluation framework.

To further broaden the scope of FIN-bench-v2 and include new tasks and datasets across a variety of domains, we investigated a wide range of existing datasets as potential candidates. While some of these datasets were already familiar to us through prior experiments and were known to meet our quality standards, others required closer inspection and additional processing. Our final pool of candidate tasks included: ARC Challenge (Clark et al., 2018), Belebele (Bandarkar et al., 2024), Golden

📸 Image Gallery

arc_challenge_fi_cf_fbv2_2m4oJMxN.png arc_challenge_fi_cf_fbv2_nm41JSHN.png arc_challenge_fi_cf_fbv2_per_prompt_with_base_line.png arc_challenge_fi_cf_fbv2_vk8IoMkd.png arc_challenge_fi_mcf_fbv2_2m4oJMxN.png arc_challenge_fi_mcf_fbv2_nm41JSHN.png arc_challenge_fi_mcf_fbv2_per_prompt_with_base_line.png arc_challenge_fi_mcf_fbv2_vk8IoMkd.png belebele_fin_cf_fbv2_2m4oJMxN.png belebele_fin_cf_fbv2_nm41JSHN.png belebele_fin_cf_fbv2_per_prompt_with_base_line.png belebele_fin_cf_fbv2_vk8IoMkd.png belebele_fin_mcf_fbv2_2m4oJMxN.png belebele_fin_mcf_fbv2_nm41JSHN.png belebele_fin_mcf_fbv2_per_prompt_with_base_line.png belebele_fin_mcf_fbv2_vk8IoMkd.png cf_mcf_comparison_LumiOpen_Llama-Poro-2-70B-Instruct.png cf_mcf_comparison_LumiOpen_Poro-34B-chat.png cf_mcf_comparison_all_models.png cf_mcf_comparison_google_gemma-3-27b-it.png cf_mcf_comparison_meta-llama_Llama-4-Scout-17B-16E-Instruct.png cover.png finbench_analogies_cf_fbv2_2m4oJMxN.png finbench_analogies_cf_fbv2_nm41JSHN.png finbench_analogies_cf_fbv2_per_prompt_with_base_line.png finbench_analogies_cf_fbv2_vk8IoMkd.png finbench_analogies_mcf_fbv2_2m4oJMxN.png finbench_analogies_mcf_fbv2_nm41JSHN.png finbench_analogies_mcf_fbv2_per_prompt_with_base_line.png finbench_analogies_mcf_fbv2_vk8IoMkd.png finbench_emotions_1k_cf_fbv2_2m4oJMxN.png finbench_emotions_1k_cf_fbv2_nm41JSHN.png finbench_emotions_1k_cf_fbv2_per_prompt_with_base_line.png finbench_emotions_1k_cf_fbv2_vk8IoMkd.png finbench_emotions_1k_mcf_fbv2_2m4oJMxN.png finbench_emotions_1k_mcf_fbv2_nm41JSHN.png finbench_emotions_1k_mcf_fbv2_per_prompt_with_base_line.png finbench_emotions_1k_mcf_fbv2_vk8IoMkd.png finbench_empirical_judgments_cf_fbv2_vk8IoMkd.png finbench_general_knowledge_cf_fbv2_2m4oJMxN.png finbench_general_knowledge_cf_fbv2_nm41JSHN.png finbench_general_knowledge_cf_fbv2_per_prompt_with_base_line.png finbench_general_knowledge_cf_fbv2_vk8IoMkd.png finbench_general_knowledge_mcf_fbv2_2m4oJMxN.png finbench_general_knowledge_mcf_fbv2_nm41JSHN.png finbench_general_knowledge_mcf_fbv2_per_prompt_with_base_line.png finbench_general_knowledge_mcf_fbv2_vk8IoMkd.png finbench_hhh_alignment_cf_fbv2_2m4oJMxN.png finbench_hhh_alignment_cf_fbv2_nm41JSHN.png finbench_hhh_alignment_cf_fbv2_per_prompt_with_base_line.png finbench_hhh_alignment_cf_fbv2_vk8IoMkd.png finbench_hhh_alignment_mcf_fbv2_2m4oJMxN.png finbench_hhh_alignment_mcf_fbv2_nm41JSHN.png finbench_hhh_alignment_mcf_fbv2_per_prompt_with_base_line.png finbench_hhh_alignment_mcf_fbv2_vk8IoMkd.png finbench_similarities_abstraction_cf_fbv2_2m4oJMxN.png finbench_similarities_abstraction_cf_fbv2_nm41JSHN.png finbench_similarities_abstraction_cf_fbv2_per_prompt_with_base_line.png finbench_similarities_abstraction_cf_fbv2_vk8IoMkd.png finbench_similarities_abstraction_mcf_fbv2_2m4oJMxN.png finbench_similarities_abstraction_mcf_fbv2_nm41JSHN.png finbench_similarities_abstraction_mcf_fbv2_per_prompt_with_base_line.png finbench_similarities_abstraction_mcf_fbv2_vk8IoMkd.png goldenswag_ht_fi_cf_fbv2_2m4oJMxN.png goldenswag_ht_fi_cf_fbv2_nm41JSHN.png goldenswag_ht_fi_cf_fbv2_per_prompt_with_base_line.png goldenswag_ht_fi_cf_fbv2_vk8IoMkd.png goldenswag_ht_fi_mcf_fbv2_2m4oJMxN.png goldenswag_ht_fi_mcf_fbv2_nm41JSHN.png goldenswag_ht_fi_mcf_fbv2_per_prompt_with_base_line.png goldenswag_ht_fi_mcf_fbv2_vk8IoMkd.png ogx_truthfulqax_gen_fi_fbv2_vk8IoMkd.png ogx_truthfulqax_mc1_fi_fbv2_2m4oJMxN.png ogx_truthfulqax_mc1_fi_fbv2_nm41JSHN.png ogx_truthfulqax_mc1_fi_fbv2_per_prompt_with_base_line.png ogx_truthfulqax_mc1_fi_fbv2_vk8IoMkd.png ogx_truthfulqax_mc2_fi_fbv2_2m4oJMxN.png ogx_truthfulqax_mc2_fi_fbv2_nm41JSHN.png ogx_truthfulqax_mc2_fi_fbv2_per_prompt_with_base_line.png ogx_truthfulqax_mc2_fi_fbv2_vk8IoMkd.png scandisent_fi_cf_fbv2_2m4oJMxN.png scandisent_fi_cf_fbv2_nm41JSHN.png scandisent_fi_cf_fbv2_per_prompt_with_base_line.png scandisent_fi_cf_fbv2_vk8IoMkd.png scandisent_fi_mcf_fbv2_2m4oJMxN.png scandisent_fi_mcf_fbv2_nm41JSHN.png scandisent_fi_mcf_fbv2_per_prompt_with_base_line.png scandisent_fi_mcf_fbv2_vk8IoMkd.png sib200_fi_cf_fbv2_2m4oJMxN.png sib200_fi_cf_fbv2_nm41JSHN.png sib200_fi_cf_fbv2_per_prompt_with_base_line.png sib200_fi_cf_fbv2_vk8IoMkd.png sib200_fi_mcf_fbv2_2m4oJMxN.png sib200_fi_mcf_fbv2_nm41JSHN.png sib200_fi_mcf_fbv2_per_prompt_with_base_line.png sib200_fi_mcf_fbv2_vk8IoMkd.png squad_fi-0vs1.png squad_fi_gen_fbv2_vk8IoMkd.png task_comparison_LumiOpen_Llama-Poro-2-70B-Instruct.png task_comparison_LumiOpen_Poro-34B-chat.png task_comparison_by_model_grouped.png task_comparison_google_gemma-3-27b-it.png task_comparison_meta-llama_Llama-4-Scout-17B-16E-Instruct.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut