Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.
💡 Research Summary
This paper investigates the reliability of large language model (LLM) evaluation across languages by disentangling true performance differences from measurement instability. The authors adopt a controlled generation approach: they generate 10,000 synthetic customer‑support dialogues for Estonian, Finnish, Hungarian, and English using identical templates and parameters, ensuring that semantic content is constant while only language‑specific surface properties vary.
First, they validate generation consistency with surface‑level automatic metrics—type‑token ratio (TTR), moving‑average TTR (MA‑TTR), self‑BLEU, and semantic similarity. Semantic similarity remains highly stable across languages (0.89‑0.94), whereas lexical diversity shows predictable differences due to morphological complexity (e.g., higher MA‑TTR in Estonian). This confirms that the underlying content quality is comparable despite surface variations.
Second, they employ an LLM‑as‑a‑judge model (gpt‑5‑mini) to score each dialogue on five dimensions: Grammar (G), Readability (R), Fluency (F) – all surface‑level – and Coherence (C) plus Label Recovery Accuracy (LRA), which require discourse‑level reasoning and instruction‑following. Rankings of models per metric are compared across languages using Kendall’s τ with bootstrap confidence intervals. Surface metrics (G, R, F) exhibit high cross‑language stability (τ ≥ 0.70) and few rank inversions. In contrast, Coherence and LRA show near‑zero or negative τ values (e.g., τ = ‑0.06 for Estonian‑Hungarian Coherence) and statistically significant rank inversions, indicating that the judge’s discourse‑level assessment collapses when transferred to morphologically rich, low‑resource languages.
Meta‑prompt language sensitivity tests (English vs. Estonian prompts) produce negligible score differences (<0.05), ruling out prompt‑language bias. An ablation across six different judge models (various GPT‑5 variants, Qwen3‑32B, Llama‑4‑Maverick, GPT‑OSS‑120B) shows virtually identical instability patterns (Δ < 0.02), suggesting the issue is systematic rather than model‑specific.
Human annotation is collected from three native Estonian speakers for 100 dialogues, providing a noisy reference (κ ≈ 0.38 for coherence, κ ≈ 0.32 for fluency). Human scores align better with surface metrics than with Coherence or LRA, reinforcing the conclusion that pragmatic judgments are unreliable in zero‑shot cross‑lingual settings.
The authors propose a diagnostic workflow: (1) verify generation consistency with automatic metrics; (2) collect a modest set of expert annotations in the target language; (3) compare judge‑human ranking alignment; (4) calibrate or fine‑tune the judge if correlations are weak. This staged approach offers a “stability gate” that can flag evaluation methods likely to fail on natural data before large‑scale deployment, especially valuable for under‑represented language communities with limited resources.
In summary, while surface‑level evaluations (grammar, readability, fluency) transfer robustly across related Finno‑Ugric languages, discourse‑level assessments (coherence, instruction following) do not. The paper highlights the necessity of language‑specific calibration for pragmatic metrics and provides publicly released code, synthetic data, and protocols to enable replication across other language families.
Comments & Academic Discussion
Loading comments...
Leave a Comment