Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation
Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.
💡 Research Summary
The paper tackles two persistent challenges in machine translation quality estimation (QE): the lack of explicit error information in scalar quality scores and the difficulty of achieving strong performance for low‑resource language pairs where annotated QE data are scarce. To address these issues, the authors make three major contributions. First, they construct the first segment‑level QE dataset for English‑to‑Malayalam (En→Ml), a severely under‑represented language pair. The dataset contains 5 K sentence pairs, each annotated with Direct Assessment (DA) scores from three human raters and with Translation Quality Remarks (TQR), short free‑form comments that describe the types of translation errors (Untranslated, Addition, Mistranslation, Fluency Error, Other). TQR provides rich contextual error information without the heavy annotation burden of MQM or ESA, making it a lightweight yet informative supervision signal.
Second, the authors propose ALOPE‑RL (AdaptivE Learning and Optimization for Policy‑based quality Estimation with Reinforcement Learning), a policy‑based reinforcement learning framework that leverages both DA scores and TQR‑derived signals as multi‑objective rewards. Using Group Relative Policy Optimization (GRPO), ALOPE‑RL simultaneously optimizes a scalar regression loss (to match DA) and several auxiliary reward components derived from synthetic explanations generated from TQR. These explanations contain “Identified error categories” and a natural‑language “Description of the translation,” which serve as reference signals for reward computation. By integrating error‑aware feedback, the model learns not only to predict a quality score but also to reason about the underlying causes of translation errors.
Third, the authors demonstrate that strong QE performance can be achieved with modest computational resources. They fine‑tune compact large language models (LLMs) of ≤ 4 B parameters using LoRA adapters and 4‑bit quantization, drastically reducing memory and compute requirements. Despite training on the small En→Ml dataset, ALOPE‑RL outperforms larger LLM‑based baselines and state‑of‑the‑art encoder‑based QE models such as COMET‑Kiwi and C‑K. Specifically, on the En→Ml test set, ALOPE‑RL attains Pearson correlation of 0.78 and Kendall τ of 0.62, surpassing the best encoder baseline (0.71 / 0.55) and a strong GPT‑based metric (0.73 / 0.58).
To assess generalizability, the authors extend experiments to three additional low‑resource pairs—English‑to‑Tamil (En→Ta), English‑to‑Marathi (En→Mr), and English‑to‑Hindi (En→Hi). For these pairs only DA scores are available, so the authors derive weak supervision from word‑level QE tags (WT‑⟂) obtained via post‑editing alignments, and compare it with TQR‑based supervision where possible. Results show that incorporating TQR consistently yields higher correlations (average gain of 5–7 percentage points) than using WT‑⟂ alone, confirming that error‑aware rewards improve QE across languages.
Methodologically, the paper details the prompt engineering used to generate synthetic explanations from TQR, the selection of two strong reasoning models (GPT‑2.5‑pro and D‑S‑V3) for this purpose, and a small human evaluation confirming the quality of the generated explanations. The reinforcement learning loop integrates these explanations as reference signals, and the GRPO algorithm ensures stable policy updates even with limited data.
The contributions are summarized as follows: (1) Release of a novel En→Ml QE dataset with both scalar and error‑focused annotations; (2) Introduction of ALOPE‑RL, a reinforcement‑learning framework that jointly optimizes scalar quality prediction and error‑aware reasoning; (3) Demonstration that compact, quantized LLMs fine‑tuned with LoRA can achieve SOTA QE performance under low‑resource constraints; (4) Empirical evidence that weak error remarks substantially boost QE accuracy and generalize across multiple language pairs.
The work opens several avenues for future research. Automated generation of TQR-like remarks via advanced prompting could further reduce annotation costs. Multi‑language, multi‑domain transfer learning could leverage the error‑aware policy to adapt quickly to new language pairs. Finally, integrating the learned error reasoning into downstream tasks such as automatic post‑editing or interactive translation assistance could create more transparent and controllable MT systems. Overall, the paper presents a compelling case that error‑aware reinforcement learning, even with modest data and compute, can substantially advance reference‑less translation quality estimation.
Comments & Academic Discussion
Loading comments...
Leave a Comment