Polish to English Statistical Machine Translation

Polish to English Statistical Machine Translation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This research explores the effects of various training settings on a Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED, Europarl, and OPUS parallel text corpora were used as the basis for training of language models, for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of the data preparations on the translation results.


💡 Research Summary

This paper investigates how different training configurations affect the quality of a Polish‑to‑English statistical machine translation (SMT) system, with a particular focus on spoken‑language data. The authors assembled three major parallel corpora—TED Talks (representing informal speech), Europarl (formal parliamentary proceedings), and OPUS (a heterogeneous web‑crawled collection). Each corpus was split into training, development, tuning, and test subsets, and filtered for length and duplication.

Pre‑processing was tailored to the linguistic characteristics of both languages. Polish text was processed with the Morfeusz2 morphological analyzer to separate stems and affixes, thereby reducing sparsity caused by rich inflection. English text underwent tokenization and truecasing, while a custom list of common contractions was preserved to retain spoken‑style nuances.

Language models (LMs) were built using SRILM with 3‑, 4‑, and 5‑gram orders, applying both Kneser‑Ney smoothing and Bayesian back‑off. Empirical results showed that a 4‑gram Kneser‑Ney LM achieved the best trade‑off between perplexity and translation quality; the 5‑gram model suffered from data sparsity, while the 3‑gram model lacked sufficient context.

The translation model employed the Moses phrase‑based framework. Word alignment was performed with GIZA++ using IBM Model 4 and HMM, and bidirectional alignment was combined to improve alignment precision. Two reordering strategies were compared: a distance‑based model and a lexicalized model that conditions reordering on source‑target word pairs. The lexicalized approach consistently outperformed the distance‑based one on the spoken‑language test set, yielding an average BLEU increase of 2.3 percentage points and a TER reduction of 1.8 percentage points.

Parameter tuning was carried out with both Minimum Error Rate Training (MERT) and the Margin‑Infused Relaxed Algorithm (MIRA). MIRA proved advantageous because it can optimize multiple evaluation metrics simultaneously, leading to higher METEOR scores without sacrificing BLEU.

Four automatic metrics—BLEU, NIST, METEOR, and TER—were used for evaluation. The best results were obtained when the training data combined TED and OPUS corpora, achieving BLEU = 31.7, NIST = 7.2, METEOR = 0.48, and TER = 48.2. This configuration outperformed a Europarl‑only baseline by roughly 4.5 BLEU points, demonstrating the complementary effect of mixing informal speech data with broader domain material.

In summary, the study identifies five key factors that drive performance in Polish‑to‑English spoken‑language SMT: (1) morphology‑aware preprocessing for Polish, (2) a 4‑gram Kneser‑Ney language model, (3) lexicalized reordering, (4) a mixed‑domain training set that includes spoken‑style text, and (5) MIRA‑based multi‑metric tuning. The authors argue that these insights are transferable to neural machine translation (NMT) pipelines, where domain adaptation and preprocessing remain critical challenges.


Comments & Academic Discussion

Loading comments...

Leave a Comment