PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by   Comparable Corpora
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality.


💡 Research Summary

The paper presents the development and evaluation of statistical machine translation (SMT) systems built by the PJAIT team for the IWSLT 2015 evaluation campaign. The authors focus on four language pairs—Czech‑English, Vietnamese‑English, French‑English, and German‑English—covering both translation directions. Their primary objective is to improve translation quality while staying within the data constraints defined by the IWSLT organizers. To achieve this, the researchers follow a four‑stage methodology.

In the first stage, they construct a baseline system using the publicly available TED talks parallel corpora. Standard preprocessing steps such as tokenization, lower‑casing, and length filtering are applied. Moses is employed to train phrase‑based translation models, and KenLM is used to build 5‑gram language models. This baseline mirrors the typical IWSLT setup and serves as a reference point for later experiments.

The second stage introduces a novel source of data: Wikipedia‑derived comparable corpora. The team automatically crawls Wikipedia articles in each target language, extracts sentences, and aligns them at the sentence level using document structure, inter‑language links, and metadata. To ensure high alignment quality, they filter candidate pairs based on length ratio, lexical overlap, and translation probability estimates, ultimately obtaining roughly two million high‑quality comparable sentences per language pair. Unlike true parallel data, these sentences are independent renditions of the same topic, providing complementary domain coverage and lexical diversity.

In the third stage, the authors integrate the comparable corpora with the TED data and adapt the SMT pipeline accordingly. They retrain the language model on the combined corpus using KenLM, which yields a richer n‑gram distribution. For word alignment, they employ GIZA++ with symmetrized IBM Model 4, generating bidirectional alignments that reduce systematic alignment errors. To address out‑of‑vocabulary (OOV) issues—particularly prevalent for proper nouns and loanwords in Vietnamese and Czech—they incorporate an unsupervised transliteration model that learns character‑level mappings without supervision. Parameter tuning is performed separately for each language pair using Minimum Error Rate Training (MERT) to optimize BLEU on a held‑out development set.

The fourth stage consists of extensive evaluation. The authors report results using three automatic metrics: BLEU, NIST, and TER. Across all four language pairs, the system augmented with comparable corpora and domain adaptation outperforms the baseline by an average of 2.5 BLEU points, shows consistent NIST gains, and reduces TER by roughly 1.8 %. The most pronounced improvements appear for Vietnamese‑English and Czech‑English, where the comparable data supplies abundant domain‑specific terminology and phraseology. Error analysis reveals that while the added data helps with lexical choice and idiomatic expressions, challenges remain for long, complex sentences and for handling multiple senses of ambiguous words.

In conclusion, the study demonstrates that Wikipedia‑derived comparable corpora can be seamlessly integrated into a conventional phrase‑based SMT workflow, delivering measurable quality gains without violating IWSLT data usage policies. The authors suggest that future work should explore transferring these techniques to neural machine translation architectures, refining sentence alignment algorithms, and developing more sophisticated lexical normalization methods to further close the remaining performance gap.


Comments & Academic Discussion

Loading comments...

Leave a Comment