Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.

💡 Research Summary

The paper addresses the problem of extracting truly parallel sentence pairs from large, non‑parallel multilingual resources, focusing on Polish–English data drawn from Wikipedia. The authors propose a fully automated pipeline that first builds a subject‑aligned comparable corpus and then filters it to retain only high‑quality parallel sentences.

In the first stage, a language‑independent web crawler starts from a Polish Wikipedia article, follows inter‑language links, and downloads the corresponding English pages. The HTML is stripped of tables, figures, URLs, and other noise, leaving plain text that is stored with a unique topic identifier. This yields a comparable corpus where each document pair shares the same subject but the sentences are not aligned.

The second stage performs sentence alignment and filtering. Initial alignment uses HunAlign, which can operate without a pre‑existing bilingual dictionary. When no dictionary is available, HunAlign relies on sentence‑length heuristics to produce a rough alignment, then builds an automatic dictionary from the aligned pairs and re‑aligns using this dictionary. This two‑pass approach improves alignment accuracy but still fails to handle crossing alignments (i.e., reordered sentences).

To overcome this limitation, the authors develop a custom filtering tool that leverages a specialized Polish‑English statistical machine translation (SMT) system. Every Polish sentence is translated into English using this SMT engine; the resulting “intermediate” English translation is then compared with the original English sentence from the comparable corpus. Similarity is measured by a combination of methods: (1) normalized string‑matching ratios (SequenceMatcher), (2) word‑level overlap after stop‑word removal and stemming, (3) synonym expansion using NLTK’s WordNet, and (4) order‑sensitive block matching. The tool allows users to prioritize speed (high‑precision, low‑recall functions) or accuracy (slower, higher‑recall functions) by setting acceptance thresholds.

The SMT system itself is trained on 36.7 million Polish‑English sentence pairs extracted from the OPUS project and adapted to Wikipedia using a 79‑million‑sentence English language model. Training employs the Moses toolkit, MGIZA++ for word alignment, a 6‑gram Kneser‑Ney language model, and a bidirectional lexical reordering model (msd). Evaluation against Google and Bing online translators shows the custom system achieving higher BLEU (20.51 vs. 18.15/18.87), NIST, METEOR, and lower TER scores, confirming its superiority for this domain.

Experimental validation consists of two parts. First, 20 randomly selected bilingual Wikipedia articles were manually aligned by professional translators to create a gold standard. The pipeline’s HunAlign stage initially aligned roughly 30 % of sentences correctly; after filtering, about 70 % of the remaining pairs were judged correct, with a precision exceeding 80 % and an error rate below 5 %. Second, a separate set of 1,000 sentence pairs (unseen during training) was translated and evaluated with BLEU, NIST, METEOR, and TER, again demonstrating the advantage of the custom SMT engine over generic online services.

Key contributions of the work are: (1) an automated, language‑agnostic crawler that builds subject‑aligned comparable corpora from Wikipedia, (2) a two‑pass HunAlign alignment strategy that generates an automatic bilingual dictionary, (3) a sophisticated filtering mechanism that combines machine translation with multi‑faceted similarity metrics, and (4) a domain‑adapted Polish‑English SMT system that supplies high‑quality translations for the filter. The approach is shown to be effective without requiring expensive linguistic resources or language‑specific grammatical rules, making it readily adaptable to other language pairs.

Future directions include replacing the SMT component with neural machine translation models to further improve translation quality, developing algorithms to detect and correct crossing alignments automatically, and extending the pipeline to additional language pairs (e.g., Russian‑English, Czech‑English). Incorporating sentence‑level embeddings such as LASER or LaBSE for semantic similarity could also enhance the filtering stage. Overall, the paper presents a practical, scalable solution for mining parallel sentences from comparable corpora, addressing a critical bottleneck in multilingual NLP resource creation.

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment