Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps.

💡 Research Summary

The paper addresses the chronic shortage of high‑quality parallel corpora for statistical and neural machine translation, especially for low‑resource languages and narrow domains. While human‑crafted bilingual dictionaries exist, they cannot cover out‑of‑vocabulary items or newly coined terms. Comparable corpora such as Wikipedia contain abundant bilingual content, but the texts are not aligned at the sentence level, making automatic extraction essential.

The authors focus on improving the Yalign framework, a widely used tool that aligns sentences from comparable corpora by computing similarity scores (TF‑IDF weighted cosine similarity combined with Levenshtein distance) and then applying a dynamic‑programming based optimal alignment algorithm. The original Yalign implementation is CPU‑bound and sequential, which limits its scalability to the massive size of modern Wikipedia dumps (hundreds of millions of sentences).

Three major contributions are presented:

Algorithmic Re‑implementation and Low‑Level Optimisation – The tokenisation, TF‑IDF matrix construction, and similarity calculations are rewritten in Cython with SIMD vectorisation. This reduces per‑sentence processing time by roughly 30 % and speeds up similarity scoring by a factor of three to four compared with the pure‑Python baseline.
Automatic Parameter Tuning – Yalign relies on several hyper‑parameters (minimum similarity threshold, window size, cost‑function weights). The authors provide a tuning script that employs Bayesian optimisation together with k‑fold cross‑validation to discover domain‑specific optimal settings within minutes. Experiments reveal that the optimal similarity threshold varies widely across domains (0.55 – 0.78), underscoring the importance of adaptive tuning for maintaining a balance between precision and recall.
GPU‑Accelerated Parallel Mining – The core mining pipeline is ported to CUDA. Sentences are first embedded into fixed‑size vectors (e.g., 300‑dimensional) and stored contiguously in GPU global memory. Cosine similarity and Levenshtein distance are computed by dedicated kernels that evaluate millions of candidate pairs in parallel. The dynamic‑programming alignment step is also parallelised by assigning each alignment sub‑problem to a thread‑block, with the final traceback performed on the host. This redesign yields an average speed‑up of 12× (up to 18× in the best case) on a single NVIDIA RTX 3090 compared with the original CPU version.

The system is evaluated on three language pairs (English‑Spanish, English‑Russian, English‑Korean) and three topical domains (Science & Technology, Culture & Arts, General Encyclopedia). Using the full English‑Wikipedia dump (≈2 × 10⁸ sentences), the GPU‑accelerated pipeline extracts parallel sentence pairs in 48 minutes, whereas the baseline requires roughly 10 hours. Quality metrics remain stable: precision 0.84–0.88, recall 0.76–0.81, and F1 0.80–0.84. When the extracted pairs are added to the training data of a Moses‑based statistical MT system, BLEU scores improve by 1.2–1.5 points; comparable gains are observed for an OpenNMT neural system.

The authors discuss limitations: GPU memory caps the number of candidate pairs that can be processed simultaneously, and very long sentences (>200 tokens) suffer from information loss during the fixed‑size vectorisation step. Moreover, the current similarity model relies on TF‑IDF features; replacing it with contextual embeddings from pretrained multilingual Transformers (e.g., XLM‑R, mBERT) could further boost alignment accuracy.

Future work includes implementing a streaming architecture to overcome memory constraints, scaling the solution across multiple GPUs or a GPU cluster, and integrating Transformer‑based sentence embeddings into the similarity computation.

In conclusion, by re‑engineering Yalign, adding an automated tuning component, and exploiting GPU parallelism, the authors deliver a practical, high‑throughput solution for mining parallel data from comparable corpora. The approach dramatically reduces processing time while preserving alignment quality, thereby facilitating the creation of large‑scale bilingual resources for low‑resource languages and specialised domains, which in turn can enhance the performance of both statistical and neural machine translation systems.

💡 Research Summary

📜 Original Paper Content