Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
💡 Research Summary
The paper addresses a well‑known limitation of multilingual pretrained language models: they lack explicit cross‑lingual alignment signals, which leads to suboptimal positioning of different languages in the shared representation space. To remedy this, the authors propose a two‑step approach that leverages synthetic multi‑way parallel data and a novel contrastive learning scheme that treats every language as an anchor.
First, they construct a multi‑way parallel corpus. Starting from English sentences sampled equally from OPUS‑Wikipedia (formal text) and OPUS‑OpenSubtitles (conversational text), they select 38 000 sentences from each source, yielding a balanced set of 75 822 sentences. Using the state‑of‑the‑art NLLB‑200 3.3B machine‑translation model, each English sentence is translated into six target languages: Chinese, Japanese, French, German, Hindi, and Spanish. For each training instance they randomly pick three of these translations, resulting in a four‑column dataset (English + three random languages). This synthetic multi‑way parallel corpus is inexpensive to produce and can be scaled to many languages as long as decent English‑to‑target NMT models exist.
Second, they adapt the supervised contrastive loss (originally proposed for images) to multilingual text. In a batch, each row contains k (here k = 4) sentences that are semantic equivalents. Instead of fixing English as the sole anchor, they allow every language in the row to serve as an anchor, with the remaining k − 1 sentences acting as positive examples. The loss Lₛᵤₚ maximizes the cosine similarity between an anchor and its positives while pushing away all other sentences in the batch (treated as negatives). A regularization term R penalizes deviation from the original pretrained embeddings, controlled by a scalar λ, ensuring that the model does not drift too far from its initial knowledge.
The authors fine‑tune two widely used multilingual encoders—XLM‑RoBERTa‑base and multilingual BERT (mBERT)‑base—using this loss on the multi‑way corpus. They evaluate the resulting “aligned” models on four tasks from the Massive Text Embedding Benchmark (MTEB): (1) Bitext mining (BUCC and Tatoeba), (2) Semantic Textual Similarity (STS17, STS22.v2), (3) Classification (several datasets such as Amazon Counterfactual, MassiveIntent, MTOP), and (4) Clustering (mini‑batch k‑means with V‑measure). All evaluations are performed both in a zero‑shot fashion (for mining and STS) and after task‑specific fine‑tuning (for classification and clustering).
Results are striking. In bitext mining, the aligned XLM‑R model raises F1 scores from the low‑20s to the mid‑90s across language pairs (e.g., Chinese‑English from 21.6 to 95.0). STS Spearman correlations improve dramatically, with previously negative or near‑zero scores (e.g., English‑German –1.2) soaring to over 50. Classification accuracy gains average 28.4% relative to the baseline, and clustering V‑measure also shows consistent improvements. Importantly, languages that never appeared in the alignment data still benefit, indicating that the multi‑way supervision induces a more globally coherent multilingual space.
To isolate the contribution of multi‑way parallelism, the authors conduct controlled experiments. They compare (a) a multi‑way setup using only N/6 rows (par‑model‑A) with (b) a bilingual En‑X setup that uses all N rows but only one random target language per row (par‑model‑B). Both configurations have the same total number of sentence pairs, yet the multi‑way model consistently outperforms the bilingual one across all tasks and languages, confirming the value of multiple positives per anchor.
Further analysis focuses on Hindi as a target language. Four training regimes are explored: (i) bilingual En‑Hi, (ii) En‑Hi plus two European languages, (iii) En‑Hi plus two Asian languages, and (iv) En‑Hi plus a mix of all remaining languages. The “all‑languages” regime yields the best downstream performance, showing that adding diverse languages during alignment does not hurt and can even boost performance on a specific target language.
Another key ablation studies the effect of anchor choice. When only English is allowed to be the anchor (XLMR‑en‑anchor), performance drops markedly compared to the full multi‑anchor setting (XLMR‑aligned). Conversely, removing the requirement that English appear in every row (XLMR‑en‑ablate) also degrades performance, though less severely, underscoring that English still plays a useful role as a pivot but should not dominate the training signal.
Finally, the authors demonstrate that even models already pretrained for high‑quality sentence embeddings (e.g., mE5) can be further improved by fine‑tuning on a small multi‑way parallel dataset, achieving additional gains in bitext mining. This suggests that multi‑way cross‑lingual supervision is a lightweight, data‑efficient way to boost multilingual encoders for real‑world applications where inference cost and model size matter.
In summary, the paper makes three major contributions: (1) a practical pipeline to generate large‑scale synthetic multi‑way parallel corpora using off‑the‑shelf NMT, (2) a contrastive learning framework that treats every language as an anchor, providing richer alignment signals than traditional bilingual approaches, and (3) extensive empirical evidence that this multi‑way alignment substantially improves multilingual sentence embeddings across a wide range of downstream tasks, including for languages unseen during alignment. The work opens a path toward more universally aligned multilingual models without the need for costly human‑curated parallel data.
Comments & Academic Discussion
Loading comments...
Leave a Comment