Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
💡 Research Summary
The paper addresses a fundamental limitation of current unsupervised cross‑lingual part‑of‑speech (POS) tagging methods: the reliance on parallel corpora for word‑alignment based label projection. While zero‑shot transfer using multilingual pretrained models offers a parallel‑free alternative, it suffers from heavy model size, deployment complexity, and reduced robustness to typological divergence. The authors therefore propose a fully unsupervised framework that eliminates the need for any parallel data by leveraging unsupervised neural machine translation (UNMT) to create pseudo‑parallel sentence pairs from monolingual corpora alone.
The pipeline consists of four stages. First, large monolingual corpora for both a high‑resource source language (English, German, Spanish, French) and a low‑resource target language (Afrikaans, Basque, Finnish, Indonesian, Lithuanian, Portuguese, Turkish) are used to train a UNMT system following Lample et al.’s architecture. Shared BPE vocabularies, denoising auto‑encoding, and iterative back‑translation enable the model to learn a cross‑lingual embedding space and generate translations without any bilingual supervision. Second, the source‑language sentences are translated into the target language, yielding pseudo‑parallel pairs. Third, a conventional word‑alignment tool (e.g., fast_align) aligns each pseudo‑parallel pair, and the source side’s gold POS tags from the Universal Dependencies (UD) treebanks are projected onto the target tokens. Because a single source language can produce noisy alignments and erroneous projections, the authors introduce a multi‑source projection strategy: the same target sentence is generated from multiple source languages, and the resulting projected tags are combined via majority voting or weighted averaging. This calibration mitigates sparse alignments, translation errors, and typological mismatches, effectively increasing the precision of the projected training data. Finally, the calibrated pseudo‑labeled target sentences are used to train a BiLSTM‑CRF POS tagger. The model incorporates pretrained word embeddings, sub‑word representations, and word‑cluster features, mirroring the state‑of‑the‑art architecture used in prior projection‑based work.
The authors evaluate the approach on 28 language pairs, covering a diverse set of typologically distinct languages. Results on the UD test sets show that the proposed method achieves POS tagging accuracies exceeding 60 % for all target languages, comparable to the baseline that uses genuine parallel corpora. Notably, for languages typologically close to the source (Portuguese, Indonesian, Afrikaans) the system outperforms the baseline by 2.6–3.3 percentage points, reaching accuracies of 92.0 %, 87.1 %, and 89.5 % respectively. The multi‑source projection further improves performance across the board, delivering an average gain of 1.3 % over previous projection‑only methods and up to 0.6 % over the single‑source variant of the authors’ own system.
Key contributions include: (1) Demonstrating that UNMT can serve as a reliable source of pseudo‑parallel data for cross‑lingual POS projection, thereby removing the parallel‑corpus bottleneck. (2) Proposing a calibrated multi‑source projection technique that effectively reduces noise in projected tags. (3) Providing extensive empirical evidence that the fully unsupervised pipeline matches or exceeds the performance of traditional projection methods that depend on parallel data.
The work opens several avenues for future research. More sophisticated alignment models (e.g., neural alignment) could further reduce projection errors. Incorporating language‑agnostic encoders or multilingual UNMT models might streamline the system for many languages simultaneously. Finally, extending the calibration idea to other sequence labeling tasks (e.g., named‑entity recognition, morphological tagging) could broaden the impact of this parallel‑free paradigm.
Comments & Academic Discussion
Loading comments...
Leave a Comment