Effective vocabulary expanding of multilingual language models for extremely low-resource languages

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model’s vocabulary using a target language corpus. We then screen out a subset from the model’s original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models’ performance on the source language does not degrade after continued pre-training.

💡 Research Summary

The paper addresses a critical gap in multilingual pre‑trained language models (mPLMs): the inability to process languages that were not included in the original training vocabulary, especially ultra‑low‑resource languages with distinct scripts or grammatical structures. While prior work has focused on continued pre‑training to improve performance on already supported low‑resource languages, very few studies have tackled the problem of extending an mPLM to a completely new language. The authors propose a systematic pipeline that (1) expands the model’s vocabulary with sub‑word units derived from a target‑language corpus, (2) selectively screens out a subset of the original vocabulary that predominantly represents a high‑resource source language (typically English), and (3) initializes the embeddings of the newly added tokens using bilingual dictionaries and cross‑lingual (sub)word embeddings. This initialization replaces the common practice of random initialization, which can slow convergence and limit final performance because the embedding layer often accounts for more than half of a model’s parameters (e.g., ~52 % in mBERT).

The method proceeds as follows. First, a high‑resource source language is identified, and the intersection between its monolingual vocabulary and the multilingual model’s vocabulary is extracted to form a source‑language token set V_s. This set is assumed to be well‑represented in a dedicated monolingual PLM, ensuring high‑quality static embeddings E_s. Second, large monolingual corpora for both source and target languages are used to train static word embeddings W_s and W_t. Using a bilingual dictionary, the authors apply an orthogonal mapping (with relaxed isomorphism assumptions) to align the two embedding spaces, yielding cross‑lingual word embeddings. Third, sub‑word embeddings for both languages are computed via a fastText‑style n‑gram averaging: each sub‑word z is represented as the mean of the embeddings of its constituent n‑grams, producing vectors U_s and U_t for source and target vocabularies respectively.

With U_s and U_t in the same vector space, a similarity matrix S is calculated using cosine similarity. For each target sub‑word, the k most similar source sub‑words are selected, and a weighted average of their embeddings (according to the similarity scores) is used to initialize the target token’s embedding E_t. This similarity‑based initialization injects semantic knowledge from the well‑trained source language directly into the new vocabulary, dramatically reducing the “cold‑start” problem.

After initialization, the expanded model undergoes continued pre‑training on the target‑language corpus. The authors evaluate the approach on two downstream tasks—part‑of‑speech (POS) tagging and named‑entity recognition (NER)—using XLM‑R as the base model. Compared with a baseline that expands the vocabulary but initializes new tokens randomly, the proposed method yields absolute improvements of 0.54 % in POS tagging and 2.60 % in NER. The larger gain on NER suggests that accurate token embeddings are especially beneficial for recognizing entity boundaries and types. Moreover, experiments with different corpus sizes and domains demonstrate that performance is robust to the choice of training data, and the original English performance remains essentially unchanged after continued pre‑training, confirming that the extension does not degrade the model’s existing multilingual capabilities.

Key contributions of the work are: (1) a computationally efficient strategy for selecting a source‑language subset of the original vocabulary, avoiding the need to align the full multilingual token set; (2) a novel initialization scheme that combines bilingual dictionaries, cross‑lingual word alignment, and sub‑word n‑gram composition to produce semantically informed embeddings for newly added tokens; (3) empirical evidence that this initialization leads to faster convergence and superior downstream performance without harming the source language.

The paper also outlines future directions. Extending the approach to incorporate multiple source languages simultaneously could further enrich the semantic initialization for target languages that share features with several high‑resource languages. Hyper‑parameter studies on the number of nearest neighbors k and the size of the expanded vocabulary could yield more fine‑grained control over trade‑offs between model size and performance. Finally, integrating the vocabulary expansion with adapter modules or parameter‑efficient fine‑tuning techniques may enable even more scalable deployment of multilingual models to the world’s ~7,000 languages.

Effective vocabulary expanding of multilingual language models for extremely low-resource languages

💡 Research Summary

Comments & Academic Discussion

Leave a Comment