Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.

💡 Research Summary

This paper investigates why state‑of‑the‑art multilingual machine‑translation (MT) systems still exhibit large quality gaps across languages, even when the amount of training data and script similarity are taken into account. The authors focus on two representative large pre‑trained models: NLLB‑200, a 3.3 B‑parameter encoder‑decoder transformer that supports 202 languages, and Tower+, a 9 B‑parameter decoder‑only large language model (LLM) that has been post‑trained for translation. Both models are evaluated on the FLORES+ benchmark, which provides English source sentences and human translations into more than 200 target languages.

A total of 124 target languages are selected from the 212 languages covered by FLORES+, ensuring a wide typological spread and compatibility with the two models. In addition to English as the source language, six other source languages (Arabic, Italian, Dutch, Turkish, Ukrainian, Vietnamese) are used to verify that the findings are not source‑language specific. Translations are generated with beam search using four beam widths (k = 1, 3, 5, 7). The primary quality metric is chrF++, a character‑level n‑gram overlap measure that is robust for morphologically rich languages; BLEU and COMET are reported in an appendix and show the same trends.

To isolate the effect of language typology, the authors collect a rich set of linguistic features. “Resource‑ness” is approximated by the size of each language in the CommonCrawl corpus (GlotCC), serving as a proxy for the amount of web text the models likely saw during pre‑training. Six pre‑computed typological distance measures (genetic, geographic, syntactic, inventory, phonological, overall) are taken from the URIEL database, and a binary “same_script” feature captures whether source and target share a writing system. Morphological complexity is represented by eight continuous metrics derived from Universal Dependencies annotations: Information in Word Structure, Word Entropy, Lemma Entropy, Mean Size of Paradigm, Inflectional Synthesis, Morphological Feature Entropy, Inflection Accuracy, and three variants of Type‑Token Ratio (TTR, Root‑TTR, Moving‑Average TTR).

Statistical analysis proceeds in two stages. First, mixed‑effects regression models control for resource‑ness, script similarity, and the six typological distances, establishing a baseline. Then, each morphological or word‑order feature is added to assess its unique contribution to translation quality (chrF++). The results are consistent across both models: higher morphological feature entropy and lower inflection accuracy—indicators of greater morphological complexity—are significantly associated with lower chrF++ scores. Likewise, languages with higher word‑order flexibility (i.e., less rigid SVO order) tend to receive poorer translations. These effects persist after accounting for all other covariates, demonstrating that intrinsic typological properties, not just data quantity, drive performance differences.

The decoding experiments reveal an interaction between typology and optimal beam size. For most languages, beam widths of 3 or 5 yield the best trade‑off between quality and efficiency. However, languages that are morphologically complex or have free word order continue to improve when the beam is widened to 7, suggesting that a broader search space helps the model resolve ambiguous morphological endings and reorderings. Conversely, languages that are morphologically simple and share the same script with English achieve near‑optimal scores already at beam = 1 or 3, indicating that aggressive beam expansion offers diminishing returns.

Tower+ shows overall comparable trends to NLLB‑200 but with slightly attenuated sensitivity to typological factors, likely because its decoder‑only architecture relies more heavily on the underlying LLM’s multilingual knowledge. Nevertheless, the same morphological and word‑order features remain statistically significant predictors of translation difficulty for Tower+ as well.

A key contribution of the work is the release of a fine‑grained, continuous typological feature set for all 212 FLORES+ languages. This resource enables future research on language‑specific decoding strategies, adaptive beam sizing, or even model‑level adaptations (e.g., specialized adapters for high‑complexity languages).

In summary, the paper provides strong empirical evidence that even the most powerful multilingual MT systems are not immune to the intrinsic challenges posed by language typology. Morphological richness and flexible syntax systematically lower translation quality, and these languages benefit from wider beam search during inference. The findings argue for a shift away from a one‑size‑fits‑all decoding policy toward language‑aware strategies that consider typological difficulty, paving the way for more equitable MT performance across the world’s linguistic diversity.

Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment