Measuring cross-language intelligibility between Romance languages with computational tools
We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.
💡 Research Summary
The paper investigates cross‑language intelligibility among the five major Romance languages—French, Italian, Portuguese, Spanish, and Romanian—by introducing a novel computational metric that integrates surface, phonetic, and semantic similarity of related words. The authors begin by distinguishing between inherent intelligibility (derived from linguistic similarity) and acquired intelligibility (influenced by prior exposure), explicitly focusing on the former. They argue that traditional studies have relied on small, hand‑picked word lists, which fail to capture the full lexical diversity and contextual usage of languages. To overcome this limitation, the study leverages the RoBoCoP database, which contains exhaustive cognate and borrowing pairs across the Romance family, together with two large parallel corpora: RomCro (literary texts translated into Romance languages and Croatian) and EuroParl (European Parliament proceedings).
From RoBoCoP, the authors extract 19,222 cognate tuples (each containing at least two languages) and 46,490 borrowing pairs. These lexical items are aligned with the parallel corpora using spaCy for tokenization, stop‑word removal, accent stripping, and Snowball stemming. Frequency counts of how often each related word pair appears in aligned sentence pairs are recorded, providing a contextual weighting factor that goes beyond mere lexical overlap.
The core of the methodology is a three‑component similarity score. Orthographic similarity is measured by normalized Levenshtein distance on accent‑removed strings; phonetic similarity is measured similarly on phoneme sequences generated by the eSpeak‑NG library. Both scores lie in the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment