The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited \cite{GA,GJ} to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to define these distances, one of this, used by glottochronology, compute distance from the percentage of shared ``cognates''. Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identification process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all the words contained in a Swadesh list \cite{Sw}. The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in \cite{GA} with some significant differences.
Glottochronology uses the percentage of shared "cognates" between languages to calculate their distances. These "genetic" distances are logarithmically proportional to divergence times if a constant rate of lexical replacement is assumed. Cognates are words inferred to have a common historical origin, their identification is often a matter of sensibility and personal knowledge. Therefore, subjectivity plays a relevant role. Furthermore, results are often biased since it is easier for European or American scholars to find out those cognates belonging to western languages. For instance, the Spanish word leche and the Greek word gala are cognates. In fact, leche comes from the Latin lac with genitive form lactis, while the genitive form of gala is galactos. This identification is possible because of our historical records, hardly it would have been possible for languages, let's say, of Central Africa.
Our aim is to avoid this subjectivity and construct a languages tree which can be easily replicated by other scholars. To reach this goal, we compare words with same meaning belonging to different languages only considering orthographical differences. More precisely, we use a modification of the Levenshtein distance (or edit distance) to measure distance between pairs of words in different languages. The edit distance is defined as the minimum number of operations needed to transform one word into the other, where an operation is an insertion, deletion, or substitution of a single character. Our definition of genetic distance between two words is taken as the edit distance divided by the number of characters of the longer of the two. With this definition, the distance can take any value between 0 and 1. To understand why we renormalize, let us consider the following case of one substitution between two words: if the compared words are long even if the difference between them is given by one substitution they remain very similar; while, if these words are short, let’s say two characters, one substitution is enough to make them completely different. Without renormalization, the distance between the words compared in the two examples would be the same, no matter their length. Instead, introducing the normalization factor, in the first case the genetic distance would be much smaller than in the second one.
We use the distance between words pairs, as defined above, to construct a distance between pairs of languages. The first step is to find lists of words with the same meaning for all the languages for which we intend to construct the distance. Then, we compute the genetic distance for each pair of words with same meaning in one language pair. Finally, the distance between each language pair is defined as the average of the distance between words pair. As a result we have a number between 0 and 1 which we claim to be the genetic distance between the two languages.
The database we use for the present analysis [5] is composed by 50 languages with 200 words for each of them. The words are chosen according to the Swadesh list. All the languages considered belong to the Indo-European group. The database is a selection/modification of the one used in [4], where some errors have been corrected, and many missing words have been added. In the database only the English alphabet is used (26 character plus space); those languages written in a different alphabet (i.e. Greek etc.) were already transliterated into the English one in [4]. For some of the languages in our lists [5] there are still few missing words for a total number of 43 in a database of 9957. When a language has one or more missing words, these are simply not considered in the average that brings to the definition of distance. This implies that for some pairs of languages, the number of compared words is not 200 but a number always greater than or equal to 187. There is no bias in this procedure, the only effect is that the statistic is slightly reduced.
The result of the analysis described above is a 50 × 50 upper triangular matrix which expresses the 1225 distances among all languages pairs. Indeed, our method for computing distances is a very simple operation, that does not need any specific linguistic knowledge and requires a minimum computing time.
A phylogenetic tree can be build already from this matrix, but this would only give the topology of the tree whereas the absolute time scale would be missing. In order to have this quantitative information, some hypotheses on the time evolution of genetic distances are necessary. We assume that the genetic distance among words, on one side tends to grow due to random mutations and on the other side may reduce since different words may become more similar by accident or, more likely, by language borrowings. Therefore, the distance D between two given languages can be thought to evolve according to the simple differential equation
where Ḋ is the time derivative of D. The parameter α is related to the increasing
This content is AI-processed based on open access ArXiv data.