A simple branching model that reproduces language family and language population distributions
📝 Abstract
Human history leaves fingerprints in human languages. Little is known over language evolution and its study is of great importance. Here, we construct a simple stochastic model and compare its results to statistical data of real languages. The model bases on the recent findings that language changes occur independently on the population size. We find agreement with the data additionally assuming that languages may be distinguished by having at least one among a finite, small number of different features. This finite set is used also in order to define the distance between two languages, similarly to linguistics tradition since Swadesh.
💡 Analysis
Human history leaves fingerprints in human languages. Little is known over language evolution and its study is of great importance. Here, we construct a simple stochastic model and compare its results to statistical data of real languages. The model bases on the recent findings that language changes occur independently on the population size. We find agreement with the data additionally assuming that languages may be distinguished by having at least one among a finite, small number of different features. This finite set is used also in order to define the distance between two languages, similarly to linguistics tradition since Swadesh.
📄 Content
arXiv:0811.2301v1 [physics.soc-ph] 14 Nov 2008 A simple branching model that reproduces language family and language population distributions Veit Schw¨ammle1, 2 and Paulo Murilo Castro de Oliveira1, 3 1Laboratoire PMMH, ´Ecole Sup´erieure de Physique et de Chimie Industrielles, 10 rue Vauquelin, F-75231 Paris, France. 2Centro Brasileiro de Pesquisas F´ısicas, Rua Xavier Sigaud 150, 22290-180 Rio de Janeiro, Brazil. 3Instituto de F´ısica, Universidade Federal Fluminense; Av. Litorˆanea s/n, Boa Viagem, Niter´oi 24210-340, RJ, Brazil. Abstract Human history leaves fingerprints in human languages. Little is known over language evolution and its study is of great importance. Here, we construct a simple stochastic model and compare its results to statistical data of real languages. The model bases on the recent findings that language changes occur independently on the population size. We find agreement with the data additionally assuming that languages may be distinguished by having at least one among a finite, small number of different features. This finite set is used also in order to define the distance between two languages, similarly to linguistics tradition since Swadesh. PACS numbers: Keywords: 1 I. INTRODUCTON The existence of the large number of around 6, 000 languages on Earth can be explained through their continuous modification. They come from a tree-like evolution being founded from one or few proto-languages several thousand years ago. Language evolution is thus a result of the particular history followed by humankind and its migration pattern. On the other hand, human genetic evolution is also a result of the same history. The parallel between language and genetic evolution was explored by many researchers since Cavalli- Sforza (for a review, see [1]), being a precious tool in order to discover unknown details of the human past history. The complex structure of a language changes slowly in time, as for instance new words ap- pear and are adopted by a majority of speakers. In spite of this complex structure consisting of individually difficult–to–predict “microscopic” events as for example the appearance of a new word, several universal patterns have been reported. Statistical data analysis showed universal laws both in actual language structure and macroscopic observables as for instance the language populations. The histogram of the latter, counting languages according to their number of speakers, roughly shows a lognormal shape [2] and will be one of the central points in this paper. In addition to the present situation of nowadays spoken languages, we are also interested in their historical course, resulting from the branching of ancestor languages. The direct comparison of selected word sets of different languages can be used to estimate their historical distance, an idea pioneered by Morris Swadesh half a century ago [3]. For instance, the measurement of Levenshtein distances between two languages gives an idea of the time their first common ancestor language existed. The evaluation of the data from this analysis showed that historical distances accumulate at a certain age independently on the population size [4], suggesting that change occurs with the same rate for all languages [5]. This result challenges the often proposed direct analogy between biological and language evolution and demands different approaches. In contrast, the global mutation rate of biological species depends on their population size, leading for instance to faster changes of the genetic pool in smaller populations. The statistical analysis of highly complex systems like opinion dynamics and stock ex- change among others showed that their patterns can be reproduced by simple agent-based 2 models. The simplification of most of the complex low-level mechanisms to random processes has been shown to be a valid approach. We follow this idea and will construct a simple model that uses the main properties found for the evolution of languages. Our stochastic model is based on the following assumptions: (i) Languages evolve in a tree-like structure. (ii) The structure is modeled by having a probabilistic change rate that does not depend on the language’s number of speakers. (iii) The space of possible languages is finite, the same lan- guage can be visited from different evolutionary paths. (iv) Each population exponentially increases in time. Motivated by the increasing popularity of simulating language evolution and competition, several analytical and computational models concentrated on reproducing the histogram of language sizes. For instance, its lognormal-like shape may be explained by simply assuming independent language population growth. These independent processes naturally lead to a lognormal distribution being result of the central limit theorem for multiplied random variables [6]. However, the distribution of real languages presents a deviation from lognormal behavior for languages spoken by very few people. Agent–based models
This content is AI-processed based on ArXiv data.