We demonstrate that large texts, representing human (English, Russian, Ukrainian) and artificial (C++, Java) languages, display quantitative patterns characterized by the Benford-like and Zipf laws. The frequency of a word following the Zipf law is inversely proportional to its rank, whereas the total numbers of a certain word appearing in the text generate the uneven Benford-like distribution of leading numbers. Excluding the most popular words essentially improves the correlation of actual textual data with the Zipfian distribution, whereas the Benford distribution of leading numbers (arising from the overall amount of a certain word) is insensitive to the same elimination procedure. The calculated values of the moduli of slopes of double logarithmical plots for artificial languages (C++, Java) are markedly larger than those for human ones.
Human languages demonstrate hidden phonetic, semantic and historical patterns, which may be revealed with computer analysis (Bybee, J. L. & Hopper, 2001;Johnson, 2008, Gerlach & Altmann, 2013;Aiden & Michel, 2013;Shulzinger & Bormashenko, 2017). The origin of these patterns remains mysterious; however, counter to intuition, significant pattern-generating evolution of linguistic behavior was observed by computer simulations (Kirby, 2001). From an initially unstructured communication system (a protolanguage) a fully compositional syntactic meaning-string mapping emerged (Kirby, 2001).
One of remarkable features of natural languages is Zipf’s law (Zipf, 1965a;Pustet, 2004).
Zipf’s law states that in all natural languages, the frequency of a word is inversely proportional to its rank (Zipf, 1965, Manin, 2009, Fontanari & Perlovsky, 2004, Fontanari & Perlovsky, 2009, Mehri & Lashkari, 2016), namely Eq. 1 takes place:
where C is the constant, 1 . Zipf’s distribution was observed for languages as diverse as English, Latin, German, Gothic, Chinese (Mandarin), Lakota, Nootka, and Plains Cree (Pustet, 2004). Baek et al. draw attention to the fact that the value of α in not necessarily close to unity; and the general power-law supplied by Eq. 1 is expected for languages (Baek et al., 2011).
Zipf suggested that the revealed frequency distribution arises from the “least effort” principle requested from transport of information from speaker to listener (Zipf 1965b). This hypothesis is supported by recent investigations (Ferrer i Cancho & Solé, 2003, Ferrer i Cancho, 2005). It was also suggested that the recurrence of Zipf’s law in human languages could originate from pressure for easy and fast communication (Ferrer-i-Cancho 2016).On the contrary, it was suggested that Zipf’ distribution arises from the Poisson process (Eliazar 2016;Reed 2001).
However, both the roots and significance of Zipf’s law in linguistics remain unclear (Fontanari & Perlovsky, 2004). On the one hand, the finding that texts produced by the random emission of symbols and spaces, so that words of the same length are equiprobable, also generate word frequency distributions that follow a generalized Zipf’s law, also called the Zipf-Mandelbrot law (Mandelbrot, 1982;Manin, 2009). It was also noted that Zipf’s law is not valid for the most common words (Tsonis et al., 1997a). Montemurro noted that the Zipf-Mandelbrot law can only describe the statistical behavior of a rather restricted fraction of the total number of words contained in some given corpus (Montemurro, 2001). Significant deviations between the predicted hyperbolic and real frequency of different words of a text were reported (Ferrer-i Cancho & Solé, 2001). Moreover, it was demonstrated that various random sequences of characters (random texts) reproduce Zipf’s law (Ferrer-i-Cancho & Elvevåg 2010).
On the other hand, quantitative analyses of the evolution of syntactic communication in comparison to animal communication (Nowak et al., 2000) and the emergence of irregularities in language (Kirby, 2001) latently assume that human lexicons follow the Zipf distribution of word frequencies (Fontanari & Perlovsky, 2009). Zipf himself related structuring of languages to a ‘‘principle of least effort’’, resembling the principle of least action in physics (Zipf, 1965b, p. 20;Lanczos, 1970). It is also noteworthy that Zipf-like distribution was revealed in a plenty of statistical problems, including the distribution of firm sizes (Stanley et al., 1995), distributions of open spaces in city space (Volchenkov & Blanchard 2008) and genomic data (Mantegna et al., 1994;Tsonis & Tsonis, 2002). Mandelbrot suggested that Zipf’s hyperbolic frequency distributions are about as prevalent in social sciences as Gaussian distributions are dominating in the natural sciences (Mandelbrot, 1982). It was also demonstrated that in certain cases, Zipf ranking appears as the ordering by growing Kolmogorov complexity (Manin, 2014).
As to the roots of the origin of the Zipf law, Baek et al., suggested that Zipf’s law arises from the random group formation phenomena (Baek et al., 2011). The elements of a group can be citizens of a country, or the groups of family names, or the elements can be all the words making up a novel, or the groups comprised of unique words. Thus, it was suggested that the Zipf law has a pure statistical origin (Baek et al., 2011). It was also conjectured that Zipf’s law is related to the focal expression of a generalized thermodynamic structure (Altamirano & Robledo, 2011). This structure is obtained from a deformed type of statistical mechanics that arises when configurational phase space is incompletely visited in a strict way; the restriction is that the accessible fraction of this space has fractal properties (Altamirano & Robledo, 2011). It was also assumed that Zipf’s law arises from the fractal structure of the language itself (Mehri & Lashkari, 2016). It was conjectured that Zipf’s di
This content is AI-processed based on open access ArXiv data.