Background: Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. Methodology/Principal Findings: We show that the Heaps' law can be considered as a derivative phenomenon if the system obeys the Zipf's law. Furthermore, we refine the known approximate solution of the Heaps' exponent provided the Zipf's exponent. We show that the approximate solution is indeed an asymptotic solution for infinite systems, while in the finite-size system the Heaps' exponent is sensitive to the system size. Extensive empirical analysis on tens of disparate systems demonstrates that our refined results can better capture the relation between the Zipf's and Heaps' exponents. Conclusions/Significance: The present analysis provides a clear picture about the relation between the Zipf's law and Heaps' law without the help of any specific stochastic model, namely the Heaps' law is indeed a derivative phenomenon from Zipf's law. The presented numerical method gives considerably better estimation of the Heaps' exponent given the Zipf's exponent and the system size. Our analysis provides some insights and implications of real complex systems, for example, one can naturally obtained a better explanation of the accelerated growth of scale-free networks.
Giant strides in Complexity Sciences have been the direct outcome of efforts to uncover the universal laws that govern disparate systems. Zipf's law [1] and Heaps' law [2] are two representative examples. In 1940s, Zipf found a certain scaling law in the distribution of the word frequencies. Ranking all the words in descending order of occurrence frequency and denoting by z(r) the frequency of the word with rank r, the Zipf's law reads z(r) = z max • r -α , where z max is the maximal frequency and α is the so-called Zipf's exponent. This power-law frequency-rank relation indicates a power-law probability distribution of the frequency itself, say p(z) ∼ z -β with β equal to 1 + 1/α (see Materials and Methods). As a signature of complex systems, the Zipf's law is observed everywhere [3]: these include the distributions of firm sizes [4], wealths and incomes [5], paper citations [6], gene expressions [7], sizes of blackouts [8], family names [9], city sizes [10], personal donations [11], chess openings [12], traffic loads caused by YouTube videos [13], and so on. Accordingly, many mechanisms are put forward to explain the emergence of the Zipf's law [14,15], such as the rich gets richer [16,17], the self-organized criticality [18], Markov Processes [19], aggregation of interacting individuals [20], optimization designs [21] and the least effort principle [22]. To name just a few.
Heaps’ law [2] can also be applied in characterizing natural language processing, according to which the vocabulary size grows in a sublinear function with document size, say N (t) ∼ t λ with λ < 1, where t denotes the total number of words and N (t) is the number of distinct words. One ingredient causing such a sublinear growth may be the memory and bursty nature of human language [23][24][25]. A particular interesting phenomenon is the coexistence of the Zipf’s law and Heaps’ law. Gelbukh and Sidorov [26] observed these two laws in English, Russian and Spanish texts, with different exponents depending on languages. Similar results were recently reported for the corpus of web texts [27], including the Industry Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of text, the occurrences of tags for online resources [28,29], keywords for scientific publications [30], words contained by web pages resulted from web searching [31], and identifiers in modern Java, C++ and C programs [32] also simultaneously display the Zipf’s law and Heaps’ law. Benz et al. [33] reported the Zipf’s law of the distribution of the features of small organic molecules, together with the Heaps’ law about the number of unique features. In particular, the Zipf’s law and Heaps’ law are closely related to the evolving networks. It is well-known that some networks grow in an accelerating manner [34,35] and have scale-free structures (see for example the WWW [36] and Internet [37]), in fact, the former property corresponds to the Heaps’ law that the number of nodes grows in a sublinear form with the total degree of nodes, while the latter is equivalent to the Zipf’s law for degree distribution.
Baeza-Yates and Navarro [38] showed that the two laws are related: when α > 1, it can be derived that if both the Zipf’s law and Heaps’ law hold, λ = 1 α . By using a more polished approach, Leijenhorst and Weide [39] generalized this result from the Zipf’s law to the Mandelbrot’s law [40] where z(r) ∼ (r c + r) -α and r c is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [41,42] showed that the Zipf’s law is a result from the Heaps’ law with α depending on λ and the modeling parameter. Also based on a stochastic model, Serrano et al. [27] claimed that the Zipf’s law can result in the Heaps’ law when α > 1, and the Heaps’ exponent is λ = 1 α . In this paper, we prove that for an evolving system with stable Zipf’s exponent, the Heaps’ law can be directly derived from the Zipf’s law without the help of any specific stochastic model. The relation λ = 1 α is only an asymptotic solution hold for very-large-size systems with α > 1. We will refine this result for finite-size systems with α 1 and complement it with α < 1. In particular, we analyze the effects of system size on the Heaps’ exponent, which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A (H1N1) has demonstrated that the refined results presented here can better capture the relation between Zipf’s and Heaps’ exponents. In particular, our results agree well with the evolving regularities of the accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law degree distribution. Whereas the majority of studies on the Heaps’ law are limited in linguistics, our work opens up the door to a much wider horizon that includes many complex systems.
For si
This content is AI-processed based on open access ArXiv data.