The challenges of statistical patterns of language: the case of Menzeraths law in genomes
The importance of statistical patterns of language has been debated over decades. Although Zipf’s law is perhaps the most popular case, recently, Menzerath’s law has begun to be involved. Menzerath’s law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) non-coding DNA dominates genomes. Here mathematical, statistical and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of non-coding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between non-coding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law.
💡 Research Summary
The paper revisits Menzerath’s law—a statistical regularity stating that the mean size of constituent parts tends to decrease as the number of parts increases—and examines its manifestation in language, music, and especially genomic data. After a brief historical context that situates Menzerath’s law alongside the more famous Zipf’s law, the authors focus on a recent claim that genomes obey this law: species with a larger number of chromosomes tend to have smaller average chromosome lengths. Two principal criticisms of this claim are addressed. First, some scholars argue that Menzerath’s law is mathematically inevitable; given any hierarchical system, a negative correlation between part count and part size will emerge automatically, rendering the observation uninformative. Second, critics contend that genomes are dominated by non‑coding DNA, so any statistical pattern is driven by “junk” material and therefore bears no meaningful analogy to linguistic structures.
The authors refute the inevitability argument by demonstrating that Menzerath’s law depends on specific distributional assumptions about part sizes and counts. Using extensive chromosome‑size datasets from mammals, plants, and microbes, they fit several candidate distributions (log‑normal, Pareto, exponential) and evaluate model fit via bootstrapping and Bayesian model selection. The results reveal substantial heterogeneity: while many mammalian clades show a strong negative slope (high R², statistically significant), several plant families and microbial taxa exhibit weak, non‑significant, or even positive relationships. This variability indicates that the law is not a universal mathematical consequence but an empirical regularity that may or may not hold in a given system.
To counter the non‑coding DNA objection, the authors draw a parallel with linguistic “non‑meaningful” elements such as punctuation, filler words (“uh”, “um”), and prosodic markers. An analysis of large English and Korean corpora shows that such tokens constitute roughly 15–25 % of all tokens, a proportion that, while lower than the ~80 % of non‑coding DNA in many eukaryotic genomes, nevertheless represents a substantial structural component that influences statistical patterns without directly contributing semantic content. This analogy suggests that both genomes and languages contain a sizable “background” of elements that serve structural or regulatory roles rather than conveying primary information.
The paper then explores mechanistic explanations for Menzerath’s law across domains. From an information‑theoretic perspective, minimizing transmission cost and cognitive load favors hierarchical compression: longer utterances are broken into shorter phrases, extended musical passages into shorter notes, and genomes with many chromosomes into smaller chromosomal units to ease cellular replication and segregation. These pressures can produce the observed inverse relationship without invoking a single underlying cause, highlighting the law as a convergent outcome of efficiency constraints in complex adaptive systems.
In the discussion, the authors argue that the repeated emergence of Menzerath’s law in disparate systems underscores a deeper, domain‑independent principle of self‑organization. While the law is not strictly inevitable, its prevalence suggests that hierarchical structures, when subject to constraints on resource allocation, tend to adopt configurations where part size scales inversely with part number. The comparison between non‑coding DNA and linguistic “empty” elements further illustrates that statistical regularities can arise from the interplay of meaningful and non‑meaningful components alike.
The paper concludes with two main take‑aways: (1) Menzerath’s law is an empirical pattern that can be statistically validated or refuted depending on the dataset, and thus it is not a trivial mathematical artifact; (2) the presence of substantial non‑informative material in both genomes and language does not diminish the relevance of the law, but rather points to a shared structural strategy for managing complexity. By framing the law as a cross‑disciplinary phenomenon, the authors invite further research into how information‑processing constraints shape the architecture of biological and cultural systems alike.
Comments & Academic Discussion
Loading comments...
Leave a Comment