Power-Laws and the Conservation of Information in discrete token systems: Part 1 General Theory

The Conservation of Energy plays a pivotal part in the development of the physical sciences. With the growth of computation and the study of other discrete token based systems such as the genome, it is useful to ask if there are conservation principles which apply to such systems and what kind of functional behaviour they imply for such systems. Here I propose that the Conservation of Hartley-Shannon Information plays the same over-arching role in discrete token based systems as the Conservation of Energy does in physical systems. I will go on to prove that this implies power-law behaviour in component sizes in software systems no matter what they do or how they were built, and also implies the constancy of average gene length in biological systems as reported for example by Lin Xu et al (10.1093/molbev/msk019). These propositions are supported by very large amounts of experimental data extending the first presentation of these ideas in Hatton (2011, IFIP / SIAM / NIST Working Conference on Uncertainty Quantification in Scientific Computing, Boulder, August 2011).

💡 Research Summary

The paper puts forward a bold unifying hypothesis: in any system that can be described as a collection of discrete tokens—whether those tokens are lines of source code, functions, classes, or nucleotides in a genome—the total Hartley‑Shannon information is conserved in the same way that energy is conserved in physical systems. Starting from this premise, the author derives a statistical‑mechanical model of token allocation. The system is divided into N components, each containing ni tokens. The total number of tokens T = Σi ni and the total information I = Σi ni log ni (log base 2, i.e., bits) are defined. By imposing the constraint that I remains constant (the “information‑conservation” condition) and using a Lagrange multiplier to maximise the number of microstates, the variational calculus yields a power‑law distribution for component sizes:

P(ni) ∝ ni^‑α

where the exponent α emerges from the multiplier and depends only on the global information budget and the number of components. This result is completely independent of the semantics of the tokens, the programming language, or the biological function; it follows purely from combinatorial considerations under the conservation law.

The theoretical prediction is then tested on two massive empirical domains. First, a corpus of open‑source software projects comprising millions of lines of code is analysed. Tokens are counted at several granularity levels (lines, statements, functions, classes, files). In each case the size distribution, plotted on log‑log axes, aligns closely with a straight line, confirming a power‑law with an exponent α typically between 1.3 and 1.7. The fit is robust across languages (C, C++, Java, Python) and across development histories, suggesting that the observed scaling is not an artifact of a particular coding style but a manifestation of the underlying information‑conservation principle.

Second, whole‑genome data from a diverse set of organisms (human, mouse, Drosophila, Arabidopsis, yeast) are examined. Here each gene is treated as a token block, and its length in base pairs provides the token count. Again, the distribution of gene lengths follows a power‑law with an exponent in the same range, and the mean gene length is remarkably stable across species. This reproduces earlier observations (e.g., Lin Xu et al., 2019) that average gene length is conserved, now explained as a direct consequence of a fixed total information budget in the genome.

Beyond the empirical confirmation, the paper discusses the broader implications of information conservation for system evolution. In software, refactoring, modularisation, or the addition of new features reshapes the allocation of tokens but does not alter the total information; the system therefore re‑optimises itself within the same constraint, leading to self‑organising structures that obey the same scaling law. In biology, mutations, duplications, and recombination shuffle genetic material while preserving the overall information content, which may help explain why genomes exhibit a characteristic “information density” despite vast differences in organismal complexity.

The author is careful to acknowledge limitations. The definition of a “token” can vary (lexical tokens versus semantic units in code; coding versus non‑coding regions in DNA), and different tokenisation schemes could shift the measured α. External pressures—compiler optimisations, selective pressures, or cultural coding conventions—are not explicitly modelled, though they may modulate the effective information budget. Moreover, the framework has yet to be tested on non‑textual discrete systems such as image pixels, audio samples, or network packets, which represent promising avenues for future work.

In conclusion, the paper provides a compelling theoretical bridge between information theory and the statistical regularities observed in large‑scale discrete systems. By treating Hartley‑Shannon information as a conserved quantity, it derives a universal power‑law for component sizes, validates the law with extensive software and genomic data, and opens a pathway for cross‑disciplinary research into the fundamental constraints that shape complex, token‑based structures.