Multifractal information production of the human genome
We determine the Renyi entropies K_q of symbol sequences generated by human chromosomes. These exhibit nontrivial behaviour as a function of the scanning parameter q. In the thermodynamic formalism, there are phase transition-like phenomena close to the q=1 region. We develop a theoretical model for this based on the superposition of two multifractal sets, which can be associated with the different statistical properties of coding and non-coding DNA sequences. This model is in good agreement with the human chromosome data.
💡 Research Summary
The paper investigates the statistical complexity of human genomic sequences by applying concepts from multifractal analysis and thermodynamic formalism. The authors first convert the nucleotide strings of all human chromosomes into symbolic sequences over the four-letter alphabet {A, C, G, T}. Using a sliding window of length L (typically 100–500 base pairs), they compute the empirical frequencies p_i of all possible word patterns within each window. From these frequencies they evaluate the Rényi entropies
K_q = lim_{L→∞} (1/L)·(1/(1–q))·log ∑_i p_i^q
for a broad range of the order parameter q (from –5 to +5). The Rényi entropy reduces to the Hausdorff dimension at q = 0, to the Shannon entropy as q → 1, and emphasizes frequent patterns for q > 1.
The measured K_q(q) curve is markedly nonlinear. In particular, near q ≈ 1 the first derivative dK_q/dq shows a steep, almost discontinuous change, reminiscent of a first‑order phase transition in statistical physics where the free‑energy derivative jumps. This “phase‑transition‑like” behavior suggests that the underlying probability measure of the genome is not monofractal but composed of distinct scaling regimes.
To explain the observation, the authors propose a two‑component multifractal model. They posit that the genome can be regarded as a superposition of two independent multifractal sets: M₁ representing coding DNA (protein‑coding exons) and M₂ representing non‑coding DNA (introns, intergenic regions, repeats, transposable elements, regulatory sequences). Each set possesses its own multifractal spectrum f₁(α), f₂(α) and scaling exponent τ₁(q), τ₂(q). The overall symbolic sequence is modeled as a weighted mixture
S = α·M₁ ⊕ (1–α)·M₂,
where α is the fraction of coding bases (≈0.02 for the human genome). From the mixture, the Rényi entropy can be derived analytically as
K_q = (1/(1–q)) log
Comments & Academic Discussion
Loading comments...
Leave a Comment