Equilibrium (Zipf) and Dynamic (Grasseberg-Procaccia) method based analyses of human texts. A comparison of natural (english) and artificial (esperanto) languages
A comparison of two english texts from Lewis Carroll, one (Alice in wonderland), also translated into esperanto, the other (Through a looking glass) are discussed in order to observe whether natural and artificial languages significantly differ from each other. One dimensional time series like signals are constructed using only word frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations in the FTS and (ii) a Grassberger-Procaccia (GP) technique based method for finding correlations in LTS. Features are compared : different power laws are observed with characteristic exponents for the ranking properties, and the {\it phase space attractor dimensionality}. The Zipf exponent can take values much less than unity ($ca.$ 0.50 or 0.30) depending on how a sentence is defined. This non-universality is conjectured to be a measure of the author $style$. Moreover the attractor dimension $r$ is a simple function of the so called phase space dimension $n$, i.e., $r = n^{\lambda}$, with $\lambda = 0.79$. Such an exponent should also conjecture to be a measure of the author $creativity$. However, even though there are quantitative differences between the original english text and its esperanto translation, the qualitative differences are very minutes, indicating in this case a translation relatively well respecting, along our analysis lines, the content of the author writing.
💡 Research Summary
The paper investigates whether natural (English) and artificial (Esperanto) languages exhibit statistically significant differences by analysing two classic works by Lewis Carroll: “Alice in Wonderland” (AWL) and “Through the Looking‑Glass” (TLG). The authors treat each text as two distinct one‑dimensional time series. The first, a Frequency Time Series (FTS), records the occurrence frequency of each word in the order it appears; the second, a Length Time Series (LTS), records the number of characters in each successive word. The FTS is examined using Zipf’s law, which predicts a power‑law relationship between word rank R and frequency f (f ∝ R^‑ζ). The LTS is analysed with the Grassberger‑Procaccia (GP) algorithm, which reconstructs an embedding of the series in an n‑dimensional phase space and computes the correlation integral C_n(l) as a function of distance l. From the scaling C_n(l) ∝ l^r the authors extract an attractor dimension r.
Key methodological steps:
- Text acquisition – the three files (AWL‑English, AWL‑Esperanto, TLG‑English) were downloaded from a public repository, chapter headings removed, and basic statistics (total words, distinct words, characters, punctuation, sentence count) compiled (Table 1).
- Construction of FTS – each word is replaced by its global frequency, yielding a sequence f(t). Zipf plots (log‑log of f versus R) are generated for the whole text and for individual chapters, using different punctuation marks (period, comma, semicolon, question mark, exclamation point) as sentence delimiters.
- Construction of LTS – each word’s length ℓ(t) is recorded, producing a second time series. The GP method is applied with embedding dimensions n = 2…15 and delay τ = 1. Correlation integrals C_n(ℓ) are computed and plotted on log‑log axes.
Findings:
- The classic Zipf exponent ζ≈1 is not observed. Depending on how a “sentence” is defined, ζ takes values around 0.30–0.50. For example, using periods as delimiters yields ζ≈0.33, while semicolons give ζ≈0.50. This non‑universality is interpreted as a quantitative signature of the author’s style, reflecting the distribution of punctuation and the hierarchical structure of the narrative.
- The authors also discuss the Zipf‑Mandelbrot refinement f(R)=A(1+CR)^‑ζ, but they do not present detailed fitting parameters or goodness‑of‑fit statistics, leaving the robustness of this refinement unclear.
- GP analysis reveals that the attractor dimension r scales with the embedding dimension n according to r ≈ n^λ with λ≈0.79 for both English and Esperanto versions of AWL. The authors suggest that this sub‑linear scaling could serve as a measure of “creativity”, arguing that a more complex attractor (higher r for a given n) indicates richer temporal correlations in word length sequences.
- Quantitatively, the Esperanto translation shows a larger vocabulary (distinct words) and a higher punctuation count, yet the Zipf exponents and the r‑vs‑n relationship remain remarkably similar to the English originals. The authors conclude that the translation preserves the statistical structure of the source text, despite the artificial nature of Esperanto.
Critical assessment:
The study is innovative in combining a static frequency‑based analysis (Zipf) with a dynamic phase‑space reconstruction (GP) on literary texts. However, several limitations temper the strength of the conclusions. First, the sample size is limited to two works and their translation, which restricts the generalisability of the observed ζ‑range and the r‑vs‑n scaling law. Second, the Zipf‑Mandelbrot fitting lacks quantitative validation (e.g., χ², AIC), making it difficult to assess whether the two‑parameter model truly improves over the simple power law. Third, GP results are known to be sensitive to data length, noise, and the choice of embedding parameters; the authors do not report surrogate data tests, sub‑sampling, or error bars on λ, leaving the claimed “creativity” metric under‑supported. Fourth, the interpretation of r = n^0.79 as a creativity indicator is speculative; without a comparative baseline (e.g., other authors, genres, or random texts) the exponent λ cannot be uniquely linked to creative processes. Finally, the paper’s discussion of “natural vs. artificial language” rests on a single artificial language (Esperanto) and a single translation, which may not capture broader typological differences.
Future work should expand the corpus to include multiple authors, genres, and a broader set of constructed languages, employ rigorous statistical model selection for Zipf‑Mandelbrot fits, and validate GP scaling with surrogate analyses. Incorporating additional dynamical measures (e.g., Lyapunov exponents, entropy rates) could also strengthen claims about temporal complexity and creativity.
In summary, the paper demonstrates that both English and Esperanto versions of Carroll’s texts obey modified Zipf scaling and exhibit similar GP attractor dimensionality, suggesting that the translation retains the underlying statistical signatures of the original. While the methodological combination is promising, the conclusions about author style, creativity, and language universality require broader empirical support and more rigorous statistical validation.
Comments & Academic Discussion
Loading comments...
Leave a Comment