A statistical physics study of punctuation effects on sentence lengths is presented for written texts: {\it Alice in wonderland} and {\it Through a looking glass}. The translation of the first text into esperanto is also considered as a test for the role of punctuation in defining a style, and for contrasting natural and artificial, but written, languages. Several log-log plots of the sentence length-rank relationship are presented for the major punctuation marks. Different power laws are observed with characteristic exponents. The exponent can take a value much less than unity ($ca.$ 0.50 or 0.30) depending on how a sentence is defined. The texts are also mapped into time series based on the word frequencies. The quantitative differences between the original and translated texts are very minutes, at the exponent level. It is argued that sentences seem to be more reliable than word distributions in discussing an author style.
Deep Dive into Punctuation effects in English and Esperanto texts.
A statistical physics study of punctuation effects on sentence lengths is presented for written texts: {\it Alice in wonderland} and {\it Through a looking glass}. The translation of the first text into esperanto is also considered as a test for the role of punctuation in defining a style, and for contrasting natural and artificial, but written, languages. Several log-log plots of the sentence length-rank relationship are presented for the major punctuation marks. Different power laws are observed with characteristic exponents. The exponent can take a value much less than unity ($ca.$ 0.50 or 0.30) depending on how a sentence is defined. The texts are also mapped into time series based on the word frequencies. The quantitative differences between the original and translated texts are very minutes, at the exponent level. It is argued that sentences seem to be more reliable than word distributions in discussing an author style.
Punctuation effects in english and esperanto
texts
M. AUSLOOS
previously at : GRAPES@SUPRATECS,
Universit´e de Li`ege, Sart-Tilman,
B-4000 Li`ege, Euroland
nowadays at : 7 rue des Chartreux, B-4122 Plainevaux, Belgium
Abstract
A statistical physics study of punctuation effects on sentence lengths is presented
for written texts: Alice in wonderland and Through a looking glass. The translation
of the first text into esperanto is also considered as a test for the role of punctuation
in defining a style, and for contrasting natural and artificial, but written, languages.
Several log-log plots of the sentence length-rank relationship are presented for the
major punctuation marks. Different power laws are observed with characteristic
exponents. The exponent can take a value much less than unity (ca. 0.50 or 0.30)
depending on how a sentence is defined. The texts are also mapped into time series
based on the word frequencies. The quantitative differences between the original and
translated texts are very minutes, at the exponent level. It is argued that sentences
seem to be more reliable than word distributions in discussing an author style.
Key words: texts, sentence statistics, Zipf, ranking, translation, esperanto
1
Introduction
Since [1], there is a relatively interesting set of studies pertaining to the struc-
ture of written texts through techniques based on statistical physics ideas and
methods, usually measuring the word length or/and word frequency distribu-
tion. Without claiming to be exhaustive, let us mention recent studies, much
after 2000, on german [2], polish [3] english and irish [3–8], chinese [7–10],
japanese [11], greek [12–14], turkish [15], hungarian [16], welsh [17], baltic and
slavic [18], but also in less natural languages like fortran [19], artificial [20], or
Email address: marcel.ausloos@ulg.ac.be (M. AUSLOOS ).
Preprint submitted to Elsevier
2 November 2018
arXiv:1004.4848v1 [cs.CL] 27 Apr 2010
esperanto [21]. Of course these studies are partially a revival of an enormous
flurry of studies in linguistics which started as early as 1930 and included later
on work by Zipf and many others[22–24].
Debates exist whether a few texts are sufficiently representative of a language
and how big a lexicon must be before it becomes significant. This caveat
presented, it is fair to say that it seems that several specific features of written
texts have not been studied in detail. The role of punctuation on the structure
of texts is one of these.
According to wikipedia the first inscription with punctuation mark is the
Mesha Stele (9thBC); see http : //en.wikipedia.org/wiki/Mesha−Stele. A
long time ago Greeks and Romans adopted a few punctuation marks (the
dot and combinations, essentially) in order to mark pauses in texts, to be
played. Other historical details on the creation, dissemination, use and types
of punctuations in various languages can be found in
http : //en.wikipedia.org/wiki/Punctuation, and
http : //grammar.ccc.commnet.edu/grammar/marks/marks.htm.
Through these e-references, it can be learned that punctuation marks are sym-
bols that indicate the structure and organization of a written text in a specific
language, for readability, as much as for suggesting intonation and pauses
when reading aloud. In written English, punctuation is vital to disambiguate
the meaning of sentences, though this does not go without problems [25,26].
Notice that some modern writers have attempted to go in some sense back-
ward. As far as 1895, Crane published The Black Riders and Other Lines [27]
in capital letters: the poems appearing without punctuation, an unusual ty-
pographical presentation for the time, - a style system considered as garbage
by the critics. In another language, e.g. french, Apollinaire [28] published
one of his major pieces Alcools without punctuation. Thereafter, Similarly,
the french surrealists and dadaists scorned punctuation, like Aragon [29] who
avoided any in most of his poems and prose for/about Elsa Triolet. That fol-
lowed from the para-psychological theory put forward by Breton [30] in The
Manifesto, containing new/practical recipes for enhancing the Magic Surreal-
ist Art, such as: ”...Punctuation of course necessarily hinders the stream of
absolute continuity which preoccupies us ... ”. This was recently ”poetically”
reformulated by Hahn [31] in The Pity of Punctuation poem. Some ”maxi-
mum” was likely reached by Joyce [32]. In Ulysses symbolically conserving
the structure of Homers The Odyssey, where there is no punctuation, Joyce
omits punctuation entirely, in the last chapter of the novel, - consisting of eight
long paragraphs, in order to mimic the uninterrupted flow of naked thoughts.
Thus punctuation could be avoided. Indeed there is some redundance, since a
capital letter can indicate to the reader a new sentence. One major difficulty
2
nevertheless occurs in text analysis: it is more easy to observe a punctuation
sign on a text than a capital letter.
However, fundamentally, in lite
…(Full text truncated)…
This content is AI-processed based on ArXiv data.