Excess entropy in natural language: present state and perspectives
We review recent progress in understanding the meaning of mutual information in natural language. Let us define words in a text as strings that occur sufficiently often. In a few previous papers, we have shown that a power-law distribution for so def…
Authors: {L}ukasz Dk{e}bowski
Excess ent ropy in natural language Excess entrop y in natural language: p resent state and p ersp ectives Łukasz Dębowski 1 , a) Institute of Computer Scienc e , Polish A c ademy of Scienc es, Warszawa, Poland (Dated: 11 Marc h 2020) W e r eview re c e nt progr e s s in under standing the meaning of m utual informatio n in natural langua g e. Let us define words in a text a s str ings that o ccur sufficient ly o ft en. In a few previo us pa pers, w e hav e shown that a pow er-law distr ibut ion for so defined words (a.k .a . Herdan’s law) is ob e yed if ther e is a s imilar p ow er- la w growth of (algorithmic) m utual information betw e e n a djacen t po rtions of texts of increas ing length. Moreover, the pow er-law growth of information holds if texts describ e a complicated infinite (a lgorithmically) ra ndo m ob ject in a highly repetitive wa y , acco rding to an ana logous p ow er- law distribution. The describe d ob ject may be immutable (like a mathematical or ph ysica l constant) or ma y evolve slowly in time ( like cultural heritage). Here we r eflect on the resp ectiv e mathematical re s ults in a les s technical way . W e also discuss feasibilit y of deciding to what extent these results apply to the a ctual human communication. P ACS num b ers: 89.75.Da; 8 9.75.Kd; 02.50.E y ; 89.70.Cf; 89 .70.Eg Keywords: Herdan’s law, gr ammar-based codes , m utual information, lang uage mo dels In 1990, Germ an engi neer W olfgang Hi lbe rg pub- lished an article 1 where the graph of conditional en trop y of pr in ted English from Claude Shan- non’s famous work 2 w as replotted in l o g-log s ca le. Seeing a dozen data p oin ts lie on a straightish line, he conjectured that en trop y of a blo c k of n c haracters dra wn from a text in natural lan- guage is roughly prop ortional to √ n for n tend- ing to infinit y . Although this conjecture w as not sufficient ly supported b y exp erimen t or a ratio- nal mo del, it attracted in terest o f a few physi- cists seeking to understand complex systems. 3 –6 As a graduate in ph ysics and a junior compu- tational linguist, I found their publications in 2000. They stimulated me to p onder up o n the in terpla y of randomness , order, and complex- it y in language. I fel t that b etter understand- ing o f Hilb erg’s conjecture can le ad to b etter understanding of Zipf ’s law for the distribution of words. 7,8 Using Hilb erg’s conjecture, I wished to demonstrate clearly that the m onk ey-t yping mo del, in tro duced to explain Zipf ’s la w, 9 cannot accoun t for some imp ortan t purpo ses of human comm unication. Ho wev er, i t took a few y ears to translate these in tuitions in to a mature mathe- matical mo del. 10 –13 The mo del is presented he re in an accessible w ay . I als o i de n tify a few prob- lems for future researc h. I. INTRODUCTION The phenomenon of h uman la nguage comm unica tion can be loo k ed upon from v arious persp ectiv es. Resp ec- tiv ely , these differe nt p o in ts of view give ris e to different a) ldebowsk@ipipan.w a w.pl; www.ipipan.w a w.pl/˜ldebowsk; mathematical models, which ar e applied to human lan- guage, studied for themselves, or used fo r differen t pur- po ses. The mos t clear dichotomy of ma thematica l views onto language comes from whether w e lo ok at individual sentences or a bov e. On the one hand, we may ask how human b eings un- derstand individual sentences and what rules ar e obeyed in their co mposition. This in terest leads to ela borate the- ories of phono logy , word morphology , syntax, automata, formal and pro gramming lang uages, mathematical logic and forma l seman tics. 14–16 Although fragmented, these fields influence one a nother. Their common feature is us- ing discrete rather than numerical models. Th us, they may b e called no n-quan titativ e linguistics (non-QL ). On the other hand, we ma y ask how sen tences are chained into texts, disco urses, or collectio ns o f texts t ypi- cally pro duced b y hu mans. A t this level, rigid structures are less prominen t and q ua n titative analysis of da ta , done under auspices o f qua n titative linguistics (QL) or co rpus linguistics, for ms the primary to o l of description. How- ever, in spite of a few rema r k able observ ations like Zipf ’s 7 or Menzerath’s 17 laws, QL has not established a c o heren t mathematical framework so far. 18 Although co mm unication betw een QL and non-QL is weak b ecause of using v ery diff erent mathematical no - tions, q uan titative reflectio n upo n langua ge and difficul- ties o f probabilistic mo deling thereof inspired a few great mathematicians: A. Ma rk o v formu lating the notio n of a Mark ov chain, 19 C. Shannon es tablishing information theory , 2,20 B. Mandelbrot studying fractals, 8,21 and A. Kolmogo r o v in tro ducing algorithmic complexity . 22 In th is pap er, I presen t some conceptual framework for QL whic h b orrows hea vily from infor mation theory and yields a ma croscopic view o n to human commun ica- tion. Because of the exp osed c o nnections among mut ual information, pow er la ws, a nd emergence of hierarchical patterns in data, I suppo se that m y results may be in ter- esting for resear c her s in the domain of complex sys tems, who consider the pow er-law growth of mutual informa- Excess ent ropy in natural languag e 2 tion a hallmark of co mplex b ehavior. 5,6,23 How to com- bine the ‘macro scopic’ QL and the ‘microscopic’ non-QL in to a larger theory o f language is a diff erent problem. I consider it worth pursuing but har der. The cen tral p oin t of m y pap er is linking Herdan’s law, an empirical p o wer law for the n um ber of differ- ent words, 24–27 with an intui tive idea that texts describ e v arious facts in a highly repetitive a nd mostly logically consistent wa y . Thus I will discuss a pr o position that ca n be expressed informa lly as follows: (H) If a text of length n describ es n β independent facts in a re p etitiv e wa y , wher e β ∈ (0 , 1) , then the text contains at least n β / log n different words. Prop osition (H) has b een formalized and proved by m yself in a series of ma thematical definit ions and theorems. 11–13 It holds under an appropriate quantifi- cation ov er n , which is a combination o f an upper and a low er limit over n . Let me note that Prop osition (H) ca n be also link ed to the relaxed Hilber g h ypo thesis. This conjecture s ays that (algorithmic) mutual informatio n b et w een adjacent blo c ks of text of length n is roughly pro p ortional to n β . 1,3–6,23 Besides P r oposition (H), I have formalized and prov ed the following tw o prop ositions: (H’) If a text of length n describ es n β independent facts in a re p etitiv e wa y , where β ∈ (0 , 1) , then mutual information betw ee n adjace nt blocks of length n exceeds n β . and: (H”) If mutual information betw een adjacent blo cks of n of length n exceeds n β , where β ∈ (0 , 1) , then the text of length n contains at least n β / log n diff erent words. The qua n tifications ov er n in the fo r malizations o f Prop ositions (H’) and (H”) are analogical as in the Prop osition (H). F o r this reason Pr o position (H) do es not follow f rom the conjunction of Prop ositions ( H’) and (H”). All these prop ositions are, how ever, true. The sig- nificance of the prop ositions is as follo ws. On the one hand, Prop osition (H’) demo ns tates that Hilb erg’s hy- po thesis can b e mo tiv ated r a tionally . On the other hand, Prop osition (H”) shows that the hypothesis implies cer- tain empirical regularities, suc h as Her dan’s law, even if there a re problems with verifying Hilb erg’s conjecture directly . Consecutively , I w ill in tr oduce the concepts that a p- pea r in Pr opositions (H), (H’), and (H”) and their for- mal statements. I will also discuss s ome related pro blems. The composition of the pa per is as follows: In Section II, I intro duce the motiv a ting linguistic concepts. In Sec- tion II I, I dis c us s the mathematical res ults. In Section IV, I r eflect upon limitations of these results a s a the- ory of human language or other complex co mm unication systems. Section V con tains important remarks for re- searchers wishing to verify Hilber g ’s h ypo thesis expe r i- men tally . Section VI concludes the pa per. I I. IDEAS IN THE BACK GROUND Before we embark on discussing formal mo dels , I sho uld intro duce some linguistic pla yground on which the mo dels will b e built. First, I will r ecall empirical laws for the distribution o f w ords. Second, I will intro- duce gra mmar-based co des as a method of detecting word bo undaries. Thir d, Hilb erg’s h ypothesis and its gener- alizations will be presented. In the end, I will discuss the idea of texts that describe infinitely man y fa c ts in a highly rep eatable and lo gically consistent wa y . A. Zipf ’s and Herdan’s laws A few famous empirical la ws of qua n titative linguistics concern the distribution of words. Amongst them, the Zipf-(Mandelbrot) la w is the m ost celebrated. 7,8 Accord- ing to this la w , the word frequency f ( w ) in a text is an in verse power of the word rank r ( w ) , i.e., f ( w ) ∝ 1 B + r ( w ) 1 /β . (1) The fre q uency f ( w ) of word w is defined as the num b er of its o ccurrences in the text, whereas the word rank r ( w ) is the pos ition of w on the list of words sorted by decreasing frequencies. Constant B is p ositive wherea s constant β ∈ (0 , 1) is clos e to 1 for r ( w ) . 10 3 ÷ 10 4 . F o r larg er r anks this relationship is brea ks down and β can dr op muc h closer to 0 , dep ending o n the text comp osition. 28,29 Zipf ’s la w attracts attention of man y theoreticians wishing to explain it. The most famous explanation of Zipf ’s law is given by the ‘monkey-t yping’ model. In this explanation, the text is a ssumed to be a sequence of in- dependent identically distributed (II D) v a r iables taking v alues of b oth letters and spaces and, as a result, the Zipf-Mandelbrot law is satisfied for strings of letters de- limited by spaces 8,9 . Other known expla nations involv e, e.g., multiplicativ e process es 30,31 games, 32 and informa- tion theor etic ar gumen ts. 33 –36 In this pap er, w e will focus on a certain coro llary of the Zipf-Mandelbrot law, namely a relatio nship betw een the length of the text and the num ber o f different words therein. This relationship is usua lly c a lled Herdan’s o r Heaps’ la w in the English literature. 24–27 It tak es form of an approximate empirical p o w er law V ∝ n β , (2) where V is the n um ber of differen t words and n is the text length (in characters). W e ca n see that (2), up to a mul- tiplicativ e logarithmic t erm, app ears in the c o nclusion of Prop ositions (H) and (H”). The Herda n-Heaps law ca n b e inferred from the Zipf-Mandelbrot la w a ssuming certain regularity of text growth. 37,38 In particular, if law (1) were satisfied exactly then (2) would hold automatically . W e have the following prop osition: Excess ent ropy in natural languag e 3 Prop osition 1 L et N b e the numb er of al l wor ds in the text and V b e the numb er of di ffer ent wor ds. If (1) is satisfie d with B = 0 , β c onstant , and f ( w ) / N c onstant w.r.t. N for the most fr e qu en t wor d w then we have V ∝ N β . 38 Pro of: F o r the least frequent word u we ha v e the fre- quency f ( u ) = 1 ∝ V 1 /β . Hence the pro portionality constant equals V 1 /β . Thus for the most frequent w ord w we hav e f ( w ) = V 1 /β . Because f ( w ) / N was a ssumed constant, we obtain V ∝ N β . In reality it happ ens that rela tionships (1) and (2) are quite inexact and the b est fit for (2) yields a β smaller than for (1). In this ar ticle, I prop ose another explanation of Her- dan’s law, whic h is proba bilistic. In any suc h explana- tion, tw o po s tulates are adopted more or less explicitly . The first p ostulate concerns how w ords a re delimited in the text. The second postulate concer ns what kind of sto c ha stic pro cess is suitable for mo deling the text. My explanation of Herdan’s la w tar gets t w o modeling c hal- lenges: 1. W ords can be delimited in the text even when the spaces a re absent. 2. T exts refer to man y facts unkno w n a prio ri to the reader but they usually do this in a consistent and rep etitiv e wa y . The necessary notions will be explained consecutively . First, we will r evisit the concept of a word. Second, we will address the prop erties o f texts. B. Detecting wo rd b ounda r ies with grammar-based co des In this section, w e will discuss how to delimit words in a text a nd, consec utively , how to count their num ber. If we agree that texts ar e sequences of characters taking v alues of bo th letters a nd delimiters (suc h as spaces), the most obvious choice, suggested by orthogr aphies of m any languages, is to define words as strings of letters se pa - rated b y delimiters. There are, how ever, kinds o f texts or languages where words are not s e pa rated b y delim- iters on a regular ba sis (ancient Greek, mo dern Chinese, or sp ok en E ng lish as a sp eec h signa l). Seeking for an absolute criterion for word bo undaries, linguists observed tha t string s of characters that are re- pea ted within the text significantly many times often corresp ond to whole words o r set phrases (multi-w o rd expressions) lik e Unite d States . 39,40 Another impor ta n t insight is that the n um ber o f so detected ‘words’ or ‘phrases’ is a thousand times larger in texts pro duced b y h umans than in texts g e nerated b y I ID so ur ces. 41 A particularly conv enient way to detect w ords or suf- ficien tly often rep e a ted strings is to use a grammar - based co de that minimizes the length of a certa in text enco ding. 42,43 Grammar-ba s ed co des compress strings by transforming them first in to special g r ammars, called ad- missible gra mmars, 44 and then enco ding the gr ammars back in to str ings accor ding to a fixed s imple metho d. An admissible grammar is a context-free grammar that gen- erates a s ingleton languag e { w } for s ome string w . 44 In an admissible grammar t here is exa ctly one rule p er nonterminal symbol and the nonterminals can be ordered so that the symbols are rewritten onto strings of strictly succeeding sym bo ls. 44,45 A par ticular exa mple of an a d- missible g rammar is as follows, A 1 → Ho w m uc h A 5 w A 4 A 2 A 3 , if A 2 c A 4 A 3 A 5 ? A 2 → a A 5 A 3 A 3 → c h uc k A 4 → ould A 5 → w o o d , where A 1 is the initial sym bo l and other A i are secondary nonterminal sy m b ols. If w e s tart the deriv a tion with symbol A 1 and follow the rewriting r ules, we obtain the text of a verse: How much wo o d would a wo o dchuck chuck, if a wo o dchuck c ould chuck wo o d? Although it cannot be seen in the shor t text ab ov e, sec- ondary non terminals A i often corresp ond to w ords or se t phrases in compressio ns of longer texts. This c o rrespo n- dence is particularly go o d if it is additionally required that nonterminals are defined a s strings o f only terminal symbols. 43 F or this reason, the n um ber of different words in an a rbitrary text will b e mo deled in the formalization of P ropos itions (H) and (H”) by the num ber of differen t nonterminals in a certain admissible gr ammar. C. Excess entropy and Hilberg’s hyp othesis Once I have partly des c r ibed how to detect and count “words” in an arbitrary text, let us refine our ideas ab out texts typically pro duced b y humans. There ar e justified o pinio ns that such texts result from a v ery complicated amalgam of deterministic computation and randomness 46 and t his amalgam can be realized v er y dif- ferent ly in par ticula r texts, as mo cked by D. Kn uth. 47 T o make these in tuitions more precise, let us in vestigate ent ropy and algo r ithmic complexit y of texts. Let us b egin with entrop y . F o r a probability space (Ω , J , P ) , the entrop y of a discr e te random v ariable X is defined as H P ( X ) := − E P log P ( X ) , (3) where E P is the exp ectation with r e spect to P and ran- dom v ariable P ( X ) ta kes v alue P ( X = x ) for X = x . Excess ent ropy in natural languag e 4 Subsequent ly , for a discrete sta tio na ry pro cess ( X i ) i ∈ Z , we define n -symbol entrop y H µ ( n ) := H P ( X n 1 ) , (4) where X m n = ( X i ) n ≤ i ≤ m are blo c ks of v aria bles a nd µ = P (( X i ) i ∈ Z ∈ · ) denotes the distribution of ( X i ) i ∈ Z (i.e., µ ( A ) = P (( X i ) i ∈ Z ∈ A ) ). O n the one ha nd, if the pro cess is purely random, i.e., X i are I ID v aria bles, then H µ ( n ) ∝ n . On the other hand, w e hav e H µ ( n ) = const if the pro cess is in a s e ns e deterministic, i.e., X i = f ( X i − 1 1 ) . In tuitiv ely , texts wr itten b y humans are neither deter- ministic nor purely r andom. This corresp onds to a par- ticular b ehavior o f en tropy H µ ( n ) . Some insigh t in to this behavior can b e obtained by asking people to guess the next c haracter of a text given the context of n previo us characters. In one of his v e ry fir st pap ers on information theory , 2 Shannon per formed this e xperiment. As it was later obser v ed by Hilb erg, 1 Shannon’s data points ob ey approximate relationship H µ ( n ) ∝ n β (5) for β ≈ 1 2 , n . 100 , and H µ ( n ) b eing an estimate of ent ropy of n consecutive c haracters r ather than the en- tropy itself. Hilberg s upp osed that (5) also holds for m uch larger n , even for n tending to infinit y . Some parallel resear c h in entropy of texts in natu- ral language suggests that estimates of ent ropy dep end heavily on a particular text 48,49 and Shannon’s g uessing method do es not give precise estimates of en tropy for large n . 50 Thu s Hilb erg’s conjecture (5) s ho uld be mo di- fied and other wa ys of its justification should be s ough t. First of all, let us re c a ll the co ncept of entrop y r ate h µ := lim n →∞ H µ ( n ) n . (6) F or a stationary pro cess, conjecture (5) implies ent ropy rate h µ = 0 , which is equiv a len t to asymptotic determin- ism, i.e., X 1 = f (( X i ) i< 1 ) almost surely . Such asymp- totic determinism seems an u nrealistic assumption. (But we may b e wrong.) Thu s let us introduce blo c k mutual information E µ ( n ) := I P ( X n 1 ; X 2 n n +1 ) := H P ( X n 1 ) + H P ( X 2 n n +1 ) − H P ( X 2 n 1 ) , (7) called n -sym bo l excess entrop y . 6 E µ ( n ) is a con v enient measure of co mplexit y o f discrete-v alued pro cesses. It v anishes for pur ely rando m pro cesses and is b o unded for a symptotically deterministic ones. Now let us ob- serve that for a stationar y pro cess ( X i ) i ∈ Z , we ha v e H P ( X 2 n n +1 ) = H P ( X n 1 ) and we obtain E µ ( n ) ∝ n β (8) if (5) is satisfied. W e will call (8) the r elaxed Hilb erg conjecture. Notice that, unlike the case of (5), h µ = 0 do es not follow fro m (8). Thu s, if propo rtionalit y (8) w ere actually satisfied for any n then texts in na tur a l language co uld not be pro- duced by gener a lized ‘monkey-typing’. In the gener al- ized ‘monkey-t yping’ mo del, t he text is generated by a finit e-state source a.k.a. a hidden Marko v model. In- deed, if the finite-state source has k hidden states then E µ ( n ) ≤ log k . 51 Nonetheless, rela tionship (8) do es not exhaust the problem of reasonable g e ner alizations. Blun tly speak- ing, it seems impo ssible to p oin t out a correct reference measure P for texts in natural la nguage. 22 Although re- searchers in linguistics happen to spea k of entropies of a single text, 48,49 this is an abuse of concepts b ecause en- tropy is a function of a distribution rather than of a text! T o render the relaxed Hilberg conjecture for a n individ- ual text x n 1 , w e should use pr efix algorithmic complex- it y H ( x n 1 ) instead of e ntropy H P ( X n 1 ) . F ormally , prefix complexity H ( x n 1 ) is defined as the length of the shortest self-delimiting progr am to g e ner ate text x n 1 . 52 Thu s for algorithmic mutual information I ( x n 1 ; x 2 n n +1 ) := H ( x n 1 ) + H ( x 2 n n +1 ) − H ( x 2 n 1 ) , (9) we will call relationship I ( x n 1 ; x 2 n n +1 ) ∝ n β (10) the relaxed Hilber g conjecture for individual texts. This relationship ma k es q uite a sense b e c a use in the proba- bilistic setting we hav e H P ( X n 1 ) ≤ E P H ( X n 1 ) ≤ H P ( X n 1 ) + C P n (11) for any computable measur e P a nd C P n = c P + 2 log n with c P < ∞ . 53 W e remind that mea s ure P is called computable w hen P ( X n 1 ) can b e computed g iv en X n 1 b y a fixed T uring mac hine. U nder this assumption, law (8) follows up to a logarithmic co rrection if prop ortionality (10) ho lds almost surely fo r a fixed pr oportiona lit y con- stant. D. Highly rep etitive descriptions of a random wo rld In this subsection, I w ant to discuss the question why texts typically produced b y h uma ns diverge from both simple randomness a nd deter minism. This will provide a justification for Hilb erg’s conjecture. I may po in t o ut three pla usible rea sons: A. T exts attempt to describ e an infinite collection o f independent facts that concern either an imm utable ob jective rea lit y or an evolving histor ical heritage. B. F o r some rea sons, the imm utable ob jectiv e reality and the historical heritage are descr ibed in a highly rep etitiv e and mostly logically co ns is ten t wa y . C. An y fact ab out the imm utable ob jectiv e reality can be inferr e d corr e c tly given s ufficien tly man y texts, according to a fixed infere nce method, regardless of where we star t rea ding. Excess ent ropy in natural languag e 5 As I will show in Subsection I II A, the conjuncti on o f prop ositions A.–C. implies Hilber g ’s conjecture by a for- malization of Prop osition (H’). Thus let us insp ect these statement s closer. As for p ostulate A., there exists a collection of facts ab out an immut able ob jective reality which is infinite and algorithmically random. A par ticular collection of that kind is giv en by the binary expansion of halting probability Ω . The ex pa nsion of Ω is an algorithmically random seq uence and represents a larg e b o dy of mathe- matical kno wledge in its most condensed form. 52,54 (Se- quence ( x i ) i ∈ N is c a lled algorithmically random for algo- rithmic complexity H ( x n 1 ) & n .) O ther plausible c hoices of immutable a nd algorithmically random s equences a re binary expansions of compres sed ph ysical constants. In contrast, the ev olving historical heritage, which is primarily describ ed in texts, admits a larger in terpreta- tion. Namely , this heritage encompass es b oth the culture and the prese nt state of the physical w orld. W e ca n also agree that the pr esen t state of the physical w orld contains all material asp ects of the culture. T o make these simple statemen ts less abstr act, let us men tion a few examples of wha t falls under the evolving historical her itage. The scop e o f culture cov ers: vocab- ulary and grammars of pa r ticular lang uages, fictitious worlds descr ibed in novels, a ll heritage of a rts, h umani- ties, science, and engineering . The presen t state of ph ys- ical world cov er s also all facts of biolo gy , geography , and astronomy , including those yet unknown. T o supp ort p ostulate B., let us cons ider wh y the fa cts men tioned in texts are describ ed in a highly rep etitiv e and mostly logica lly consistent w ay . This has more to do with the human nature than with proper ties of the describ ed world itself. As a plaus ible reason, I s uppose that human so ciety dev elops communication structures to maintain a large r b ody of knowledge than an y indi- vidual could manage on his or her own. Thu s the the primar y cause of repetition is probably the requirement that knowledge is passed from genera- tion to g eneration. Moreover, I supp ose that any human mind needs constant restimulation to remember and r eor- ganize the p ossessed knowledge. The result is that either in fiction or in scientifi c wr itings, p eople prefer logica lly consistent and directed narrations. This consistency also implies rep etition. T o arg ue in fav or of p ostulate C., let us obse r v e the fol- lowing. In the cour se of time, the historical heritage un- dergo es distributed creation, accumulation, description, and los sy transmission from text crea tors to text ad- dressees. This should b e contrasted with the immu table ob jective reality , which can b e discovered a nd describ ed independently by successive genera tions o f text creators. Thu s it doe s not sound w eird that every fact ab out the imm utable o b jective realit y is des cribed in some text ul- timately and repea ted infinitely many times afterw ards. Moreov er, there should exist a fixed method of in terpret- ing texts in natural language to infer these facts. Such facult y is called human language competence in the lin- guistic jargon and it a llo w s k no wledg e to b e passed from generation to generation. I II. MA THEMA TIC AL SYNTHESIS The ideas presented in the pr evious section will now b e synth esized as a n assortment of theorems and toy exam- ples of s to chastic process es. This can b e called a forma l- ization of Pr opositions (H), (H’) and (H”), ment ioned in the Introduction. Namely , in a series of theorems I w ill link Hilber g’s conjecture with Herdan’s law for voca b- ulary size of admissible g rammars and a power la w for the n um ber o f facts that can b e inferred from a giv en text. Afterwards, I will demonstr ate a few simple pro- cesses that exhibit all three laws. F or simplicity of argu- men tation, I will discuss probabilistic Hilberg hypothesis (8) rather than algo rithmic one (10). Respectively , b oth texts a nd facts will b e mo deled by rando m v ariables. In the following, sym bo l N denotes the set of p ositive in tegers. F or a co untable alphab et X , the set o f nonempty strings is X + := S n ∈ N X n and the set of all strings is X ∗ := X + ∪ { λ } , where λ stands for the empty string. The leng th of a string w ∈ X ∗ is wr itten as | w | . A. Definitions and theorems In this subsection I will show how Prop osition (H) can be for malized. First, the mo del of texts and facts is made precise. Second, the mo del o f words is elab orated. Third, I present three pr eviously pro ved theorems 12 that link Hilberg ’s conjecture and these tw o mo dels. Let ( X i ) i ∈ Z be a discr ete sto c hastic pro cess with v ar i- ables X i : Ω → X , where Ω denotes the ev ent spa ce. Pro- cess ( X i ) i ∈ Z mo dels an infinite text, where X i are c har- acters if X is finite or sentences if X is infinite. Moreov er, let Z k : Ω → { 0 , 1 } , where k ∈ N , b e equidistributed IID binary v ariables. V aria bles Z k mo del f acts describ e d in text. Their v a lues (1=tr ue and 0=fals e) can be inter- preted a s logica l v a lues of certain s ystematically enum er- ated indep enden t pro p ositions. More sp ecifically , let us assume that each fact Z k can be inferr e d fro m a half-infinit e text according to a fixed method if we start reading it fro m an a r bitrary position, like in p ostulate C. from Subsection II D. The method to infer these facts will b e formalized as ce r tain functions s k which given a text predict whether the k -th fact is true or false . This lea ds to the following definition. Definition 1 A st o chastic pr o c ess ( X i ) i ∈ Z is c al le d str ongly noner go dic if ther e exists an IID binary pr o c ess ( Z k ) k ∈ N with mar ginal distribution P ( Z k = 0) = P ( Z k = 1) = 1 2 (12) and fun ctio ns s k : X ∗ → { 0 , 1 } , wher e k ∈ N , such t ha t lim n →∞ P ( s k ( X t + n t +1 ) = Z k ) = 1 , ∀ t ∈ Z , ∀ k ∈ N . (13) Excess ent ropy in natural languag e 6 In the definition above, facts Z k are fixed for a given realization of ( X i ) i ∈ Z but they can b e very different for different realizations. I supp ose that suc h probabilistic mo deling o f b oth texts and facts reflects some proper- ties o f languag e, where r ealit y describ ed in texts is most often cre a ted a t random during text ge ner ation a nd re- called afterw a rds. Under this assumption I will derive an av er age-case result. Strong nonergo dicit y is indeed a str o nger condition than nonergo dicit y . A stationary pro cess is s tr ongly nonergo dic when there exists a con tin uous rando m v ari- able Θ : Ω → (0 , 1) measurable with resp ect to the shift-in v ariant a lgebra. 10 Suc h a v a riable is an example of a parameter in terms of Bay es ian statistics. T a king Θ = P ∞ k =1 2 − k Z k corresp onds to a uniform prior distri- bution on Θ . The n umber of facts describ ed in text X n 1 will b e iden- tified with the n umber of Z k ’s that may b e predicted with proba bilit y grea ter than δ g iv en X n 1 . That is, this n um be r is unders too d as the cardinality card U δ ( n ) o f set U δ ( n ) := { k ∈ N : P ( s k ( X n 1 ) = Z k ) ≥ δ } . (14) There is also another c o ndition for pro cess ( X i ) i ∈ Z , which is stronger than re q uiring en tropy rate h > 0 . Definition 2 A pr o c ess ( X i ) i ∈ Z is c al le d a finite-ener gy pr o c ess if P ( X t + | w u | t + | w | +1 = u | X t + | w | t +1 = w ) ≤ K c | u | for al l t ∈ Z , al l u, w ∈ X ∗ , and c ertain c onstants c < 1 and K , as long as P ( X t + | w | t +1 = w ) > 0 . The term “finite-ener gy pro cess” has be e n coined by Shields. 55 W e a re unaw are of the motiv a tio n for this name. Now let us discuss the adopted mo del of w or ds. It uses admissible gr a mmars ment ioned in Subsection I I B. A function Γ such that Γ( w ) is a gr ammar a nd genera tes language { w } for eac h string w ∈ X + is c alled a grammar transform. 44 An y such gr ammar Γ( w ) is admissible and is g iv en by its s et of pro duction rules Γ( w ) = A 1 → α 1 , A 2 → α 2 , ..., A n → α n , (15) where A 1 is the start sym b ol, other A i are seco ndary nonterminals, a nd the rig h t-ha nd sides of r ules satisfy α i ∈ ( { A i +1 , A i +2 , ..., A n } ∪ X ) ∗ . The n umber of distinct nonterminal symbols in grammar (1 5) will be called the vocabulary size of Γ( w ) and denoted by V [Γ( w )] := card { A 1 , A 2 , ..., A n } = n. (16) In the fo llowing, let us co nsider vocabula r y size of ad- missibly minimal grammar transfor ms , which were de- fined exactly in the previous pap er. 12 The formal def- inition is too long to quo te here but, briefly speak ing , admissibly minimal grammar tr a nsforms mini mize a cer- tain nice leng th function of grammars. A simple exa mple of a grammar leng th function is Y ang-K ieffer length | Γ( w ) | := n X i =1 | α i | (17) for grammar (15), where | α i | is the length of the right- hand s ide of rule A i → α i . 45 In our application w e use a sligh tly differen t length function || Γ( w ) || , w hich measur es the length o f Γ( w ) a f- ter a certa in reversible binary enco ding, and w e choose a grammar transfor m that minimizes || Γ( w ) | | for a given string w . Non terminals of these so called admissibly min- imal grammar transforms often corres pond to w ords in the linguistic sense. 42,43 Thu s w e stipulate that the v o- cabulary size of an admissibly minimal gra mmar is close to the n um ber of distinct words in the text. The for malization of Pro position (H) is a s follows: Theorem 1 L et ( X i ) i ∈ Z b e a stationary fin ite-ener gy str ongly noner go dic pr o c ess over a fi nite alphab et X . If lim inf n →∞ card U δ ( n ) n β > 0 (18) holds for some β ∈ (0 , 1) and δ ∈ ( 1 2 , 1 ) t hen lim sup n →∞ E P V [Γ( X n 1 )] n β (log n ) − 1 p > 0 , p > 1 , (19) for any admissibly minimal gr ammar tr ansform Γ . There ar e also tw o similar theorems that link inequal- ities (18) and (19) with Hilb erg’s conjecture. These are formalizations of Prop ositions (H’) and (H”) resp ectiv ely . Theorem 2 L et ( X i ) i ∈ Z b e a stationary str ongly noner- go dic pr o c ess over a finite alpha b et X . If (18) holds for some β ∈ (0 , 1 ) and δ ∈ ( 1 2 , 1 ) t hen we have lim sup n →∞ E µ ( n ) n β > 0 . (20) Theorem 3 L et ( X i ) i ∈ Z b e a stationary fin ite-ener gy pr o c ess over a finite alphab et X . Assume t hat lim inf n →∞ E µ ( n ) n β > 0 (21) holds fo r some β ∈ (0 , 1) . Then we ha ve (19) for any admissibly minimal gr ammar t r ansform Γ . Theorem 1 do es no t follow from Theorems 2 and 3 beca use (20) is a w eaker condition than (21 ). Ho w ev er, all these prop ositions are true a nd the proo fs of these prop ositions are almost sim ultaneous. 12 By an easy ar- gument , using Lemma 1 from Subsection V B, it can b e also shown that n -symbol excess entrop y E ( n ) in The- orems 1– 3 ma y b e r e pla ced with exp ected algo rithmic information E P I ( X n 1 ; X 2 n n +1 ) . Excess ent ropy in natural languag e 7 B. The zo o of Santa Fe proces s es Now I will present a few sto ch astic pro cesses to which m y theorems may b e a pplied. 10–13 These pro cesses are merely simple mathematical models that sa tisfy hypothe- ses of Theorems 1 , 2, and 3. They mo del some aspe c ts of human communication but they do not pretend to be very re a listic mo dels of langua ge. The purpose of these constructions is to enha nce our imagination and to show that the hypotheses of the theorems ca n b e satisfied. Quite ea rly in my inv estig a tions I came a cross the fol- lowing pro cess. Let the alphab et b e X = N × { 0 , 1 } and let the pro cess ( X i ) i ∈ Z hav e the for m X i := ( K i , Z K i ) , (22) where ( Z k ) k ∈ N and ( K i ) i ∈ Z are proba bilistically indep en- den t. Moreover, let Z k be IID with margina l distribution (12) and let ( K i ) i ∈ Z be such a n ergo dic stationary pro - cess that P ( K i = k ) > 0 for ev ery natural num b er k ∈ N . Under these assumptions it can b e demons tr ated that ( X i ) i ∈ Z forms a strongly nonergo dic pr o cess. 10 I will call pro cess ( X i ) i ∈ Z with v a riables X i as in (22) the Santa F e pr ocess b ecause I disc overed it during my visit to the Santa F e Institute. Santa F e proce ss (22 ) can b e int erpreted as a se- quence of statemen ts which describ e a fixed random ob- ject ( Z k ) k ∈ N in a r epetitive a nd consisten t w ay . Each statement X i = ( k, z ) reveals b oth the address k of a r a n- dom bit of ( Z k ) k ∈ N and its v alue Z k = z . The descr iption is consistent, namely , if t wo statemen ts X i = ( k , z ) and X j = ( k ′ , z ′ ) describe the same bits ( k = k ′ ) then they alwa ys assert ident ical v alue ( z = z ′ ). Moreov er, w e can see that the revelation of the bit address is imp ortant to assure the existence of functions s k such that (13) holds. Indeed we may take s k ( v ) := 0 if ( k , 0) ⊑ v and ( k , 1) 6⊑ v , 1 if ( k , 1) ⊑ v and ( k , 0) 6⊑ v , 2 else , (23) where we write u ⊑ v when a s e q uence v contains string u as a substring. F or these functions s k , I ha ve shown 12 that the cardi- nality of set U δ ( n ) ob eys card U δ ( n ) ≥ n − ζ ( β − 1 ) lo g (1 − δ ) β (24) for pro cess (22) if v ariables K i are IID and p o w er-law distributed, P ( K i = k ) = k − 1 /β /ζ ( β − 1 ) , β ∈ (0 , 1) , (25) where ζ ( α ) = P ∞ k =1 k − α is the zeta function. In co ntrast, it can be seen tha t the c a rdinalit y of se t U δ ( n ) is of order log n if ( X i ) i ∈ Z is a Bernoulli pro cess with binary v a riables X i : Ω → { 0 , 1 } , a random pa ram- eter Θ = P ∞ k =1 2 − k Z k , and conditiona l distribution P ( X n 1 || Θ) = n Y i =1 Θ X i (1 − Θ) 1 − X i . (26) Next, let us discuss a certain modification of the San ta F e process. As I hav e sa id b efore, facts that are men- tioned in texts repeatedly fall roughly under t wo t ypes: (a) facts ab out ob jects that do not change in time (lik e mathematical or ph ysical constan ts), and (b) facts ab out ob jects that evolve with a v aried speed (like culture, lan- guage, or geogr aph y). An attempt to mo del the latter phenomenon leads to pro cesses that a r e mixing, as we will see now. In the fo llowing, let us replace individual v aria bles Z k in the Santa F e pro cess with Marko v chains ( Z ik ) i ∈ Z . The Marko v chains are f ormed b y iterating a binar y symmet- ric channel. Consecutively , let us put X i = ( K i , Z i,K i ) , (27) where proc esses ( K i ) i ∈ Z and ( Z ik ) i ∈ Z , where k ∈ N , are independent a nd distributed as follo ws. First, v aria bles K i are distributed acco r ding to form ula (25), as b efore. Second, ea ch pro cess ( Z ik ) i ∈ Z is a Markov chain with marginal distribution P ( Z ik = 0) = P ( Z ik = 1) = 1 2 (28) and c r oss-ov er pro ba bilities P ( Z ik = z | Z i − 1 ,k = 1 − z ) = p k , z ∈ { 0 , 1 } . (29) The random ob ject ( Z k ) k ∈ N describ ed by or iginal Santa F e pro cess (22) do es no t evolve, or rather, no bit Z k is ever forgotten o nce revealed. In con trast, the random ob ject ( Z ik ) k ∈ N describ ed b y modified Sa n ta F e process (27) is a function of time i and the probabilit y that the k -th bit flips at a given instant equals p k . F or p k = 0 , pro cess (27) collapses to pro cess (22). As I hav e shown previously , 13 the mo dified San ta F e pro cess defined in (27) is mixing for p k ∈ (0 , 1) , and th us ergo dic. Moreover, for p k ∈ [0 , 1] , I hav e also demon- strated asymptotics lim n →∞ E µ ( n ) n β = (2 − 2 β )Γ(1 − β ) [ ζ ( β − 1 )] β (30) if lim k →∞ p k /P ( K i = k ) = 0 and K i ob ey law (25). 13 In the equation a bov e Γ( z ) = R ∞ 0 t z − 1 e − t dt is the ga mma function. F or m ula (30) follo ws from approximating an exact expression for E µ ( n ) with a n integral. Note that (30) holds also in the case of origina l stro ng ly nonergo dic Santa F e pro cess (22). Neither of pro cesses defined so far is a proces s o v er a finite alphab et, as required in Theor ems 1 – 3. T o construct the desired pro cesses ov er a ternary alpha- bet, I ha ve used stationary (v ar iable length) co ding o f Excess ent ropy in natural languag e 8 pro cesses ov er o ne a lphabet into pro cesses ov er a nother alphab et. This trans fo rmation preser v es stationarity , (non)ergo dicit y , a nd entrop y—to some exten t. 11,13 De- spite elabo rate notation, the idea of this tra nsformation is q uit e simple. First, let a function f : X → Y ∗ , called a co ding func- tion, map symbols from alphab et X in to strings over an- other alphab et Y . W e define its extensio n t o double infi- nite sequences f Z : X Z → Y Z ∪ ( Y ∗ × Y ∗ ) as f Z (( x i ) i ∈ Z ) := ...f ( x − 1 ) f ( x 0 ) . f ( x 1 ) f ( x 2 ) ..., (31) where x i ∈ X and the b old-face dot sepa rates the 0 -th and the first sym bo l. Then for a stationary process ( X i ) i ∈ Z with v ariables X i : Ω → X , w e define pr ocess ( Y i ) i ∈ Z := f Z (( X i ) i ∈ Z ) (32) with v ariables Y i : Ω → Y . In the follo wing application, let us assume the infi - nite alphab et X = N × { 0 , 1 } , the ternary alphab et Y = { 0 , 1 , 2 } , and the coding function f ( k , z ) = b ( k ) z 2 , (33) where b ( k ) ∈ { 0 , 1 } + is the binar y representation of a nat- ural num b er k s tr ipped of the leading digit 1 . T ransforma tion (32) do es not pres erv e stationa r it y in general but pro cess ( Y i ) i ∈ Z is a symptotically mean sta- tionary (AMS) for pro cess (27) and co ding f unction (33). 11 Then for the distribution ν = P (( Y i ) i ∈ Z ∈ · ) and the shift op eration T (( y i ) i ∈ Z ) := ( y i +1 ) i ∈ Z there exists a stationary measure ¯ ν ( A ) := lim n →∞ 1 n n − 1 X i =0 ν ◦ T − i ( A ) , (34) called the stationary mean of ν 11,56 . It is con venien t to suppo se that probabilit y s pace (Ω , J , P ) is rich enough to supp ort a pro cess ( ¯ Y i ) i ∈ Z with the distribution ¯ ν = P (( ¯ Y i ) i ∈ Z ∈ · ) . Pro cess ( ¯ Y i ) i ∈ Z will b e ca lled the sta- tionary co ding of ( X i ) i ∈ Z . Pro cesses ( X i ) i ∈ Z , ( Y i ) i ∈ Z , and ( ¯ Y i ) i ∈ Z hav e isomor- phic shift-inv aria n t alge br as for some nice co ding func- tions, called synchronizable injections. 11 Co ding function (33) is an insta nce o f such an injection. Thus pro cesses ( Y i ) i ∈ Z and ( ¯ Y i ) i ∈ Z obtained from process (27) using (33) are nonergo dic if p k = 0 and ergo dic if p k ∈ (0 , 1) . Now let us c o nsider blo ck mutual information for the stationary coding ( ¯ Y i ) i ∈ Z of process (27) using co ding function (33). I have shown 13 that lim inf m →∞ E ¯ ν ( m ) ( m lo g − 1 m ) β > 0 , (35) for E ¯ ν ( m ) = I P ¯ Y 1: m ; ¯ Y m +1:2 m and cro ss-o ver pr obabil- ities p k ≤ P ( K i = k ) . This b ound s ho uld b e co n tra sted with inequality lim sup m →∞ E ¯ ν ( m ) m β > 0 (36) which follows for p k = 0 . 11,12 It is an in teres ting open problem whether (36) can b e g eneralized for p k > 0 . Another in teresting o p en problem concerns the ques- tion whether the stationary co ding is a finite-energy pr o- cess. This property is ass umed in Theorems 1 a nd 3 to b ound the length of the longest rep eat a nd hence to bo und the v o cabulary size. 12 I hav e shown that pro cess ( ¯ Y i ) i ∈ Z is a finite-ener gy pro cess for β > 0 . 7 728 ... and p k = 0 11 but I wonder if it also ho lds for other ex p onents β a nd cross-over pr obabilities p k . IV. AFTERTHOUGHTS FOR THEORETICIANS The tra nslation of abstra c t mathematical results back in to linguistic realit y can b e challenging. In the following, I want to share a few remarks a b out theor etical limita- tions of my constructions a s models of natura l language. This part of the pap er is b orn by typical commen ts I re- ceive ab out my mo del. A. What are those ‘facts’ ? Many people to whom I hav e presented the concept of Santa F e pro cesses a sk the question: “What are those ‘facts’ ?” Whereas in mo dels (22) and (27) the facts are just some binary v aria bles, I tried to int erpret these v ari- ables in Section II D as indep enden t propositio ns a bout particular co mplicated infinite r andom ob jects, c o nsis- ten tly describ ed in the texts. These ob jects might be static like a mathematical or ph ysical constant or mig h t evolv e slo wly like cultural her itage. Ho wev er , the iden- tification o f the sequence of independent fac ts describ ed in the actual texts in na tural la nguage is left a s a matter of future research. This do es not mean that nothing ca n b e said ab out the in terpretation of facts at the momen t. Let me make an impo r tan t remark. In m y mo del, the probability of men- tioning indep enden t prop ositions in texts o beys a p ow er law. If the same applies to natural language, it seems unlik ely that the men tioned facts are the binary dig its of halting pr obabilit y Ω , whic h has an appea ling prop ert y of represe nting a larg e b ody of mathematical knowledge in a concise for m. 52,54 Although the digits of Ω ha ve been prov ed to be in a s e ns e indep enden t (i.e., algorithmically random), I s uppose that infor mation relay ed by humans in a rep etitiv e way is mostly unrelated to Ω b ecause h u- man b eings do not hav e supernatural pow er s to guess the bits of Ω at a p o wer-law rate. The facts that are usu- ally men tioned in texts should concern ‘mor e ev eryday’ ob jects. B. Are facts and wo rds the same? Another type of reaction I ha ve heard is: “But facts and words are the sa me so your result ab out the impli- Excess ent ropy in natural languag e 9 cation of p o w er laws for them is a tautology!” My short answer to the criticism is this: “W ords a nd facts are very different entit ies, how ever, so my result is nontrivial.” T o supp ort this reply let us notice the following. F. de Saussure ma de a famous observ ation that a linguis- tic sign is a pair of a w o rd (i.e., a string of c haracters) and a meaning (roughly , an ob ject to whic h the word refers). 57 T o a large ex ten t, the mapping b et ween words and o b jects is one-to-one. Therefore, h n um ber o f referr ed ob jects i ≈ h num b er of w ords i . In contrast, I hav e claimed a r elationship h n um ber of words i & h n um ber o f indep enden t facts i log h length o f text i . That inequality can b e strictly sharp b ecause ob jects (say , things, concepts, qualities, or a ctivities) are differ- ent entiti es than facts (i.e., prop ositions which a ssume binary v alues). The inequality is a lso nontrivial b ecause prop ositions usually consist of mo r e than one word. C. Finite active voc abula ry and division of knowledge An imp ortant limitation o f m y results is their a symp- totic character. I hav e dea lt with asymptotic statements beca use it is simpler to w ork o ut a mathematical mo del in that cas e. In reality , howev er , the n um ber of differ- ent w ords ac tively used by a single p erson is of order r ( w ) ≈ 10 3 ÷ 10 4 . F or w o rd ranks below that v alue, Zipf ’s la w (1) is observed with α ≈ 1 . In contrast, for larger word rank s , word frequencies decay exp onent ially in collections of texts written by a single author. 29 It is not k nown whether a similar brea kdo wn arises for vo cab- ulary o f admissibly minimal g rammars or fo r Hilb erg’s law (10). This ques tio n is worth inv estigating. Whereas word frequencies decay ultimately exponen- tially for single-author collections of texts, a different r e- lationship is observ ed in multi-author text collections. Namely , for r ( w ) & 10 3 ÷ 10 4 , the expo nen t in the Z ipf- Mandelbrot law (1) switches to β ≈ 0 . 4 rather than to β ≈ 0 . 28,29 This phenomenon can be interpreted a s devel- oping so cial structures to main tain a nd transmit a larger bo dy of knowledge than any individu al could manage on his o r her own. T o mo del this phenomenon pr operly we should assume that finite texts pro duced by sing le au- thors are wov en up in to a dis c o urse (a commun ication net work) o f yet unrecognized topolo gy , rather than con- catenated in an arbitrar y infinite seq uence ( X i ) i ∈ Z . D. Ho w do es language differ to maths, music , and DNA? T exts in na tural language are not the only type o f a complex co mm unication system that occurs in na- ture. E xamples of other systems are musical transcripts, mathematical writings, computer pr ograms, or genome (DNA and RNA). One may in vestigate quantitativ e laws ob ey e d in these systems, just as it is do ne for natura l language. 58 ,59 Moreov er, although the notion of a word is connected to linguistics, o ne ma y inv es tig ate Hilberg’s conjecture and statistical prop erties of admissibly mini- mal gra mmars for any symbolic sequence. One may also try to interpret or predict res p ective exp erimen tal results theoretically . F or example, Eb e ling et al. estimated n -sym bol en- tropy by coun ting n -tuples in samples of texts in natural language and classica l m usic. They confirmed formula (8) for n ≤ 15 characters with β ≈ 0 . 5 for natural language texts a nd β ≈ 0 . 25 for clas sical m usic transcr ipts. 4,60 Mathematical writings are another interesting com- m unication system which has not b een muc h researched from a quan titativ e p ersp ectiv e. I suppos e that mathe- matical writings o bey rela xed Hilberg for m ula (10) simi- lar to that of music or nov els in natural language b ecause all t hese sym bo lic sequences are produced by hum ans for h umans, either for their w ork or en tertainment . In ei- ther ca se, I suppo se that hu mans need a larg e degr ee of rep etition to learn fr om an information source how to re- act to it pr operly . Hence relations hip (10) sho uld arise. In tuitiv ely , our abilities to use a par ticular lang uage, to enjoy a particular style of m usic, or to work in a bra nc h of mathematics are all learned and learning is o nly possible if there are some patterns to b e learned. In co n tra st, computer programs o r DNA are sequences that con trol b eha vior of mac hines like co mput ers or bi- ological cells. Thes e mac hines can in terpr et co n tro l se- quences in a fixed manner without learning or los s of synchronization cause d by other factors. Hence there is less need of r epetition in the con trol sequences . Conse- quent ly , relationship (10) a nd word-lik e structures need not arise in compiled computer programs or DNA to suc h an extent as in typical texts in natura l language. V. AFTERTHOUGHTS FOR EXPERI MENT ALISTS The b o dy of theore tica l insight gathered so far asks for exp erimen tal verification. One can right ly question to what exten t these results may be a pplied t o rea l texts in natural languag e. The preparation o f a sound exp eri- men tal s tudy require s muc h mor e space than it is left in this pap er, s o let me o nly sket ch a few pro blems. There are several lev els of exp erimental v er ification, whic h cor - resp ond to growing difficulty . The easiest thing to do is to chec k whether Zipf ’s or Herdan’s law is satisfied for la nguages in which w ords are delimited b y spaces . There a re plent y of articles about that, including observ a tions that Zipf ’s law breaks for large ranks. 28,29 In con trast, it is a bit harder to verify whether a power-la w is sa tisfied for nonterminals in ad- missibly minimal gr ammars. The next task is to verify Hilberg ’s conjecture for a particular text. In the end, the hardest thing to do is to estimate ho w many facts are joint ly des c r ibed in tw o given texts. Excess ent ropy in natural languag e 10 In this section, I will touch some o f these questions. First, I discuss some grammar transforms that may not be used to approxim ate admissibly minimal grammars or to detect word b oundaries. Second, I sugges t how mutual information can b e efficiently es timated. A. What is the approp riate grammar-based co de? I hav e claimed that there is a tigh t relationship be- t ween the num b er of distinct words in a text in natural language a nd the num ber of distinct nont erminal sy m- bo ls in an admissibly minimal grammar for the text. Although th is claim is supp orted by s ev er al computa- tional exper imen ts, 39,42,43 the regression b et w een these t wo quantiti es has not be e n surv eyed dir ectly so far. In fact, inv estig a ting this dependence is hard b ecause com- puting admissibly m inimal gra mmars is extremely costly , even in a ppro ximation. 42 ,43 In cont rast, computationally less intensiv e grammar tra nsforms may detect s purious structures. F or example, irreducible grammar transforms 40 ,44,45 exhibit a pow er-law growth of voc a bulary size for any source with a p ositiv e en trop y ra te. 41 T o see it, let us first observe inequality | G | − V [ G ] ≤ ( V [ G ] + card X ) 2 , (37) where G is a n irreducible g rammar 44 and | G | is the Y ang- Kieffer length o f G . 12 An y irreducible g rammar s atisfies (37) since a n y concatenation of t wo sym b ols ma y only o ccur on the rig h t-hand sides of its rules o nly once. What happ ens if an irreducible grammar G compr e s ses a text of length n pro duced by a stationary s ource with ent ropy r ate h µ ? Then we obtain | G | & h µ n/ log n from the sourc e co ding inequality | G | log | G | & h µ n and the trivial inequalit y | G | ≤ n . Combining that with (37) yields the p ower law V [ G ] & s h µ n log n − card X − 1 . (38) In particular, the higher the en tro p y rate is, the more nonterminals are detected by the gra mma r. In m y opinion relationship (38) is an ar tif act. It a r ises beca use irreducible grammar s minimize a wrongly c hosen length function. If we c hoo se a certain differen t gramma r length function 12,42 then, a ft er complete minimization, we obtain admissibly minimal grammars . The n um ber of nonterminals in approximations of such g r ammars is a thousa nd times larger for texts in natura l languag e than for I ID sour ces. 41 Thu s I suppose that the vo cabulary size of admissibly minimal gr a mmars is lower-bounded b y the n -symbol ex cess e ntropy rather than b y the entrop y rate. B. How to measure mutual info rmation? Entrop y of a long sequence o f random v ariables is hard to estimate. It can b e effectively bounded only fro m ab o v e and there is some systema tic nonnegligible error term. W e know that an upp er bo und for the en tropy o f a text is giv en b y the expected length of any prefix-free co de for the text. How ever, the length of the shortest effectiv ely deco dable code for the text equals the algo- rithmic complexity of the text, which is greater than the ent ropy . Thus any inten tion o f estimating Sha nno n en- tropy by universal co ding ends up with e s timating algo- rithmic complexity . Now, certain ca r e must b e given to distinguishing Shannon en tropy H P ( X n 1 ) and algorithmic complexit y H ( X n 1 ) . Althoug h we hav e inequa lit y (11) for an y co m- putable measure P , the difference E P H ( X n 1 ) − H P ( X n 1 ) (39) can exceed any sublinear function of n if P is stationa ry but not computable. 61 Whereas the cla ssical pro o f o f unboundedness of (39) is difficult, 61 a similar result can be obtained using Sant a F e process (2 2 ). Let P b e the probability measure for pro cess (22) and let F = P ( · | ( Z k ) k ∈ N ) (40) be its conditional measure. 10,62 The v alues o f conditional measure F depend on the v alue of proces s ( Z k ) k ∈ N . In particular, mea sure F is not computable for algorithmi- cally random ( Z k ) k ∈ N (i.e., F ( X n 1 ) cannot be computed given no ( Z k ) k ∈ N ). F ur ther, H P ( X n 1 ) − E P H F ( X n 1 ) = I P ( X n 1 ; ( Z k ) k ∈ N ) = O ( n β ) . (41) Moreov er, we have the sourc e co ding ineq ua lit y H P ( X n 1 ) ≤ E P H ( X n 1 ) = E P E F H ( X n 1 ) . H ence we ob- tain E P [ E F H ( X n 1 ) − H F ( X n 1 )] ≥ O ( n β ) (42) as the desired result, where a n analogue of (39) ap- pea rs. In other words, univ er sal coding bounds s uffer from a la r ge sys tematic error for Shanno n entrop y of non- computable pr obabilit y mea sures. Now I will show that the err or of the coding bo unds can be greatly r educed for cer ta in noncomputable measure s when we r ather b ound algo rithmic complexity . This op ens w ay to b ounding also algorithmic information, which is a difference of complexities. Suppos e that P = Z P ( · | Θ ) dP (43) where P ( · | Θ) are measures of stationary Mar k ov chains for particular v alues of transition proba biliti es Θ and P (Θ ∈ · ) is an appropriate prio r o ver a ll transition prob- abilities for all po ssible orders of Mark ov chains. Mea- sures P ( · | Θ) are not computable for algorithmically ran- dom Θ . It is li kely , ho wever, that the Sha nnon-F a no co de Excess ent ropy in natural languag e 11 yielded b y measure (43) is computable and univ ersal, i.e., − log P ( X n 1 ) can b e computed g iv en X n 1 and lim n →∞ 1 n E Q [ − log P ( X n 1 )] = h ν (44) holds for any stationary measure ν = Q (( X i ) i ∈ Z ∈ · ) . Consider now p oin t wise mutual information I P ( x n 1 ; x 2 n n +1 ) := H P ( x n 1 ) + H P ( x 2 n n +1 ) − H P ( x 2 n 1 ) , (45) using po in twise entropy H P ( x n 1 ) := − log P ( X n 1 = x n 1 ) . (46) The Shannon-F ano co ding yields inequalit y H ( x n 1 ) ≤ H P ( x n 1 ) + C P n (47) where x n 1 is arbitrary a nd C P n = c P + 2 log n for a cer tain constant c P . Thus we define the lo ss of point wis e mut ual information with resp ect to a lgorithmic information as L P ( x n 1 ; x 2 n n +1 ) := I P ( x n 1 ; x 2 n n +1 ) − I ( x n 1 ; x 2 n n +1 ) . (48) W e will next use the following lemma: Lemma 1 Consider a funct io n G : N → R such that lim k G ( k ) /k = 0 a nd G ( n ) ≥ 0 for al l but fin itely ma ny n . Then for infinitely many n , we have 2 G ( n ) − G (2 n ) ≥ 0 . 12 If (44) holds indeed then eq ualit y lim n →∞ 1 n E Q H ( X n 1 ) = h ν (49) for any stationary measure Q and Lemma 1 for function G ( n ) = E Q H P ( X n 1 ) − H ( X n 1 ) + C P n yield lim sup n →∞ E Q L P ( X n 1 ; X 2 n n +1 ) + C P n ≥ 2 log 2 . (50) Hence, as lo ng as (44 ) holds, po in twise mutual informa- tion (45) is an upp e r estimate of algorithmic information, up to a small log arithmic correction C P n . In certain cases of noncomputable Q , quantit y (45) is also a low er e s timate of algorithmic information. Obser ve that P is computable. Hence for a ll P -alg orithmically random sequences ( x i ) i ∈ N we hav e by definition inf n ∈ N [ H ( x n 1 ) + log P ( x n 1 )] > −∞ . (51) Moreov er, we hav e the following fact: Theorem 4 The set of P -algorithmic al ly r andom se- quenc es i s the un io n of sets o f P ( ·| Θ) -algorithmic al ly r an- dom se quenc es over al l p ar ameters Θ that ar e algorithmi- c al ly r andom against the prior P (Θ ∈ · ) . 63–65 Each o f those sets of algor ithmi cally random sequences has the resp ectiv e full measure, s o in a sense it cont ains all outcomes t ypical of that measure. Let us fix a se- quence ( x i ) i ∈ N belo nging t o one of these sets. Because P is s tationary , by (47) a nd (51), we obtain sup n ∈ N L P ( x n 1 ; x 2 n n +1 ) − 2 C P n < ∞ . (52) Hence, inequalit y (5 2) gives an upp er b ound for loss (48) for typical realizations of typical Mar k ov chains. In view of b ounds (50 ) and (52), p oin t wise mutual in- formation (45) could be considered an int eresting esti- mate o f algor ithmic information. It could b e used for v er- ifying (or rather falsifying) r elaxed algor ithm ic Hilber g conjecture (10). The details of co mput ing distribution P and po in twise mut ual information (45) will b e work ed out in a nother paper, how ev er. Although the sketc hed distribution P is co mputable, there are some pro blems with assuring efficient co mput ability . VI. CONCLUSION In this article I hav e presented a wide ar ra y of inter- esting issues that arise when quantitativ e research in lan- guage is com bined with fun damental research in infor- mation theory . As I hav e indicated in the Introduction, the presented ideas may be inspiring for mo re general studies o f complex systems b e c a use of the demonstrated connections amo ng excess e ntropy , p o w er laws, and the emergence o f hierar c hica l structures in data. ACKNO WLEDGMENTS I wish to thank Ja cek Koronacki, Jan Mielniczuk, Jim Crutchfield , and tw o ano n ymous r eferees for v a luable comment s. 1 W. Hilb erg, “Der bek annte Grenzwe rt der redundanzfreien Infor- mation in T exten — eine F ehlinterpretation der Shannonsc hen Experi m en te?” F requenz 44 , 243–248 (1990). 2 C. Shannon, “Prediction and en trop y of pr in ted English,” Bell Syst. T ec h. J. 30 , 50–64 (1951). 3 W. Eb eling and G. Nicolis, “Ent ropy of symboli c sequences: the role o f co rrelations,” Europh ys. Lett. 14 , 1 91–196 ( 1991). 4 W. Eb eling and T. P¨ oschel, “Ent ropy and long-range corr elations in l iterary English,” Europh ys. Lett. 26 , 241–246 (1994 ). 5 W. Bialek, I. Nemenman, and N. Tishb y , “Complexit y through nonexte nsivity ,” Phy sica A 302 , 89–9 9 (2001) . 6 J. P . Crutchfield and D. P . F eldman, “Regularities unseen, ra n- domness observed: The entrop y conv ergence hierar c hy ,” Chaos 15 , 25–54 (2003). 7 G. K. Zipf, The Psycho-Biolo gy of L anguage: An Intr o duction to Dynamic Philolo g y, 2nd e d. (The MIT Press, 1965). 8 B. Mandelbrot, “Struct ure formelle des textes et communica- tion,” W or d 10 , 1–27 (1954). 9 G. A. Miller, “Some effects of int ermitten t silence,” Amer. J. Psyc h. 70 , 311–314 (1957) . 10 Ł. Dęb owski, “A general definition of conditional information and its appli cation to ergo dic de composition,” Statist. Probab. Lett. 79 , 1260–1268 (2009). Excess ent ropy in natural languag e 12 11 Ł. Dęb o ws ki , “V ariable-length coding of t w o-sided asympt oti- cally mean stationary measures,” J. Theor. Probab. 23 , 237–256 (2010). 12 Ł. Dębowski, “On the vocabulary of grammar- based codes and the l og ical consistency of texts,” IEEE T rans. Inform. Theor. 57 , 4589–459 9 (2011). 13 Ł. Dęb o wski, “Mi xing, ergodi c, and nonergo dic pro cesse s with rapidly gro wing information bet w een blocks,” (2011), in prepa- ration, http://arxiv.org/abs/1 103.3952 . 14 N. Chomsky , Sy ntactic St ructur es (The Hague: Mouton & Co, 1957). 15 R. Mon tague, “English as a formal language,” in Linguaggi nel la So cieta e nel la T e cnic a , edited by B. V . et al . (Milan: Edizioni di C omunita, 19 70). 16 J. E . Hopcrof t and J. D. Ul l man, Intr o duction to Automata The- ory, L anguages and Compu tation (Addison-W esley , 1979). 17 P . Menzerath, “ ¨ Uber einige phonetisc he Probleme,” in A cte s du pr emier Congr e s international de linguistes (Le iden: Sijthoff, 1928). 18 R. K¨ ohler, G. Altmann, and R . G. Piotrowski, eds., Quantitative Linguistik. Ein internationales Handbuch / Quantitative Lin- guistics. A n International Handb o ok (W alter de Gruyter, 2005). 19 A. A. Mar k ov, “An example of statistical in v estigation of the text ‘Eugen e Onegin’ concerning the connection of samples in c hains,” Scienc e in Contex t 19 , 591 –600 (2006) . 20 C. Shannon, “A mathematical theory of communication,” Bell Syst. T ec h. J. 30 , 379–4 23,623–656 (1948) . 21 B. Mandelbrot, “An i nformational theory of the stat istical structure of language s,” in Communic ation The ory , edited by W. Jac kson (London: Butterw orths, 1953) pp . 48 6–502. 22 A. N. Kolmogorov , “Three approac hes to the quan titativ e defi- nition of information,” Probl. Inform. T ransm. 1(1 ) , 1–7 (1965). 23 W. Bialek, I. Nemenman, and N. Tishb y , “Pr ed ictability , com- plexit y an d learning,” Neural Comput. 13 , 2409 (2001) . 24 W. Kur aszk iewicz and J. Łuk aszewicz, “The num b er of differen t wo rds as a function of text length,” Pamiętnik Literac ki 42(1 ) , 168–182 (19 51), in Polish. 25 P . Guiraud, L es c ar act` er es statistiques du v o c abulair e (P ari s: Presses Univ ersitair es de F rance, 1 954). 26 G. Herdan, Quantitative Linguistic s (Butterw orths, 1964). 27 H. S. Heaps, Information R e t rieval—Com putational and The o- r etica l Asp e cts (Acade mic Press, 1978). 28 R. F errer i Canch o and R. V. Sol ´ e, “Two regimes in the fre- quency of w ords and the origins of complex lexicons: Zipf ’s law revisited,” J. Quant it. Linguist. 8(3) , 165–173 (20 01). 29 M. A. Mont em urro and D. H. Zanett e, “Ne w p erspective s on Zipf’s law in l i nguistics: fr om single texts to large corp ora,” Glot- tometrics 4 , 87–99 (200 2). 30 H. A. Simon, “On a class of ske w di s tribution functions,” Biometrik a 42 , 425–440 (1955). 31 R. P erli ne, “Strong, weak and false inv ers e pow er la ws, ” Statist. Sci. 20 , 68–88 (2005 ). 32 P . Harr emo ¨ es and F. T opsøe, “Maximum entrop y fundamen tals ,” En trop y 3 , 1 91–226 (200 1). 33 D. Manin, “Zipf ’s law and av oidance of excessiv e syno ny my ,” Cognit. Sc i. 32 , 1075–1098 (2008). 34 R. F errer i Can c ho and R. V. Sol´ e, “Least effort an d the origins of scaling in h uman language,” Pro c. Nat. Acad. Sci. Uni. Stat. Amer. 100 , 788–791 (2 003). 35 R. F errer i Canc ho and A. D ´ iaz-Guilera, “The globa l minima of the communicativ e energy of natural commu nication systems,” J. Statist. Mech. 2007 , P06009 (2007). 36 M. Prokopenko , N. Ay , O. Obst, and D. Polani, “Phase tran- sitions in least-effort comm unications,” J. Statist. Mech. 2010 , P11025 (2 010). 37 E. Khmaladze, “The statistical analysis of large n umber of r are ev en ts,” (1988), T ec hnical Report MS- R 8804. Cent rum voor Wiskunde en Informatica, Amsterdam. 38 A. Kornai, “How many words are there?” Glotto metrics 4 , 61–86 (2002). 39 J. G. W olff, “Language acquisition and the discov ery of phrase structure,” La ng. Speec h 23 , 2 55–269 ( 1980). 40 C. G. Nevill-Manning, Inferring Se q uential St ructur e , Ph. D. the- sis, U ni v ersi ty o f W ai k ato (1996). 41 Ł. Dęb o wski, “M en zerath’s la w for the smallest gr amm ars,” i n Exact Metho ds in the Study of L anguage and T e xt , edited b y P . Grzybek and R. K¨ ohler (Mouton de Gruyter, 2007) pp. 77– 85. 42 C. G. de Marc k en, Unsup ervi se d L anguage A cquisition , Ph. D. thesis, M assac hussett s Institute of T ec hnology (1996 ). 43 C. Kit and Y. Wilks, “Unsupervi s ed learning of w ord bound- ary with descri ption length gain,” in Pr o c e e dings of the Com- putational Natur al L anguage L e arning ACL Workshop , Bergen , edited b y M. Osb orne and E. T. K. Sang (1 999) pp. 1–6. 44 J. C. Kieffe r and E. Y ang, “Grammar-based codes: A new class of universal l ossless source co des,” IEEE T rans. Inform. Theor. 46 , 737–754 (2000). 45 M. Chari k ar, E. Lehman, A. Lehman, D. Liu, R. P anigrah y , M. Prabhak aran, A. Sahai, an d A. She lat, “The smallest grammar problem,” IEEE T r ans. Inform. Theor. 51 , 2554–2576 (2005). 46 A. Kil garriff, “Language is neve r ev er ever random,” Corpus Lin- guist. Li ngu ist. Theor. 1 , 263–276 (2005). 47 D. Knuth, “The complexit y of songs,” Comm. ACM 2 7 , 345–348 (1984). 48 L. Hoffmann and R. G. Piotro wski, Beitr¨ age zur Spr achstatistik (Leipzig: V erlag Enz yklop¨ adie, 1979). 49 N. V. Petro v a, “Co de — Merkmal e des sc hriftlichen Textes,” in Spr achstatistik , edited by P . M. Alexejew, W. M. Kalinin, and R. G. Piotrowski (Berlin: Ak ademie-V erlag, 1973) pp. 20–70. 50 T. M. Cov er and R. C. Ki ng, “A conv ergent gambling estimate of the en trop y of English,” IEEE T rans. Inform . Theor. 24 , 413–421 (1978). 51 C. R. Shalizi and J. P . Crutchfield, “Computational mechan- ics: Pat tern and prediction, structure and s implicit y ,” J. Statist. Ph ys. 104 , 817–879 (200 1). 52 G. J. Chaitin, “A theory of program size form al ly iden tical to information t heory ,” J. A CM 22 , 329– 340 (197 5). 53 M. Li and P . M. B. Vit´ an yi, A n Intr o duction to Kolmo gor ov Complexity and Its Applic ations, 2nd e d. (Springer, 1997 ). 54 M. Gardner, “The rando m nu mber Ω bids fair to hold th e mys- teries of the univ erse,” Sci. Am. 241 , 20–34 (1979). 55 P . C. Shields, “String matc hing b ounds via coding,” Ann. Probab. 2 5 , 329–336 (19 97). 56 R. M. Gra y and J. C. Kieffer, “Asymptot ically mean stationary measures,” Ann. Probab. 8 , 962–973 (1980). 57 F. de Saussure, Cours de linguistique g´ en´ er ale (Pa ris: Pa yot, 1916). 58 S. R. Ellis and R. J. Hitc hcock, “The emergence of Zipf ’s la w: Spontane ous encoding optimization by users of a co mmand lan- guage,” IEEE T rans. Syst . Man Cyb er. 16 , 423–427 ( 1986). 59 J. D. W ang, H.-C. Liu, J. J. P . Tsai, and K.- L. Ng, “Scaling behavior of maximal repeat distributions in genomic sequences,” In t. J. Cogn. Inf. N at . I n tel. 2 , 31–42 (2008). 60 W. Ebeling and G. Nicolis, “W ord frequency and entrop y of sym - boli c seque nces: a dynamical p erspective,” Chaos Sol. F ract. 2 , 635–650 (1992). 61 P . C. Shields, “Univ er s al redundanc y rates don’t exist,” IEEE T rans. Inform. The or. IT-39 , 520–524 (199 3). 62 P . Billingsley , Pr ob ability and Me asur e (Wiley , 19 79). 63 V. G. V ovk and V . V. V’yugin, “On the empiri ca l v alidity of the Ba y esian meth od,” J. Ro y . Statist. So c. B 55 , 253–266 (1 993). 64 V. G. V ovk and V. V. V’yugin, “Prequen tial level of imp ossibil- ity with some applications,” J. Ro y . Statist. Soc. B 56 , 115–123 (1994). 65 H. T ak ahashi, “On a definition of random sequence s with re- spect to conditional pr ob ability ,” Inform. Comput. 206 , 1375– 1382 (2008).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment