Linguistic Information Energy
In this treatment a text is considered to be a series of word impulses which are read at a constant rate. The brain then assembles these units of information into higher units of meaning. A classical systems approach is used to model an initial part …
Authors: James Ford
Linguistic Information Energy James Ford jford@CACI.com ABSTRACT In this treatment a text is cons idered to be a series of word impulses which are read at a c onstant rate. The brain then assembles these units of information into higher units of m eaning. A classical systems approach is used to model an initial part of this assembly process. The concepts of linguistic system response, information energy, and ordering en ergy are defined and analyzed. Finally, as a demonstration, informat ion energy is used to estim ate the publication dates of a series of texts and the similarity of a set of texts. Introduction This paper represents an effort to apply a classic systems approach to t he modeling of linguistic processes. Specif ically a model is po sited from which statistics about the effort required by the reader of a given text can be drawn. This effort is termed the information energy. The motivational linguistic model is developed first. This is followed by several computations exploring typical behavior of the statistic. Linguistic Model Definitions. To begin, let the impulsive value of a word, w , be given by ) ( w Ι . This value should represent the work required by the reader in recognizing a word. Words that are rare or long or are spelt in an unPracticed manner should have a higher value. Some examples of ) ( w Ι are given in Table 1. Here ) Pr( w is the probability of w and l(w) is a measure of th e length of w in characters or syllables or other lexical units. These functions will be referred to later. Table 1. Word Impulse Functions Name Formula ) ( 1 w Ι )) Pr( / ) ( log( w w l ) ( 2 w Ι )) log(Pr( ) ( w w l ⋅ − ) ( 3 w Ι )) log(Pr( w − ) ( 4 w Ι )) log(Pr( * )) ( log( w w l − ) ( 5 w Ι ) ( w l ) ( 6 w Ι )) ( log( w l ) ( 3 w Ι is classically the amount of informati on resolved by the occurrence of word w . Now a text can be represented as a time series of word values ) ( ) ( ) ( t i w t Text i i − ⋅ Ι = ∑ δ where ) ( ⋅ δ is the Dirac delta. Note that the time variable t is continuous while i is discrete. Next, define a memory smoothing function (MSF) or forgetting function, ) ( t τ Μ , which represents the fact that a reader doesn’t need to remember all the words previously read to extract meaning, only the most recent ones. ) ( t τ Μ is positive and non-increasing. The subscript τ is the width of the interval over which the MSF is active. As an example of an MSF consider ] ) 1 ( exp[ ) 1 ( ) ( ) ( , k t t U t P t M k ⋅ − + − ⋅ − + + = τ τ τ τ (1) where 1 ) ( = t P τ on the interval [0, τ +1) and is zero elsewhere and 1 ) ( = − t U τ for τ >t and is zero elsewhere. A special case of this, ) ( , 2 t M ∞ , will be used in the following examples. A finite value of k may best represent real linguistic processing but in practice any k , τ pair seems to have a corresponding ∞ , eq τ pair that gives equivalent performance. ) ( t τ Μ is the linguistic system response (LSR) of a single word with an impulse value of 1 although in reality a reader would not continue to read after reading the only word in a single word text. The LSR for a text is then ) ( ) ( ) , ( t Text t t Text LSR ∗ Μ = τ where ∗ is the standard convolution operator [1]. If the LSR is the minimum attention level required to read a text at time t and the energy required to maintain that attention level is equal to it’s square, then the average information energy required by a finite text is dt t Text LSR Text IE T T ∫ = 0 2 1 ) , ( ) ( where T is the time extent of the text. Note that a more general term for this might be decoding energy since the exact choice of word impulse function ) ( w Ι hasn’t been specified. However, the term information energy will be used throughout this paper. The information energy defined here is not related to the information energy of Onicescu [2]. Example. Consider a random text, R , of N words with ) ( w Ι uniformly distributed on [] D , 0 and an MSF of the form ) ( , 2 t M ∞ given by (1). With the exception of the first word, evaluation of the LSR will be the evaluation of the sum of two random variables ) ( ) ( 1 + Ι + Ι = i i w w x and this sum will have the density shown in Figure 1. Figure 1. Probability Density for Word Impulse Sum The expected energy for this sum of variables will be 2 2 2 6 7 ) ( ] [ D dx x F x x E = ⋅ ⋅ = ∫ . Thus the expected information energy will be [] [ ] N N D N D N dx x D R IE E D 6 7 ) 1 ( 3 1 6 7 ) 1 ( 1 ) ( 2 2 0 2 ⋅ − + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⋅ − + ⋅ = ∫ where the first term in the brackets represents the evaluation of the first word. Note that the termination symbol for the text is not evaluated as a separate word, the reader merely stops reading. Definition. Let O be an ordering applied to a text T such that if ) ( , T O w w j i ∈ and i τ since the ordering O may not change the word order. But it can also be negative. As an example consider the text 3 2 1 w w w T = with ) ( ) ( ) ( 3 1 2 w w w Ι > Ι > Ι then )) ( ( ) ( T O IE T IE > if the MSF is ) ( , 2 t M ∞ . Negative ordering energy was seen less than 10% of the time for 2 = τ and decreased by half for every increase in the value of τ by 1. In practice the ordering energy will be zero 3%-4% of the time. The largest such sentence found thus far, without repeating words, had 5 words (example: “How can they do it”). The author will NOT be offering a prize for the largest non-trivial sentence found with an ordering energy of zero. However, it would be interesting to know. Typically ordering energy isn’t used by itself. Instead, the two statistics unordered information energy IE(T) and ordered information energy IE(O(T)) are handed over in raw form for further processing. Example. Let 2 T be a text with two alternating words 2 1 , w w and cardinality N which is even. As in the previous example, using an MSF of the form ) ( , 2 t M ∞ given by (1), the ordering energy is [] 2 2 1 2 ) ( ) ( 2 ) ( w w N N T OE Ι − Ι − = . S t atistical Demonstration Datasets. Two datasets, the American National Corpus 2 nd edition (ANC) [3] and a so called plain text corpus (PTC) constructed by the author, were used to examine the information energy of English. The PTC is a collection of texts written in a “conversational” style. There are no government reports or scientific papers. There are no spoken transcripts, although some of the texts have quoted dialog uttered by their characters. Currently there are 58 texts 8 of which are shared with the ANC where they were labeled as either fiction or non-fiction/OUP. All were written by a single author. About 75% were written after 1900. The earliest text dates from 1529. Most came from Project Gutenberg [4]. The word probabilities for the word impulse function were calculated from the non-shared parts of both the ANC and PTC. No pre-processing occurred other than the replacement of all numbers with a number symbol. Indeed it is the alternation of what are commonly called stop words with words of high information value that give statistics such as ordering energy a grip on the text. Stemming could reduce the average length of words in a text and cause confusion between different writing styles depending upon which word impulse function was chosen. Computation 1. The first computation is a calculation of the average per- sentence information energy on the first 400 sentences of each text in the PTC making a total of 23,600 sentences and 536,583 words. For this calculation, a sentence terminator was considered the last word or symbol read by the reader with a length of 1 and a probability equal to that of a sentence. The impulsive value of each word was given by the function ) ( 1 w Ι from Table 1 where in this case l(w) is the length of the word in characters. The MSF used was ) ( , t M ∞ τ where τ ranged between 2 and 29. The relationship between the mean information energy and τ was found to be nearly linear over that range. It can be modeled by τ τ ⋅ + − = 7 . 580 1162 ) ( mean IE (2) Figure 2 shows a graph of τ versus the mean information energy divided by τ and illustrates the differences between (2) and the observed data. This grap h peaks at a value of τ equal to 21 while the average sentence length was 22.7 words. As the value of τ nears and then passes the average sentence length the effective value of τ will become less than the actual value slowing the rate at which IE can increase. Figure 2. τ versus IE/ τ Figure 3 shows the probability density estimates of the information energy when normalized (divided) by function (2). Figure 3. Normalized Information Energy Densities The relationship found between the mean information energy, mean IE and the information energy’s standard deviation, SD IE , is 2 5 4 . 4 2903 . 0 4 . 16 mean mean SD IE e IE IE ⋅ − + ⋅ + − = . This model is nearly exact and the 2 nd order quality shows that something other than re-scaling of the data occurs as τ and thus IE increases. Ordering energy exhibited much the same behavior with respect to τ . However, because of the large number of sentences with a zero ordering energy, 3.15% in this case, the density function will be a mixture with a spike at zero. Protocols. The following two computations are focused on finding the best word impulse function from Table 1 and the best value of τ in two typical situations, estimating an abstract parameter from a text and determining the similarity of texts. There are two way of calculating information energy that are of interest, per- sentence information energy, described in Computation 1, and per-document information energy. For the later, unordered information energy is calculated by dumping or zeroing the word impulse history in the MSF at the end of each sentence but, continuing calculation of the LSR with the next sentence. This isolates the influence of each sentence while producing a fin al information energy sum across the entire text. The ordered information energy is calculated from this with a single sort. If the sorting process requires ) log( n n operations this is clearly slower. With per-sentence, the information energy values are averaged over all the sentences. For these computations the PTC was broken into 523 files of approximately 5000 words each. Breaks were made on sentence boundaries. Each computation was cross-validated with 1000 Monte Carlo trials, training on 50% of the 523 files and testing on the other 50%. Cross-validation is useful for identifying st atistics which work well for evaluation of unknowns rather than ordering of given sets. Definition. Let V , for vocabulary, be an ordered collection of words and associated frequency counts. Define the average ) ( ) ( 1 1 ∑ = ⋅ = N r r N V freq r V Z where ) ( r V freq is the frequency count of the rank r word in V . The notation Z is used to refer to Zipf’s Law [5] which states that rank times frequency is nearly constant in observations from certain “natural” systems. Let ALL V be the vocabulary of all the words in a text. Let STOP V be a collection of syntactic glue words, sometimes called stop words, which have little interest by themselves. Finally, define the statistic [] ) ( ) ( ) ( ALL STOP ALL RATIO V Z V Z V Z Z − = . (3) This statistic will be used in the follow ing two computations. Computation 2. Each text in the PTC has an associated date in years which is the earliest of either the publication date or the date of writing when known. The ordered, IEo , and unordered, IEu , information energy is used to estimate that date. Date estimation is cast as a least squares problem with the i th row of the data matrix constructed as i i i i IEu IEo IEu IEo ⋅ , , , 1 . The constant value 1 is used to generate the correct offset for the dates. The estimated dates ar e compared with the actual dates and the average error is returned. Figure 3 shows the average date error versus τ for the six word impulse functions in Table 1 in the per-document case. ) ( 6 w Ι is the winner with a minimum around 110 = τ . Figure 3. Per-Document Estimation Performance When the computation was carried out with the test set equal to the training set, ) ( 1 w Ι was the clear winner at 233 = τ with an error of 37.9 years. This would be the impulse function of choice for ordering a given set. The per-document case gave better performance than the per-sentence case. In the per-sentence case the error curve could drop to a minimum quickly but would not reach the final performance seen in the per-document case. The best result for the per-sentence case was an average error of 46 years for the ) ( 3 w Ι function. When the least squares problem was changed such that the i th row of the data matrix became RATIO RATIO RATIO Z IEo Z IEu IEo IEu Z IEo IEu ⋅ ⋅ ⋅ , , , , , , 1 where RATIO Z is defined in (3), then the average date error was found to be that of Figure 4. Figure 4. Per-Document Expanded Estimation Performance Note that the order of performance is nearly the reverse of Figure 3. Also note that the location of the minimum for ) ( 1 w Ι and ) ( 3 w Ι is for a value of τ close to the average sentence length. Computation 3. After breaking the 58 PTC texts into 523 5000 word chunks, the natural question to ask is whether Humpty-Dumpty [6] can be put back together again. This computation tries to find the best way to do that with information energy. Here 2-dimensional vectors of the form IEu IEo vIE , = represent each text and the average vector for each author’s training set represents that author. T he minimum distance between a vector from the testing set and any author vector classifies the test text. Distance is defined in the following way. For real vectors X and Y and symmetric positive-definite matrix K , let Y K X Y X K • • = | be an inner product. Then the distance between two real vectors U and V is ( ) 2 1 | ) , ( K V U V U V U d − − = . For this computation K is the inverse of the covariance matrix [7] derived from the set of averaged author training vectors. Using the inverse covariance matrix has the effect of de-correlating and equalizing the vector coordinates. Figure 5 shows the performance with respect to τ of each of the word impulse functions from Table 1 for the per-document case. ) ( 5 w Ι and ) ( 6 w Ι are the winners. The best performance in the per-sentence case was an error of 0.74. Figure 5. Per-Document Similarity Performance When the text representation vector was expanded to RATIO Z IEu IEo , , the author matching error became that of Figure 6. Figure 6. Per-Document Expanded Similarity Performance Once again the performance ordering of the word impulse functions has changed with the addition of a new statistic. Conclusions. The choice of word impulse function and parameter τ is dependent on circumstance. ) ( 5 w Ι and ) ( 6 w Ι work well when information energy is the only statistic. However, it is expected that it will be used in conjunction with other statistics. In conjunction w ith RATIO Z , good performance was obtained from the word impulse functions ) ( 1 w Ι and ) ( 3 w Ι , which have a structure similar or equal to the classic information measure )) log(Pr( w − . It should be noted that the statistic RATIO Z gave the best performance improvement over several standard vocabulary richness statistics tested for this purpose (results not reported here). References [1] http://www.kfupm.edu.sa/math/Sc_Events/SignalProcessing/Note3.pdf [2] O. Onicescu, Énergie Informationelle , C. R. Acad. Sci. Paris Sér. A-B 263 (1966), A841-A842. [3] http://www.americannationalcorpus.org/ [4] http://www.gutenberg.org/wiki/Main_Page [5] http://en.wikipedia.org/wiki/Zipfs_law [6] http://en.wikipedia.org/wiki/Humpty-Dumpty [7] http://en.wikipedia.org/wiki/Covariance_matrix
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment