Use of self-correlation metrics for evaluation of information properties of binary strings

Use of sel f-correlation metrics for evaluatio n of info rmation properties of binary strings Sergei Viznyuk Abstract It is de monstrated that appropriately chosen computable metrics based on self-correlation properties provide a deg ree of determinism sufficient to segregate binary strings by lev el of information content. Shannon’s classic formula [ 1 ] ∑ ⋅ − = i i i p p H ) ( log 2 (1) , originally defined as a measure of informati on contained in the messa ge, has been the cornerstone of mos t works on information theor y [ 3 ]. Later the definit ion has been re fined [ 2 , 3 ] to the effect that entrop y H is rather a measure of miss ing informati on needed to determine the state of an object. Some valid interpretations of the value provided b y ( 1 ) are a) the minimum nu mber of ye s/no answers n eeded to describe the object i n terms of predefined set of observabl es with mutuall y independent p robabilities p i b) the minimum l ength to which the message can be compressed, in bits per symbol c) the minimum channel capacit y required to transmi t the message, in bits per symbol The value for entropy provi ded b y ( 1 ) depe nds on p robability dis tribution p i of the describing observables. In a case of message s trings, the set of obs ervables makes up an alphabet , or codebook . A message and a codebook allow defining t he in formation content of th e message using ( 1 ). W ithout known codebook the information cont ent of the messa ge is not defined, d espite efforts [ 4 , 5 , 6 ] towards definition in t erms of algorit hmic (Kolmogorov) complexit y or effective compl exit y . It w as shown [ 7 ] that such definitio ns alwa ys i mply a choice of a codebook . If we ar e to involve terminol ogy from quantum mech anics we can draw an analogy betw een a set of message strin gs and an ensemble o f open qu antum systems . An open quantum s ystem and its environment to gether ca n be vie wed as a closed quantu m system whi ch state (informati on content) can be det ermined by m easurement of certain observables. S imilarly, the information content of a messa ge can be determined in the contex t of a given codebook (environment). Without codebook t he information content of the m essage cannot be measured, similar to an operator in quantum mechanics n ot having eigenvalue for the given open quantum system. Th e approach i n this c ase is to consid er the ensemble o f objects. The ensemble is characterized s tatistically usin g a set o f descripti ve obs ervables. The mean values of the observables can be measu red (o r calculated usin g known density matrix ), to provide a lev el of information about the ensemble , th ough not specific objects. The reasoning above leads us to beli eve we are free to choose observable (or computable) parameters fo r the s tatistical evaluation of n ot directly obs ervable properties of the obje cts in an ensemble i n general sens e, for example for evaluation of information conte nt. The success of t he evaluation is measured by degree of determinism betwee n obse rved par ameter values and the property under evaluation. We consider a large collection of various t ypes of binary computer files as the ensemble of objects. A comp utable metrics M F and D F have been chosen as t he parameters to be used for statistical evaluation of i nformation content of the files. W e demonstrate that chosen metrics provide a degree of d etermini sm sufficient to segregate arbitrary files b y l evel of information content . The degree of determinism increases with fil e size resulting in reliable se gregation of files larger than 10Mbits. The lo gic in choosin g com putable metric is ba sed on a premise that non-random messages sh ould have a de gree of s elf-correlation. We start with definin g correlation metric C R (n) as ∑ = + ⊕ = M i n i i R B B n C 1 ) ( , M n < ≤ 0 (2) , where B i is i th bit of a binar y mes sage strin g B , cons istin g of M bits w ith val ues 0 or 1, and ⊕ is logical X OR op erator; B i+n = B i+n-M for i+n>M . The ran ge of possib le valu es of C R (n) is 0 t o M . Next we define m etric M F (n)=M−2• C R (n) , and a ggregate m etri cs M F and D F as ) ( 1 0 n M M M n F F ∑ − = = (3) ∑ − = −       = 1 0 2 1 ) ( M n F F M n M D (4) The comp utation of M F and D F metrics h as been per formed u sin g corrb its pro gram (see http:/ /www.p hystech.co m/do wnload/ corrbits .htm ) on a larg e coll ection of vario us t ypes of bin ary comput er files: 1. Group 1: M icrosoft Ex cel, Wo rd, PowerPoint, Visio , RTF documents, delimit ed and positi on-based non-r ando m d ata files , W indow s and UN IX ex ecutabl es and librar y files, plain text files, various other types of non -com press ed files: TTF, REP, JSP, RDF, TAR, HTM, LOG, XM L. The files we re coll ected from di fferent s ources. Tot al nu mber of files in thi s group 112 2. The s iz e of the fi les in this gro up ran ges from 13 b ytes to 4 .3Mb ytes 2. Group 2: Com pressed non-rand om com puter fil es. BZIP2 and GZIP compressio n programs h ave been use d to compress files of the same types as i n Group 1, an d random slices of the compressed fil es as well as whole f iles have been analyzed w ith corrbits program. Total number of files in th is group 18 67. The siz e of th e files in th is group ranges f rom 30 bytes t o 2.3M bytes. 3. Group 3: Rando m d ata files obt ained from /dev/rand om device on several different UN IX servers. Total number of files analyzed i n th is group 1 334. The si ze of the fi les i n t his group ran ges from 14 b ytes to 4.5 Mb ytes. Figure 1 sh ow v alues of M F metric p lotted against the s ize M of the anal yze d files. 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 10,000,000,000,000 100,000,000,000,000 1,000,000,000,000,000 10,000,000,000,000,000 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 Fil e si z e M , bi t s M F val ue G r o u p 1 : no n -r an d o m n o n - c o m p res s ed d at a G r o u p 2 : co m p res s ed n o n - ran do m da t a G r o u p 3 : ran d o m d a t a Figure 1 Whi le at smaller file size M the M F values for a ll three groups are not clearl y s egre gated, the separation grows w ith fil e size and wit h M great er than 10Mb its the s egregati on gets clo se to 100%. The as ymptoti c behavior of the groupin gs becom es also evid ent at bigger file size. M F values for G roup 1 ex hibi t M F ≈ M 2 asymptot ic dep endenc y, wh ile fo r Grou p 3 it i s M F ≈ M , and Group 2 approx imatel y follo ws M F ≈ M 3/2 . From ( 3 ) m etric M F = M 2 for strings co nsist ing of all 0 or all 1 bits . The informat ion content of su ch s trin gs is m inim al: 0 acco rding to ( 1 ). Th erefor e the adjus tment in th e first o rder to the m etric M F has been made as F F F M M M M Adj ⋅       − = 2 1 . (5) Figure 2 sh ow v alues of Adj .M F metric p lotted again st the s ize M of the an alyzed fi les. 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 10,000,000,000,000 100,000,000,000,000 1,000,000,000,000,000 10,000,000,000,000,000 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 File siz e M, b its Ad j u s te d M F v a lue G rou p 1 : no n-ran d o m n o n -co m p res s ed d at a G rou p 2 : co m pres s e d n o n -r a n do m da t a G rou p 3 : r a n d o m d at a Figure 2 The Adj.M F metric show s the s ame as y m ptoti c behav ior as M F while po tent iall y allow ing more accurat e segre gation of files b y t he level of info rmation content. As evident from Figu re 2 Adj.M F m etric breaks the ensemb le of arbi trar y bi nar y s trings into “spectru m” bands , acting as a dispers ion op erator . Other pos sible met rics m ay provid e differ ent wa y of segre gatin g binar y m essa ge strin gs. As an ex ample w e consid er D F metric pro vided b y ( 4 ). Figur e 3 shows values o f D F metric plotted agains t t he s ize M of the sam e col lection of files as Figures 1-2. Figure 3 demon strates the q ualitati ve di fferenc e i n as ymptoti c beh avior of D F met ric bet ween binar y strin gs i n Gro up 1 and Groups 2-3 . We c an speculate that the difference in as y m ptot ic beh avior of D F metri c between Grou p 1 and Groups 2-3 is tied to the am ount of what can b e co nsi dered self-conta ined inform ation . Group 1 con tains significant am ount of sel f-cont ained dat a, G roup 3 has non e, and Group 2 has ve r y small am ount i nserted as part of the com pressio n routi ne m etadata. The as ymptoti c behavior of D F metric for Group 1 fo llows D F ≈ M/100, and for Gro up 2- 3 D F ≈ 1. 0 1 1 0 1 00 1 ,0 0 0 10 ,0 0 0 1 00 ,0 00 1 ,0 00 ,0 0 0 1 0, 0 00 ,0 00 1 1 0 10 0 1 ,0 0 0 10 ,0 0 0 1 00 ,0 00 1 ,0 00 ,0 0 0 10 ,0 00 ,0 0 0 1 00 ,0 00 ,0 0 0 Fil e s i ze M , bi t s D F val ue G r o u p 1 : n o n -ran d o m n o n -co m p res s ed da t a G r o u p 2 : co m p res s ed n o n -rand om da t a G rou p 3: ran do m d a ta Figure 3 It has b een d emons trated that a comput able met rics based on self -correlat ion properti es can be used for s tatisti cal evaluation of information properti es of binar y strings. The degree of determin ism b etween the ch osen metrics and eval uated p ropert y is shown to increase wit h siz e of the st rings. The particul ar ch oice of metrics can b e ex panded in the fu ture or algorith ms b ehind their calcul ation m odifie d to impro ve the practi ca l useful ness of t he app roach. 1. Shanno n, C. E . 1948 A Mathem atical Theo r y of Comm unicati on. The Bel l S ystem T echni cal Jou rnal 27, 3 79–42 3, 623–656 . 2. Cherr y, E.C., Discuss ion on Dr. S hannon ’s paper s, 19 53 I EEE T ransa ctions on Informatio n Theor y, vol.1, n o.1, pp .169 . 3. Cover, T homas M., Thomas, Joy A . 2006 Elements o f inform ation t heory. S econd Ed ition . Joh n Wile y & S ons, Inc. , ISBN-13 978-0 -471 -241 95-9 . 4. Gacs, P., Tromp J .T., Vitan yi, P.M. Al gorith mic St atisti cs. 2001 IEEE Trans actio ns on Informati on Theor y. http :// arxiv .org/abs/m ath/000 6233 . 5. Gell-Man n, M urra y, and Seth Lloyd. 1996 I nform ati on M easures, Effecti ve Complex it y, and Total Informatio n. Com plex it y 2, 44-52 . 6. Titchener, Mark R. A Measure of Information . 2000 Proceedings on the Conference on Data Comp ression. Pages: 35 3 – 362. ISBN:0-769 5-059 2-9 . 7. McAlli ster, James W. Effective Complex it y as a Measure of Informat ion Content. 2003 Philo soph y of Science, v ol.7 0, pages 3 02-307.

Use of self-correlation metrics for evaluation of information properties of binary strings

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment