Deformed Statistics Formulation of the Information Bottleneck Method

Deformed Statistic s Formulat ion of the Informat ion Bottleneck Method R. C. V enkates an Systems Research Corp oration Aundh, Pune 4 1100 7, India Email: ravi@systemsresearchcor p.com A. Plastino IFLP , Nation al University La Plata & National Research Council (CONICET) C. C., 727 19 00, La Plata,Argen tina Email: plastino @venus.ﬁsica.unlp.e du.ar Abstract —The theoretical basis f or a candidate v ariational principle for the inf ormation bottleneck (IB) method is f ormu- lated withi n the ambit of the generalized nonaddi tive statistics of Tsallis. Giv en a nonadditivity parameter q , the role of t he additive dualit y of n onadditive statistics ( q ∗ = 2 − q ) in relating Tsallis en tropies fo r ranges of the nonadditivity parameter q < 1 and q > 1 i s described. Deﬁnin g X , ˜ X , and Y to be th e source alphabet, the compressed repro duction alphabet, and , the rele- vance variable respecti vely , it is demonstrated that minimization of a generalized IB (gIB) Lagrangian deﬁ ned in terms of the nonadditivity parame ter q ∗ self-consistently yields the nonadd itive effe ctive distortion measure to b e the q -defo rmed generalized Kullback-Leibler divergence: D q K − L [ p ( Y | X ) || p ( Y | ˜ X )] . This r e- sult is achiev ed without enfor cing any a-priori assumptions. Next, it is prov en that the q ∗ − def or med nonadditive fr ee energy of the system is non-negativ e a nd con vex. Finally , the update equations fo r the gIB method are derived. These results generalize critical features of the IB method to the case of T sallis statistics. I . I N T RO D U C T I O N Rate distor tion ( RD) th eory [1, 2] is a m ajor b ranch of informa tion theory which provides the theoretical found ations for lossy d ata co mpression. RD theo ry ad dresses th e p roblem of determining the minimal amount of entropy (or info rmation) R that should be com municated over a ch annel, so th at the source (input signal/sourc e alphab et/codebo ok) X ∈ X can be ap prox imately reconstructed at the receiver (o utput sig- nal/repro duction alp habet/qu antized codebook) ˜ X ∈ ˜ X with- out exceed ing a gi ven expe cted distor tion D . Note that calli- graphic fonts are used to denote sets. In turn, th e in formation bottleneck (IB) meth od is a techn ique introduced by T ishby , Pereira, an d Bi alek [3, 4] for ﬁn ding the best tradeoff be tween accuracy and comp lexity (comp ression) when summarizin g (e.g. clustering) a discrete rando m variable X , giv en a joint probab ility distribution between X and a rele vance va riable Y ∈ Y , i. e. p ( x, y ) . In this regard, the IB method represents a signiﬁcant q ualitative improvement over RD theory . T he IB method has acquire d immense utility in machin e learn ing theory . For examp le, the IB method and its mod iﬁcations have successfully b een employed in applicatio ns in diverse areas such as geno me sequence analy sis, astroph ysics, and, text mining [4]. q − Def ormed (or Tsallis) statistics [5 ,6] has recently been shown to yield inte resting improvements con- cerning RD theory [7]. The present paper exten ds a nalogo us ”q-”con siderations to th e IB m ethod. The genera lized (n onadd iti ve) statisti cs of Tsallis’ has re- cently been the fo cus of much attention in statistical physics, and allied disciplines. Note that the terms gen eralized statis- tics. q − def or med statistics, nonad ditiv e statistics, and no nex- tensiv e statistics are u sed in terchang eably . Nonad ditive statis- tics, which gener alizes the Boltzmann-Gibb s-Shannon (B-G-S) statistics, has recently fou nd much utility in a wide spectrum of disciplines ranging fro m complex systems an d cond ensed matter physics to ﬁnancial mathematics. A continually updated bibliogra phy o f w orks in nonadd itiv e statistics may b e fou nd at http://tsallis.cat.cbp f.br/b iblio.htm . Since the w ork on nonextensive sou rce coding by Landsberg and V edra l [8] , a n umber of studies on the info rmation theoretic aspects of generalize d statistics p ertinent to cod ing related problems h av e b een perform ed [9-1 2]. Most recently , the nonadditive statistics of Tsallis [5,6] has been utilized to develop a gener alized statistics RD th eory [ 7]. This paper [7] in vestigates nonad ditive statistics with in the co ntext of RD theory in lossy data compression. The generalized statistics RD mo del perfo rms variational m inimization o f the nonaddi- ti ve RD Lagrang ian emp loying a method de veloped [13] to ”rescue” the linear constraints orig inally employed by Tsallis [5]. Nonadditive statistics p ossesses a num ber of con straints having different forms [1 4-16 ]. RD theory is now brieﬂy described as a precur sor to intro- ducing the leitm otif for the IB meth od. For a so urce alphabet X ∈ X and a reprodu ction alphabet ˜ X ∈ ˜ X , the mapping of x ∈ X to ˜ x ∈ ˜ X is character ized b y the quantize r p ( ˜ x | x ) . Th e RD fun ction is o btained by min imizing th e genera lized mutual informa tion (GMI) I q ( X ; ˜ X ) (deﬁned in Section 2, [7 ]) over all normalized p ( ˜ x | x ) 1 . In RD theory I q ( X ; ˜ X ) is kn own as the co mpr ession information (see Section 3, [7]). Here, q is the non additivity parameter [5, 6]. A sign iﬁcant fe ature of the n onadd iti ve RD model [7] is that the thr eshold for th e compression in formatio n is lower th an th at en counter ed in RD models der iv ed from B-G-S statistics. T his featu re aug urs well for u tilizing Tsallis statistics in data co mpression applicatio ns. 1 The absence of a deﬁniti ve nona dditi ve channe l coding theorem s ometimes prompts the use of the te rm generali zed mutual entropy instead of GMI [10]. By deﬁnitio n, the non additive RD f unction is [1, 2,7] R q ( D ) = min p ( ˜ x | x ): h d ( x, ˜ x ) i p ( x, ˜ x ) ≤ D I q  X ; ˜ X  ; 0 < q < 1 , (1) where, R q ( D ) is the minimum of the compr ession in formation . The distortion measur e is deno ted by d ( x, ˜ x ) and i s taken to be the Euclidean square distance for most proble ms in science and engineer ing [1,2] . Gi ven d ( x, ˜ x ) , th e partitioning of X induced by p ( ˜ x | x ) has an e xpected d istortion D = < d ( x, ˜ x ) > p ( x, ˜ x ) . Note that in this p aper, h•i p ( • ) denotes the expectation with respect to the probability p ( • ) . RD theory a-prio ri speciﬁes the n ature of th e distortion m easure, which is tantam ount to a n a-priori speciﬁcation o f the f eatures o f interest in the source alphabet X to be contained in compressed representation ˜ X . RD theory lacks the framework to specify the features of interest in X to b e contained in ˜ X , th at are relevant to a given study . T o ameliora te this drawback, the IB meth od introd uces another v ariable Y ∈ Y , the r e levance variable . Note that Y need not in habit the same space as X . Thus, the crux of the IB meth od is to simultaneou sly minimize the compr ession information I q ( X ; ˜ X ) and maximize the rele vant informa tion I q ( ˜ X ; Y ) . More speciﬁcally , the IB method extracts structure fr om th e source alp habet via d ata compression , fo llowed by a quantiﬁcation of the information contained in the extracted stru cture with re spect to a rele vance variable . Conseq uently , the IB m ethod ”squeezes” the infor- mation between X and Y throug h a bo ttleneck ˜ X . Th e IB method is compa ctly described by the Markov co ndition [3 ,4] ˜ X ↔ X ↔ Y . (2) As discussed in [7], the un-norm alized GMI in Tsallis statistics acquires different forms in the regimes 0 < q < 1 and q ≥ 1 , respectively . For example, for 0 < q < 1 , the GMI is of the form I 0 1 , the GM I is deﬁned b y I q> 1 ( X ; ˜ X ) = S q ( X ) + S q ( ˜ X ) − S q ( X, ˜ X ) , where S q ( X ) and S q ( ˜ X ) are the m arginal Tsallis en tropies fo r th e ran dom variables X and ˜ X , and, S q ( X, ˜ X ) is the join t T sallis e ntropy [ 7]. U nlike th e B-G- S case, I 0 1 , and vice versa [7, 9, 10 ], the rea son bein g that the sub- additivities S q ( X | ˜ X ) ≤ S q ( X ) and S q ( ˜ X | X ) ≤ S q ( ˜ X ) are n ot gen erally valid when 0 < q < 1 . While the form of I 0 1 possess a numbe r o f impor tant proper ties such as th e g eneralized d ata p r oce ssing inequality and the generalized F ano in equality [10 ]. Th e different forms of the GMI for 0 < q < 1 and q > 1 are reco nciled by in - voking the a dditive dua lity of n onadd itive sta tistics [17 ]. This entails a re-param eterization of the no nadditivity parameter q ∗ = 2 − q , resulting in d ual Tsallis entr opies . This pap er derives a theoretical basis f or the g eneralized IB (gIB) method , wh ich is a fu ndamen tal qualitative extension of th e seminal work of T ishby , Pereira, and Bialek [3]. This analysis co mmences with the min imization o f the g IB Lagrang ian L q gI B [ p ( ˜ x | x )] = I q  X ; ˜ X  − ˜ β gI B I q  ˜ X ; Y  ; q > 1 , ( 3) subject to the normalization of p ( ˜ x | x ) . Her e, ˜ β gI B is the gIB tradeoff pa rameter for the simultan eous min imization and maximization described by ( 3). Fr om (3 ), it is easily shown that: δ L q gI B [ p ( ˜ x | x )] = 0 ⇒ δI q ( X ; ˜ X ) δI q ( ˜ X ; Y ) = 1 ˜ β gI B . Thus, by increasing ˜ β gI B , co n vex curves akin to th e RD cur ves [ 1,2,7 ], may be co nstructed in the ”information plan e” ( I q ( X ; ˜ X ) , I q ( ˜ X ; Y ) ). These are ca lled rele vance-co mpr ession curves [4]. Apart from its ability to model lo ng-range in teractions when perfor ming clustering of complex data sets, the gIB meth od also facilitates the analysis of the IB method within the context of pr edictability [18]. Pr edictability may be vie wed as an excursion from the extensiv e B -G-S statistics, and is inherently nonextensive (non additive). I I . T S A L L I S E N T RO P I E S A N D D UA L T S A L L I S E N T RO P I E S The u n-nor malized Tsallis entr opy , condition al Tsallis en- tropy , joint T sallis entropy , the jointly conv ex gen eralized K- Ld, and, th e GMI may thus be wr itten as [ 19,20 ] S q ( X ) = − P x p ( x ) q ln q p ( x ) , S q  ˜ X    X  = − P x P ˜ x p ( x, ˜ x ) q ln q p ( ˜ x | x ) , S q  X , ˜ X  = − P x P ˜ x p ( x, ˜ x ) q ln q p ( x, ˜ x ) = S q ( X ) + S q ( ˜ X | X ) = S q ( ˜ X ) + S q ( X | ˜ X ) , D q K − L ( p ( X ) k r ( X ) ) = − P x p ( x ) ln q r ( x ) p ( x ) , I 0 1 ( X ; ˜ X ) and I 0 1 and 0 < q ∗ < 1 respectively , relate to ea ch other as (Theorem 3, [7]) I q ∗ ( X ; ˜ X ) = − P x P ˜ x p ( x, ˜ x ) ln q ∗  p ( x ) p ( ˜ x ) p ( x, ˜ x )  ( q ∗ → q ) = S q ( X ) + S q  ˜ X  − S q  X ; ˜ X  = I q ( X ; ˜ X ) ( q → q ∗ ) = I q ∗ ( ˜ X ; X ) . (11) Here, ” q ∗ → q ” is a re-p arameterizatio n from q ∗ to q , and, ” q → q ∗ ” is a re-parameterization fr om q to q ∗ . I I I . G E N E R A L I Z E D I N F O R M A T I O N B OT T L E N E C K V A R I AT I O N A L P R I N C I P L E A. Self-co nsistent eq uation s Dependin g u pon the ”upstream” and ”downstream” vari- ables in the Markov cond ition (2) , the total probab ility may be expressed as ˜ X ← X ← Y ⇒ p ( x, ˜ x , y ) = p ( x, y ) p ( ˜ x | x ) , ˜ X → X → Y ⇒ p ( x, ˜ x , y ) = p ( x, ˜ x ) p ( y | x ) . (12) The Markov condition ˜ X ← X ← Y yields [3] p ( y | ˜ x ) = 1 p ( ˜ x ) X x p ( y | x ) p ( ˜ x | x ) p ( x ) . (13) Since ˜ X ← X ← Y = Y ← X ← ˜ X , the Markov con dition yields throu gh application of Bayes rule an d consistency [3] p ( y | ˜ x ) = X x ∈ X p ( y | x ) p ( x | ˜ x ) . (14) Thus p ( ˜ x ) = P x,y p ( x, ˜ x, y ) = P x p ( x ) p ( ˜ x | x ) , and, p ( ˜ x, y ) = P x p ( x, ˜ x, y ) = P x p ( x, y ) p ( ˜ x | x ) . (15) From (15 ), the following relatio ns are o btained δ p ( ˜ x ) δ p ( ˜ x | x ) = p ( x ) , and, δ p ( ˜ x | y ) δ p ( ˜ x | x ) = p ( x | y ) . (16) B. The variation al principle The g IB Lag rangian (3) c annot be expressed in ter ms o f th e generalized K-L d. As discussed in [7], the ad ditive dua lity is r equired to express the GMI f or q > 1 in terms o f the generalized K- Ld. This is req uired to f ormulate no nadditive numerical schemes akin to the EM algorithm [2 2], using the alternating minimization metho d based on the C sisz ´ ar-T usn ´ ady theory [23] . T he gIB L agrangia n in q ∗ − space is L q ∗ gI B [ p ( ˜ x | x )] = I q ∗  X ; ˜ X  − ˜ β gI B I q ∗  ˜ X ; Y  ; 0 < q ∗ < 1 , (17) contingen t to the no rmalization of p ( ˜ x | x ) . Here, I q ∗  X ; ˜ X  and I q ∗  ˜ X ; Y  are obtained from I q ( X ; ˜ X ) and I q ( ˜ X ; Y ) employing (11). V ariational minimization of (17) [7, 13] yields δ δp ( ˜ x | x ) L q ∗ gI B [ p ( ˜ x | x )] = 0 ( a ) ⇒ 1 q ∗ − 1  p ( ˜ x ) p ( ˜ x | x )  1 − q ∗ + ˜ β gI B 1 − q ∗ P y p ( y | x )  p ( y ) p ( y | ˜ x )  1 − q ∗ − λ ( x ) p ( x ) = 0 . (18) Here, ( a ) is from Bayes’ theorem p ( ˜ x | x ) p ( ˜ x ) = p ( x | ˜ x ) p ( x ) and p ( ˜ x | y ) p ( ˜ x ) = p ( y | ˜ x ) p ( y ) , a nd, (16 ). Th e term p ( x ) is canceled out. A ln q ∗ ( • ) term is intr oduced in (1 8), by addin g and subtracting ˜ β gI B P y p ( y | x ) 1 − q ∗ , to yield 1 q ∗ − 1  p ( ˜ x ) p ( ˜ x | x )  1 − q ∗ + ˜ β gI B P y p ( y | x ) ln q ∗  p ( y ) p ( y | ˜ x )  − λ (1) = 0 . (19) In (19), − λ (1) ( x ) = − λ ( x ) p ( x ) + ˜ β gI B P y p ( y | x ) 1 − q ∗ , which is on ly depend ent on x . The seco nd term in (19 ) is expressed as: ˜ β gI B P y p ( y | x ) ln q ∗  p ( y | x ) p ( y | ˜ x )  employing q ∗ − def orme d subtraction and ad dition (6), yield ing 1 q ∗ − 1  p ( ˜ x ) p ( ˜ x | x )  1 − q ∗ + ˜ β gI B P y p ( y | x ) h ln q ∗  p ( y ) p ( y | ˜ x )  ⊖ q ∗ ln q ∗  p ( y ) p ( y | x ) i − λ (1) ( x ) + ˜ β gI B P y p ( y | x ) ln q ∗  p ( y ) p ( y | x )  ⊕ q ∗ = 0 ⇒ 1 q ∗ − 1  p ( ˜ x ) p ( ˜ x | x )  1 − q ∗ + ˜ β gI B P y p ( y | x ) ln q ∗  p ( y | x ) p ( y | ˜ x )  − λ (2) ( x ) = 0 . (20) Here, − λ (2) ( x ) = − h ˜ β gI B I q ∗ ( x ; Y ) ⊕ q ∗ λ (1) ( x ) i , I q ∗ ( x ; Y ) = − P y p ( y | x ) ln q ∗  p ( y ) p ( y | x )  . Multiplyin g (20) by p ( ˜ x | x ) and summing over ˜ x , y ields λ (2) ( x ) = P ˜ x p ( ˜ x | x ) q ∗ − 1  p ( ˜ x ) p ( ˜ x | x )  1 − q ∗ + ˜ β gI B * P y p ( y | x ) ln q ∗  p ( y | x ) p ( y | ˜ x )  + p ( ˜ x | x ) . (21) Deﬁning ℑ gI B ( x ) = ( q ∗ − 1) λ (2) ( x ) , (20 ) yields p ( ˜ x | x ) = p ( ˜ x )  1 − ( q ∗ − 1) ˜ β gI B ℑ gI B ( x ) P y p ( y | x ) ln q ∗  p ( y | x ) p ( y | ˜ x )   1 q ∗ − 1 ℑ gI B ( x ) 1 1 − q ∗ . (22) Setting ˜ β gI B ℑ gI B ( x ) = ˜ β gI B ( x ) , and , inv oking the additive du ality in the nu merator ( q ∗ = (2 − q ) ), (22) yields the cano nical transition pro bability p ( ˜ x | x ) = p ( ˜ x ) exp q [ − ˜ β gI B ( x ) D q K − L [ p ( y | x ) k p ( y | ˜ x ) ] ] ˜ Z ( x, ˜ β gI B ( x ) ) ⇒ ˜ Z  x, ˜ β gI B ( x )  = ℑ gI B ( x ) 1 1 − q ∗ . (23) In (23), ˜ β gI B ( x ) is the gIB tradeo ff parameter evalua ted for each source a lphab et x ∈ X , and, ˜ Z  x, ˜ β gI B ( x )  is the pa rtition fu nction. The effective distortion measu r e has been self consistently obtaine d via the variational prin - ciple to be D q K − L [ p ( y | x ) k p ( y | ˜ x )] , without any a-priori assumptions. In the limit q → 1 the B-G-S statistics re- sult [3,4] is recovered. So lutions of (23 ) are valid on ly for n 1 − ( 1 − q ) ˜ β gI B ( x ) D q K − L [ p ( y | x ) k p ( y | ˜ x )] o > 0 . The condition n 1 − ( 1 − q ) ˜ β gI B ( x ) D q K − L [ p ( y | x ) k p ( y | ˜ x )] o < 0 is c alled the Tsallis cut-o ff condition [5], and requ ires setting p ( ˜ x | x ) =0 and sto pping the iteration at the gi ven ˜ β gI B . C. F ree ener gy of the system The q ∗ − def or med nonadditive fr ee ener gy of the syst em is [3] F q ∗ gI B [ p ( ˜ x | x ); p ( ˜ x ); p ( y | ˜ x )] = F q ∗ gI B [ • ] = − D ln q ∗ ˜ Z  x, ˜ β gI B ( x ) E p ( x ) = D ℑ gI B ( x ) − 1 q ∗ − 1 E p ( x ) . (24) Here, [ • ] denotes th e argu ments [ p ( ˜ x | x ) ; p ( ˜ x ) ; p ( y | ˜ x )] o f the free energy . No te F q ∗ gI B [ • ] = ˜ β gI B F H elmholtz q ∗ , wh ere F H elmholtz q ∗ is the q ∗ − def or med gIB Helmholtz free en ergy . In voking (4) an d (21), (2 4) yields F q ∗ gI B [ • ] = I q ∗  X ; ˜ X  + ˜ β gI B * P y p ( y | x ) ln q ∗  p ( y | x ) p ( y | ˜ x )  + p ( x, ˜ x ) ( a ) = I q ∗  X ; ˜ X  − ˜ β gI B * P y p ( y | x ) ln q  p ( y | ˜ x ) p ( y | x )  + p ( x, ˜ x ) ( b ) = I q ∗  X ; ˜ X  − ˜ β gI B P x, ˜ x ,y p ( x, ˜ x, y ) ln q  p ( x, ˜ x ) p ( y | ˜ x ) p ( x, ˜ x ,y )  = D q ∗ K − L h p  X , ˜ X     p ( X ) p  ˜ X  i + ˜ β gI B D q K − L h p  X , ˜ X , Y     p  X , ˜ X  p  Y | ˜ X  i . (25) Here, ( a ) inv okes the additive dua lity in the second term in order to introd uce the expected effective distortion in q − space , and, ( b ) inv okes (12 ) to o btain the total p robability p ( x, ˜ x , y ) . In (2 5), F q ∗ gI B [ • ] is the sum of two gen eralized K-Ld’ s h aving nonad ditivity param eters q ∗ and q , where 0 < q ∗ < 1 and q > 1 . Fr om [ 24], it is readily follows that F q ∗ gI B [ • ] is non- negativ e and conve x. The expected effective distortion ter m in (25) is r elated to the rele v ant inf ormation as F q ∗ gI B [ • ] = I q ∗  X ; ˜ X  − ˜ β gI B X x, ˜ x p ( x, ˜ x ) X y p ( y | x ) ln q  p ( y | ˜ x ) p ( y | x )  | {z } D ef f ( a ) = I q ∗  X ; ˜ X  − ˜ β gI B P x, ˜ x ,y p ( x, ˜ x, y ) q [ln q p ( y | ˜ x ) − ln q p ( y | x )] ( b ) = I q ∗  X ; ˜ X  + ˜ β gI B h I q ( X ; Y ) − I q  ˜ X ; Y i ⇒ D ef f = I q ( X ; Y ) − I q  ˜ X ; Y  . (26) Note I q  ˜ X ; Y  ≤ I q ( X ; Y ) by the generalized data pr o- cessing inequa lity [ 10]. Here, ( a ) employs ln q ( x/y ) = y q − 1 (ln q ( x ) − ln q ( y )) in (6) , ( b ) add s and subtr acts S q ( Y ) , and, inv okes (11) and the symmetr y of the GMI: I q ( Y ; X ) = I q ( X ; Y ); I q ( Y ; ˜ X ) = I q ( ˜ X ; Y ) . F r om (22 )-(24 ), an em- pirical criterion equivalen t to the Tsallis cut-off con dition, described in terms of the gIB fr ee ener gy for an y x ∈ X , is: 1 + ( q ∗ − 1) F q ∗ gI B [ • ] ( x ) < 0 . I V . T H E U P D A T E E Q UAT I O N S Lemma 1 : Giv en a joint distribution p ( x ) p ( ˜ x | x ) , the d istri- bution p ( ˜ x ) that min imizes D q K − L [ p ( X ) p ( ˜ X | X ) || p ( X ) p ( ˜ X )] is the m arginal p ∗ ( ˜ x ) = P x p ( x ) p ( ˜ x | x ) , i.e. D q ∗ K − L h p ( X ) p  ˜ X    X     p ( X ) p ∗  ˜ X  i = min p ( ˜ x ) D q ∗ K − L h p ( X ) p  ˜ X    X     p ( X ) p  ˜ X  i . (27) Also  D q K − L [ p ( ˜ x | x ) k p ∗ ( ˜ x )]  p ( x ) = min p ( ˜ x )  D q K − L [ p ( ˜ x | x ) k p ( ˜ x )]  p ( x ) . (28) Proof: The positivity condition for (27) is pr oven in Lemma 1 of [7 ], with q ∗ replacing q . The p ositivity co ndition fo r (28 ) is − P x, ˜ x p ( x, ˜ x ) ln q  p ( ˜ x ) p ( ˜ x | x )  + P x, ˜ x p ( x, ˜ x ) ln q  p ∗ ( ˜ x ) p ( ˜ x | x )  ( a ) = − P x, ˜ x p ( x, ˜ x ) ln q ∗  p ( ˜ x | x ) p ∗ ( ˜ x )  + P x, ˜ x p ( x, ˜ x ) ln q ∗  p ( ˜ x | x ) p ( ˜ x )  ( b ) = D q ∗ K − L [ p ∗ ( ˜ x ) k p ( ˜ x )] × h 1 + ( q − 1 )  D q K − L [ p ( ˜ x | x ) k p ( ˜ x )]  p ( x ) i > 0; ∀ 0 < q ∗ < 1 , q > 1 . (29) Here, ( a ) in vokes the a dditive du ality , ( b ) employs the q-deformed algebra deﬁnitio n for ln q ∗  a b  from (6) [21] by multip lying and di viding by  P x, ˜ x p ( x, ˜ x )+(1 − q ∗ ) P x, ˜ x p ( x, ˜ x ) ln q ∗  p ( ˜ x | x ) p ( ˜ x )   P x, ˜ x p ( x, ˜ x ) , P x, ˜ x p ( x, ˜ x ) = 1 , and, establishes p ∗ ( ˜ x ) = P x p ( x ) p ( ˜ x | x ) after subjectin g the term within brackets ( [ • ] ) in (29) to th e additive duality ( q = 2 − q ∗ ). The free ene rgy F q ∗ gI B [ • ] is c onv ex only when in depend ently ev aluated with respect to one of the co nv ex distribution sets { p ( ˜ x | x ) } , { p ( ˜ x ) } , and { p ( y | ˜ x ) } . The upd ate eq uations which minimize the fr ee en er gy are obtain ed by pr ojecting the fr ee energy on to each co n vex distribution while keeping the other two arguments constant . Theorem 1 : Equatio ns (14), (15) and (23) are satisﬁed at the minima of the free energy (24) for each argumen t of the free energy as min p ( y | ˜ x ) min p ( ˜ x ) min p ( ˜ x | x ) F q ∗ gI B [ p ( ˜ x | x ) ; p ( ˜ x ) ; p ( y | ˜ x )] . (30) Denoting the itera tion level as ( τ ) , min imization is per formed indepen dently by con verging a lternating iterations p ( τ +1) ( ˜ x | x ) ← p ( τ ) ( ˜ x ) exp q  − ˜ β ( τ ) gI B ( x ) D q K − L [ p ( y | x ) k p ( τ ) ( y | ˜ x ) ]  ˜ Z ( τ +1) ( x, ˜ β gI B ( x ) ) , whe re, ˜ β ( τ ) gI B ( x ) = ˜ β gI B ℑ ( τ ) gI B ( x ) = ˜ β gI B ˜ Z ( τ ) ( x, ˜ β gI B ( x ) ) 1 − q ∗ , p ( τ +1) ( ˜ x ) ← P x p ( x ) p ( τ +1) ( ˜ x | x ) , p ( τ +1) ( y | ˜ x ) ← 1 p ( τ +1) ( ˜ x ) P x p ( x, y ) p ( τ +1) ( ˜ x | x ) . (31) Proof: Th e outline o f the proof is presented herein ow- ing to space constraints. Deﬁning ˜ F q ∗ 1 [ • ] = F q ∗ gI B [ • ] + ˜ λ ( x ) ( p ( ˜ x | x ) − 1) , and following the proce dure in Section III.B. of this p aper, δ ˜ F q ∗ 1 [ • ] δp ( ˜ x | x ) = 0 e xactly y ields (26). Min - imization with respect to p ( y | ˜ x ) affects on ly D ef f in (25). Deﬁning ˜ F q ∗ 2 [ • ] = F q ∗ gI B [ • ] + ˜ λ ( ˜ x ) ( p ( y | ˜ x ) − 1) and in vok- ing D ef f = I q ( X ; Y ) − I q ( ˜ X ; Y ) from (26), em ploying (4), (11), (12) , a nd (15 ) y ields − ˜ β gI B P x p ( x, ˜ x, y ) q δ ln q p ( y | ˜ x ) δp ( y | ˜ x ) + ˜ λ ( ˜ x ) = 0 ⇒ − 1 p ( ˜ x ) q P x p ( x,y ) q p ( ˜ x | x ) q p ( y | ˜ x ) q + ˜ λ 1 ( ˜ x ) = 0; ˜ λ 1 = ˜ λ ( ˜ x ) ˜ β gI B ⇒ p ( y | ˜ x ) = 1 p ( ˜ x ) P x p ( x, y ) p ( ˜ x | x ) . (32) From (27), it may be shown that p ( ˜ x ) minimizes I q ∗ ( X ; ˜ X ) . Since, D ef f is the expectation of a gener alized K-L d, (28) is applied to d emonstrate tha t p ( ˜ x ) is a m inimizer o f D ef f . No te that the g IB update equ ations are not g lobally covergent. V . C O N C L U S I O N S A N D D I S C U S S I O N S Akin to the R D th eory , the degree of c ompression m ay be assessed by the compr ession information I q ∗ ( X ; ˜ X ) . Howe ver , while the RD meth od is up per boun ded by an a-priori chosen optimal expected distortion D , the gI B meth od is lower bound ed b y the rele vant information I q ∗ ( ˜ X ; Y ) . It has been dem onstrated that in lossy compression, I q ∗ ( X ; ˜ X ) is always lower than its counterp art obtained using B-G-S statistics [7]. Th is observation implies that gIB rele vance- compr ession c urves will tend to tr av erse th e forbidden r e gion of an equ iv alent I B method based on B-G- S statistics. Future work casts the gIB model within the framew ork o f Bre gman div ergences [25 ]. A C K N O W L E D G M E N T RCV g ratefully ack nowledges support from RAND-MSR contract CSM-DI & S-QIT -10 1155 -03-2 009 . R E F E R E N C E S [1] T . Cover and J. Thomas Elements of Information Theory , John Wi ley & Sons, New Y ork, NY , 1991. [2] T . Berg er T , Rate Distortion Theory , Prentice-Hall , Engle wood Cliffs, 1971. [3] N. Tishby N, F . C. Pereira and W . Bialek, ”The information bottleneck method”, Pr oceed ings of the 37th Annual Allerton Conferen ce on Communicat ion Contr ol and Computi ng Eds B Hajek and R S Sreeni vas 368, Univ ersity of Illinois, Urbana, IL, 1999. [4] N. Slonim The Information B ottlene ck : Theory and Application s PhD thesis, Hebrew Uni versi ty , Jerusalem, 2003. [5] C. Ts allis ”Possible gener aliza tions of Bol tzmann-Gib bs statistic s” J. Stat. P hys. 542 479, 1988. [6] M. Gell-Mann and C. T sallis, Eds. None xtensiv e Entro py- Inter disci plinary Applications , Oxford Unive rsity Press, Oxford, 2004. [7] R. C. V enkatesan and A. Plastino ”Generali zed s tatist ics frame work for rate distorti on theory”, Physica A, , 388 , 12, 2337, 2009. [8] P .T . Landsberg and V . V edral, ”Distrib utio ns an d channe l capaciti es i n general ized statistical mechanics”, Phys. Lett. A , 247 , 211, 1998. [9] T . Y amano, ”Informati on theory based on nonaddi ti ve information content ”, Phys. Rev . E , 63 , 046105, 2001. [10] S. Furuichi , ”Informati on theoretic al propertie s on Tsallis entropi es”, J. Math. P hys. , 47 , 023302, 2006. [11] H. Suyari, ”Source coding theorem based on a nonadditi ve information content ”, IEEE T rans. on Inform. Theory , 50 ,8, 1783, 2004. [12] T . Y amano, ”Source cod ing theorem ba sed on a nona dditi ve information content ”, Physica A , 305 , 190 (2002). [13] G. L. Ferri, S. Martinez S and A. Plastino A ”Equi v alenc e of the four versi ons of Tsallis statistics” J . Stat . Mec h.: Theory and Experiment 2005(04) P04009, 2005. [14] E. M. F . Curado and C. Tsallis ”Generali zed stat istica l m echani cs: connec tion with thermodynamics” J . Phys. A: Math Gen. 24 L69, 1991. [15] C. Tsallis, R. S. Mendes and A. R. Plasti no ”The role of constraints within generalized nonextensi ve statistics” Physica A 261 534, 1998. [16] S. Mart´ ınez, F . Nico l ´ as, F . Pennini and A. Plastino ” Genera lized statisti cal mechanics: connec tion with thermodynamics” Physica A 286 489, 2000. [17] J. Naudts ”Deformed exponenti als and logari thms in general ized thermostat istics” Physica A 340 32, 2004. [18] W . Bialek , I. Nemenman I and N. Tishby , ” Predict abili ty , Comple xity , and Learning Predi ctabi lity , complexity , and learning”, Neur al Comp. 13 2409, 2001. [19] C. Tsallis ”Gene ralize d e ntrop y-based criteri on for consiste nt testing” Phys. R ev . E 58 1442, 1998. [20] L. Borland, A. Plastin o and C. Tsallis ”Information gain within none xtensi ve thermostatistic s” J . Math. Phys. 39 6490, 1998. [21] E. Borg es ”A possible deforme d algebra and cal culus inspired in none xtensi ve thermostatistic s” Physica A 340 95, 2004. [22] G. J. McLachlan and T . Krishnan The EM Algorithm and Extensio ns , John W ile y & Sons, New Y ork, NY , 1996. [23] I. Csisz ´ ar and G. Tusn ´ ady ”Informati on geometry and alte rnatin g minimizat ion procedures” Statisti cs and Decisions 1 205, 1984. [24] S. Furuichi ”On uniqueness the orems fo r Tsallis e ntrop y an d Ts allis relati ve entropy” IEEE T rans Inform. Theory 51 3638, 2005. [25] P . Harrem ¨ oes and N. Tishby ”The information bottleneck re visit ed or ho w to choose a good distortion measure” Pr oceedings of the IEEE Int. Symp. on Information Theory 2007 566, 2007.

Deformed Statistics Formulation of the Information Bottleneck Method

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment