A non-negative expansion for small Jensen-Shannon Divergences

In this report, we derive a non-negative series expansion for the Jensen-Shannon divergence (JSD) between two probability distributions. This series expansion is shown to be useful for numerical calculations of the JSD, when the probability distribut…

Authors: - **Anil Raj** – Department of Applied Physics, Applied Mathematics, Columbia University

A non-negative expansion for small Jensen-Shannon Divergences
A non-negativ e expansion for small Jensen-Shannon Div ergences Anil Ra j Dep artment of Applie d Physics and Applie d Mathematics Columbia University, New Y ork ∗ Chris H. Wiggins Dep artment of Applie d Physics and Applie d Mathematics Center for Computational Biolo gy and Bioinformatics Columbia University, New Y ork † (Dated: Octob er 13, 2021) In this report, we deriv e a non-negativ e series expansion for the Jensen-Shannon div ergence (JSD) b et ween t wo probabilit y distributions. This series expansion is shown to b e useful for numerical calculations of the JSD, when the probabilit y distributions are nearly equal, and for whic h, conse- quen tly , small numerical errors dominate ev aluation. Keywords: entrop y , JS divergence I. INTR ODUCTION The Jensen-Shannon div ergence (JSD) has b een widely used as a dissimilarit y measure betw een w eigh ted probability distributions. The direct numerical ev aluation of the exact expression for the JSD (in volving difference of logarithms), ho wev er, leads to numerical errors when the distributions are close to each other (small JSD); when the element-wise difference b etw een the distributions is O (10 − 1 ), this naiv e formula pro duces erroneous v alues (sometimes negativ e) when used for n umerical calculations. In this rep ort, w e deriv e a prov ably non-negativ e series expansion for the JSD whic h can b e used in the small JSD limit, where the naiv e form ula fails. I I. SERIES EXP ANSION FOR JENSEN-SHANNON DIVERGENCE Consider tw o discrete probabilit y distributions p 1 and p 2 o ver a sample space S of cardinality N with relativ e normalized weigh ts π 1 and π 2 b et ween them. The JSD b etw een the distributions is then defined as [1] ∆ naiv e [ p 1 , p 2 ; π 1 , π 2 ] = H [ π 1 p 1 + π 2 p 2 ] − ( π 1 H [ p 1 ] + π 2 H [ p 2 ]) (1) where the entrop y (measured in nats) of a probabilit y distribution is defined as H [ p ] = − N X j =1 h ( p j ) = − N X j =1 p j log( p j ) . (2) Defining ¯ p j = ( p 1 j + p 2 j ) / 2 ; 0 6 ¯ p j 6 1 ; P N j =1 ¯ p j = 1 η j = ( p 1 j − p 2 j ) / 2 ; P N j =1 η j = 0 ε j = ( η j ) / ¯ p j ; − 1 6 ε j 6 1 α = π 1 − π 2 ; − 1 6 α 6 1 (3) w e ha ve h ( π 1 p 1 j + π 2 p 2 j ) = − ( π 1 ( ¯ p j + η j ) + π 2 ( ¯ p j − η j )) log( π 1 ( ¯ p j + η j ) + π 2 ( ¯ p j − η j )) = − ¯ p j (1 + αε j ) [log( ¯ p j ) + log(1 + αε j )] (4) ∗ Electronic address: ar2384@columbia.edu † Electronic address: chris.wiggins@colum bia.edu 2 and π 1 h ( p 1 j ) + π 2 h ( p 2 j ) = − π 1 ( ¯ p j + η j ) log( ¯ p j + η j ) − π 2 ( ¯ p j − η j ) log( ¯ p j − η j ) = − 1 2 ¯ p j (1 + α )(1 + ε j ) log( ¯ p j (1 + ε j )) − 1 2 ¯ p j (1 − α )(1 − ε j ) log( ¯ p j (1 − ε j )) = − ¯ p j (1 + αε j ) log( ¯ p j ) − 1 2 ¯ p j (1 + αε j ) log(1 − ε 2 j ) − 1 2 ¯ p j ( α + ε j ) log  1 + ε j 1 − ε j  . (5) Th us, h ( π 1 p 1 j + π 2 p 2 j ) − ( π 1 h ( p 1 j ) + π 2 h ( p 2 j )) = 1 2 ¯ p j " (1 + αε j ) log 1 − ε 2 j (1 + αε j ) 2 ! + ( α + ε j ) log  1 + ε j 1 − ε j  # . (6) The T aylor series expansion of the logarithm function is giv en as log(1 + x ) = ∞ X i =1 c i x i ; c i = ( − 1) i +1 i . (7) The logarithms in the expression for the J-S divergence can then be written as log(1 + ε j ) = ∞ X i =1 c i ε i j log(1 − ε j ) = ∞ X i =1 ( − 1) i c i ε i j (8) log(1 + αε j ) = ∞ X i =1 c i α i ε i j . W e then ha ve ∆ = 1 2 P N j =1 ¯ p j δ j , with δ j = (1 + α ε j ) [log(1 + ε j ) + log (1 − ε j ) − 2 log (1 + αε j )] + ( α + ε j ) [log(1 + ε j ) − log (1 − ε j )] = (1 + α ε j ) " ∞ X i =1 c i ε i j + ∞ X i =1 ( − 1) i c i ε i j − 2 ∞ X i =1 c i α i ε i j # + ( α + ε j ) " ∞ X i =1 c i ε i j − ∞ X i =1 ( − 1) i c i ε i j # = ∞ X i =1 c i  ε i j + αε i +1 j + ( − 1) i ε i j + ( − 1) i αε i +1 j − 2 α i ε i j − 2 α i +1 ε i +1 j + αε i j + ε i +1 j + ( − 1) i +1 αε i j + ( − 1) i +1 ε i +1 j  = ∞ X i =1 c i  ( − 1) i − 2 α i + α + ( − 1) i +1 α + 1  ε i j +  ( − 1) i α − 2 α i +1 + 1 + ( − 1) i +1 + α  ε i +1 j  . (9) When i = 1, coeff ( ε j ) = c 1 ( − 1 − 2 α + α + α + 1) = 0. The first non-v anishing term in the expansion is then of order 2. Shifting indices of the first term in Eqn. (9) giv es δ j = ∞ X i =1  c i +1  ( − 1) i +1 − 2 α i +1 + α + ( − 1) i +2 α + 1  + c i  ( − 1) i α − 2 α i +1 + 1 + ( − 1) i +1 + α  ε i +1 j = ∞ X i =1 ( c i + c i +1 )  ( − 1) i α − 2 α i +1 + α + 1 + ( − 1) i +1 )  ε i +1 j = ∞ X i =1 ( − 1) i +1 i ( i + 1)  ( − 1) i α − 2 α i +1 + α + 1 + ( − 1) i +1 )  ε i +1 j = ∞ X i =1 B i ε i +1 j (10) where B i = 1 − α + ( − 1) i +1 (1 + α − 2 α i +1 ) i ( i + 1) = ( 2(1 − α i +1 ) / ( i ( i + 1)) i o dd , − 2( α − α i +1 ) / ( i ( i + 1)) i even . (11) 3 FIG. 1: Plot comparing the naive and approximate formulae, truncated at different orders for calculating JSD as a function of the normalized L2-distance ( k ε k ; see Section I I I) b etw een pairs of randomly generated probabilit y distributions. Best fit slop es are: -2.05 ( k = 3), -5.89 ( k = 6), -8.14 ( k = 9), -11.91 ( k = 12) and -105.43 (comparing naive with k = 100). FIG. 2: Probability of obtaining (erroneous) negativ e v alues, when directly ev aluating JSD using its exact expression, is plotted as a function of k ε k . When implemen ted in ma tlab , w e observe that the naive form ula gives negativ e JSD when k ε k is merely of O (10 − 6 ). This series expansion can be further simplified as δ j = ∞ X i =1 ( B 2 i − 1 + B 2 i ε j ) ε 2 i j = ∞ X i =1 B 2 i − 1  1 + B 2 i B 2 i − 1 ε j  ε 2 i j , (12) B 2 i B 2 i − 1 ε j = −  2 i − 1 2 i + 1  αε j . (13) Since − 1 6 αε j 6 1, w e hav e − 1 6 B 2 i B 2 i − 1 ε j 6 1. Thus, for every i , ( B 2 i − 1 + B 2 i ε j ) ε 2 i j > 0, making δ j — and the series e xpansion for ∆ naiv e — non-negative up to all orders. I I I. NUMERICAL RESUL TS The accuracy of the truncated series expansion can b e compared with the naiv e form ula by measuring the JSD b et ween randomly generated probability distributions. P airs of probabilit y distributions with − 4 6 log 10 k ε k < 0, where k ε k = q P N j =1 ε 2 j N , were randomly generated and the J-S divergence b etw een each pair w as calculated b y b oth a direct ev aluation of the exact expression (∆ naiv e ) and the appro ximate expansion (∆ k ; k ∈ { 3 , 6 , 9 , 12 } ), where ∆ k = 1 2 N X j =1 ¯ p j δ j k ; δ j k = k X i =1 B i ε i +1 j . (14) The results shown in Fig. 1 suggest the series expansion to b e a more numerically useful form ula when the probability distributions differ by k ε k ∼ O (10 − 0 . 5 ). Fig. 2 further sho ws that when k ε k ∼ O (10 − 6 ), a direct ev aluation of the exact formula for JSD giv es negativ e v alues (when implemented in ma tlab ). APPENDIX Here we include the ma tlab co de used in the figures for appro ximate ev aluation of JSD using its series expansion. 4 function [JS,epsnorm] = JSapprx(pi1,p1,pi2,p2,order) % [JS,epsnorm]=JSapprx(pi1,p1,pi2,p2,order) calculates JS % divergence given two probability distributions and % their relative weights. JSapprx uses an approximation % to the JSD by expanding in powers of epsilon=(p1-p2)/(p1+p2) % and truncating at an order input by the user. % % This calculation is described in the technical report % ‘‘A non-negative expansion % for small Jensen-Shannon Divergences’’ % by Anil Raj and Chris H. Wiggins, October 2008 % average of distributions pbar=(p1+p2)/2; % difference of distributions eta=(p1-p2)/2; % ratio of difference to average epsilon=eta./pbar; % difference in biases, where pi1+pi2=1 alpha=pi1-pi2; % calculate JS by summing up to order ‘order’ js=zeros(size(pbar)); % denominator computed by summing, as well denominator=0; for i=2:order denominator=denominator+(i-1); % numerical coefficient c=(-1)^i*(1/denominator); Bi=c*(alpha^(mod(i,2))-alpha^i); js=js+Bi*(epsilon.^i); end % sum over ‘j’: JS=pbar’*js/2; % convert from nats to bits: JS=JS/log(2); % norm of epsilon reported as output if nargout==2 epsnorm=sqrt(sum(epsilon.^2)/length(pbar)); end [1] J Lin. Divergence measures based on the shannon entrop y . IEEE T ransactions on Information The ory , 37(1):145–151, Jan 1991.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment