Chernoff information of exponential families
Chernoff information upper bounds the probability of error of the optimal Bayesian decision rule for $2$-class classification problems. However, it turns out that in practice the Chernoff bound is hard to calculate or even approximate. In statistics,…
Authors: ** Frank Nielsen (Senior Member, IEEE) **
1 Cher noff inf or mation of e xponential f amilies F rank Nielsen, Senior Member , IEEE, Abstract —Chernoff inf ormation upper bounds the probability of error of the optimal Bayesian decision r ule for 2 -class classification problems . How ev er , it tur ns out that in practice the Cher noff bound is hard to calculate or ev en appro ximate. In statistics, many usual distributions, such as Gaussians, P oissons or frequency histog rams called multinomials, can be handled in the unified framework of exponential families . In this note , w e pro ve that the Cher noff inf ormation f or members of the same e xponential family can be either derived analytically in closed f or m, or efficiently approximated using a simple geodesic bisection optimization technique based on an exact geometric characterization of the “Chernoff point” on the underlying statistical manifold. Index T erms —Cher noff information, α -divergences, e xponential f amilies, information geometry . F 1 I N T R O D U C T I O N C O N S I D E R the following statistical decision problem of classifying a random observation x as one of two possible classes: C 1 and C 2 (say , detect target signal from noise signal). Let w 1 = Pr( C 1 ) > 0 and w 2 = Pr( C 2 ) = 1 − w 1 > 0 denote the a priori class pr oba- bilities, and let p 1 ( x ) = Pr( x | C 1 ) and p 2 ( x ) = Pr( x | C 2 ) denote the class-conditional probabilities, so that we have p ( x ) = w 1 p 1 ( x ) + w 2 p 2 ( x ) . Bayes decision rule classifies x as C 1 if Pr( C 1 | x ) > Pr( C 2 | x ) , and as C 2 otherwise. Using Bayes rule 1 , we have Pr( C i | x ) = Pr( C i )Pr( x | C i ) Pr( x ) = w i p i ( x ) p ( x ) for i ∈ { 1 , 2 } . Thus Bayes decision rule assigns x to class C 1 if and only if w 1 p 1 ( x ) > w 2 p 2 ( x ) , and to C 2 otherwise. Let L ( x ) = Pr( x | C 1 ) Pr( x | C 2 ) denote the likelihood ratio . In decision theory [1], Neyman and Pearson proved that the optimum decision test has necessarily to be of the form L ( x ) ≥ t to accept hypothesis C 1 , where t is a threshold value. The probability of error E = Pr(Error) of any decision rule D is E = R p ( x )Pr(Error | x )d x , where Pr(Error | x ) = Pr( C 1 | x ) if D wrongly decided C 2 , Pr( C 2 | x ) if D wrongly decided C 1 . Thus Bayes decision rule minimizes by principle the average probability of error : E ∗ = Z Pr(Error | x ) p ( x )d x, (1) = Z min(Pr( C 1 | x ) , Pr( C 2 | x )) p ( x )d x. (2) The Bayesian rule is also called the maximum a- posteriori (MAP) decision rule. Bayes error constitutes • F . Nielsen is with the Sony Computer Science Laboratories (T okyo, Japan) and ´ Ecole Polytechnique (Palaiseau, France). E-mail: nielsen@lix.polytechnique.fr 1. Bayes rule states that the joint probability of two events equals the product of the probability of one event times the conditional prob- ability of the second event given the first one. That is, in mathematical terms Pr( x ∧ θ ) = Pr( x )Pr( θ | x ) = Pr( θ )Pr( x | θ ) , so that we have Pr( θ | x ) = Pr( θ )Pr( x | θ ) / Pr( x ) . therefor e the refer ence benchmark since no other deci- sion rule can beat its classification performance. Bounding tightly the Bayes error is thus crucial in hypothesis testing. Chernoff derived a notion of infor- mation 2 from this hypothesis task (see Section 7 of [2]). T o upper bound Bayes error , one replaces the minimum function by a smooth power function: Namely , for a, b > 0 , we have min( a, b ) ≤ a α b 1 − α , ∀ α ∈ (0 , 1) . (3) Thus we get the following Chernoff bound: E ∗ = Z min(Pr( C 1 | x ) , Pr( C 2 | x )) p ( x )d x (4) ≤ w α 1 w 1 − α 2 Z p α 1 ( x ) p 1 − α 2 ( x )d x (5) Since the inequality holds for any α ∈ (0 , 1) , we upper bound the minimum error E ∗ as follows E ∗ ≤ w α 1 w 1 − α 2 c α ( p 1 : p 2 ) , where c α ( p 1 : p 2 ) = R p α 1 ( x ) p 1 − α 2 ( x )d x is called the Chernoff α -coefficient. W e use the ”:” delimiter to em- phasize the fact that this statistical measure is usually not symmetric: c α ( p 1 : p 2 ) 6 = c α ( p 2 : p 1 ) , although we have c α ( p 2 : p 1 ) = c 1 − α ( p 1 : p 2 ) . For α = 1 2 , we obtain the symmetric Bhattacharrya coefficient [3] b ( p 1 : p 2 ) = c 1 2 ( p 1 : p 2 ) = R p p 1 ( x ) p 2 ( x )d x = b ( p 2 , p 1 ) . The optimal Chernoff α -coefficient is found by choosing the best exponent for upper bounding Bayes error [1]: 2. In information theory , there exists several notions of information such as Fisher information in Statistics or Shannon information in Coding theory . Those various definitions gained momentum by asking questions like ”How hard is it to estimate/discriminate distributions?” (Fisher) or ”How hard is it to compress data?” (Shannon). Those ”how hard...” questions were answered by proving lower bounds (Cram ´ er-Rao for Fisher , and Entropy for Shannon). Similarly , Chernoff information answers the ”How hard is it to classify (empirical) data?” by providing a tight lower bound: the (Chernoff) (classification) infor- mation. 2 c ∗ ( p 1 : p 2 ) = c α ∗ ( p 1 : p 2 ) = min α ∈ (0 , 1) Z p α 1 ( x ) p 1 − α 2 ( x )d x. (6) Since the Chernoff coefficient is a measure of similarity (with 0 < c α ( p 1 , p 2 ) ≤ 1 ) relating to the overlapping of the densities p 1 and p 2 , it follows that we can derive thereof a statistical distance measure, called the Chernoff information (or Chernoff divergence) as C ∗ ( p 1 : p 2 ) = C α ∗ ( p 1 : p 2 ) (7) = − log min α ∈ (0 , 1) Z p α 1 ( x ) p 1 − α 2 ( x )d x ≥ 0 . = max α ∈ (0 , 1) − log Z p α 1 ( x ) p 1 − α 2 ( x )d x (8) In the r emainder , we call Chernoff diver gence (or Chernoff information) the measure C ∗ ( · : · ) , and Cher- noff α -divergence (of the first type) the functional C α ( p : q ) (for α ∈ (0 , 1) ). Chernoff information yields the best achievable exponent for a Bayesian probability of error [1]: E ∗ ≤ w α ∗ 1 w 1 − α ∗ 2 e − C ∗ ( p 1 : p 2 ) . (9) From the Chernoff α -coefficient measure of simi- larity , we can derive a second type of Chernoff α - divergences [4] defined by C 0 α ( p : q ) = 1 α (1 − α ) (1 − c α ( p : q )) . Those second type Chernoff α -divergences are re- lated to Amari α -divergences [5] by a linear mapping [4] on the exponent α , and to R ´ enyi and T sallis relative entropies (see Section 4). In the remainder , Chernoff α - divergences refer to the first-type divergence. In practice, we do not have statistical knowledge of the prior distributions of classes nor of the class-conditional distributions. But we are rather given a training set of correctly labeled class points. In that case, a simple decision rule, called the nearest neighbor rule 3 , consists for an observation x , to label it according to the label of its nearest neighbor (ground-truth). It can be shown that the probability error of this simple scheme is upper bounded by twice the optimal Bayes error [6], [7]. Thus half of the Chernoff information is contained somehow in the nearest neighbor knowledge, a key component of machine learning algorithms. (It is traditional to impr ove this classification by taking a majority vote over the k nearest neighbors.) Chernoff information has appeared in many applica- tions ranging from sensor networks [8] to visual com- puting tasks such as image segmentation [9], image reg- istration [10], face recognition [11], feature detector [12], and edge segmentation [13], just to name a few . The paper is or ganized as follows: Section 2 introduces the functional parametric Bregman and Jensen class of statistical distances. Section 3 concisely describes the exponential families in statistics. Section 4 proves that 3. The nearest neighbor rule postulates that things that “look alike must be alike.” See [6]. the Chernoff α -divergences of two members of the same exponential family class is equivalent to a skew Jensen divergence evaluated at the corresponding distribution parameters. In section 5, we show that the optimal Chernoff coefficient obtained by minimizing skew Jensen divergences yields an equivalent Bregman divergence, which can be derived from a simple optimality crite- rion. It follows a closed-form formula for the Chernoff information on single-parametric exponential families in Section 5.1. W e extend the optimality criterion to the multi-parametric case in Section 5.2. Section 6 character- izes geometrically the optimal solution by introducing concepts of information geometry . Section 7 designs a simple yet efficient geodesic bisection search algorithm for approximating the multi-parametric case. Finally , section 8 concludes the paper . 2 S T AT I S T I C A L D I V E R G E N C E S Given two probability distributions with respective den- sities p and q , a divergence D ( p : q ) measur es the distance between those distributions. The classical di- vergence in information theory [1] is the Kullback-Leibler divergence , also called relative entropy : KL( p : q ) = Z p ( x ) log p ( x ) q ( x ) d x (10) (For probability mass functions, the integral is replaced by a discrete sum.) This diver gence is oriented (ie. KL( p : q ) 6 = KL( q : p ) ) and does not satisfy the triangle in- equality of metrics. It turns out that the Kullback-Leibler divergence belongs to a wider class of divergences called Bregman divergences. A Br egman diver gence is obtained for a strictly convex and differentiable generator F as: B F ( p : q ) = (11) Z ( F ( p ( x )) − F ( q ( x )) − ( p ( x ) − q ( x )) F 0 ( q ( x ))) d x The Kullback-Leibler divergence is obtained for the generator F ( x ) = x log x , the negative Shannon entropy (also called Shannon information). This functional para- metric class of Bregman divergences B F can further be interpreted as limit cases of skew Jensen divergences. A skew Jensen divergence (Jensen α -divergences, or α -Jensen divergences) is defined for a strictly convex generator F as J ( α ) F ( p : q ) = Z ( αF ( p ( x )) + (1 − α ) F ( q ( x )) − F ( αp ( x ) + (1 − α ) q ( x ))) d x ≥ 0 , ∀ α ∈ (0 , 1) (12) Note that J ( α ) F ( p : q ) = J (1 − α ) F ( q : p ) , and that F is defined up to affine terms. For α → { 0 , 1 } , the Jensen divergence tend to zero, and loose its power of discrim- ination. However , interestingly , we have lim α → 1 J ( α ) F ( p : q ) = 1 1 − α B F ( p : q ) and lim α → 0 J ( α ) F ( p : q ) = 1 α B F ( q : p ) , 3 as proved in [14], [15]. That is, Jensen α -divergences tend asymptotically to (scaled) Bregman divergences. The Kullback-Leibler divergence also belongs to the class of Csisz ´ ar F -divergences (with F ( x ) = x log x ), defined for a convex function F with F (1) = 0 : I F ( p : q ) = Z x F p ( x ) q ( x ) q ( x )d x. (13) Amari’s α -divergences are the canonical divergences in α -flat spaces in information geometry [16] defined by A α ( p : q ) = 4 1 − α 2 (1 − c 1 − α 2 ( p : q )) , α 6 = ± 1 , R p ( x ) log p ( x ) q ( x ) d x = KL( p, q ) , α = − 1 , R q ( x ) log q ( x ) p ( x ) d x = KL( q , p ) , α = 1 , (14) Those Amari α -divergences (related to Chernoff α - coefficients, and Chernoff α -divergences of the second type by a linear mapping of the exponent [4]) are F - divergences for the generator F α ( x ) = 4 1 − α 2 (1 − x 1+ α 2 ) , α 6∈ {− 1 , 1 } . Next, we introduce a versatile class of probability den- sities in statistics for which α -Jensen divergences (and hence Bregman divergences) admit closed-form formula. 3 E X P O N E N T I A L FA M I L I E S A generic class of statistical distributions encapsulating many usual distributions (Bernoulli, Poisson, Gaussian, multinomials, Beta, Gamma, Dirichlet, etc.) are the ex- ponential families.W e recall their elementary definition here, and refer the reader to [17] for a more detailed overview . An exponential family E F is a parametric set of probability distributions admitting the following canon- ical decomposition of their densities: p ( x ; θ ) = exp ( h t ( x ) , θ i − F ( θ ) + k ( x )) (15) where t ( x ) is the suf ficient statistic, θ ∈ Θ ar e the natural parameters belonging to an open convex natural space Θ , h ., . i is the inner product (i.e., h x, y i = x T y for column vectors), F ( · ) is the log-normalizer (a C ∞ convex function), and k ( x ) the carrier measure. For example, Poisson distributions Pr( x = k ; λ ) = λ k e − λ k ! , for k ∈ N form an exponential family E F = { p F ( x ; θ ) | θ ∈ Θ } , with t ( x ) = x the sufficient statistic, θ = log λ the natural parameters, F ( θ ) = exp θ the log- normalizer , and k ( x ) = − log x ! is the carrier measure. Since we often deal with applications using multivari- ate normals, we also report the canonical decomposition for the multivariate Gaussian family . W e rewrite the Gaussian density of mean µ and variance-covariance matrix Σ : p ( x ; µ, Σ) = 1 2 π √ det Σ exp − ( x − µ ) T Σ − 1 ( x − µ )) 2 in the canonical form with θ = (Σ − 1 µ, 1 2 Σ − 1 ) ∈ Θ = R d × K d × d ( K d × d denotes the cone of positive definite matrices), F ( θ ) = 1 4 tr( θ − 1 2 θ 1 θ T 1 ) − 1 2 log det θ 2 + d 2 log π the log-normalizer , t ( x ) = ( x, − x T x ) the sufficient statistics, and k ( x ) = 0 the carrier measure. In that case, the inner product h· , ·i is composite, and calculated as the sum of a vector dot product with a matrix trace product: h θ , θ 0 i = θ T 1 θ 0 1 + tr( θ T 2 θ 0 2 ) , where θ = [ θ 1 θ 2 ] T and θ 0 = [ θ 0 1 θ 0 2 ] T . The order of an exponential family denotes the di- mension of its parameter space. For example, Poisson family is of order 1 , univariate Gaussians of order 2 , and d -dimensional multivariate Gaussians of order d ( d +3) 2 . Exponential families brings mathematical convenience to easily solve tasks, like finding the maximum likelihood estimators [17]. It can be shown that the Kullback- Leibler divergence of members of the same exponential family is equivalent to a Bregman divergence on the natural parameters [18], thus bypassing the fastidious integral computation of Eq. 10, and yielding a closed- form formula (following Eq. 11): KL( p F ( x ; θ p ) : p F ( x ; θ q )) = B F ( θ q : θ p ) . (16) Note that on the left hand side, the Kullback-Leibler is a distance acting on distributions, while on the right hand side, the Bregman divergence is a distance acting on corresponding swapper parameters. Exponential families play a crucial role in statistics as they also bring mathematical convenience for generaliz- ing results. For example, the log-likelihood ratio test for members of the same exponential family writes down as: log e h t ( x ) ,θ 1 i− F ( θ 1 )+ k ( x ) e h t ( x ) ,θ 2 i− F ( θ 2 )+ k ( x ) ≥ log w 2 w 1 (17) Thus the decision border is a linear bisector in the sufficient statistics t ( x ) : h t ( x ) , θ 1 − θ 2 i − F ( θ 1 ) + F ( θ 2 ) = log w 2 w 1 . (18) 4 C H E R N O FF C O E FFI C I E N T S O F E X P O N E N T I A L FA M I L I E S Let us prove that the Chernoff α -divergence of members of the same exponential families is equivalent to a α - Jensen divergence defined for the log-normalizer gener- ator , and evaluated at the corresponding natural param- eters. W ithout loss of generality , let us consider the re- duced canonical form of exponential families p F ( x ; θ ) = exp( h x, θ i − F ( θ )) (assuming t ( x ) = x and k ( x ) = 0 ). Consider the Chernoff α -coefficient of similarity of two distributions p and q belonging to the same exponential family E F : c α ( p : q ) = Z p α ( x ) q 1 − α ( x )d x = Z p ( α ) F ( x ; θ p ) p 1 − α F ( x ; θ q )d x (19) 4 = Z exp( α ( h x, θ p i − F ( θ p ))) exp((1 − α )( h x, θ q i − F ( θ q )))d x = Z exp ( h x, αθ p + (1 − α ) θ q i − ( αF ( θ p ) + (1 − α ) F ( θ q )) d x = exp − ( αF ( θ p ) + (1 − α ) F ( θ q )) Z exp( h x, αθ p + (1 − α ) θ q i − F ( αθ p + (1 − α ) θ q ) + F ( αθ p + (1 − α ) θ q ))d x = exp ( F ( αθ p + (1 − α ) θ q ) − ( αF ( θ p ) + (1 − α ) F ( θ q )) × Z exp h x, αθ p + (1 − α ) θ q i − F ( αθ p + (1 − α ) θ q )d x = exp( F ( αθ p + (1 − α ) θ q ) − ( αF ( θ p ) + (1 − α ) F ( θ q )) × Z p F ( x ; αθ p + (1 − α ) θ q )d x = exp( − J ( α ) F ( θ p : θ q )) ≥ 0 . It follows that the Chernoff α -divergence (of the first type) is given by C α ( p : q ) = − log c α ( p, q ) = J ( α ) F ( θ p : θ q ) , c α ( p : q ) = e − C α ( p : q ) = e − J ( α ) F ( θ p : θ q ) . That is, the Chernoff α -divergence on members of the same exponential family is equivalent to a Jensen α -divergence on the corresponding natural parameters. For multivariate normals, we thus retrieve easily the following Chernoff α -divergence between p ∼ N ( µ 1 , Σ 1 ) and q ∼ N ( µ 2 , Σ 2 ) : C α ( p, q ) = 1 2 log | α Σ 1 + (1 − α )Σ 2 | | Σ 1 | α | Σ 2 | 1 − α + α (1 − α ) 2 ( µ 1 − µ 2 ) T ( α Σ 1 + (1 − α )Σ 2 )( µ 1 − µ 2 ) . (20) For α = 1 2 , we find the Bhattacharyya distance [3], [19] between multivariate Gaussians. Note that since Chernoff α -divergences are related to R ´ enyi α -divergences R α ( p : q ) = 1 α − 1 log Z x p ( x ) α q 1 − α ( x )d x, (21) built on R ´ enyi entropy H α R ( p ) = 1 1 − α log( Z x p α ( x )d x − 1) , (22) (and hence by a monotonic mapping 4 to T sallis diver- gences), closed form formulas for members of the same exponential family follow: R α ( p : q ) = 1 1 − α C α ( p : q ) , (23) R α ( p F ( x ; θ p ) : p F ( x ; θ q )) = 1 1 − α J ( α ) F ( θ p : θ q ) (24) 4. The T sallis entropy H α T ( p ) = 1 α − 1 (1 − R p ( x ) α d x ) is obtained from the R ´ enyi entropy (and vice-versa) via the mappings: H α T ( p ) = 1 1 − α ( e (1 − α ) H α R ( p ) − 1) and H α R ( p ) = 1 1 − α log(1 + (1 − α ) H α T ( p )) . (Note that R 1 2 ( p : q ) is twice the Bhattacharyya coeffi- cient: R 1 2 ( p : q ) = 2 C 1 2 ( p : q ) .) For example, the R ´ enyi di- vergence on members p ∼ N ( µ p , Σ p ) and q ∼ N ( µ q , Σ q ) of the normal exponential family is obtained in closed form solution using Eq. 24: R α ( p, q ) = 1 2 α ( µ p − µ q ) T ((1 − α )Σ p + α Σ q ) − 1 ( µ p − µ q ) + 1 1 − α log det((1 − α )Σ p + α Σ q ) det(Σ 1 − α p ) det(Σ α q ) . (25) Similarly , for the T sallis relative entropy , we have: T α ( p : q ) = 1 1 − α (1 − c α ( q : p )) , (26) T α ( p F ( x ; θ p ) : p F ( x ; θ q )) = (1 − e − J ( α ) F ( q : p ) ) 1 − α (27) Note that lim α → 1 R α ( p : q ) = lim α → 1 T α ( p : q ) = KL( p : q ) = B F ( θ q : θ p ) , as expected. So far , particular cases of exponential families have been considered for computing the Chernoff α - divergences (but not Chernoff divergence). For exam- ple, Rauber et al. [20] investigated statistical distances for Dirichlet and Beta distributions (both belonging to the exponential families). The density of a Dirichlet distribution parameterized by a d -dimensional vector p = ( p 1 , ..., p d ) is Pr( X = x ; p ) = Γ( P d i =1 p i ) Q d i =1 Γ( p i ) d Y i =1 x p i − 1 i , with Γ( t ) = R ∞ 0 z t − 1 e − z d z the gamma function general- izing the factorial Γ( n − 1) = n ! . Beta distributions are particular cases of Dirichlet distributions, obtained for d = 2 . Rauber et al. [20] report the following closed-form formula for the Chernoff α -divergences: C α ( p : q ) = log Γ( d X i =1 ( αp i − (1 − λ ) q i )) + α d X i =1 log Γ( p i ) + (1 − α ) d X i =1 log Γ( q i ) − d X i =1 log Γ( αp i − (1 − α ) q i ) − α log Γ( d X i =1 | p i | ) − (1 − α ) log Γ( d X i =1 | q i | ) . Dirichlet distributions are exponential families of or- der d with natural parameters θ = ( p 1 − 1 , ..., p d − 1) and log-normalizer F ( θ ) = P d i =1 log Γ( θ i + 1) − log Γ( d + P d i =1 θ i ) (or F ( p ) = P d i =1 log Γ( p i ) − log Γ( P d i =1 p i ) ). Our work extends the computation of Chernoff α - divergences to arbitrary exponential families using the natural parameters and the log-normalizer . 5 J ( α ∗ ) F ( p : q ) B F ( p : m α ∗ ) B F ( q : m α ∗ ) p q F ( p ) F ( q ) m α ∗ = α ∗ p + (1 − α ∗ ) q F Fig. 1. The maximal Jensen α -div ergence is a Bregman divergence in disguise: J ( α ∗ ) F ( p : q ) = max α ∈ (0 , 1) J ( α ) F ( p : q ) = B F ( p : m α ∗ ) = B F ( q : m α ∗ ) . Since Chernoff information is defined as the maximal Chernoff α -divergence (which corresponds to minimize the Chernoff coefficient in the Bayes error upper bound, with 0 < c α ( p, q ) ≤ 1 ), we concentrate on maximizing the equivalent skew Jensen divergence. 5 M A X I M I Z I N G α - J E N S E N D I V E R G E N C E S W e now prove that the maximal skew Jensen diver- gence can be computed as an equivalent Bregman di- vergence. First, consider univariate functions. Let α ∗ = arg max 0 <α< 1 J ( α ) F ( p : q ) be the maximal α -divergence. Following Figure 1, we observe that we have geometri- cally the following relationships [14]: J ( α ∗ ) F ( p : q ) = B F ( p : m α ∗ ) = B F ( q : m α ∗ ) , (28) where m α = αp + (1 − α ) q be the α -mixing of distributions p and q . W e maximize the α -Jensen divergence by setting its derivative to zero: d J ( α ) F ( p : q ) d α = F ( p ) − F ( q ) − ( m α ) 0 F 0 ( m α ) . (29) Since the derivative ( m α ) 0 of m α is equal to p − q , we deduce from the maximization that d J ( α ) F ( p : q ) d α = 0 implies the following constraint: F 0 ( m ∗ α ) = F ( p ) − F ( q ) p − q . (30) This means geometrically that the tangent at α ∗ should be parallel to the line passing through ( p, z = F ( p )) and ( q , z = F ( q )) , as illustrated in Figure 1. It follows that α ∗ = F 0 − 1 F ( p ) − F ( q ) p − q − p q − p . (31) Using Eq. 28 , we have p − m ∗ α = (1 − α ∗ )( p − q ) , so that it comes B F ( p : m ∗ α ) = F ( p ) − F ( m ∗ α ) − (1 − α ∗ )( F ( p ) − F ( q )) = α ∗ F ( p ) + (1 − α ∗ ) F ( q ) − F ( m α ∗ ) = J ( α ∗ ) F ( p : q ) Fig. 2. Plot of the α -divergences for two nor mal distr ibu- tions for α ∈ (0 , 1) : (T op) p ∼ N (0 , 9) and q ∼ N (2 , 9) , and (Bottom) p ∼ N (0 , 9) and q ∼ N (2 , 36) . Obser ve that f or equal v ar iance, the minimum α divergence is obtained f or α = 1 2 , and that Cher noff divergence reduces to the Bhattachar yya div ergence. Similarly , we have q − m ∗ α = α ∗ ( q − p ) and it follows that B F ( q : m ∗ α ) = F ( q ) − F ( m ∗ α ) − ( q − m ∗ α ) F 0 ( m ∗ α ) = F ( q ) − F ( m ∗ α ) + α ∗ ( p − q ) F ( p ) − F ( q ) p − q = α ∗ F ( p ) + (1 − α ∗ ) F ( q ) − F ( m α ∗ ) = J ( α ∗ ) F ( p : q ) (32) Thus, we analytically checked the geometric intuition that J ( α ∗ ) F ( p : q ) = B F ( p : m ∗ α ) = B F ( q : m ∗ α ) . Observe that in the definition of a Bregman divergence, we requir e to compute explicitly the gradient ∇ F , but that in the Jensen α -divergence, we do not need it. (However , the gradient computation occurs in the computation of the best α ). 5.1 Single-parametric exponential families W e conclude that the Chernoff information divergence of members of the same exponential family of order 1 has always a closed-form analytic formula: C ( p : q ) = (33) α ∗ F ( p ) + (1 − α ∗ ) F ( q ) − F F 0− 1 ( F ( p ) − F ( q ) p − q ) , with α ∗ = F 0 − 1 F ( p ) − F ( q ) p − q − p q − p . (34) Common exponential families of order 1 include the Binomial, Bernoulli, Laplacian (exponential), Rayleigh, 6 Poisson, Gaussian with fixed standard deviations. T o illustrate the calculation method, let us instantiate the univariate Gaussian and Poisson distributions. For univariate Gaussian differing in mean only (ie., constant standard deviation σ ), we have the following: θ = µ σ 2 , F ( θ ) = θ 2 σ 2 2 = µ 2 2 σ 2 , F 0 ( θ ) = θ σ 2 = µ W e solve for α ∗ using Eq. 34: F 0 ( α ∗ θ p + (1 − α ∗ ) θ q ) = F ( θ p ) − F ( θ q ) θ p − θ q µ p + (1 − α ∗ )( µ q − µ p ) = µ 2 p − µ 2 q 2( µ p − µ q ) = µ p + µ q 2 It follows that α ∗ = 1 2 as expected, and that the Chernoff information is the Bhattacharrya distance: C ( p : q ) = C 1 2 ( p, q ) = J ( 1 2 ) F ( θ p : θ q ) , = 1 2 σ 2 ( µ 2 p + µ 2 q 2 ) − ( µ p + µ q 2 ) 2 2 σ 2 = 1 8 σ 2 ( µ p − µ q ) 2 For Poisson distributions ( F ( θ ) = exp( θ ) = F (log λ ) = exp log λ = λ ), Chernoff divergence is found by first computing α ∗ = log λ 2 λ 1 − 1 log λ 2 λ 1 log λ 2 λ 1 . (35) Then using Eq. 34, we deduce that C ( λ 1 : λ 2 ) = λ 2 + α ∗ ( λ 1 − λ 2 ) − exp( m α ∗ ) = λ 2 + α ∗ ( λ 1 − λ 2 ) − exp( α ∗ (log λ 1 ) + (1 − α ∗ ) log λ 2 ) = λ 2 + α ∗ ( λ 1 − λ 2 ) − λ 1 α ∗ λ 2 1 − α ∗ (36) Plugging Eq. 35 in Eq. 36, and “beautifying” the formula yields the following closed-form solution for the Chernoff information: C ( λ 1 , λ 2 ) = λ 1 ( λ 2 λ 1 − 1)(log λ 2 λ 1 − 1 log λ 2 λ 1 − 1) + log λ 2 λ 1 log λ 2 λ 1 . (37) 5.2 Arbitrary exponential families For multivariate generators F , we consider the restricted univariate convex function F pq ( α ) = F ( p + (1 − α )( q − p )) with parameters p 0 = 0 and q 0 = 1 , so that F pq (0) = F ( p ) and F pq (1) = F ( q ) . W e have C F ( p : q ) = max α J ( α ) F ( θ p : θ q ) = max α J ( α ) F θ p θ q (0 : 1) . (38) W e have F 0 pq ( α ) = ( p − q ) T ∇ F ( αp + (1 − α ) q ) . T o get the inverse of F 0 pq , we need to solve the equation: ( p − q ) T ∇ F ( α ∗ p + (1 − α ∗ ) q ) = F ( q ) − F ( p ) . (39) Observe that in 1D, this equation matches Eq. 30. Finding α ∗ may not always be in closed-form. Let θ ∗ = α ∗ p + (1 − α ∗ ) q , then we need to find α ∗ such that ( p − q ) T ∇ F ( θ ∗ ) = F ( q ) − F ( p ) . (40) Now , observe that equation 40 is equivalent to the following condition: B F ( θ p : θ ∗ ) = B F ( θ q : θ ∗ ) (41) and that therefor e it follows that KL( p F ( x ; θ ∗ ) : p F ( x ; θ p )) = KL( p F ( x ; θ ∗ ) : p F ( x ; θ q )) . (42) Thus it can be checked that the Chernoff distribution r ∗ = p F ( x ; θ ∗ ) is written as p F ( x ; θ ∗ ) = p F ( x ; θ p ) α ∗ ( x ) p F ( x ; θ q ) 1 − α ∗ R x p F ( x ; θ p ) α ∗ ( x ) p F ( x ; θ q ) 1 − α ∗ d x (43) 6 T H E C H E R N O FF P O I N T Let us consider now the exponential family E F = { p F ( x ; θ ) | θ ∈ Θ } , (44) as a smooth statistical manifold [16]. T wo distributions p = p F ( x ; θ p ) and q = p F ( x ; θ q ) are geometrically viewed as two points (expressed as θ p and θ q coordinates in the natural coordinate system). The Kullback-Leibler diver- gence between p and q is equivalent to a Bregman diver- gence on the natural parameters: KL( p : q ) = B F ( θ q : θ p ) . For infinitesimal close distributions p ' q , the Fisher information provides the underlying Riemannian met- ric, and is equal to the Hessian ∇ 2 F ( θ ) of the log- normalizer for exponential families [16]. On statistical manifolds [16], we define two types of geodesics: the mix- ture ∇ ( m ) geodesic and the exponential ∇ ( e ) geodesics: ∇ ( m ) ( p ( x ) , q ( x ) , λ ) = (1 − λ ) p ( x ) + λq ( x ) , (45) ∇ ( e ) ( p ( x ) , q ( x ) , λ ) = p ( x ) 1 − λ q ( x ) λ R x p ( x ) 1 − λ q ( x ) λ d x , (46) (47) Furthermore, to any convex function F , we can asso- ciate a dual convex conjugate F ∗ (such that F ∗ ∗ = F ) via the Legendre-Fenchel transformation: F ∗ ( y ) = max x {h x, y i − F ( x ) } . (48) The maximum is obtained for y = ∇ F ( x ) . Moreover , the convex conjugates are coupled by recipr ocal inverse gradient: ∇ F ∗ = ( ∇ F ) − 1 . Thus a member p of the exponential family , can be parameterized by its natural 7 coordinates θ p = θ ( p ) , or dually by its expectation coordinates η p = η ( p ) = ∇ F ( θ ) . That is, there exists a dual coordinate system on the information manifold E F of the exponential family . Note that the Chernoff distribution r ∗ = p F ( x ; θ ∗ ) of Eq. 43 is a distribution belonging to the exponential geodesic. The natural parameters on the exponential geodesic are interpolated linearly in the θ -coordinate system. Thus the exponential geodesic segment has nat- ural coordinates θ ( p, q , λ ) = (1 − λ ) θ p + λθ q . Using the dual expectation parameterization η ∗ = ∇ F ( θ ∗ ) , we may also rewrite the optimality criterion of equation Eq. 40 equivalently as ( p − q ) T η ∗ = F ( q ) − F ( p ) , (49) with η ∗ a point on the exponential geodesic parameter- ized by the expectation parameters (each mixture/expo- nential geodesic can be parameterized in each natural/- expectation coordinate systems). From Eq. 41, we deduce that the Chernoff distribution should also necessarily belong to the right-sided Breg- man V oronoi bisector V ( p, q ) = { x | B F ( θ p : θ x ) = B F ( θ q : θ x ) } . (50) This bisector is curved in the natural coordinate system, but affine in the dual expectation coordinate system [18]. Moreover , we have B F ( q : p ) = B F ∗ ( ∇ F ( p ) : ∇ F ( q )) , so that we may express the right-sided bisector equivalently in the expectation coordinate system as V ( p, q ) = { x | B F ∗ ( η x : η p ) = B F ( η x : η q ) } . (51) That is, a left-sided bisector for the dual Legendr e convex conjugate F ∗ . Thus the Chernoff distribution r ∗ is viewed as a Chernoff point on the statistical manifold such that r ∗ is defined as the intersection of the exponential geodesic ( η -geodesic, or e -geodesic) with the curved bisector { x | B F ( θ p : θ x ) = B F ( θ q : θ x ) } . In [18], it is proved that the exponential geodesic right-sided bisector intersection is Bregman orthogonal. Figure 3 illustrates the geometric property of the Chernoff distribution (which can be viewed indiffer ently in the natural/expectation parame- ter space), from which the corresponding best exponent can be retrieved to define the Chernoff information. W e following section builds on this exact geometric characterization to build a geodesic bisection optimization method to arbitrarily finely approximate the optimal exponent. 7 A G E O D E S I C B I S E C T I O N A L G O R I T H M T o find the Chernoff point r ∗ (ie., the parameter θ ∗ = (1 − α ∗ ) θ p + α ∗ θ q , a simple bisection algorithm follows: Let initially α ∈ [ α m , α M ] with α m = 0 , α M = 1 . Compute the midpoint α 0 = α m + α M 2 and let θ = θ p + α 0 ( θ q − θ p ) . If B F ( θ p : θ ) < B F ( θ q : θ ) recurse on interval [ α 0 , α M ] , otherwise recurse on interval [ α m , α 0 ] . At each stage we θ p θ q V ( p, q ) = { x | B F ( θ p : θ x ) = B F ( θ q : θ x ) } θ r ∗ Natural co ordinate system η p η q η r ∗ V ( p, q ) = { x | B F ∗ ( η x , η p ) = B F ∗ ( η x , η q ) } Exp ectation co ordinate system Chernoff p oin t Chernoff p oin t Fig. 3. Chernoff point r ∗ of p and q is defined as the intersection of the exponential geodesic ∇ ( e ) ( p, q ) with the r ight-sided V oronoi bisector V ( p, q ) . In the natural coordinate system, the exponential geodesic is a line segment and the right-sided bisector is cur ved. In the dual e xpectation coordinate system, the exponential geodesic is cur ved, and the right-sided bisector is affine. split the α -range in the θ -coordinate system. Thus we can get arbitrarily precise approximation of the Chernoff information of members of the same exponential family by walking on the exponential geodesic towards the Chernoff point. 8 C O N C L U D I N G R E M A R K S Chernoff divergence upper bounds asymptotically the optimal Bayes error [1]: lim n →∞ E ∗ = e − nC ( p : q ) . Cher- noff bound thus provides the best Bayesian exponent error [1], improving over the Bhattacharyya divergence ( α = 1 2 ): lim n →∞ E ∗ = e − nC ( p,q ) ≤ e − nB ( p,q ) , (52) at the expense of solving an optimization problem. The probability of misclassification err or can also be lower bounded by information-theoretic statistical dis- tances [21], [22] (Stein lemma [1]): lim n →∞ E ∗ = e − nC ( p : q ) ≥ e − nR ( p : q ) ≥ e − nJ ( p : q ) , (53) where J ( p : q ) denotes half of the Jeffreys divergence J ( p : q ) = KL( p : q )+KL( q : p ) 2 (i.e., the arithmetic mean on sided relative entropies) and R ( p : q ) = 1 1 KL( p : q ) + 1 KL( p : q ) is the resistor -average distance [22] (i.e., the harmonic 8 mean). In this paper , we have shown that the Cher- noff α -divergence of members of the same exponential family can be computed from an equivalent α -Jensen divergence on corresponding natural parameters. Then we have explained how the maximum α -Jensen diver- gence yields a simple gradient constraint. As a byprod- uct this shows that the maximal α -Jensen divergence is equivalent to compute a Bregman divergence. For single-parametric exponential families (order- 1 families or dimension-wise separable families), we deduced a closed form formula for the Chernoff divergence (or Chernoff information). Otherwise, based on the frame- work of information geometry , we interpreted the opti- mization task as of finding the “Chernoff point” defined by the intersection of the exponential geodesic linking the source distributions with a right-sided Br egman V oronoi bisector . Based on this observation, we designed an efficient geodesic bisection algorithm to arbitrarily approximate the Chernoff information. R E F E R E N C E S [1] T . M. Cover and J. A. Thomas, Elements of information theory . New Y ork, NY , USA: W iley-Interscience, 1991. [2] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” vol. 23, pp. 493– 507, 1952. [3] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Bulletin of Calcutta Mathematical Society , vol. 35, pp. 99–110, 1943. [4] A. Cichocki and S.-i. Amari, “Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities,” Entropy , vol. 12, no. 6, pp. 1532–1568, June 2010. [Online]. A vailable: http://dx.doi.org/10.3390/e12061532 [5] A. Cichocki and S. ichi Amari, “Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similari- ties,” Entropy , 2010, review submitted. [6] T . M. Cover and P . E. Hart, “Nearest neighbor pattern classifica- tion,” vol. 13, no. 1, pp. 21–27, 1967. [7] C. J. Stone, “Consistent nonparametric regression,” Annals of Statistics , vol. 5, no. 4, pp. 595–645, 1977. [8] J.-F . Chamberland and V . V eeravalli, “Decentralized detection in sensor networks,” IEEE T ransactions on Signal Processing , vol. 51(2), pp. pp.407–416, Feb 2003. [9] F . Calderero and F . Marques, “Region merging techniques using information theory statistical measures,” T ransactions on Image Processing , vol. 19, no. 6, pp. 1567–1586, 2010. [10] F . Sadjadi, “Performance evaluations of correlations of digital images using different separability measures,” IEEE T ransactions P AMI , vol. 4, no. 4, pp. 436–441, Jul. 1982. [11] S. K. Zhou and R. Chellappa, “Beyond one still image: Face recognition from multiple still images or video sequence,” in Face Processing: Advanced Modeling and Methods . Academic Press, 2005. [12] P . Suau and F . Escolano, “Exploiting information theory for filter- ing the kadir scale-saliency detector ,” in IbPRIA ’07: Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II . Berlin, Heidelberg: Springer-V erlag, 2007, pp. 146–153. [13] S. Konishi, A. L. Y uille, J. M. Coughlan, and S. C. Zhu, “Statis- tical edge detection: Learning and evaluating edge cues,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , vol. 25, pp. 57–74, 2003. [14] M. Basseville and J.-F . Cardoso, “On entropies, divergences and mean values,” in Proceedings of the IEEE Workshop on Information Theory , 1995. [15] F . Nielsen and S. Boltz, “The Burbea-Rao and Bhattacharyya centroids,” Computing Research Repository (CoRR) , vol. http://arxiv .org/, April 2010. [16] S. Amari and H. Nagaoka, Methods of Information Geometry , A. M. Society , Ed. Oxford University Press, 2000. [17] F . Nielsen and V . Garcia, “Statistical exponential families: A digest with flash cards,” 2009, arXiv .org:0911.4863. [18] J.-D. Boissonnat, F . Nielsen, and R. Nock, “Bregman voronoi dia- grams,” Discrete and Computational Geometry , April 2010. [Online]. A vailable: http://dx.doi.org/10.1007/s00454- 010- 9256- 1 [19] T . Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” Communications, IEEE T ransactions on [legacy, pre - 1988] , vol. 15, no. 1, pp. 52–60, 1967. [20] T . W . Rauber , T . Braun, and K. Berns, “Probabilistic distance measures of the dirichlet and beta distributions,” Pattern Recogn. , vol. 41, no. 2, pp. 637–645, 2008. [21] H. A vi-Itzhak and T . Diep, “Arbitrarily tight upper and lower bounds on the bayesian probability of error ,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , vol. 18, no. 1, pp. 89– 91, 1996. [22] D. H. Johnson and S. Sinanovic, “Symmetrizing the Kullback- Leibler distance,” T echnical report , Mar . 2001. Frank Nielsen defended his PhD thesis on Adaptive Computational Geometry in 1996 (INRIA/University of Sophia-Antipolis, Fr ance), and his accreditation to lead research in 2006. He is a researcher of Sony Computer Science Laboratories Inc., T oky o (Japan) since 1997, and a professor at ´ Ecole P olytechnique since 2008. His research focuses on computational information geometr y with applications to visual comput- ing.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment