The Burbea-Rao and Bhattacharyya centroids

We study the centroid with respect to the class of information-theoretic Burbea-Rao divergences that generalize the celebrated Jensen-Shannon divergence by measuring the non-negative Jensen difference induced by a strictly convex and differentiable f…

Authors: Frank Nielsen, Sylvain Boltz

IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 1 The Burbea-Rao and Bhattacharyya centroids Frank Nielsen, Senior Member , IEEE, and Sylv ain Boltz, Nonmember , IEEE Abstract —W e study the centroid with respect to the class of information-theor etic Burbea-Rao div ergences that generalize the celebrated Jensen-Shannon div ergence by measuring the non- negative Jensen difference induced by a strictly con vex and differentiable function. Although those Burbea-Rao div ergences are symmetric by construction, they are not metric since they fail to satisfy the triangle inequality . W e first explain how a particular symmetrization of Bregman divergences called Jensen- Bregman distances yields exactly those Burbea-Rao di vergences. W e then pr oceed by defining skew Burbea-Rao divergences, and show that skew Burbea-Rao div ergences amount in limit cases to compute Bregman divergences. W e then pro ve that Burbea-Rao centroids are unique, and can be arbitrarily finely appr oximated by a generic iterativ e conca ve-con v ex optimization algorithm with guaranteed con ver gence pr operty . In the second part of the paper , we consider the Bhattacharyya distance that is commonly used to measure o verlapping degr ee of pr obability distrib utions. W e show that Bhattacharyya distances on members of the same statistical exponential family amount to calculate a Burbea-Rao divergence in disguise. Thus we get an efficient algorithm f or computing the Bhattacharyya centroid of a set of parametric distrib utions belonging to the same exponential families, impro ving over former specialized methods found in the literature that were limited to univariate or “diagonal” multivariate Gaussians. T o illustrate the perf ormance of our Bhattacharyya/Burbea-Rao centroid algorithm, we present experimental perf ormance results for k -means and hierarchical clustering methods of Gaussian mixture models. Index T erms —Centroid, Kullback-Leibler divergence, Jensen- Shannon divergence, Burbea-Rao divergence, Bregman diver - gences, Exponential families, Bhattacharrya diver gence, Inf or - mation geometry . I . I N T RO D U C T I O N A. Means and centr oids In Euclidean geometry , the centroid c of a point set P = { p 1 , ..., p n } is defined as the center of mass 1 n P n i =1 p i , also characterized as the center point that minimizes the average squar ed Euclidean distances: c = arg min p P n i =1 1 n k p − p i k 2 . This basic notion of Euclidean centroid can be extended to denote a mean point M ( P ) representing the centrality of a giv en point set P . There are basically two complementary approaches to define mean values of numbers: (1) by ax- iomatization, or (2) by optimization, summarized concisely as follows: • By axiomatization . This approach was first historically pioneered by the independent work of Kolmogoro v [1] and Nagumo [2] in 1930, and simplified and refined later F . Nielsen is with the Department of Fundamental Research of Sony Computer Science Laboratories, Inc., T ok yo, Japan, and the Computer Sci- ence Department (LIX) of ´ Ecole Polytechnique, Palaiseau, France. e-mail: Frank.Nielsen@acm.org S. Boltz is with the Computer Science Department (LIX) of ´ Ecole Poly- technique, P alaiseau, France. e-mail: boltz@lix.polytechnique.fr Manuscript receiv ed April 2010, revised April 2012. This re vision includes in the appendix a proof of the uniqueness of the centroid. by Acz ´ el [3]. W ithout loss of generality we consider the mean of two non-negati v e numbers x 1 and x 2 , and postulate the following expected behaviors of a mean function M ( x 1 , x 2 ) as axioms (common sense): – Reflexi vity . M ( x, x ) = x , – Symmetry . M ( x 1 , x 2 ) = M ( x 2 , x 1 ) , – Continuity and strict monotonicity . M ( · , · ) continu- ous and M ( x 1 , x 2 ) < M ( x 0 1 , x 2 ) for x 1 < x 0 1 , and – Anonymity . M ( M ( x 11 , x 12 ) , M ( x 21 , x 22 )) = M ( M ( x 11 , x 21 ) , M ( x 12 , x 22 )) (also called bisymmetry expressing the f act that the mean can be computed as a mean on the ro w means or equiv alently as a mean on the column means). Then one can show that the mean function M ( · , · ) is necessarily written as: M ( x 1 , x 2 ) = f − 1  f ( x 1 ) + f ( x 2 ) 2  def = M f ( x 1 , x 2 ) , (1) for a strictly increasing function f . The arithmetic x 1 + x 2 2 , geometric √ x 1 x 2 and harmonic means 2 1 x 1 + 1 x 2 are in- stances of such generalized means obtained for f ( x ) = x , f ( x ) = log x and f ( x ) = 1 x , respectiv ely . Those general- ized means are also called quasi-arithmetic means , since they can be interpreted as the arithmetic mean on the se- quence f ( x 1 ) , ..., f ( x n ) , the f -representation of numbers. T o get geometric centroids, we simply consider means on each coordinate axis independently . The Euclidean centroid is thus interpreted as the Euclidean arithmetic mean. Barycenters (weighted centroids) are similarly obtained using non-negati v e weights (normalized so that P n i =1 w i = 1 ): M f ( x 1 , ..., x n ; w 1 , ..., w n ) = f − 1 n X i =1 w i f ( x i ) ! (2) Those generalized means satisfy the inequality property: M f ( x 1 , ..., x n ; w 1 , ..., w n ) ≤ M g ( x 1 , ..., x n ; w 1 , ..., w n ) , (3) if and only if function g dominates f : That is, ∀ x, g ( x ) > f ( x ) . Therefore the arithmetic mean ( f ( x ) = x ) domi- nates the geometric mean ( f ( x ) = log x ) which in turn dominates the harmonic mean f ( x ) = 1 x . Note that it is not a strict inequality in Eq. 3 as the means coincide for all identical elements: if all x i are equal to x then M f ( x 1 , ..., x n ) = f − 1 ( f ( x )) = x = g − 1 ( g ( x )) = M g ( x 1 , ..., x n ) . All those quasi-arithmetic means further satisfy the “interness” property IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 2 min( x 1 , ..., x n ) ≤ M f ( x 1 , ..., x n ) ≤ max( x 1 , ..., x n ) , (4) deriv ed from limit cases p → ±∞ of power means 1 for f ( x ) = x p , p ∈ R ∗ = ( −∞ , ∞ ) \{ 0 } , a non-zero real number . • By optimization . In this second alternativ e approach, the barycenter c is defined according to a distance function d ( · , · ) as the optimal solution of a minimization problem (OPT) : min x n X i =1 w i d ( x, p i ) = min x L ( x ; P , d ) , (5) where the non-negati ve weights w i denote multiplicity or relati ve importance of points (by default, the centroid is defined by fixing all w i = 1 n ). Ben-T al et al. [4] considered an information-theoretic class of distances called f -di ver gences [5], [6]: I f ( x, p ) = pf  x p  , (6) for a strictly con v ex dif ferentiable function f ( · ) satisfying f (1) = 0 and f 0 (1) = 0 . Although those f -di ver gences were primarily in vestigated for probability measures, 2 we can extend the f -div er gence to positiv e measures. Since program (OPT) is strictly con vex in x , it admits a unique minimizer M ( P ; I f ) = arg min x L ( x ; P , I f ) , termed the entr opic mean by Ben-T al et al. [4]. Interestingly , those entropic means are linear scale-in v ariant: 3 M ( λp 1 , ..., λp n ; I f ) = λM ( p 1 , ..., p n ; I f ) (7) Nielsen and Nock [7] considered another class of information-theoretic distortion measures B F called Bregman div ergences [8], [9]: B F ( x, p ) = F ( x ) − F ( p ) − ( x − p ) F 0 ( p ) , (8) for a strictly con ve x differentiable function F . It follows that (OPT) is con v ex, and admits a unique minimizer M ( p 1 , ..., p n ; B F ) = M F 0 ( p 1 , ..., p n ) , a quasi-arithmetic mean for the strictly increasing and continuous func- tion F 0 , the deriv ati v e of F . Observe that information- theoretic distances may be asymmetric (i.e., d ( x, p ) 6 = d ( p, x ) ), and therefore one may also define a right-sided centroid M 0 as the minimizer of (OPT 0 ) : min x n X i =1 w i d ( p i , x ) , (9) It turns out that for f -div ergences, we hav e: I f ( x, p ) = I f ∗ ( p, x ) , (10) 1 Besides the min/max operators interpreted as extremal po wer means, the geometric mean itself can also be interpreted as a power mean ( Q n i =1 x p i ) 1 p in the limit case p → 0 . 2 In that context, a d -dimensional point is interpreted as a discrete and finite probability measure lying in the ( d − 1) -dimensional unit simple x. 3 That is, means of homogeneous degree 1 . for f ∗ ( x ) = xf (1 /x ) so that (OPT’) is solved as a (OPT) problem for the conjugate function f ∗ ( · ) . In the same spirit, we hav e: B F ( x, p ) = B F ∗ ( F 0 ( p ) , F 0 ( x )) (11) for Bregman diver gences, where F ∗ denotes the Le gendre con v ex conjugate [8], [9]. 4 Surprisingly , although (OPT’) may not be conv ex in x for Bregman di ver gences (e.g., F ( x ) = − log x ), (OPT’) admits nevertheless a unique minimizer , independent of the generator function F : the center of mass M 0 ( P ; B F ) = P n i =1 1 n p i . Bregman means are not homogeneous except for the power generators F ( x ) = x p which yields entropic means, i.e. means that can also be interpreted 5 as minimizers of average f - div er gences [4]. Amari [11] further studied those power means (kno wn as α -means in information geometry [12]), and showed that they are linear-scale free means ob- tained as minimizers of α -div ergences, a proper sub- class of f -div ergences. Nielsen and Nock [13] reported an alternati ve simpler proof of α -means by sho wing that the α -div er gences are Bregman diver gences in dis- guise (namely , representational Bregman diver gences for positiv e measures, but not for normalized distribution measures [10]). T o get geometric centroids, we simply consider multiv ariate extensions of the optimization task (OPT). In particular , one may consider separable di- ver gences that are div er gences that can be assembled coordinate-wise: d ( x, p ) = d X i =1 d i ( x ( i ) , p ( i ) ) , (12) with x ( i ) denoting the i th coordinate. A typical non separable div ergence is the squared Mahalanobis dis- tance [14]: d ( x, p ) = ( x − p ) T Q ( x − p ) , (13) a Bregman div er gence called generalized quadratic dis- tance, defined for the generator F ( x ) = x T Qx , where Q is a positi ve-definite matrix ( Q  0 ). For separable distances, the optimization problem (OPT) may then be reinterpreted as the task of finding the projection [15] of a point p (of dimension d × n ) to the upper line U : (PR OJ) : inf u ∈ U d ( u, p ) (14) with u 1 = ... = u d × n > 0 , and p the ( n × d ) -dimensional point obtained by stacking the d coordinates of each of the n points. In geometry , means (centroids) play a crucial role in center - based clustering (i.e., k -means [16] for vector quantization applications). Indeed, the mean of a cluster allows one to aggr egate data into a single center datum. Thus the notion 4 Legendre dual con vex conjugates F and F ∗ hav e necessarily reciprocal gradients: F ∗ 0 = ( F 0 ) − 1 . See [7]. 5 In fact, Amari [10] proved that the intersection of the class of f - div ergences with the class of Bregman di v ergences are α -diver gences. IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 3 of means are encapsulated into the broader theory of mathe- matical aggregators [17]. Results on geometric means can be easily transfered to the field of Statistics [4] by generalizing the optimization problem task to a random variable X with distribution F as: (OPT) : min x E [ X d ( x, X )] = min x Z t td ( x, t )d F ( t ) , (15) where E [ · ] denotes the expectation defined with respect to the Lebesgue-Stieltjes integral. Although this approach is discussed in [4] and important for defining v arious notions of centrality in statistics, we shall not cov er this extended framew ork here, for sake of brevity . B. Burbea-Rao diver gences In this paper , we focus on the optimization approach (OPT) for defining other (geometric) means using the class of information-theoretic distances obtained by Jensen dif ference for a strictly con ve x and dif ferentiable function F : d ( x, p ) = F ( x ) + F ( p ) 2 − F  x + p 2  def = BR F ( x, p ) ≥ 0 . (16) Since the underlying dif ferential geometry implied by those Jensen difference distances have been seminally studied in papers of Burbea and Rao [18], [19], we shall term them Burbea-Rao div ergences, and point out to them as BR F . In the remainder , we consider separable Burbea-Rao di v ergences. That is, for d -dimensional points p and q , we define BR F ( p, q ) = d X i =1 BR F ( p ( i ) , q ( i ) ) , (17) and study the Burbea-Rao centroids (and barycenters) as the minimizers of the average Burbea-Rao div ergences. Those Burbea-Rao di ver gences generalize the celebrated Jensen- Shannon div er gence [20] JS( p, q ) = H  p + q 2  − H ( p ) + H ( q ) 2 (18) by choosing F ( x ) = − H ( x ) , the negati ve Shannon entropy H ( x ) = − x log x . Generators F ( · ) of parametric distances are conv ex functions representing entropies which are concav e functions. Burbea-Rao div er gences contain all generalized quadratic distances ( F ( x ) = x T Qx = h Qx, x i for a positiv e definite matrix Q  0 , also called squared Mahalanobis distances): BR F ( p, q ) = F ( p ) + F ( q ) 2 − F  p + q 2  = 2 h Qp, p i + 2 h Qq , q i − h Q ( p + q ) , p + q i 4 = 1 4 ( h Qp, p i + h Qq , q i − 2 h Qp, q i ) = 1 4 h Q ( p − q ) , p − q i = 1 4 k p − q k 2 Q . Although the square root of the Jensen-Shannon diver - gence yields a metric (a Hilbertian metric), it is not true in general for Burbea-Rao diver gences. The closest work to our paper is a 1 -page symposium 6 paper [21] discussing about Ali-Silv ey-Csisz ´ ar f -diver gences [5], [6] and Bregman div er gences [22], [8] (two entropy-based di ver gence classes). Those information-theoretic distortion classes are compared using quadratic differential metrics, mean v alues and projec- tions. The notion of skew Jensen differences intervene in the discussion. C. Contributions and paper organization The paper is articulated into tw o parts: The first part studies the Burbea-Rao centroids, and the second part shows some applications in Statistics. W e summarize our contributions as follows: • W e define the parametric class of (skew) Burbea-Rao div er gences, and show that those div ergences naturally arise when generalizing the principle of the Jensen- Shannon div ergence [20] to Jensen-Bregman div er gences. In the limit cases, we further pro ve that those ske w Burbea-Rao diver gences yield asymptotically Bregman div er gences. • W e show that the centroids with respect to the (skew) Burbea-Rao div er gences are unique. Besides centroids for special cases of Burbea-Rao di ver gences (including the squared Euclidean distances), those centroids are not av ailable in closed-form equations. Ho we ver , we show that any Burbea-Rao centroid can be estimated efficiently using an iterati ve con ve x-concav e optimization procedure. As a by-product, we find Bregman sided centroids [7] in closed-form in the extremal ske w cases. W e then consider applications of Burbea-Rao centroids in Statistics, and show the link with Bhattacharyya distances. A wide class of statistical parametric models can be handled in a unified manner as exponential families [23]. The classes of exponential families contain many of the standard parametric models including the Poisson, Gaussian, multinomial, and Gamma/Beta distributions, just to name a fe w prominent members. Ho we ver , only a fe w closed-form formulas for the statistical Bhattacharyya distances between those densities are reported in the literature. 7 For the second part, our contrib utions are re vie wed as follows: • W e show that the (ske w) Bhattacharyya distances calcu- lated for distributions belonging to the same exponential family in statistics, are equiv alent to (ske w) Burbea- Rao di ver gences. W e mention corresponding closed-form formula for computing Chernoff coefficients and α - div er gences of exponential families. In the limit case, we obtain an alternativ e proof sho wing that the Kullback- Leibler div er gence of members of the same exponential 6 In the nineties, the IEEE International Symposium on Information Theory (ISIT) published only 1 -page papers. W e are grateful to Prof. Mich ` ele Basseville for sending us the corresponding slides. 7 For instance, the Bhattacharyya distance between multiv ariate normal distributions is giv en here [24]. IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 4 family is equiv alent to a Bregman diver gence calculated on the natural parameters [14]. • W e approximate iterati vely the Bhattacharyya centroid of any set of distributions of the same exponential family (including multiv ariate Gaussians) using the Burbea-Rao centroid algorithm. For the case of multiv ariate Gaus- sians, we design yet another tailored iterativ e scheme based on matrix differentials, generalizing the former univ ariate study of Rigazio et al. [25]. Thus we get either the generic way or the tailored way for computing the Bhattacharrya centroids of arbitrary Gaussians. • As a field application, we show how to simplify Gaus- sian mixture models using hierarchical clustering, and show experimentally that the results obtained with the Bhattacharyya centroids compare fav orably well with former results obtained for Bregman centroids [26]. Our numerical e xperiments sho w that the generic method out- performs the alternati ve tailored method for multi v ariate Gaussians. The paper is organized as follows: In section II, we intro- duce Burbea-Rao div ergences as a natural extension of the Jensen-Shannon div ergence using the framew ork of Bregman div er gences. It is followed by Section III which considers the general case of ske w div ergences, and re veals asymptotic behaviors of extreme skew Burbea-Rao di ver gences as Breg- man di ver gences. Section IV defines the (ske w) Burbea-Rao centroids, sho w they are unique, and present a simple iterati ve algorithm with guaranteed con vergence. W e then consider applications in Statistics in Section V: After briefly recalling exponential distributions in § V -A, we sho w that Bhattacharyya distances and Chernoff/Amari α -div er gences are av ailable in closed-form equations as Burbea-Rao di ver gences for distribu- tions of the same exponential families. Section V -C presents an alternati v e iterati v e algorithm tailored to compute the Bhat- tacharyya centroid of multiv ariate Gaussians, generalizing the former specialized work of Rigazio et al. [25]. In section V -D, we use those Bhattacharyya/Burbea-Rao centroids to simplify hierarchically Gaussian mixture models, and comment both qualitativ ely and quantitatively our experiments on a color image segmentation application. Finally , section VI concludes this paper by describing further perspectiv es and hinting at some information geometrical aspects of this work. I I . B U R B E A - R AO D I V E R G E N C E S F R O M S Y M M E T R I Z AT I O N O F B R E G M A N D I V E R G E N C E S Let R + = [0 , + ∞ ) denote the set of non-negati v e reals. For a strictly conv ex (and differentiable) generator F , we define the Burbea-Rao di ver gence as the following non-negati ve function: BR F : X × X → R + ( p, q ) 7→ BR F ( p, q ) = F ( p ) + F ( q ) 2 − F  p + q 2  ≥ 0 The non-negativ e property of those diver gences follows straightforwardly from Jensen inequality . Although Burbea- Rao distances are symmetric ( BR F ( p, q ) = BR F ( q , p ) ), they ( p, F ( p )) ( q , F ( q )) p q p + q 2 ( p + q 2 , F ( p + q 2 )) ( p + q 2 , F ( p )+ F ( q ) 2 ) BR F ( p, q ) Fig. 1. Interpreting the Burbea-Rao diver gence BR F ( p, q ) as the vertical distance between the midpoint of segment [( p, F ( p )) , ( q , F ( q ))] and the midpoint of the graph plot  p + q 2 , F  p + q 2  . F q p ˆ p ˆ q H q H 0 q B F ( p, q ) = H q − H 0 q Fig. 2. Interpreting the Bregman di ver gence B F ( p, q ) as the vertical distance between the tangent plane at q and its translate passing through p (with identical slope ∇ F ( q ) ). are not metrics since they fail to satisfy the triangle inequality . A geometric interpretation of those div ergences is given in Figure 1. Note that F is defined up to an affine term ax + b . W e show that Burbea-Rao diver gences extend the Jensen- Shannon div er gence using the broader concept of Bregman div er gences instead of the K ullback-Leibler div ergence. A Bregman di v ergence [22], [8], [9] B F is defined as the positiv e tail of the first-order T aylor e xpansion of a strictly con vex and differentiable con ve x function F : B F ( p, q ) = F ( p ) − F ( q ) − h p − q , ∇ F ( q ) i , (19) where ∇ F denote the gradient of F (the vector of partial deriv ati v es { ∂ F ∂ x i } i ), and h x, y i = x T y the inner product (dot product for vectors). A Bregman diver gence is interpreted geometrically [14] as the vertical distance between the tangent plane H q at q of the graph plot F = { ˆ x = ( x, F ( x )) | x ∈ X } and its translates H 0 q passing through ˆ p = ( p, F ( p )) . Fig- ure 2 depicts graphically the geometric interpretation of the Bregman di ver gence (to be compared with the Burbea-Rao div er gence in Figure 1). Bregman diver gences are never metrics, and symmetric only for the generalized quadratic distances [14] obtained by choosing F ( x ) = x T Qx , for some positi ve definite matrix Q  0 . Bregman div ergences allow one to encapsulate both statistical distances with geometric distances: IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 5 • Kullback-Leibler diver gence obtained for F ( x ) = x log x : KL( p, q ) = d X i =1 p ( i ) log p ( i ) q ( i ) (20) • squared Euclidean distance obtained for F ( x ) = x 2 : L 2 2 ( p, q ) = d X i =1 ( p ( i ) − q ( i ) ) 2 = k p − q k 2 (21) Basically , there are two ways to symmetrize Bregman di- ver gences (see also work on Bregman metrization [27], [28]): • Jeffr eys-Br egman divergences. W e consider half of the double-sided div er gences: S F ( p ; q ) = B F ( p, q ) + B F ( q , p ) 2 (22) = 1 2 h p − q , ∇ F ( p ) − ∇ F ( q ) i , (23) Except for the generalized quadratic distances, this sym- metric distance cannot be interpreted as a Bregman div er gence [14]. • Jensen-Br egman diver gences. W e consider the Jeffreys- Bregman diver gences from the source parameters to the av erage parameter p + q 2 as follows: J F ( p ; q ) = B F ( p, p + q 2 ) + B F ( q , p + q 2 ) 2 (24) = F ( p ) + F ( q ) 2 − F ( p + q 2 ) = BR F ( p, q ) Note that ev en for the negati v e Shannon entropy F ( x ) = x log x − x (extended to positiv e measures), those two sym- metrizations yield dif ferent div ergences: While S F uses the gradient ∇ F , J F relies only on the generator F . Both J F and S F hav e always finite values. 8 The first symmetrization approach was historically studied by Jeffreys [29]. The second way to symmetrize Bregman di ver gences gen- eralizes the spirit of the Jensen-Shannon diver gence [20] JS( p, q ) = 1 2  KL  p, p + q 2  + KL  q , p + q 2  , (25) = H  p + q 2  − H ( p ) + H ( q ) 2 (26) with non-negati vity that can be deriv ed from Jensen’ s in- equality , hence its name. The Jensen-Shannon div ergence is also called the total div ergence to the av erage, a generalized measure of di versity from the population distributions p and q to the averag e population p + q 2 . Those Jensen dif ference-type div er gences are by definition Burbea-Rao diver gences. F or the Shannon entropy , those two different information di ver gence symmetrizations (Jensen-Shannon div er gence and Jeffre ys J div er gence) satisfy the following inequality: J ( p, q ) ≥ 4 JS( p, q ) ≥ 0 . (27) 8 This may not be the case of Bregman/Kullback-Leibler diver gences that can potentially be unbounded. Nielsen and Nock [7] in vestigated the centroids with respect to Jef freys-Bre gman diver gences (the symmetrized Kullback- Leibler div er gence). I I I . S K E W B U R B E A - R AO D I V E R G E N C E S W e further generalize Burbea-Rao diver gences by intro- ducing a positive weight α ∈ (0 , 1) when averaging source parameters p and q as follows: BR ( α ) F : X × X → R + BR ( α ) F ( p, q ) = αF ( p ) + (1 − α ) F ( q ) − F ( αp + (1 − α ) q ) W e consider the open interval (0 , 1) since otherwise the div er gence has no discriminatory power (indeed, for α ∈ { 0 , 1 } , BR ( α ) F ( p, q ) = 0 , ∀ p, q ). Although sk ewed div ergences are asymmetric BR ( α ) F ( p, q ) 6 = BR ( α ) F ( q , p ) , we can swap arguments by replacing α by 1 − α : BR ( α ) F ( p, q ) = αF ( p ) + (1 − α ) F ( q ) − F ( αp + (1 − α ) q ) = BR (1 − α ) F ( q , p ) (28) Those skew Burbea-Rao div er gences are similarly found us- ing a ske w Jensen-Bregman counterpart (the gradient terms ∇ F ( αp + (1 − α ) q ) perfectly cancel in the sum of skew Bregman div ergences): αB F ( p, αp + (1 − α ) q ) + (1 − α ) B F ( q , αp + (1 − α ) q ) def = BR ( α ) F ( p, q ) In the limit cases, α → 0 or α → 1 , we have BR ( α ) F ( p, q ) → 0 ∀ p, q . That is, those div ergences loose their discriminatory power at extremities. Howe v er , we sho w that those ske w Burbea-Rao div ergences tend asymptotically to Bregman di- ver gences: B F ( p, q ) = lim α → 0 1 α BR ( α ) F ( p, q ) (29) B F ( q , p ) = lim α → 1 1 1 − α BR ( α ) F ( p, q ) (30) The limit in the right-hand-side of Eq. 30 can be expressed alternativ ely as the following one-sided limit: lim α ↑ 1 1 1 − α BR ( α ) F ( p, q ) = lim α ↓ 0 1 α BR ( α ) F ( q , p ) , (31) where the arrows ↑ and ↓ denote the limit from the left and the limit from the right, respectively (see [30] for notations). The right deriv ati ve of a function f at x is defined as f 0 + ( x ) = lim y ↓ x f ( y ) − f ( x ) y − x . Since BR (0) F ( p, q ) = 0 ∀ p, q , it follows that the right-hand-side limit of Eq. 31 is the right deriv ati ve (see Theorem 1 of [30] that giv es a generalized T aylor expansion of con ve x functions) of the map L ( α ) : α 7→ BR ( α ) F ( q , p ) (32) taken at α = 0 . Thus we have IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 6 lim α ↓ 0 1 α BR ( α ) F ( q , p ) = L 0 + (0) ., (33) with L 0 + (0) = d + d α ( αF ( q ) + (1 − α ) F ( p ) − F ( αq + (1 − α ) p )) = F ( q ) − F ( p ) − h q − p, ∇ F ( p ) i (34) = B F ( q , p ) (35) Lemma 1: Ske w Burbea-Rao di ver gences tend asymptoti- cally to Bregman div ergences ( α → 0 ) or rev erse Bregman div er gences ( α → 1 ). Thus we may scale skew Burbea-Rao diver gences so that Bregman di v ergences belong to skew Burbea-Rao di v ergences: sBR ( α ) F ( p, q ) = 1 α (1 − α ) ( αF ( p ) + (1 − α ) F ( q ) − F ( αp + (1 − α ) q )) (36) Moreov er , α is now not anymore restricted to (0 , 1) but to the full real line: α ∈ R , as also noticed in [31]. Setting α = 1 − α 0 2 (that is, α 0 = 1 − 2 α ), we get sBR ( α 0 ) F ( p, q ) = 4 1 − α 0 2  1 − α 0 2 F ( p ) + 1 + α 0 2 F ( q ) − F  1 − α 0 2 p + 1 + α 0 2 q  (37) I V . B U R B E A - R AO C E N T RO I D S Let P = { p 1 , ..., p n } denote a d -dimensional point set. T o each point, let us further associate a positiv e weight w i (accounting for arbitrary multiplicity) and a positiv e scalar α i ∈ (0 , 1) to define an anchored distance BR ( α i ) F ( · , p i ) . Define the ske w Burbea-Rao 9 barycenter (or centroid) c as the minimizer of the following optimization task: OPT : c = arg min x n X i =1 w i BR ( α i ) F ( x, p i ) = arg min x L ( x ) (38) W ithout loss of generality , we consider ar gument x on the left ar gument position (otherwise, we change all α i → 1 − α i to get the right-sided Burbea-Rao centroid). Removing all terms independent of x , the minimization program (OPT) amounts to minimize equiv alently the following energy function: E ( c ) = ( n X i =1 w i α i ) F ( c ) − n X i =1 w i F ( α i c + (1 − α i ) p i ) (39) Observe that the energy function is decomposable in the sum of a conv e x function ( P n i =1 w i α i ) F ( c ) with a concave function − P n i =1 w i F ( α i c + (1 − α i ) p i ) (since the sum of n 9 W e also call them skew Jensen barycenters or centroids since they are induced by a div ergence using the Jensen inequality . concav e functions is concav e). W e can thus solve iterativ ely this optimization problem using the Con v ex-ConCa ve Proce- dure [32], [33] (CCCP), by starting from an initial position c 0 (say , the barycenter c 0 = P n i =1 w i p i ), and iteratively update the barycenter as follows: ∇ F ( c t +1 ) = 1 P n i =1 w i α i n X i =1 w i α i ∇ F ( α i c t + (1 − α i ) p i ) (40) c t +1 = ∇ F − 1 1 P n i =1 w i α i n X i =1 w i α i ∇ F ( α i c t + (1 − α i ) p i ) ! (41) Since F is con ve x, the second-order deriv ativ e ∇ 2 F is always positiv e definite, and ∇ F is strictly monotone in- creasing. Thus we can interpret Eq. 41 as a fix ed-point equation by considering the ∇ F -representation. Each iteration is interpreted as a quasi-arithmetic mean. This proves that the Burbea-Rao centroid is always well-defined and unique (see Appendix A for a detailed proof), since there is (at most) a unique fixed point for x = g ( x ) with a function g ( · ) strictly monotone increasing. In some cases, like the squared Euclidean distance (or squared Mahalanobis distances), we find closed-form solutions for the Burbea-Rao barycenters. For example, consider the (negati v e) quadratic entropy F ( x ) = h x, x i = P d i =1 ( x ( i ) ) 2 with weights w i and all α i = 1 2 (non-ske w symmetric Burbea- Rao div er gences). W e hav e: min E ( x ) = F ( x ) 2 − n X i =1 w i F  p i + x 2  , (42) = min h x, x i 2 − 1 4 n X i =1 w i ( h x, x i + 2 h x, p i i + h p i , p i i ) The minimum is obtained when the gradient ∇ E ( x ) = 0 , that is when x = ¯ p = P n i =1 w i p i , the barycenter of the point set P . For most Burbea-Rao diver gences, Eq. 42 can only be solved numerically . Observe that for extremal ske w cases (for α → 0 or α → 1 ), we obtain the Bregman centroids in closed-form solutions (see Eq. 30). Thus skew Burbea-Rao centroids allo w one to get a smooth transition from the right-sided centroid (the center of mass) to the left-sided centroid (a quasi-arithmetic mean M f obtained for f = ∇ F , a continuous and strictly increasing function). Theor em 1: Skew Burbea-Rao centroids are unique. They can be estimated iterati vely using the CCCP iterative algo- rithm. In extremal skew cases, the Burbea-Rao centroids tend to Bregman left/right sided centroids, and hav e closed-form equations in limit cases. T o describe the orbit of Burbea-Rao centroids linking the left to right sided Bregman centroids, we compute for α ∈ [0 , 1] the ske w Burbea-Rao centroids with the follo wing update scheme: c t +1 = ∇ F − 1 n X i =1 w i ∇ F ( αc t + (1 − α ) p i ) ! (43) IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 7 W e may further consider various con ve x generators F i for each point, and consider the updating scheme c t +1 = X i w i ∇ F i ! − 1 1 P n i =1 w i α i n X i =1 w i α i ∇ F i ( α i c t + (1 − α i ) p i ) ! A. Burbea-Rao diver gences of a population Consider no w the Burbea-Rao diver gence of a popula- tion p 1 , ..., p n with respectiv e positiv e normalized weights w 1 , ..., w n . The Burbea-Rao div ergence is defined by: BR w F ( p 1 , ..., p n ) = n X i =1 w i F ( p i ) − F ( n X i =1 w i p i ) ≥ 0 (44) This family of div ersity measures includes the Jensen- R ´ enyi div er gences [34], [35] for F ( x ) = − R α ( x ) , where R α ( x ) = 1 1 − α log P d j =1 p α j is the R ´ enyi entropy of order α . (R ´ enyi entropy is concave for α ∈ (0 , 1) and tend to Shannon entropy for α → 1 .) V . B H AT T AC H A RY Y A D I S TA N C E S A S B U R B E A - R AO D I S TA N C E S W e first briefly recall the versatile class of e xponential fam- ily distributions in Section V -A. Then we sho w in Section V -B that the statistical Bhattacharyya/Chernoff distances between exponential family distributions amount to compute a Burbea- Rao div er gence. A. Exponential family distribution in Statistics Many usual statistical parametric distributions p ( x ; λ ) (e.g., Gaussian, Poisson, Bernoulli/multinomial, Gamma/Beta, etc.) share common properties arising from their common canonical decomposition of probability distribution [9]: p ( x ; λ ) = p F ( x ; θ ) = exp ( h t ( x ) , θ i − F ( θ ) + k ( x )) . (45) Those distributions 10 are said to belong to the exponential families (see [23] for a tutorial). An exponential family is characterized by its log-normalizer F ( θ ) , and a distribution in that family by its natural parameter θ belonging to the natural space Θ . The log-normalizer F is strictly conv e x and C ∞ , and can also be expressed using the source coordinate system λ using the 1-to-1 map τ : Λ → Θ that con verts parameters from the source coordinate system λ to the natural coordinate system θ : F ( θ ) = F ( τ ( λ )) = ( F ◦ τ )( λ ) = F λ ( λ ) , (46) where F λ = F ◦ τ denotes the log-normalizer function expressed using the λ -coordinates instead of the natural θ - coordinates. 10 The distrib utions can either be discrete or continuous. W e do not introduce the unifying framework of probability measures in order to not burden the paper . The vector t ( x ) denote the sufficient statistics , that is the set of linear independent functions that allows to concentrate without any loss all information about the parameter θ carried in the iid. observations x 1 , x 2 , ..., . The inner product h p, q i is defined according to the primitive type of θ . Namely , it is a multiplication h p, q i = pq for scalars, a dot product h p, q i = p T q for vectors, a matrix trace h p, q i = tr( p T × q ) = tr( p × q T ) for matrices, etc. For composite types such as p being defined by both a vector part and a matrix part, the composite inner product is defined as the sum of inner products on the primiti v e types. Finally , k ( x ) represents the carrier measure according to the counting or Lebesgue measures. Decompositions for most common exponential family distributions are gi ven in [23]. An exponential family E F = { p F ( x ; θ ) | θ ∈ Θ } is the set of probability distributions obtained for the same log-normalizer function F . Information geometry considers E F as a manifold entity , and study its differential geometric properties [12]. For example, consider the family of Poisson distributions E F with mass function: p ( x ; λ ) = λ x x ! exp( − λ ) , (47) for x ∈ N + = N ∪ { 0 } a positive integer . Poisson distributions are univ ariate exponential f amilies ( x ∈ N + ) of order 1 (parameter λ ). The canonical decomposition yields • the suf ficient statistic t ( x ) = x , • θ = log λ , the natural parameter, • F ( θ ) = exp θ , the log-normalizer, • and k ( x ) = − log x ! the carrier measure (with respect to the counting measure). Since we deal with applications using multiv ariate nor- mals in the follo wing, we also report explicitly that canon- ical decomposition for the multiv ariate Gaussian family { p F ( x ; θ ) | θ ∈ Θ } . W e rewrite the usual Gaussian density of mean µ and variance-co v ariance matrix Σ : p ( x ; λ ) = p ( x ; µ, Σ) (48) = 1 2 π √ det Σ exp  − ( x − µ ) T Σ − 1 ( x − µ )) 2  (49) in the canonical form of Eq. 45 with, • θ = (Σ − 1 µ, 1 2 Σ − 1 ) ∈ Θ = R d × K d × d , with K d × d denotes the cone of positiv e definite matrices, • F ( θ ) = 1 4 tr( θ − 1 2 θ 1 θ T 1 ) − 1 2 log det θ 2 + d 2 log π , • t ( x ) = ( x, − x T x ) , • k ( x ) = 0 . In this case, the inner product is composite and is calculated as the sum of a dot product and a matrix trace as follows: h θ , θ 0 i = θ T 1 θ 0 1 + tr( θ T 2 θ 0 2 ) . (50) The coordinate transformation τ : Λ → Θ is given for λ = ( µ, Σ) by τ ( λ ) =  λ − 1 2 λ 1 , 1 2 λ − 1 2  , (51) and its in verse mapping τ − 1 : Θ → Λ by IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 8 τ − 1 ( θ ) =  1 2 θ − 1 2 θ 1 , 1 2 θ − 1 2  . (52) B. Bhattacharyya/Chernof f coefficients and α -diver gences as skew Burbea-Rao diver gences For arbitrary probability distributions p ( x ) and q ( x ) (para- metric or not), we measure the amount of overlap between those distributions using the Bhattacharyya coefficient [36]: C ( p, q ) = Z p p ( x ) q ( x )d x, (53) Clearly , the Bhattacharyya coefficient (measuring the affinity between distributions [37]) falls in the unit range: 0 ≤ C ( p, q ) ≤ 1 . (54) In fact, we may interpret this coefficient geometrically by con- sidering p p ( x ) and p q ( x ) as unit vectors. The Bhattacharyya distance is then the dot product, representing the cosine of the angle made by the two unit vectors. The Bhattacharyya distance B : X × X → R + is deriv ed from its coefficient [36] as B ( p, q ) = − ln C ( p, q ) . (55) The Bhattacharyya distance allows one to get both upper and lo wer bound the Bayes’ classification error [38], [39], while there are no such results for the symmetric Kullback- Leibler diver gence. Both the Bhattacharyya distance and the symmetric Kullback-Leibler div ergence agrees with the Fisher information at the infinitesimal lev el. Although the Bhat- tacharyya distance is symmetric, it is not a metric. Nev erthe- less, it can be metrized by transforming it into to the follo wing Hellinger metric [40]: H ( p, q ) = s 1 2 Z ( p p ( x ) − p q ( x )) 2 d x, (56) such that 0 ≤ H ( p, q ) ≤ 1 . It follo ws that H ( p, q ) = s 1 2  Z p ( x )d x + Z q ( x )d x − 2 Z p p ( x ) p q ( x )d x  = p 1 − C ( p, q ) . (57) Hellinger metric is also called Matusita metric [37] in the literature. The thesis of Hellinger was emphasized in the work of Kakutani [41]. W e consider a direct generalization of Bhattacharyya coef- ficients and div ergences called Chernoff div ergences 11 11 In the literature, Chernoff information is also defined as − log inf α ∈ [0 , 1] R p α ( x ) q 1 − α ( x )d x . Similarly , Chernoff coefficients C α ( p, q ) are defined as the supremum: C α ( p, q ) = sup α ∈ [0 , 1] R p α ( x ) q 1 − α ( x )d x . B α ( p, q ) = − ln Z x p α ( x ) q 1 − α ( x )d x = − ln C α ( p, q ) (58) = − ln Z x q ( x )  p ( x ) q ( x )  α d x (59) = − ln E q [ L α ( x )] (60) defined for some α ∈ (0 , 1) (the Bhattacharyya div ergence is obtained for α = 1 2 ), where E [ · ] denote the expec- tation, and L ( x ) = p ( x ) q ( x ) the likelihood ratio. The term R x p α ( x ) q 1 − α ( x )d x is called the Chernoff coefficient. The Bhattacharyya/Chernoff distance of members of the same exponential family yields a weighted asymmetric Burbea-Rao div er gence (namely , a skew Burbea-Rao di ver gence): B α ( p F ( x ; θ p ) , p F ( x ; θ q )) = BR ( α ) F ( θ p , θ q ) (61) with BR ( α ) F ( θ p , θ q ) = αF ( θ p ) + (1 − α ) F ( θ q ) − F ( αθ p + (1 − α ) θ q ) (62) Chernoff coefficients are also related to α -di ver gences, the canonical div ergences in α -flat spaces in information geome- try [12] (p. 57): D α ( p || q ) =        4 1 − α 2  1 − R p ( x ) 1 − α 2 q ( x ) 1+ α 2 d x  , α 6 = ± 1 , R p ( x ) log p ( x ) q ( x ) d x = KL( p, q ) , α = − 1 , R q ( x ) log q ( x ) p ( x ) d x = KL( q , p ) , α = 1 , (63) The class of α -div er gences satisfy the follo wing reference duality: D α ( p || q ) = D − α ( q || p ) . Remapping α 0 = 1 − α 2 ( α = 1 − 2 α 0 ), we transform Amari α -diver gences to Chernoff α 0 - div er gences: 12 D α 0 ( p, q ) =        1 α 0 (1 − α 0 )  1 − R p ( x ) α 0 q ( x ) 1 − α 0 d x  , α 0 6∈ { 0 , 1 } , R p ( x ) log p ( x ) q ( x ) d x = KL( p, q ) , α 0 = 1 , R q ( x ) log q ( x ) p ( x ) d x = KL( q , p ) , α 0 = 0 , (64) Theor em 2: The Chernof f α 0 -div er gence ( α 6 = ± 1 ) of distributions belonging to the same exponential family is giv en in closed-form by means of a ske wed Burbea-Rao div er gence as: D α 0 ( p, q ) = 1 α 0 (1 − α 0 ) (1 − e − BR α 0 F ( θ p ,θ q ) ) , with BR ( α ) F ( θ p , θ q ) = ( αF ( θ p ) − (1 − α ) F ( θ q )) − F ( α θ p − (1 − α ) θ q ) . Amari α -diver gence for members of the same exponential families amount to compute D α ( p, q ) = 4 1 − α 2 (1 − e − BR ( 1 − α 2 ) F ( θ p ,θ q ) ) W e get the following theorem for Bhattacharyya/Chernoff distances: 12 Chernoff coefficients are also related to R ´ enyi α -diver gence generalizing the Kullback-Leibler div ergence: R α ( p || q ) = 1 α − 1 log R x p ( x ) α q 1 − α ( x )d x built on R ´ enyi entropy H α R ( p ) = 1 1 − α log( R x p α ( x )d x − 1) . The Tsallis entropy H α T ( p ) = 1 α − 1 (1 − R p ( x ) α d x ) can also be obtained from the R ´ enyi entropy (and vice-v ersa) via the mappings: H α T ( p ) = 1 1 − α ( e (1 − α ) H α R ( p ) − 1) and H α R ( p ) = 1 1 − α log(1 + (1 − α ) H α T ( p )) . IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 9 Let us compute the Chernof f coef ficient for distributions belonging to the same exponential families. W ithout loss of generality , let us consider the reduced canonical form of exponential families p F ( x ; θ ) = exp h x, θ i − F ( θ ) . Chernof f coefficients C α ( p, q ) of members p = p F ( x ; θ p ) and q = p F ( x ; θ q ) of the same exponential family E F : C α ( p, q ) = Z p α ( x ) q 1 − α ( x )d x = Z p ( α ) F ( x ; θ p ) p 1 − α F ( x ; θ q )d x = Z exp( α ( h x, θ p i − F ( θ p ))) × exp((1 − α )( h x, θ q i − F ( θ q )))d x = Z exp ( h x, αθ p + (1 − α ) θ q i − ( αF ( θ p ) + (1 − α ) F ( θ q )) d x = exp − ( αF ( θ p ) + (1 − α ) F ( θ q )) × Z exp ( h x, αθ p + (1 − α ) θ q i − F ( αθ p + (1 − α ) θ q ) + F ( αθ p + (1 − α ) θ q )) d x = exp ( F ( αθ p + (1 − α ) θ q ) − ( αF ( θ p ) + (1 − α ) F ( θ q )) × Z exp h x, αθ p + (1 − α ) θ q i − F ( αθ p + (1 − α ) θ q )d x = exp ( F ( αθ p + (1 − α ) θ q ) − ( αF ( θ p ) + (1 − α ) F ( θ q )) × Z p F ( x ; αθ p + (1 − α ) θ q )d x | {z } =1 = exp( − BR ( α ) F ( θ p , θ q )) ≥ 0 . Theor em 3: The ske w Bhattacharyya div ergence B α ( p, q ) is equiv alent to the Burbea-Rao di v ergence for members of the same exponential family E F : B α ( p, q ) = B α ( p F ( x ; θ p ) , p F ( x ; θ q )) = − log C α ( p F ( x ; θ p ) , p F ( x ; θ q )) = BR ( α ) F ( θ p , θ q ) ≥ 0 . In particular, for α = ± 1 , the Kullback-Leibler div ergence of those exponential family distributions amount to compute a Br e gman diver gence [14] (by taking the limit as α → 1 or α → 0 ). Cor ollary 1: In the limit case α 0 ∈ { 0 , 1 } , the α 0 - div er gences amount to compute a Kullback-Leibler di v er- gence, and is equiv alent to compute a Bregman diver gence for the log-normalized on the swapped natural parameters: KL( p F ( x ; θ p ) , p F ( x ; θ q )) = B F ( θ q , θ p ) . Pr oof: The proof relies on the equiv alence of Burbea- Rao diver gences to Bregman diver gences for extremal values of α ∈ { 0 , 1 } . KL( p, q ) = KL( p F ( x ; θ p ) , p F ( x ; θ q )) (65) = lim α 0 → 1 D α 0 ( p F ( x ; θ p ) , p F ( x ; θ q )) (66) = lim α 0 → 1 1 α 0 (1 − α 0 ) (1 − C α ( p F ( x ; θ p ) , p F ( x ; θ q )) | {z } since exp x ' x ' 0 1+ x ) = lim α 0 → 1 1 α 0 (1 − α 0 ) BR α 0 F ( θ p , θ q ) | {z } (1 − α 0 ) B F ( θ q ,θ p ) (67) = lim α 0 → 1 1 α 0 B F ( θ q , θ p ) = B F ( θ q , θ p ) (68) Similarly , we hav e lim α 0 → 0 D α 0 ( p F ( x ; θ p ) , p F ( x ; θ q )) = KL( p F ( x ; θ q ) , p F ( x ; θ p )) = B F ( θ p , θ q ) . T able I reports the Bhattacharyya distances for members of the same exponential families. C. Dir ect method for calculating the Bhattacharyya centr oids of multivariate normals T o the best of our knowledge, the Bhattacharyya centroid has only been studied for univ ariate Gaussian or diagonal multiv ariate Gaussian distributions [42] in the context of speech recognition, where it is reported that it can be estimated using an iterative algorithm (no con v ergence guarantees are reported in [42]). In order to compare this scheme on multi variate data with our generic Burbea-Rao scheme, we extend the approach of Rigazio et al. [42] to multiv ariate Gaussians. Plugging the Bhattacharyya distance of Gaussians in the energy function of the optimization problem (OPT), we get L ( c ) = n X i =1 1 8 ( µ c − µ i ) T  Σ c + Σ i 2  − 1 ( µ c − µ i ) + 1 2 log det  Σ c +Σ i 2  √ det Σ c det Σ i ! . (69) This is equiv alent to minimize the following energy: F ( c ) = n X i =1 ( µ c − µ i ) T (Σ c + Σ i ) − 1 ( µ c − µ i ) + 2 log (det(Σ c + Σ i )) − log (det Σ c ) − log  2 2 d det Σ i  . (70) In order to minimize F ( c ) , let us dif ferentiate with respect to µ c . let U i denote (Σ c + Σ i ) − 1 . Using matrix differentials [43] (p.10 Eq. 73), we get: ∂ L ∂ µ c = n X i =1  U i + U T i  [ µ c − µ i ] (71) Then one can estimate iterati vely µ c , since U i depends on Σ c which is unknown. W e update µ c as follows: µ c ( t + 1) = " n X i =1  U i + U T i  # − 1 " n X i =1  U i + U T i  µ i # (72) IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 10 Exponential family τ : λ → θ F ( θ ) (up to a constant) Bhattacharyya/Burbea-Rao BR F ( λ p , λ q ) = BR F ( τ ( λ p ) , τ ( λ q )) Multinomial (log p i p d ) i log(1 + P d − 1 i =1 exp θ i ) − ln P d i =1 √ p i q i Poisson log λ exp θ 1 2 ( √ µ p − √ µ q ) 2 Gaussian ( θ 1 = µ, θ 2 = σ 2 ) − θ 2 1 4 θ 2 + 1 2 log( − π θ 2 ) 1 4 ( µ p − µ q ) 2 σ 2 p + σ 2 q + 1 2 ln σ 2 p + σ 2 q 2 σ p σ q Multiv ariate Gaussian ( θ = Σ − 1 µ, Θ = 1 2 Σ − 1 ) 1 4 tr(Θ − 1 θθ T ) − 1 2 log det Θ 1 8 ( µ p − µ q ) T  Σ p +Σ q 2  − 1 ( µ p − µ q ) + 1 2 ln det Σ p +Σ q 2 det Σ p det Σ q T ABLE I C L O SE D - F O R M B H ATTAC H A RY Y A D I S T A N C ES F O R S O ME C L A S S ES O F E X P ON E N T I A L FA M I L I ES ( E X P R ES S E D I N SO U R C E P A R A ME T E R S FO R E A S E O F U S E )) . Now let us estimate Σ c . W e used matrix differentials [43] (p.9 Eq. 55 for the first term, and Eq. 51 p.8 for the two others): ∂ L ∂ Σ c = n X i =1 − U T i ( µ c − µ i ) ( µ c − µ i ) T U T i + 2 n X i =1 U T i − n X i =1 Σ − T c . (73) T aken into account the fact that Σ c is symmetric, differential calculus on symmetric matrices can be simply estimate: dL d Σ c = ∂ L ∂ Σ c +  ∂ L ∂ Σ c  T − diag  ∂ L ∂ Σ c  . (74) Thus, if one notes A = n X i =1 2 U T i − U T i ( µ c − µ i ) ( µ c − µ i ) T U T i (75) and recalling that Σ c is symmetric, one has to solve n (2Σ − 1 c − diag (Σ − 1 c )) = A + A T − diag ( A ) . (76) Let B = A + A T − diag ( A ) (77) Then one can estimate Σ c iterativ ely as follows: Σ ( k +1) c = 2 n h ( B ( k ) + diag ( B ( k ) )) i − 1 (78) Let us now compare the two generic Burbea-Rao/tailored Gaussian methods for computing the Bhattacharyya centroids on multvariate Gaussians. D. Applications to mixtur e simplification in statistics Simplifying Gaussian mixtures is important in many appli- cations arising in signal processing [26]. Mixture simplifica- tion is also a crucial step when one wants to study the Rie- mannian geometry induced by the Rao distance with respect to the Fisher metric: The set of mixture models need to ha ve the same number of components, so that we simplify source mixtures to get a set of Gaussian mixtures with prescribed size. W e adapt the hierarchical clustering algorithm of Garcia et al. [26] by replacing the symmetrized Bregman centroid (namely , the Jeffreys-Bre gman centroid) by the Bhattacharyya centroid. W e consider the task of color image segmentation by learning a Gaussian mixture model for each image. Each image is represented as a set of 5 D points (color R GB and position xy ). The first experimental results depicted in Figure 3 demon- strates the qualitative stability of the clustering performance. In particular , the hierarchical clustering with respect to the Bhattacharrya distance performs qualitativ ely much better on the last colormap image. 13 The second experiment focuses on characterizing the nu- merical con ver gence of the generic Burbea-Rao method com- pared to the tailored Gaussian method. Since we presented two novel different schemes to compute the Bhattacharyya centroids of multiv ariate Gaussians, one wants to compare them, both in terms of stability and accuracy . Whenever the ratio of Bhattacharyya distance energy function between those estimated centroids is greater than 1% , we consider that one of the two estimation methods is beaten (namely , the method that gives the highest Bhattacharyya distance). Among the 760 centroids computed to generate Figures 3, 100% were correct with the Burbea-Rao approach, while only 87% were correct with the tailored multiv ariate Gaussian matrix optimization method. The average number of iterations to reach the 1% accuracy is 4 . 1 for the Burbea-Rao estimation algorithm, and 5 . 2 for the alternativ e method. Thus we experimentally checked that the generic CCCP iterativ e Burbea-Rao algorithm described for computing the Bhattacharrya centroids always con ver ge, and moreover beats another ad-hoc iterativ e method tailored for multi v ariate Gaus- sians. V I . C O N C L U D I N G R E M A R K S In this paper , we hav e sho wn that the Bhattacharrya distance for distributions of the same statistical exponential families can be computed equiv alently as a Burbea-Rao di ver gence on the corresponding natural parameters. Those results ex- tend to skew Chernoff coef ficients (and Amari α -div ergences) and skew Bhattacharyya distances using the notion of skew Burbea-Rao div er gences. W e prov ed that (ske w) Burbea-Rao centroids are unique, and can be efficiently estimated using an iterativ e concav e-con v ex procedure with guaranteed con- ver gence. W e have shown that extremally ske wed Burbea- Rao di ver gences amount asymptotically to ev aluate Bregman div er gences. This work emphasizes on the attractiv eness of exponential families in Statistics. Indeed, it turns out that for many statistical distances, one can ev aluate them in closed- form. For sake of brevity , we hav e not mentioned the recent 13 See reference images and segmentation using Bregman centroids at http: //www .informationgeometry .org/MEF/ IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 11 (a) (b) (c) Fig. 3. Color image segmentation results: (a) source images, (b) segmentation with k = 48 5D Gaussians, and (c) segmentation with k = 16 5D Gaussians. β -div ergences and γ -div er gences [44], although their distances on exponential families are again av ailable in closed-form. The differential Riemannian geometry induced by the class of such Jensen difference measures was studied by Burbea and Rao [18], [19] who built quadratic differential metrics on probability spaces using Jensen differences. The Jensen- Shannon div ergence is also an instance of a broad class of div er gences called the f -div ergences. A f -div er gence I f is a statistical measure of dissimilarity defined by the functional I f ( p, q ) = R p ( x ) f ( q ( x ) p ( x ) )d x . It turns out that the Jensen- Shannon div er gence is a f -div er gence for the generator f ( x ) = 1 2  ( x + 1) log 2 x + 1 + x log x  . (79) f -di ver gences preserve the information monotonicity [44], and their differential geometry was studied by V os [45]. Howe ver , this Jensen-Shannon div er gence is a very particular case of Burbea-Rao div ergences since the squared Euclidean distance (another Burbea-Rao div ergence) does not belong to the class of f -di ver gences. S O U R C E C O D E The generic Burbea-Rao barycenter estimation algorithm shall be released in the J M E F open source library: http://www .informationgeometry .or g/MEF/ An applet visualizing the skew Burbea- Rao centroids ranging from the right-sided to left-sided Bregman centroids is available at: http://www .informationgeometry .or g/BurbeaRao/ A C K N O W L E D G M E N T S W e gratefully acknowledge financial support from French agency DIGITEO (GAS 2008-16D) and French National Re- search Agency (ANR GAIA 07-BLAN-0328-01), and Sony Computer Science Laboratories, Inc. W e are very grateful to the revie wers for their thorough and thoughtful comments and suggestions. In particular , we are thankful to the anonymous Referee that pointed out a rigorous proof of Lemma 1. A P P E N D I X P RO O F O F U N I Q U E N E S S O F T H E B U R B E A - R AO C E N T R O I D S Consider without loss of generality the Burbea-Rao centroid (also called Jensen centroid) defined as the minimizer of c = arg min x n X i =1 1 n J F ( p i , x ) where J F ( p, q ) = F ( p )+ F ( q ) 2 − F ( p + q 2 ) ≥ 0 . For sake of simplicity , let us consider univ ariate generators. The Jensen div er gence may not be con ve x as J 0 ( x, p ) = F 0 ( x ) 2 − 1 2 F ( x + p 2 ) and J 00 ( x, p ) = 1 2 F 00 ( x ) − 1 4 F 00 ( x + p 2 ) can be alternativ ely positiv e/ne gati ve (see [46]). In general, minimizing the average non-con v ex diver gence may a priori yield to many local minima [47]. It is remarkable to observe that the centroid IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 12 induced by a Jensen div ergence is unique although the problem may not be con ve x. The proof of uniqueness of the Burbea-Rao centroid and the con ver gence of the CCCP approximation algorithm rely on the “interness” property (called compensati v eness 14 in [48]) of quasi-arithmetic means: n min i =1 p i ≤ M ∇ F ( p 1 , ..., p n ) ≤ n max i =1 p i , with M ∇ F ( p 1 , ..., p n ) = ( ∇ F ) − 1 n X i =1 1 n ∇ F ( p i ) ! for a strictly con ve x function F (and hence, strictly monotone increasing gradient ∇ F ). The interness property of quasi- arithmetic means ensures that it is indeed a mean value contained within the extremal values. For sake of simplicity , let us first consider a univ ariate con v ex generator F with the p i ’ s following the increasing order: p 1 ≤ ... ≤ p n . Let initially c 0 ∈ [ p (0) 1 = p 1 , p (0) n = p n ] . Since c 1 = M ∇ F ( p (1) 1 = p 1 + c 0 2 , ..., p (1) n = p n + c 0 2 ) is a quasi- arithmetic mean, we necessarily hav e c 1 ∈ [ p (1) 1 , p (1) n ] and p (1) n − p (1) 1 = c 0 + p n − c 0 − p 1 2 = p n − p 1 2 . Thus the CCCP iterations induce a sequence of iterated quasi-arithmetic means c t such that c t = M ∇ F p ( t ) 1 = p ( t − 1) 1 + c t − 1 2 , ..., p ( t ) n = p ( t − 1) n + c t − 1 2 ! , c t ∈ [ p ( t ) 1 , p ( t ) n ] with p ( t ) n − p ( t ) 1 = c t − 1 + p ( t − 1) n − c t − 1 − p ( t − 1) 1 2 , = p ( t − 1) n − p ( t − 1) 1 2 , = 1 2 t ( p n − p 1 ) . It follows that the sequence of centroid approximation c t con v erges in the limit to a unique centroid c ∗ . That is, the Burbea-Rao centroids exist and are unique for any strictly con v ex generator F . The centroid can be approximated within 1 2 t relativ e precision after t iterations (linear conv ergence of the CCCP). Since the CCCP iterations yield both an approximation c t and a range [ p ( t ) 1 , p ( t ) n ] where c t should be at the t -iteration, we choose in practice to stop iterating whene ver p ( t ) n − p ( t ) 1 p ( t ) n goes below a prescribed threshold (for example, taking ∇ F = log x , we find in about 50 iterations the centroid with machine precision 10 − 12 ). The CCCP algorithm with b bits precision require O ( nb ) time to approximate. Note that lim t →∞ p ( t ) 1 = lim t →∞ 1 2 P t i =0 1 2 i p 1 = p 1 (and similarly , we have lim t →∞ p ( t ) n = p n ). It follows that c ∗ ∈ [ p 1 , p n ] as expected (all the initial extremal range is possible, and the center shall depend on the chosen generator F ). The proof extends naturally to separable multiv ariate functions by carrying the analysis on each dimension independently . 14 A fact following from the monotonicity of the generator function ∇ F . R E F E R E N C E S [1] A. N. K olmogoro v , “Sur la notion de la moyenne, ” Accad. Naz. Lincei Mem. Cl. Sci. F is. Mat. Natur . Sez. , vol. 12, pp. 388–391, 1930. [2] M. Nagumo, “ ¨ Uber eine Klasse der Mittelwerte, ” Japanese Journal of Mathematics , vol. 7, pp. 71–79, 1930, see Collected papers, Springer 1993. [3] J. D. Acz ´ el, “On mean v alues, ” Bulletin of the American Mathematical Society , v ol. 54, no. 4, pp. 392–400, 1948, http://www .mta.hu/. [4] A. Ben-T al, A. Charnes, and M. T eboulle, “Entropic means, ” Journal of Mathematical Analysis and Applications , vol. 139, no. 2, pp. 537 – 551, 1989. [5] S. M. Ali and S. D. Silvey , “ A general class of coefficients of diver gence of one distribution from another , ” J ournal of the Royal Statistical Society , Series B , vol. 28, pp. 131–142, 1966. [6] I. Csisz ´ ar , “Information-type measures of difference of probability distri- butions and indirect observation, ” Studia Scientiarum Mathematicarum Hungarica , v ol. 2, p. 229318, 1967. [7] F . Nielsen and R. Nock, “Sided and symmetrized Bregman centroids, ” IEEE T ransactions on Information Theory , vol. 55, no. 6, pp. 2048– 2059, June 2009. [8] Y . Censor and S. A. Zenios, P arallel Optimization: Theory , Algorithms, and Applications . Oxford Uni versity Press, 1997. [9] M. J. W ainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference, ” F oundational T r ends in Machine Learning , v ol. 1, pp. 1–305, January 2008. [10] S.-I. Amari, “ α -diver gence is unique, belonging to both f -di ver gence and bregman diver gence classes, ” IEEE T rans. Inf. Theor . , vol. 55, no. 11, pp. 4925–4931, 2009. [11] S.-i. Amari, “Integration of stochastic models by minimizing α - div ergence, ” Neural Comput. , vol. 19, no. 10, pp. 2780–2796, 2007. [12] S. Amari and H. Nagaoka, Methods of Information Geometry , A. M. Society , Ed. Oxford Uni versity Press, 2000. [13] F . Nielsen and R. Nock, “The dual voronoi diagrams with respect to representational bregman diver gences, ” in International Symposium on V or onoi Diagrams (ISVD) . DTU L yngby , Denmark: IEEE, June 2009. [14] J.-D. Boissonnat, F . Nielsen, and R. Nock, “Bregman Voronoi dia- grams, ” Discr ete & Computational Geometry , 2010, accepted, extend A CM-SIAM SOD A 2007. [15] I. Csisz ´ ar , “Generalized projections for non-negativ e functions, ” Acta Mathematica Hungarica , vol. 68, no. 1-2, pp. 161–185, 1995. [16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman diver gences, ” J. Mach. Learn. Res. , vol. 6, pp. 1705–1749, 2005. [17] M. Detyniecki, “Mathematical aggregation operators and their applica- tion to video querying, ” Ph.D. dissertation, 2000. [18] J. Burbea and C. R. Rao, “On the con v exity of some diver gence measures based on entrop y functions, ” IEEE T r ansactions on Information Theory , vol. 28, no. 3, pp. 489–495, 1982. [19] ——, “On the conve xity of higher order Jensen differences based on entropy functions, ” IEEE T r ansactions on Information Theory , vol. 28, no. 6, pp. 961–, 1982. [20] J. Lin, “Diver gence measures based on the Shannon entropy , ” IEEE T r ansactions on Information Theory , v ol. 37, pp. 145–151, 1991. [21] M. Basseville and J.-F . Cardoso, “On entropies, diver gences and mean values, ” in Pr oceedings of the IEEE W orkshop on Information Theory , 1995. [22] L. M. Bregman, “The relaxation method of finding the common point of conv ex sets and its application to the solution of problems in conv ex programming, ” USSR Computational Mathematics and Mathematical Physics , v ol. 7, pp. 200–217, 1967. [23] F . Nielsen and V . Garcia, “Statistical exponential families: A digest with flash cards, ” 2009, arXi v .org:0911.4863. [24] K. Fukunaga, Intr oduction to statistical pattern r ecognition (2nd ed.) . Academic Press Professional, Inc., 1990. [25] L. Rigazio, B. Tsakam, and J. C. Junqua, “ An optimal bhattacharyya centroid algorithm for gaussian clustering with applications in automatic speech recognition, ” in Acoustics, Speech, and Signal Pr ocessing, 2000. ICASSP ’00. Pr oceedings. 2000 IEEE International Confer ence on , vol. 3, 2000, pp. 1599–1602 v ol.3. [Online]. A v ailable: http://ieeexplore.ieee.or g/xpls/abs all.jsp?arnumber=861998 [26] V . Garcia, F . Nielsen, and R. Nock, “Hierarchical gaussian mixture model, ” in International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2010. [27] P . Chen, Y . Chen, and M. Rao, “Metrics defined by Bregman diver - gences: P art I, ” Commun. Math. Sci. , vol. 6, pp. 9915–926, 2008. IEEE TRANSA CTIONS ON INFORMA TION THEOR Y 57(8):5455-5466, 2011. 13 [28] ——, “Metrics defined by Bregman div ergences: Part II, ” Commun. Math. Sci. , vol. 6, pp. 927–948, 2008. [29] H. Jeffreys, “ An inv ariant form for the prior probability in estimation problems, ” Pr oceedings of the Royal Society of London , vol. 186, no. 1007, pp. 453–461, March 1946. [30] F . Liese and I. V ajda, “On Diver gences and Informations in Statistics and Information Theory, ” IEEE T ransactions on Information Theory , vol. 52, no. 10, pp. 4394–4412, October 2006. [31] J. Zhang, “Div ergence function, duality , and conv ex analysis, ” Neural Computation , v ol. 16, no. 1, pp. 159–195, 2004. [32] A. Y uille and A. Rangarajan, “The concave-con ve x procedure, ” Neural Computation , v ol. 15, no. 4, pp. 915–936, 2003. [33] B. Sriperumbudur and G. Lanckriet, “On the conv ergence of the concav e-con ve x procedure, ” in Neural Information Pr ocessing Systems , 2009. [34] Y . He, A. B. Hamza, and H. Krim, “ An information divergence measure for ISAR image registration, ” in Automatic target r ecognition XI (SPIE) , vol. 4379, 2001, pp. 199–208. [35] A. O. Hero, B. Ma, O. Michel, and J. D. Gorman, “ Alpha-di ver gence for classification, indexing and retriev al, ” Comm. and Sig. Proc. Lab. (CSPL), Dept. EECS, University of Michigan, Ann Arbor, T ech. Rep. 328, July , 2001, presented at Joint Statistical Meeting. [36] A. Bhattacharyya, “On a measure of diver gence between two statisti- cal populations defined by their probability distributions, ” Bulletin of Calcutta Mathematical Society , vol. 35, pp. 99–110, 1943. [37] K. Matusita, “Decision rules based on the distance, for problems of fit, two samples, and estimation, ” Annal of Mathematics and Statistics , vol. 26, pp. 631–640, 1955. [38] T . Kailath, “The div ergence and Bhattacharyya distance measures in signal selection, ” IEEE T ransactions on Communication T echnology , vol. 15, no. 1, pp. 52–60, 1967. [39] F . Aherne, N. Thacker , and P . Rockett, “The Bhattacharyya metric as an absolute similarity measure for frequency coded data, ” Kybernetika , vol. 34, no. 4, pp. 363–368, 1998. [40] E. D. Hellinger, “Die orthogonalinv arianten quadratischer formen von unendlich vielen variablen, ” 1907, thesis of the uni v ersity of G ¨ ottingen. [41] S. Kakutani, “On equiv alence of infinite product measures, ” Annals of Mathematics , v ol. 49, no. 214-224, 1948. [42] L. Rigazio, B. Tsakam, and J. Junqua, “Optimal Bhattacharyya centroid algorithm for Gaussian clustering with applications in automatic speech recognition, ” in IEEE International Conference on Acoustics, Speech, and Signal Processing , vol. 3, 2000, pp. 1599–1602. [43] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook . T echnical University of Denmark, oct 2008. [Online]. A vailable: http://www2.imm.dtu.dk/pubdb/p.php?3274 [44] A. Cichocki and S. ichi Amari, “Families of alpha- beta- and gamma- div ergences: Flexible and robust measures of similarities, ” Entr opy , 2010, re vie w submitted. [45] P . V os, “Geometry of f -diver gence, ” Annals of the Institute of Statistical Mathematics , v ol. 43, no. 3, pp. 515–537, 1991. [46] F . Nielsen, R. Nock, “Ske w Jensen-Bregman V oronoi Diagrams. ” T rans- actions on Computational Science, no. 14, pp. 102–128, 2011. [47] P . Auer, M. Herbster and M. W armuth, “Exponentially many local minima for single neurons, ” Advances in Neural Information Processing Systems, v ol. 8, pp. 316-317, 1995. [48] J.-L. Marichal, “ Aggregation Operators for Multicriteria Decision Aid, ” Institute of Mathematics, Univ ersity of Li ` ege, Belgium, 1998. PLA CE PHO TO HERE Frank Nielsen receiv ed the BSc (1992) and MSc (1994) degrees from Ecole Normale Superieure (ENS L yon, France). He prepared his PhD on adaptiv e computational geometry at INRIA Sophia- Antipolis (France) and defended it in 1996. As a civil servant of the University of Nice (France), he gave lectures at the engineering schools ESSI and ISIA (Ecole des Mines). In 1997, he served in the army as a scientific member in the computer science laboratory of Ecole Polytechnique. In 1998, he joined Son y Computer Science Laboratories Inc., T okyo (Japan) where he is senior researcher . He became a professor of the CS Dept. of Ecole Polytechnique in 2008. His current research interests include geometry , vision, graphics, learning, and optimization. He is a senior ACM and senior IEEE member . PLA CE PHO TO HERE Sylvain Boltz Sylvain Boltz receiv ed the M.S. de- gree and the Ph.D. degree in computer vision from the Univ ersity of Nice-Sophia Antipolis, France, in 2004 and 2008, respectiv ely . Since then, he has been a postdoctoral fellow at the V isionLab, University of California, Los Angeles and a LIX-Qualcomm postdoctoral fello w in Ecole Polytechnique, France. His research spans computer vision and image, video processing with a particular interest in applications of information theory and compressed sensing to these areas.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment