Barycenter Estimation of Positive Semi-Definite Matrices with Bures-Wasserstein Distance

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 1 Barycenter Estimation of Positi v e Semi-Deﬁnite Matrices with Bures-W asserstein Distance Jingyi Zheng, Huajun Huang, Y uyan Y i, Y ue xin Li, and Shu-Chin Lin, Abstract —Brain-computer interface (BCI) builds a bridge between human brain and external devices by recording brain signals and translating them into commands for devices to perform the user’ s imagined action. The core of the BCI system is the classiﬁer that labels the input signals as the user’s imagined action. The classiﬁers that directly classify covariance matrices using Riemannian geometry are widely used not only in BCI domain but also in a variety of ﬁelds including neuroscience, remote sensing, biomedical imaging, etc. However , the existing Afﬁne-In variant Riemannian-based methods treat covariance matrices as positive deﬁnite while they are indeed positive semi- deﬁnite especially for high dimensional data. Besides, the Afﬁne- In variant Riemannian-based barycenter estimation algorithms become time consuming, not rob ust, and have con vergence issues when the dimension and number of covariance matrices become large. T o address these challenges, in this paper , we establish the mathematical foundation for Bures-W asserstein distance and propose new algorithms to estimate the barycenter of positive semi-deﬁnite matrices efﬁciently and r obustly . Both theoretical and computational aspects of Bures-W asserstein distance and barycenter estimation algorithms are discussed. With extensive simulations, we comprehensi vely in vestigate the accuracy , efﬁ- ciency , and robustness of the barycenter estimation algorithms coupled with Bur es-W asserstein distance. The results show that Bures-W asserstein based barycenter estimation algorithms are more efﬁcient and robust. Index T erms —Brain-computer interface (BCI), Riemannian manifold, Afﬁne-In variant distance, Bures-W asserstein distance, Fr ´ echet Mean. I . I N T R O DU C T I O N B RAIN computer interface (BCI) builds a bridge between human brain and external devices by translating the brain signals into instructions for the external devices to perform the user’ s imagined actions. The core of BCI system is the classiﬁer that classify the brain signals into one of the commands for the external de vices. The brain signals can be captured by multi-channel Electroencephalography (EEG) [1], [2], functional magnetic resonance imaging (fMRI) [3], [4], and other neuroimaging techniques, which leads to BCI data being mostly spatial and temporal. T o capture the spatial and temporal pattern of BCI data, cov ariance matrices are widely used, and the BCI classiﬁers are trained by directly classifying covariance matrices using Riemannian geometry [5]–[7]. Besides BCI, cov ariance matrices are also widely used to capture the structure of comple x data in various ﬁelds such as computer vision [8]–[10], natural language processing [11], J. Zheng, H. Huang, Y . Yi, and Y . Li are with the Department of Mathematics and Statistics, Aub urn Univ ersity , Auburn, AL, 36849, USA. S. Lin is with Center for Neuropsychiatric Research, National Health Research Institutes, T aiwan. correspondence e-mail: jingyi.zheng@auburn.edu. [12], domain adaption [13], [14], remote sensing [15], [16], biomedical imaging [3], [4], and many others [17]–[19]. Previous works have shown that cov ariance matrices are often treated as positiv e deﬁnite matrices, and analyzed on the manifold of positive deﬁnite matrices, denoted as P n , using Af ﬁne-In variant Riemannian metric, which has been a popular choice of used Riemannian metric due to its many great mathematical properties. Considering P n as an open sub-manifold of M n , at any point A ∈ P n (i.e., any positi ve deﬁnite matrix A ), the Afﬁne-In variant Riemannian metric on the tangent space T A P n is deﬁned as h X, Y i A = tr ( A − 1 X A − 1 Y ) . (I.1) The corresponding distance function for two positi ve deﬁnite matrices A and B , named Afﬁne-In variant Riemannian (AI) distance, is d AI ( A, B ) =  n X i =1 log 2 λ i ( A − 1 B )  1 / 2 , (I.2) where λ i ( A − 1 B ) , i = 1 , . . . , n is the eigenv alues of A − 1 B . W ith respect to the metric (I.1), any two points on P n (i.e., any two positiv e deﬁnite matrices) is joined by the following geodesic: γ AI ( t ) = A 1 / 2 ( A − 1 / 2 B A − 1 / 2 ) t A 1 / 2 , t ∈ [0 , 1] . (I.3) The geometric mean of the two matrices, denoted as A # B , is the mid point of the geodesic: γ AI (1 / 2) = A 1 / 2 ( A − 1 / 2 B A − 1 / 2 ) 1 / 2 A 1 / 2 . (I.4) It can be easily shown that d AI ( A, γ AI ( t )) = td AI ( A, B ) . Therefore, we have d AI ( A, A # B ) = 1 2 d AI ( A, B ) . When analyzing a set of matrices, one of the most important measure is the barycenter of matrices. Like the arithmetic mean of a set of numbers, the barycenter is a measure of the central tendency of a set of matrices on the manifold. It is also used when calculating the standard deviation and variance of a set of matrices. Most importantly , it is the foundation for de- veloping classiﬁcation models for matrices. Using AI distance, statistical and computational methods hav e been dev eloped [5]–[7], [20]–[24] to analyze covariance matrices including the barycenter estimation and classiﬁcation of matrices. Howe ver , computing AI distance in volves matrix inv erse and eigen value decomposition as shown in (I.2), which are time-consuming and computationally unstable especially for large matrices. Furthermore, instead of being strictly positiv e deﬁnite, a cov ariance matrix is in fact a positi ve semi-deﬁnite (PSD) matrix, especially for high dimensional data. As the JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 2 dimension of covariance matrix increases, the likelihood of having zero eigen values grows substantially , which inv alidates the use of AI distance. Therefore, we propose to analyze cov ariance matrices directly on the manifold of PSD matrices coupled with Bures-W asserstein distance. In this paper , we ﬁrst establish the mathematical foundation for the Bures-W asserstein (BW) distance by pro ving its proper - ties and studying the retractions maps of the manifold of PSD matrices. T o estimate the central tendency of a set of matrices, we then propose three algorithms to estimate the Fr ´ echet mean (i.e, barycenter) of matrices using BW distance. Extensiv e simulations are conducted to comprehensively in vestigate the efﬁcienc y and robustness of BW distance, as well as the accuracy , efﬁciency , and robustness of the proposed barycenter estimation algorithms. The remaining of the paper is organized as follo ws. In Section II (A), we establish the mathematical properties of BW distance and the retraction maps of the manifold. In Section II (B), we propose three algorithms for estimating the barycenter of matrices on the manifold. In Section III, we present the simulation results and discuss the efﬁcienc y and rob ustness of BW distance and the proposed algorithms for barycenter estimation, and a full comparison with the widely used AI distance. In Section IV, we summarize our contributions and conclude the paper . I I . M E T H O D O L O G Y By viewing P n as the quotient manifold P n = GL(n) U(n) , where GL(n) is the set of nonsingular matrices, [25] proposed the Bures-W asserstein (BW) distance d B W ( A, B ) = h tr ( A + B ) − 2tr ( A 1 / 2 B A 1 / 2 ) 1 / 2 i 1 / 2 = h tr ( A + B ) − 2tr ( AB ) 1 / 2 i 1 / 2 where A, B ∈ P n . Notice that the above BW distance d B W ( A, B ) coincides with the W asserstein distance between two Gaussian distribution with the same mean and covariance matrices being A and B , respectiv ely . The geodesic from A to B is deﬁned as γ ( t ) : [0 , 1] → P n γ ( t ) = (1 − t ) 2 A + t 2 B + t (1 − t )[( AB ) 1 / 2 + ( B A ) 1 / 2 ] Indeed, the above BW distance and the geodesic can be extended to the set P n of n × n positi ve semi-deﬁnite (PSD) matrices by vie wing P n = M(n) U(n) since GL(n) = M(n) . Therefore, for any two PSD matrices A and B , the BW distance is d B W ( A, B ) = h tr ( A + B ) − 2tr ( AB ) 1 / 2 i 1 / 2 (II.1) where A, B ∈ P n . And the geodesic from A to B , denoted as A  t B , is A  t B := γ ( t ) = (1 − t ) 2 A + t 2 B + t (1 − t )[( AB ) 1 / 2 +( B A ) 1 / 2 ] (II.2) Same as A # B , A  1 / 2 B is called W asserstein mean and denoted as A  B A  1 / 2 B = 1 4 [ A + B + ( AB ) 1 / 2 + ( B A ) 1 / 2 ] . (II.3) A. Establish the mathematical foundation for BW distance Few studies ha ve discussed the mathematical properties of BW distance [25]–[33], especially on P n . In the following, we study some mathematical properties of BW distance and the retraction maps of the manifold P n . These properties provide nice insights and intuitions about BW metric and BW mean. Giv en X ∈ M n , let | X | = ( X ∗ X ) 1 / 2 denote the PSD part in the polar decomposition of X . Theorem II.1. F or A, B ∈ P n and t ∈ R , A  t B = B  1 − t A = | (1 − t ) A 1 / 2 + tU ∗ B 1 / 2 | 2 . (II.4) in which U is a certain unitary matrix occurring in a polar decomposition of B 1 / 2 A 1 / 2 : B 1 / 2 A 1 / 2 = U | B 1 / 2 A 1 / 2 | = U ( A 1 / 2 B A 1 / 2 ) 1 / 2 , (II.5) or equivalently , A 1 / 2 B 1 / 2 = U ∗ | A 1 / 2 B 1 / 2 | . Mor eover , when (1 − t ) A + t | B 1 / 2 A 1 / 2 | is PSD (e .g. when t ∈ [0 , 1] ), | ( A  t B ) 1 / 2 A 1 / 2 | = (1 − t ) A + t | B 1 / 2 A 1 / 2 | . (II.6) Pr oof. If A 1 / 2 is nonsingular , i.e. A ∈ P n , then for an y unitary matrix U in the polar decomposition (II.5), we ha ve A 1 / 2 U ∗ B 1 / 2 = A 1 / 2 ( U ∗ B 1 / 2 A 1 / 2 ) A − 1 / 2 = A 1 / 2 ( A 1 / 2 B A 1 / 2 ) 1 / 2 A − 1 / 2 = ( A 1 / 2 A 1 / 2 B A 1 / 2 A − 1 / 2 ) 1 / 2 = ( AB ) 1 / 2 . (II.7a) If A 1 / 2 is singular , we may ﬁnd a sequence of nonsin- gular positiv e deﬁnite matrices { A i } ∞ i =1 ⊆ P n such that lim i →∞ A i = A . Let B 1 / 2 A 1 / 2 i = U i | B 1 / 2 A 1 / 2 i | be the polar decomposition for i = 1 , 2 , 3 , . . . Since the unitary group of degree n is compact, the sequence { U i } ∞ i =1 has a con ver gent subsequence { U i t } ∞ t =1 . Let U := lim t →∞ U i t . Then A 1 / 2 U ∗ B 1 / 2 = lim t →∞ A 1 / 2 i t U ∗ i t B 1 / 2 = lim t →∞ ( A i t B ) 1 / 2 = ( AB ) 1 / 2 . (II.7b) In both cases, B 1 / 2 U A 1 / 2 = ( A 1 / 2 U ∗ B 1 / 2 ) ∗ = ( B A ) 1 / 2 . (II.8) So the geodesic from A to B can be e xpressed as: A  t B = B  1 − t A = (1 − t ) 2 A + t 2 B + t (1 − t )[( AB ) 1 / 2 + ( B A ) 1 / 2 ] = | (1 − t ) A 1 / 2 + tU ∗ B 1 / 2 | 2 . Moreov er , | ( A  t B ) 1 / 2 A 1 / 2 | = | | (1 − t ) A 1 / 2 + tU ∗ B 1 / 2 | A 1 / 2 | = | [(1 − t ) A 1 / 2 + tU ∗ B 1 / 2 ] A 1 / 2 | = | (1 − t ) A + t | B 1 / 2 A 1 / 2 | | . (II.9) When (1 − t ) A + t | B 1 / 2 A 1 / 2 | is PSD, we hav e | ( A  t B ) 1 / 2 A 1 / 2 | = (1 − t ) A + t | B 1 / 2 A 1 / 2 | . Therefore, the theorem is prov ed. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 3 W e remark that (II.9) still holds even when the Hermitian matrix (1 − t ) A + t | B 1 / 2 A 1 / 2 | is not PSD. When both A and B are singular , some unitary matrix U that satisﬁes (II.5) may not satisfy (II.4). Theorem II.2. Let A, B ∈ P n . 1) If r, t ∈ R and (1 − t ) A + t | B 1 / 2 A 1 / 2 | ∈ P n , then A  r ( A  t B ) = A  rt B . (II.10) 2) If r, s, t ∈ R satisfy that (1 − x ) A + x | B 1 / 2 A 1 / 2 | ∈ P n for x ∈ { s, t } , then ( A  s B )  r ( A  t B ) = A  (1 − r ) s + r t B . (II.11) Pr oof. (II.10) is a special case of (II.11) by choosing s = 0 . W e will show that (II.10) also implies (II.11) after proving (II.10). Suppose r , t ∈ R such that (1 − t ) A + t | B 1 / 2 A 1 / 2 | ∈ P n . Assume that A ∈ P n ﬁrst (the case of singular A can be done by continuous extension). By (II.9) and (II.6), | [ A  r ( A  t B )] 1 / 2 A 1 / 2 | = | (1 − r ) A + r | ( A  t B ) 1 / 2 A 1 / 2 | | = | (1 − r ) A + r [(1 − t ) A + t | B 1 / 2 A 1 / 2 | ] | = | (1 − r t ) A + r t | B 1 / 2 A 1 / 2 | | = | ( A  rt B ) 1 / 2 A 1 / 2 | . (II.12) Hence there is a unitary matrix V such that [ A  r ( A  t B )] 1 / 2 A 1 / 2 = V ( A  rt B ) 1 / 2 A 1 / 2 . By assumption, A is nonsingular , so that [ A  r ( A  t B )] 1 / 2 = V ( A  rt B ) 1 / 2 . W e get (II.10). Now suppose r , s, t ∈ R satisfy that (1 − x ) A + x | B 1 / 2 A 1 / 2 | ∈ P n for x ∈ { s, t } . Again we assume that A ∈ P n ﬁrst and then extend the result continuously to singular A . Suppose t ≤ s without loss of generality . Then by (II.10), ( A  s B )  r ( A  t B ) = ( A  s B )  r [ A  t s ( A  s B )] = ( A  s B )  r [( A  s B )  1 − t s A ] = ( A  s B )  rs − r t s A = A  1 − rs − r t s ( A  s B ) = A  (1 − r ) s + r t B . W e get (II.11). Let U be the unitary matrix in the polar decomposition of B 1 / 2 A 1 / 2 as in (II.5). Bhatia, Jain, and Lim showed that [25, Theorem 1]: d B W ( A, B ) = k A 1 / 2 − B 1 / 2 U k F (II.13) It implies the follo wing property about BW distance. Theorem II.3. F or A, B ∈ P n and t ∈ R , if (1 − t ) A + t | B 1 / 2 A 1 / 2 | is PSD (e .g. when t ∈ [0 , 1] ), then d B W ( A, A  t B ) = | t | d B W ( A, B ) . (II.14) Pr oof. W e pro ve for the nonsingular A case. The singular A case can be done by continuous extension. Let U and V be Fig. 1: Manifold and its tangent space at point A ∈ P n the unitary matrix in the polar decomposition of B 1 / 2 A 1 / 2 and ( A  t B ) 1 / 2 A 1 / 2 , respectiv ely: B 1 / 2 A 1 / 2 = U | B 1 / 2 A 1 / 2 | , ( A  t B ) 1 / 2 A 1 / 2 = V | ( A  t B ) 1 / 2 A 1 / 2 | = V [(1 − t ) A + t | B 1 / 2 A 1 / 2 | ] . T aking Hermitian transposes on the abo ve equalities, we get A 1 / 2 B 1 / 2 = | B 1 / 2 A 1 / 2 | U ∗ , A 1 / 2 ( A  t B ) 1 / 2 = [(1 − t ) A + t | B 1 / 2 A 1 / 2 | ] V ∗ . Therefore, ( A  t B ) 1 / 2 V = A − 1 / 2 [(1 − t ) A + t | B 1 / 2 A 1 / 2 | ] = (1 − t ) A 1 / 2 + tB 1 / 2 U. By [25, Theorem 1], d B W ( A, A  t B ) = k A 1 / 2 − ( A  t B ) 1 / 2 V k F = k tA 1 / 2 − tB 1 / 2 U k F = | t | d B W ( A, B ) . W e get (II.14). As a Riemannian manifold, points in P n (i.e., PSD matrices) can be projected to the ﬂat tangent space as illustrated in Figure 1. The logarithm function maps points on the manifold to the tangent space while exponential function maps the points on the tangent space back to the manifold. W ith the log and exp functions, matrices on the manifold can be easily project to the ﬂat tangent space where methods that work in Euclidean space are applicable. Therefore, deriv ation of the log and exp functions is extremely important and fundamental for the further analysis. Let H n (resp. U(n) ) denote the set of n × n real symmetric (resp. orthogonal) matrices. Lemma II.4. F or A, B ∈ P n , X ∈ H n such that exp A X is well-deﬁned, U ∈ U(n) , and t ∈ R , d B W ( U ∗ AU, U ∗ B U ) = d B W ( A, B ) , (II.15) ( U ∗ AU )  t ( U ∗ B U ) = U ∗ ( A  t B ) U, (II.16) log U ∗ AU ( U ∗ B U ) = U ∗ (log A B ) U, (II.17) exp U ∗ AU ( U ∗ X U ) = U ∗ (exp A X ) U. (II.18) Pr oof. The proofs of (II.15) and (II.16) are straightforward by (II.1) and (II.2). T aking d dt   t =0 on both sides of (II.16), we get (II.17). Then (II.18) follows. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 4 According to the spectral decomposition, ev ery A ∈ P n can be written as A = U Λ U ∗ for a nonnegati ve diagonal matrix Λ and a unitary matrix U ∈ U(n) . Lemma II.4 implies that we can transform the BW metric around A to that around the diagonal matrix Λ and simplify the computations. The log function on P n under BW metric has been described in [25], [32] and [29]. W e add an approximation of log A ( A + tX ) when tX is nearby 0 as follo ws. Theorem II.5. F or any A, B ∈ P n , X ∈ H n , and t ∈ R sufﬁciently close to 0 , we have log A ( B ) = ( AB ) 1 / 2 + ( B A ) 1 / 2 − 2 A, (II.19) log A ( A + tX ) = tX + Ø( t 2 ) . (II.20) Pr oof. By (II.2), the tangent vector of geodesic γ ( t ) at A is log A ( B ) = γ 0 (0) = ( AB ) 1 / 2 + ( B A ) 1 / 2 − 2 A. For X ∈ H n and t ∈ R sufﬁciently close to 0 , we hav e A + tX ∈ P n , so that log A ( A + tX ) = [ A ( A + tX )] 1 / 2 + [( A + tX ) A ] 1 / 2 − 2 A. Moreov er , [ A ( A + tX )] 1 / 2 = [ A 1 / 2 ( A 2 + tA 1 / 2 X A 1 / 2 ) A − 1 / 2 ] 1 / 2 = A 1 / 2 ( A 2 + tA 1 / 2 X A 1 / 2 ) 1 / 2 A − 1 / 2 , [( A + tX ) A ] 1 / 2 = [ A − 1 / 2 ( A 2 + tA 1 / 2 X A 1 / 2 ) A 1 / 2 ] 1 / 2 = A − 1 / 2 ( A 2 + tA 1 / 2 X A 1 / 2 ) 1 / 2 A 1 / 2 Let C := ( A 2 + tA 1 / 2 X A 1 / 2 ) 1 / 2 . Then log A ( A + tX ) = A 1 / 2 C A − 1 / 2 + A − 1 / 2 C A 1 / 2 − 2 A. (II.21) When t is close to 0, C can be expressed as a power series of t : C = ( A 2 + tA 1 / 2 X A 1 / 2 ) 1 / 2 = A + tZ + Ø( t 2 ) . (II.22) T aking squares, we hav e A 2 + tA 1 / 2 X A 1 / 2 = [ A + tZ + Ø( t 2 )] 2 = A 2 + t ( AZ + Z A ) + Ø( t 2 ) . (II.23) Therefore, AZ + Z A = A 1 / 2 X A 1 / 2 . By (II.21) and (II.22), log A ( A + tX ) = A 1 2 ( A + tZ ) A − 1 2 + A − 1 2 ( A + tZ ) A 1 2 − 2 A + Ø( t 2 ) = tA − 1 2 ( AZ + Z A ) A − 1 2 + Ø( t 2 ) = tX + Ø( t 2 ) . The exponential map has been studied in [33] and [27]. Here we giv e a concise form of the exponential map and provide its exact domain. W e also provide an approximation of exp A ( X ) when X is a Hermitian matrix nearby 0 . Let A ◦ B denote the Hadamard product of matrices A and B of the same size. Lemma II.6. Let A = diag ( λ 1 , ..., λ n ) ∈ P n be a positive diagonal matrix. Denote W :=  1 λ i + λ j  n × n . Then for every Hermitian matrix X ∈ H n such that I n + W ◦ X is PSD, exp A ( X ) = A + X + ( W ◦ X ) A ( W ◦ X ) . (II.24) In particular , for t ∈ R suf ﬁciently close to 0 , exp A ( tX ) = A + tX + Ø( t 2 ) . (II.25) Pr oof. Firstly , let B = exp A ( X ) and C := ( A 1 / 2 B A 1 / 2 ) 1 / 2 . (II.26) Then (II.19) and the assumption A = diag ( λ 1 , . . . , λ n ) imply that 2 A + X = ( AB ) 1 / 2 + ( B A ) 1 / 2 = A 1 / 2 ( A 1 / 2 B A 1 / 2 ) 1 / 2 A − 1 / 2 + A − 1 / 2 ( A 1 / 2 B A 1 / 2 ) 1 / 2 A 1 / 2 = A 1 / 2 C A − 1 / 2 + A − 1 / 2 C A 1 / 2 = λ i + λ j λ 1 / 2 i λ 1 / 2 j ! n × n ◦ C. C = λ 1 / 2 i λ 1 / 2 j λ i + λ j ! n × n ◦ (2 A + X ) = A + λ 1 / 2 i λ 1 / 2 j λ i + λ j ! n × n ◦ X. (II.27) By (II.26), C must be PSD, which is equiv alent to that A − 1 / 2 C A − 1 / 2 = I n + W ◦ X is PSD. In such a case, B = exp A ( X ) = A − 1 / 2 C 2 A − 1 / 2 = A − 1 / 2 A + λ 1 / 2 i λ 1 / 2 j λ i + λ j ! n × n ◦ X ! 2 A − 1 / 2 = A + X + ( W ◦ X ) A ( W ◦ X ) . The expression of (II.24) is obtained. Replace X by tX for small t , then (II.25) follows. Similarly , the exponential function and approximation that are deri ved in Lemma II.6 for the case of A being a positiv e diagonal matrix also can be further deﬁned for general cases. Theorem II.7. Suppose A ∈ P n has the spectral decom- position A = U Λ U ∗ , where U is a unitary matrix and Λ = diag ( λ 1 , ..., λ n ) . Denote W =  1 λ i + λ j  n × n . Then for every Hermitian matrix X ∈ H n such that I n + W ◦ X U is PSD where X U := U ∗ X U , we have exp A ( X ) = A + X + U [( W ◦ X U )Λ( W ◦ X U )] U ∗ . (II.28) In particular , when t ∈ R is sufﬁciently closed to zer o, we have exp A ( tX ) = A + tX + Ø( t 2 ) . (II.29) Pr oof. By Lemma II.6, under the assumption of X , exp A X = exp U Λ U ∗ X = U exp Λ ( U ∗ X U ) U ∗ = U [Λ + X U + ( W ◦ X U )Λ( W ◦ X U )] U ∗ = A + X + U [ W ◦ X U )Λ( W ◦ X U ] U ∗ . Replace X by tX , then we get (II.29). JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 5 Fig. 2: Illustration of the Inductiv e Mean Algorithm. B. Estimate the Barycenter of PSD matrices with BW distance T o summarize a set of numbers, arithmetic mean is used to measure the central tendency while standard de viation is used to measure the v ariation or dispersion of the numbers. Simi- larly , we also need metrics to characterize the distribution of a set of PSD matrices on P n . W ith BW distance, we estimate the central tendenc y of a set of PSD matrices A 1 , . . . , A m ∈ P n by the Fr ´ echet mean, deﬁned as the following: A ( A 1 , . . . , A m ) = arg min X ∈ P n m X i =1 d 2 B W ( A i , X ) (II.30) The Fr ´ echet mean is also called the barycenter in literatures (e.g. [25], [34]). W ith the Fr ´ echet mean, we can further quantify the dispersion of matrices around their Fr ´ echet mean via the Fr ´ echet variance as the following: σ 2 = 1 m m X i =1 d 2 B W ( A i , A ) (II.31) T o estimate the Fr ´ echet mean for a set of matrices on P n , we propose three methods: Inducti ve Mean Algorithm, Projection Mean Algorithm, and Cheap Mean Algorithm. The error tolerance in the stopping criteria of each algorithm is denoted as ε . Details are discussed in the following. Inductive Mean Algorithm Giv en m PSD matrices A 1 , . . . , A m ∈ P n , the Inductiv e Mean Algorithm estimates their Fr ´ echet mean using the geodesic that connects two points on the manifold. Details of the algorithm is summarized as the following, and we illustrate the process using four matrices in Figure 2. 1) Deﬁne the sequence { A k } k ∈ N such that A k = A k + m = A k +2 m = · · · for all k ∈ N . 2) Let S (1) := A 1 . For k = 2 , 3 , . . . , let S ( k ) := S ( k − 1)  1 k A k = ( k − 1) 2 k 2 S ( k − 1) + 1 k 2 A k + k − 1 k 2 h ( S ( k − 1) A k ) 1 / 2 + ( A k S ( k − 1) ) 1 / 2 i . 3) The limit of { S ( k ) } k ∈ N is the Fr ´ echet mean of { A 1 , . . . , A m } with d B W : lim k →∞ S ( k ) = A ( A 1 , . . . , A m ) . (II.32) In practice, the iteration process stops when d B W ( S ( k ) , S ( k − 1) ) ≤ ε , where S ( k ) is the estimated Fr ´ echet mean. By the design of this algorithm, points in { S ( k ) } k ∈ N are located in the compact region bounded by the geodesics con- necting A 1 , . . . , A m , and the con vergence point of { S ( k ) } k ∈ N is the unique con vergence point. Therefore, the Inductiv e Mean algorithm is valid for estimating the barycenter of PSD matrices. The computation process of the algorithm is simple, and is not sensitiv e to the size and number of matrices as long as the maximal distance between two matrices in A 1 , . . . , A m is bounded. The con vergence rate is uniform but slo w compared to the following two algorithms. Projection Mean Algorithm Dif ferent from the Inductiv e Mean Algorithm which uses the geodesic on the manifold, Projection Mean Algorithm on ( P n , d B W ) leverages the Log and Exp functions that project matrices between the manifold and tangent space. The iteration process of the algorithm is summarized in the follo wing: 1) Let S (0) := 1 m P m j =1 A j . 2) Suppose S ( ` ) is kno wn for some ` ∈ N , update S ( ` +1) as follows: (a) Project { A 1 , . . . , A m } onto the tangent space at S ( ` ) by (II.19): X ( ` ) j := ( S ( ` ) A j ) 1 / 2 + ( A j S ( ` ) ) 1 / 2 − 2 S ( ` ) . (b) Find the arithmetic mean of the projection vectors. X ( ` ) := 1 m m X j =1 X ( ` ) j = 1 m m X j =1 h ( S ( ` ) A j ) 1 / 2 + ( A j S ( ` ) ) 1 / 2 i − 2 S ( l ) . (c) S ( ` +1) is updated as the exponential of X ( ` ) at S ( ` ) by (II.28). S ( ` +1) := S ( ` ) + X ( ` ) + U ( ` ) [( W ( ` ) ◦ X ` U )Λ ( ` ) ( W ( ` ) ◦ X ` U )] U ( ` ) ∗ . where S ( ` ) = U ( ` ) Λ ( ` ) U ( ` ) ∗ , U ( ` ) ∈ U(n) , Λ ( ` ) = diag ( λ ( ` ) 1 , λ ( ` ) 2 , · · · , λ ( ` ) n ) W ( ` ) :=  1 λ ( ` ) i + λ ( ` ) j  n × n X ( ` ) U = U ( ` ) ∗ X ( ` ) U ( ` ) . 3) The limit of { S ( l ) } l ∈ N is the approximation of the Fr ´ echet mean of { A 1 , . . . , A m } with d B W : lim l →∞ S ( l ) = A ( A 1 , . . . , A m ) . (II.33) In practice, the iteration process stops when d B W ( S ( l ) , S ( l − 1) ) ≤ ε , where S ( l ) is the estimated Fr ´ echet mean. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 6 Compared to the Inductive Mean algorithms, the Projection Mean algorithm con verges much faster and has much less computational cost. For instance, when ε = 10 − 3 , the Pro- jection Mean algorithm usually con verges within 10 iterations regardless of the number and dimension of matrices, while the Inductiv e Mean algorithm takes up to Ø(10 3 ) iterations. Cheap Mean Algorithm Similar to the Projection Mean Algorithm, the Cheap Mean Algorithm also leverages the Log and Exp functions to project matrices between the man- ifold and the tangent space. Differently , the Cheap Mean algorithm also updates the original matrices when updating the barycenter . The details of the iterativ e process are summarized in the following: 1) Let A (0) k = A k for k = 1 , . . . , m. 2) Suppose A ( ` ) 1 , . . . , A ( ` ) m are known for some ` ∈ N . For each k ∈ { 1 , . . . , m } , we project the geodesic curves connecting A ( ` ) k to A ( ` ) 1 , . . . , A ( ` ) m onto the tangent space at A ( ` ) k . Then ﬁnd the arithmetic mean of the projection vectors. The exponential of this arithmetic mean at A ( ` ) k is denoted by A ( ` +1) k . By (II.19), the projection of A ( ` ) j onto the tangent space at A ( ` ) k is X ( ` ) kj := ( A ( ` ) k A ( ` ) j ) 1 / 2 + ( A ( ` ) j A ( ` ) k ) 1 / 2 − 2 A ( ` ) k . The arithmetic mean of the projection vectors is X ( ` ) k := 1 m m X j =1 X ( ` ) kj = 1 m m X j =1 h ( A ( ` ) k A ( ` ) j ) 1 / 2 + ( A ( ` ) j A ( ` ) k ) 1 / 2 i − 2 A ( ` ) k . The spectral decomposition of A ( ` ) k is A ( ` ) k = U ( ` ) k Λ ( ` ) k U ( ` ) k ∗ , U ( ` ) k ∈ U(n) , Λ ( ` ) k = diag ( λ ( ` ) k 1 , λ ( ` ) k 2 , · · · , λ ( ` ) kn ) . Denote the matrices W ( ` ) k :=  1 λ ( ` ) ki + λ ( ` ) kj  n × n , X ( ` ) k,U := U ( ` ) k ∗ X ( ` ) k U ( ` ) k . By (II.28), the e xponential of X ( ` ) k at A ( ` ) k is A ( ` +1) k := A ( ` ) k + X ( ` ) k + U ( ` ) k h ( W ( ` ) k ◦ X ( ` ) k,U )Λ ( ` ) k ( W ( ` ) k ◦ X ( ` ) k,U ) i U ( ` ) k ∗ 3) All sequences { A ( ` ) k } ` ∈ N for k = 1 , . . . , m con verge to the same limit, which is called the Cheap Mean A 0 ( A 1 , . . . , A m ) : A 0 ( A 1 , . . . , A m ) := lim ` →∞ A ( ` ) 1 = lim ` →∞ A ( ` ) 2 = · · · = lim ` →∞ A ( ` ) m . In practice, the iteration process stops when d B W ( 1 m P m k =1 A ( l ) k , 1 m P m k =1 A ( l − 1) k ) ≤ ε , where 1 m P m k =1 A ( l ) k is the estimated Fr ´ echet mean. When m = 2 , the Cheap Mean Algorithm produces the true Fr ´ echet Mean. When m > 2 , the Cheap Mean A 0 ( A 1 , . . . , A m ) is an approximation of the true Fr ´ echet Mean [35]. In practice, the computation cost of this algorithm is relativ ely low and the con vergence rate is competiti ve to the Projection Mean Algorithm. The algorithm is much faster compared to the inductive mean algorithm, howe ver slower compared with the Projection algorithm. I I I . E X P E R I M E N TA L R E S U LT S Extensiv e simulations are conducted to comprehensively in vestigate the robustness of BW distance, the accurac y , ef ﬁ- ciency , and rob ustness of estimating the Fr ´ echet Mean of PSD matrices using BW distance on the manifold P n . Both efﬁ- ciency and robustness are compared with the commonly used AI distance (I.1). Considering the fact that AI distance only works for positi ve deﬁnite matrices, small values ( O (10 − 3 ) ) are added to the zero-eigenv alues to make PSD matrices positiv e deﬁnite in our simulations. T able I summarizes the setting of the parameters used in the simulations. T ABLE I: Simulation Parameter Setting Simulation Parameter V alues n : Dimension of Matrices [5 , 10 , 20 , 30 , 50 , 100] m : Number of Matrices [5 , 10 , 20 , 30 , 50 , 100] p : Proportion of close-to-zero eigen values [0 . 1 , 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8] Fig. 3: Empirical distributions of ∆ AI ij and ∆ B W ij with (A) varying matrix dimension (B) v arying proportion of close-to- zero eigen values. A. Robustness of BW distance The robustness of a distance measure is quantiﬁed by the changes in distance when matrices are contaminated by JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 7 small perturbations. The smaller the changes are, the more robust the distance measure is. In this simulation, we test the robustness of BW distance, compare it with AI distance, and also in vestigate the factors that inﬂuence the robustness of BW distance. Speciﬁcally , 100 pairs of positiv e deﬁnite matrices are randomly generated with n being 10 and p being 0 . 2 , denoted as ( A i , B i ) , i = 1 , . . . , 100 . Then, 100 pairs of Hermitian perturbation matrices, denoted as ( E A ij , E B ij ) , j = 1 , . . . , 100 , are randomly generated for each pair of ( A i , B i ) , with their spectral norms on the same scale as the smallest eigen values of ( A i , B i ) , i.e., 10 − 3 . The contaminated matrices ( ˜ A ij , ˜ B ij ) are obtained by adding the perturbation matrices on ( A i , B i ) , i.e., ˜ A ij = A i + E A ij , ˜ B ij = B i + E B ij . The rob ustness of distance measures are quantiﬁed as the difference in the distances between the two matrices with and without perturbation: ∆ AI ij = d AI ( A i , B i ) − d AI ( ˜ A ij , ˜ B ij ) ∆ B W ij = d B W ( A i , B i ) − d B W ( ˜ A ij , ˜ B ij ) where i, j = 1 , . . . , 100 . W ith p being 0 . 2 , the empirical distributions of ∆ AI ij and ∆ B W ij with v arying matrix dimension n are sho wn in Figure 3(A). Overall, ∆ B W is in much smaller scale than ∆ AI , which implies that BW distance is much more robust than AI distance when matrices are affected by perturbations. Besides, the matrix dimension also affect the robustness of both BW and AI distances (all p-v alues < 0 . 01 for Kruskal-W allis and Dunn test). The higher the matrix dimension is, the less robust the distance is. W ith n being 10 , the robustness of distances is also in- vestigated with respect to the proportion of close-to-zero eigen values p . Figure 3(B) shows the empirical distribution of ∆ AI ij and ∆ B W ij with varying p . Overall, the scale of ∆ B W is much smaller than ∆ AI , i.e., BW distance is more robust than AI distance. Howe ver , p has less impact on the robustness of both distances than n does (p-values < 0 . 01 for Kruskal- W allis test, but not all p-values are signiﬁcant for Dunn tests with 0 . 01 signiﬁcance le vel, especially for AI distance). In summary , BW distance is more robust when matrices are affected by perturbations. Besides, BW distance requires much less computing time than AI distance does, which is not hard to show by comparing equation (I.2) and (II.1). B. Robustness of Barycenter for T wo Matrices The Fr ´ echet Mean of two matrices is simply the mid point of the geodesic, i.e., A # B (I.4) with AI distance and A  1 / 2 B (II.3) with BW distance. The robustness of barycenter refers to whether the barycenter would be affected when the two matrices are contaminated by small perturbations. In this simulation, we in vestigate the robustness of the two barycenters A  1 / 2 B and A # B and the contributing f actors of the robustness. 200 pairs of positi ve deﬁnite matrices are randomly gener- ated with n being 10 and p being 0 . 2 , denoted as ( A i , B i ) , i = 1 , . . . , 200 . For each pair of ( A i , B i ) , 100 Hermitian pertur - bation matrices ( E A ij , E B ij ) , j = 1 , . . . , 100 are also randomly generated with their spectral norms being O (10 − 3 ) . Then the Fr ´ echet Means of ( A i , B i ) are denoted as M AI i and M B W i for AI and BW distance respecti vely . For matrices with perturbations, the Fr ´ echet Means are noted as ˜ M AI ij and ˜ M B W ij as follows: M AI i = A i # B i , M B W i = A i  1 / 2 B i ˜ M AI ij = ˜ A ij # ˜ B ij , ˜ M B W ij = ˜ A ij  1 / 2 ˜ B ij where ˜ A ij = A i + E A ij , ˜ B ij = B i + E B ij , i = 1 , . . . , 200 , j = 1 , . . . , 100 are matrices with random perturbations. Fig. 4: Empirical distributions of d B W F ( M i , ˜ M ij ) and d AI F ( M i , ˜ M ij ) in 3 representati ve cases of (1) 0 ≤ δ i ≤ Q 3 (2) δ i > Q 3 (3) δ i < 0 . T o quantify the robustness of barycenter, we record the Frobenius distance between the barycenters of matrices with and without perturbations, i.e., d F ( M ◦ i , ˜ M ◦ ij ) , for AI and BW measure, respecti vely . The smaller the distance d F ( M ◦ i , ˜ M ◦ ij ) is, the more rob ust the barycenter is. Note that the Frobenius distance between two matrices is deﬁned as d F ( X, Y ) = || X − Y || F = p tr [( X − Y ) ∗ ( X − Y )] . For each pair of ( A i , B i ) , the distrib ution of { d F ( M ◦ i , ˜ M ◦ ij ) , j = 1 , . . . , 100 } for AI and BW distance are compared. Figure 4 shows three representativ e cases of { d F ( M ◦ i , ˜ M ◦ ij ) } . T o quantify the differences between the two distributions, i.e., { d F ( M AI i , ˜ M AI ij ) } and { d F ( M B W i , ˜ M B W ij ) } , for each ( A i , B i ) pair , we record the relativ e difference of the two sample means as the ev aluation metric: δ i = d F ( M AI i , ˜ M AI ij ) − d F ( M B W i , ˜ M B W ij ) d F ( M B W i , ˜ M B W ij ) d F ( M ◦ i , ˜ M ◦ ij ) = 1 100 100 X j =1 d F ( M ◦ i , ˜ M ◦ ij ) where i = 1 , . . . , 200 . δ i quantiﬁes the dif ferences in the robustness of BW and AI barycenter . If δ i is positive (e.g., case 1 and 2 in Figure 4), the BW barycenter is more robust. T able II summarizes the distribution of { δ i } for 100 pairs. In Figure 5, case 1 represents the most common scenario ( 73% ) where d F ( M B W i , ˜ M B W ij ) is mostly smaller than d F ( M AI i , ˜ M AI ij ) . Case 2 shows the common scenario ( 25% ) when d F ( M B W i , ˜ M B W ij ) is signiﬁcantly smaller than d F ( M AI i , ˜ M AI ij ) . Both case 1 and 2 represents cases when BW JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 8 barycenter is more robust than AI barycenter . Case 3 depicts the least common scenario ( 2% ) where AI barycenter shows greater robustness than BW barycenter ( δ i < 0 ), but with a considerable amount of overlap between the two distributions. Note that the maximum and minimum of { δ i } are 35 . 04 and − 0 . 423 respecti vely , which implies that AI barycenter is not signiﬁcant more robust than BW barycenter ev en in the rare occurrence of case 3. T ABLE II: Descripti ve statistics of the distribution of { δ i , i = 1 , . . . , 100 } with n = 10 and p = 0 . 2 . Min Q1 Median Mean Q3 Max -0.42 2.67 8.23 9.53 14.81 35.04 W ith the proportion of close-to-zero eigenv alues p being 0 . 2 , we inv estigate ho w matrix dimension n inﬂuence the robustness of BW and AI barycenters. Figure 5 (A) shows the empirical distribution of { δ } with varying n . The line plot in the top right corner describes the median of each distribution of { δ } . Overall, the majority of δ locates on the positiv e sides regardless of n while negati ve δ rarely occur in our simulations. It indicates that BW barycenter is more robust than AI barycenter , i.e., the BW barycenter is less sensitiv e to changes in the matrices. Besides, as n increases, the distribution of δ i becomes more concentrated towards 0 and a monotonic decrease in the median is also observed. In another word, the robustness superiority of BW barycenter ov er AI is more signiﬁcant for lower dimensional matrices (p-value < 0 . 05 for Kruskal-W allis test and 11 out of the 15 pairwise comparisons of Dunn test are signiﬁcant). Fig. 5: Empirical distributions of δ i with (A) varying matrix dimension (B) v arying proportion of close-to-zero eigen values. W ith matrix dimension n being 10 , we also study the impact of p on the robustness of the two barycenters, as sho wn in Figure 5 (B). Overall, majority of δ lies in the positive side, which implies that BW barycenter is more robust than Fig. 6: Empirical distribution of { S S D ◦ } for the three mean algorithms, with matrix dimension being 10 and number of matrix being 100. AI barycenter . Different from n , the distribution of δ with varying p share similar shapes with similar medians (p-value is not signiﬁcant for Kruskal-W allis test). Therefore, p has little impact on the dif ferences in the robustness of BW and AI barycenter . C. Accuracy , Efﬁciency , and Robustness of Barycenter Esti- mation for More Than T wo Matrices Now , we consider a set of positive deﬁnite matrices, A 1 , . . . , A m , m > 2 with p being 0.2. The number of matrices m and matrix dimension n are chosen according to the grid giv en in T able I. In this simulation, we comprehensiv ely study the accuracy , efﬁcienc y , and rob ustness of the proposed three algorithms for barycenter estimation using BW distance. For stopping criteria, we use ε = 10 − 3 . 1) Accuracy of Barycenter Estimation with BW distance: Though the true Fr ´ echet mean is unknown, it is the one that minimizes the sum of the squared BW distances between each matrix and itself by deﬁnition. Therefore, the sum of the squared distances (SSD) is chosen to be the e valuation metric for the accuracy of the barycenter estimation. The smaller the SSD is, more accurate the barycenter estimation is. S S D ◦ = m X i =1 d 2 B W ( A i , M ◦ ) where M ◦ denotes the barycenter estimated by the three algorithms. I denotes Inducti ve mean algorithm, P denotes Pro- jection mean algorithm, and C denotes Cheap Mean algorithms in the following. T ake n = 10 and m = 100 as an example. In the k th iteration ( k = 1 , . . . , 100 ), 100 positive deﬁnite matrices are randomly generated with dimensions being 10 and 2 eigen- values being close-to-zeros (e.g., O (10 − 3 ) . The barycenter of the 100 matrices (i.e., M I , M P , M C ) are then obtained via the three algorithms respecti vely . The SSD of each barycenter is obtained accordingly , denote them as S S D I k , S S D P k , S S D C k . The process is iterated 100 times, resulting in a collection of { S S D ◦ k , k = 1 , . . . , 100 } . Figure 6 shows the empirical distribution of { S S D I } , { S S D P } , { S S D C } . { S S D P } is sig- niﬁcantly less than the other two with p-v alues being < 0 . 05 in Wilcoxon signed-rank tests. Therefore, projection mean algorithm is the most accurate among the three methods. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 9 Fig. 7: Accuracy of barycenter estimation algorithms. (A) Distribution of ( S S D I − S S D P ) . (B) Distribution of ( S S D C − S S D P ) . All distributions are signiﬁcantly dif ferent from each other (all p-values < 0 . 05 ). W e then change the matrix dimension n and the number of matrix m , and perform the aforementioned process. For the comparison purpose, we denote { S S D I k − S S D P k } 100 k =1 as S S D I − S S D P and { S S D C k − S S D P k } 100 k =1 as S S D C − S S D P . Figure 7 sho ws the empirical distributions of S S D I − S S D P and S S D C − S S D P . Ov erall, both S S D I − S S D P and S S D C − S S D P are positiv e, which implies that the project mean algorithm is the most accurate, regardless of n and m . Compared with S S D I − S S D P , S S D C − S S D P is in a much lar ger scale except for dimension being 5. It indicates that inductiv e mean algorithm is generally more accurate than cheap mean algorithm, except for lo w dimensional matrices. Besides, cheap mean algorithm is less accurate when the number of matrices or dimension of matrices increases (Figure 7 (B)). The differences between inductiv e mean and projection mean also increases when the matrix dimension is higher . Howe ver , unlike cheap mean, the inﬂuence of matrix num- ber m is not monotonic for inductive mean algorithm. For instance, when n is 50, the smallest SSD happens when m is also 50. When n is 30, the smallest SSD occurs when m is 20. Therefore, we suspect that inductive mean algorithm is most accurate when the number of matrices is close to the dimension of matrices. In summary , the barycenter estimated by projection mean algorithm is the most accurate regardless of the number of matrices and dimension of matrices. Inductive mean algorithm is more accurate than cheap mean algorithm except for lo w dimensional matrices. 2) Efﬁciency of Barycenter Estimation with BW distance: T o ev aluate the ef ﬁciency of barycenter estimation algorithms, the running time is chosen as the ev aluation metric. The efﬁcienc y of the proposed three algorithms with BW distance is inv estigated and compared with the algorithms coupled with AI distance. Besides, the contrib uting factors that affect the efﬁcienc y of each algorithm is also comprehensi vely studied. Similarly , positiv e deﬁnite matrices are randomly generated with different n and m chosen from T able I. The running time is a veraged over 100 iterations. The details of the barycenter algorithms coupled with AI distance are discussed in [36]. Figure 8 shows the comparison of the running time (seconds) in log scale. Comparing the two distances, both projection mean and cheap mean algorithms performs more efﬁciently when cou- pled with BW distance regardless of matrix dimension ( n ) and number of matrices ( m ). For inductiv e mean algorithm, it is more efﬁcient when coupled with BW distance for lo w dimensional matrices. The impact of the contrib uting factors (i.e., n and m ) is similar for both distances. For instance, regardless of the distance choice, the efﬁcienc y of inductive mean algorithm is mainly affected by matrix dimension; the cheap mean and projection mean algorithms become slower when n or m is larger . For BW distance, the projection mean algorithm is the most efﬁcient regardless of n and m . Depending on n , cheap mean algorithm is more ef ﬁcient when m is small and less efﬁcient when m becomes lar ge compared with inductive mean algorithm. For AI distance, the cheap mean algorithm is the most efﬁcient when m is small, for instance, less than 10 matrices. This is consistent with the ﬁnding in [36]. F or more matrices (i.e., m is large), projection mean and inductive mean become more ef ﬁcient. Depending on the matrix dimension, JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 10 Fig. 8: Comparison of ef ﬁciency of Barycenter estimation algorithms coupled with BW and AI distance. projection mean algorithm is more ef ﬁcient when n is large. Howe ver , it is worth noting that projection mean algorithm does not con verge when it is coupled with AI distance and n is much larger than m [36]. Overall, the projection mean algorithm coupled with BW distance is the most efﬁcient re gardless of n and m . 3) Robustness of Barycenter Estimation with BW distance: The robustness of barycenter estimation refers to the changes in the estimated barycenter caused by perturbations that contaminate the matrices. Therefore, we use the Frobenius distance between the estimated Fr ´ echet mean with and without perturbations as the e valuation metric to quantify the robust- ness of algorithms. W ith n and m chosen from T able I, in the i th iteration ( i = 1 , . . . , 100) , positi ve deﬁnite matrices { A 1 i , ..., A m i } are randomly generated. The contaminated ma- trices, { ˜ A 1 ij , ..., ˜ A m ij , j = 1 , . . . , 100 } , are then obtained by adding the randomly generated Hermitian perturbation matri- ces E k ij on A k i as follows: ˜ A k ij = A k i + E k ij , i, j = 1 , ..., 100 , k = 1 , ..., m. The rob ustness of barycenter estimation is then quantiﬁed as the Frobenius distance between the estimated Fr ´ echet mean with and without perturbation: d ◦ F,ij = || M ◦ i , ˜ M ◦ ij || F , i, j = 1 , ..., 100 where M ◦ i and ˜ M ◦ ij denotes the barycenter of matrices esti- mated by difference algorithms without and with perturbations respectiv ely . For the comparison of robustness, we focus on three algorithms, inductive mean with AI distance, inductive mean with BW distance, and projection mean with BW Fig. 9: Distribution of Frobenious distance between the Fr ´ echet mean estimated by BW projection mean algorithm for 20 × 20 matrices without and with perturbations. distance because cheap mean is time consuming especially for large m and projection mean with AI does not con verge for small m . W ith n being 20, Figure 9 displays the distribution of { d B W ,P F,ij } . The contributing factor m has signiﬁcant impact on the robustness of projection mean algorithm (BW). The estimated barycenter is more robust when there are more matrices (all p-v alues < 0 . 05 for Kruskal-W allis test and Pairwise W ilcoxon Rank Sum test). T o compare the robustness between two algorithms,the following metrics are displayed in Figure 10. d AI ,I F − d B W ,I F := { d AI ,I F,ij − d B W ,I F,ij , i, j = 1 , ..., 100 } d B W ,I F − d B W ,P F := { d B W ,I F,ij − d B W ,P F,ij , i, j = 1 , ..., 100 } Figure 10 (A) displays the differences in the robustness of inductiv e mean algorithms coupled with AI and BW distances. As n or m increases, BW inductive mean algorithm becomes signiﬁcantly more robust than AI mean (all p-values < 0 . 05 for Kruskal-W allis test and P airwise W ilcoxon Rank Sum test). Figure 10 (B) compares the BW inductive and BW projec- tion mean algorithms. The BW projection mean algorithm is more robust than BW inductive algorithm when m is much larger than n . For instance, when there are less than 20 5 × 5 matrices, BW inductiv e is more rob ust but when there are 50 or 100 matrices, BW projection algorithm becomes signiﬁcantly more robust (p-values < 0 . 05 for both Kruskal-W allis test and Pairwise W ilcoxon Rank Sum test). In summary , BW projection mean algorithm is the most robust when m is much larger than n . BW inductiv e mean algorithm is also robust compared to AI inductive mean algorithm. I V . C O N C L U S I O N In this paper , we ﬁrst establish the mathematical foundation for the BW distance by studying the properties of BW distance and the retraction maps of the manifold P n . T o characterize the distribution of PSD matrices on the manifold, we propose three algorithms to estimate the Fr ´ echet mean (i.e., barycenter) of a set of PSD matrices. W ith extensi ve simulation experiments, we comprehensiv ely in vestigate three aspects: 1. the rob ust- ness of BW distance when using it to quantify the distance JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 11 Fig. 10: Comparison of robustness of Barycenter estimation for more than two matrices. (A) Distribution of d AI ,I F − d B W ,I F with different n and m . (B) Distribution of d B W ,I F − d B W ,P F with different n and m . between PSD matrices, 2. the robustness of BW barycenter for two matrices if the two matrices are contaminated by small perturbations, 3. the accuracy , efﬁcienc y , and rob ustness of the proposed three barycenter estimation algorithms. Compared with AI distance, BW distance is more robust especially when matrices are close to being positiv e semi-deﬁnite, which is a common scenario for high dimensional data. When there are only two matrices, BW barycenter is more robust than AI barycenter when matrices are affected by noises, especially for positi ve deﬁnite matrices with some small eigen values. When there are more than two matrices, BW projection mean algorithm outperforms others in terms of accuracy , ef ﬁciency , and rob ustness. Therefore, BW distance and projection mean algorithm are recommended especially for studying high di- mensional matrices. A C K N O W L E D G M E N T The authors would like to thank anonymous referees, an As- sociate Editor, and the Editor for their constructiv e comments that improved the quality of this paper . This paper is based upon work supported by the National Science Foundation under Grant No. 2153492. R E F E R E N C E S [1] R. T . Schirrmeister , J. T . Springenberg, L. D. J. Fiederer , M. Glasstetter, K. Eggensperger , M. T angermann, F . Hutter , W . Burgard, and T . Ball, “Deep learning with conv olutional neural networks for eeg decoding and visualization, ” Human brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017. [2] J. Zheng, M. Liang, S. Sinha, L. Ge, W . Y u, A. Ekstrom, and F . Hsieh, “T ime-frequency analysis of scalp ee g with hilbert-huang transform and deep learning, ” IEEE Journal of biomedical and health informatics, vol. 26, no. 4, pp. 1549–1559, 2021. [3] A. Qiu, A. Lee, M. T an, and M. K. Chung, “Manifold learning on brain functional networks in aging, ” Medical image analysis, vol. 20, no. 1, pp. 52–60, 2015. [4] G. V aroquaux, F . Baronnet, A. Kleinschmidt, P . Fillard, and B. Thirion, “Detection of brain functional-connectivity dif ference in post-stroke patients using group-lev el covariance modeling, ” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010: 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part I 13. Springer, 2010, pp. 200–208. [5] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Riemannian geometry applied to bci classiﬁcation, ” in Latent V ariable Analysis and Signal Separation, V . V igneron, V . Zarzoso, E. Moreau, R. Gribonv al, and E. V incent, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 629–636. [6] ——, “Classiﬁcation of covariance matrices using a riemannian-based kernel for bci applications, ” Neurocomput., vol. 112, p. 172–178, jul 2013. [Online]. A vailable: https://doi.org/10.1016/j.neucom.2012.12.039 [7] A. S. M. Miah, M. R. Islam, and M. K. I. Molla, “Eeg classiﬁcation for mi-bci using csp with averaging covariance matrices: an experimental study , ” in 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2). IEEE, 2019, pp. 1–5. [8] K.-X. Chen, J.-Y . Ren, X.-J. Wu, and J. Kittler, “Co variance descriptors on a Gaussian manifold and their application to image set classiﬁcation, ” Pattern Recognition, vol. 107, p. 107463, Nov . 2020. [9] F . Porikli, O. T uzel, and P . Meer, “Covariance tracking using model up- date based on lie algebra, ” in 2006 IEEE Computer Society Conference on Computer V ision and Pattern Recognition (CVPR’06), vol. 1. IEEE, 2006, pp. 728–735. [10] R. Siv alingam, D. Boley , V . Morellas, and N. Papanikolopoulos, “T ensor sparse coding for region cov ariances, ” in Computer V ision–ECCV 2010: 11th European Conference on Computer V ision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, P art IV 11. Springer, 2010, pp. 722–735. [11] J. Jagarlamudi, R. Udupa, H. Daum ´ e III, and A. Bhole, “Improving bilingual projections via sparse cov ariance matrices, ” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 930–940. [12] W . Zhang and P . Fung, “Discriminatively trained sparse in verse cov ari- ance matrices for speech recognition, ” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 5, pp. 873–882, 2014. [13] Z. Cui, W . Li, D. Xu, S. Shan, X. Chen, and X. Li, “Flowing on JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 12 riemannian manifold: Domain adaptation by shifting cov ariance, ” IEEE T ransactions on Cybernetics, vol. 44, no. 12, pp. 2264–2273, 2014. [14] Z. Zhang, M. W ang, Y . Huang, and A. Nehorai, “ Aligning inﬁnite- dimensional cov ariance matrices in reproducing k ernel hilbert spaces for domain adaptation, ” in Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2018, pp. 3437–3445. [15] N. He, L. Fang, S. Li, A. Plaza, and J. Plaza, “Remote sensing scene classiﬁcation using multilayer stacked cov ariance pooling, ” IEEE T ransactions on Geoscience and Remote Sensing, vol. 56, no. 12, pp. 6899–6910, 2018. [16] L. Eklundh and A. Singh, “ A comparative analysis of standardised and unstandardised principal components analysis in remote sensing, ” International Journal of Remote Sensing, vol. 14, no. 7, pp. 1359–1370, 1993. [17] D. Y ang, C. Gu, Z. Dong, P . Jirutitijaroen, N. Chen, and W . M. W alsh, “Solar irradiance forecasting using spatial-temporal cov ariance structures and time-forward kriging, ” Rene wable Energy, vol. 60, pp. 235–245, 2013. [18] K. Meyer, “Factor-analytic models for genotype × en vironment type problems and structured covariance matrices, ” Genetics Selection Evolution, vol. 41, no. 1, pp. 1–11, 2009. [19] Z. Huang, R. W ang, S. Shan, and X. Chen, “Face recognition on large- scale video in the wild with hybrid euclidean-and-riemannian metric learning, ” Pattern Recognition, vol. 48, no. 10, pp. 3113–3124, 2015. [20] V . Arsigny , P . Fillard, X. Pennec, and N. A yache, “Geometric means in a nov el vector space structure on symmetric positi ve-deﬁnite matrices, ” SIAM journal on matrix analysis and applications, vol. 29, no. 1, pp. 328–347, 2007. [21] S. Jayasumana, R. Hartle y , M. Salzmann, H. Li, and M. Harandi, “K ernel methods on the riemannian manifold of symmetric positive deﬁnite matrices, ” in proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, 2013, pp. 73–80. [22] Z. Huang, R. W ang, S. Shan, X. Li, and X. Chen, “Log-euclidean metric learning on symmetric positive deﬁnite manifold with application to image set classiﬁcation, ” in International conference on machine learning. PMLR, 2015, pp. 720–729. [23] Z. Lin, “Riemannian geometry of symmetric positi ve deﬁnite matrices via cholesky decomposition, ” SIAM Journal on Matrix Analysis and Applications, vol. 40, no. 4, pp. 1353–1370, 2019. [24] X. Pennec, “Manifold-valued image processing with spd matrices, ” in Riemannian geometric statistics in medical image analysis. Else vier, 2020, pp. 75–134. [25] R. Bhatia, T . Jain, and Y . Lim, “On the bures–wasserstein distance between positive deﬁnite matrices, ” Expositiones Mathematicae, vol. 37, no. 2, pp. 165–191, 2019. [26] R. Bhatia, T . Jain, and Y . Lim, “Inequalities for the wasserstein mean of positive deﬁnite matrices, ” Linear Algebra and its Applications, vol. 576, pp. 108–123, 2019. [27] Y . Thanwerdas and X. Pennec, “O(n)-inv ariant riemannian metrics on spd matrices, ” Linear Algebra and its Applications, vol. 661, pp. 163–201, 2023. [Online]. A vailable: https://www .sciencedirect.com/ science/article/pii/S0024379522004360 [28] J. Hwang and S. Kim, “T wo-v ariable wasserstein means of positiv e deﬁnite operators, ” Mediterranean Journal of Mathematics, vol. 19, no. 3, p. 110, 2022. [29] Y . Thanwerdas, “Riemannian and stratiﬁed geometries of cov ariance and correlation matrices, ” Ph.D. dissertation, Universit ´ e C ˆ ote d’Azur, 2022. [30] S. Kim and H. Lee, “Inequalities of the w asserstein mean with other matrix means, ” Annals of Functional Analysis, v ol. 11, pp. 194–207, 2020. [31] J. Hwang and S. Kim, “Bounds for the wasserstein mean with appli- cations to the lie-trotter mean, ” Journal of Mathematical Analysis and Applications, vol. 475, no. 2, pp. 1744–1753, 2019. [32] E. Massart and P .-A. Absil, “Quotient geometry with simple geodesics for the manifold of ﬁxed-rank positive-semideﬁnite matrices, ” SIAM Journal on Matrix Analysis and Applications, vol. 41, no. 1, pp. 171– 198, 2020. [33] L. Malag ` o, L. Montrucchio, and G. Pistone, “W asserstein riemannian geometry of gaussian densities, ” Information Geometry, vol. 1, pp. 137– 179, 2018. [34] F . Yger, M. Berar, and F . Lotte, “Riemannian approaches in brain- computer interfaces: A revie w , ” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 10, pp. 1753–1762, 2017. [35] D. A. Bini and B. Iannazzo, “ A note on computing matrix geometric means, ” Adv . Comput. Math., vol. 35, no. 2-4, pp. 175–192, 2011. [36] B. Jeuris, R. V andebril, and B. V andereycken, “ A survey and comparison of contemporary algorithms for computing the matrix geometric mean, ” Electronic T ransactions on Numerical Analysis, v ol. 39, no. AR TICLE, pp. 379–402, 2012.

Barycenter Estimation of Positive Semi-Definite Matrices with Bures-Wasserstein Distance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment