Factor Analysis and Alternating Minimization
In this paper we make a first attempt at understanding how to build an optimal approximate normal factor analysis model. The criterion we have chosen to evaluate the distance between different models is the I-divergence between the corresponding norm…
Authors: Lorenzo Finesso, Peter Spreij
F actor Analysis and Alternating Minimization Lorenzo Finesso ∗ P eter Spreij † De dic ate d to Gior gio Pic ci on the o c c asion of his 65th birthday. Happy Birthday Gior gio! Abstract In this paper w e make a fi rst attempt at understanding ho w to build an optimal appr oximate normal factor analysis mo del. The criterion we hav e chosen t o eva luate the distance b et ween different mo dels is the I- divergence betw een the corresp onding normal laws. The algorithm th at w e p rop ose for the construction of th e b est approximation is of an the alternating minimization kind. 1 In tro duction F actor analysis, in its or iginal formulation, is the linear s tatistical mo del Y = H X + ε (1.1) where H is a deterministic matrix, X and ε indep enden t r andom vectors, the first with dimension sma ller than Y , the second with indep endent compo nent s. What makes this mo del attractive in applied r esearch is the data r e duction mechanism built in it. A large n umber of observed v ar iables Y are explained in terms of a small num ber of unobs erved (latent) v ar iables X p erturb ed by the independent no ise ε . Under normality a ssumptions, which a re the rule in the standard theory , all the laws of the mo del ar e sp ecified by cov ariance matrices. More precisely , assume that X and ε are zer o mean independent nor ma l v ectors with C ov( X ) = P and C ov( ε ) = D , where D is diago nal. It follows from (1.1) that C ov( Y ) = H P H ⊤ + D . Building a factor analysis mo del of the observed data re q uires the solution of a difficult a lgebraic pro blem. Given Σ 0 , the c ov ar iance ma trix of Y , find the triples ( H , P , D ) such tha t Σ 0 = H P H ⊤ + D . Due to the s tructural co ns traint on D , which is a ssumed to b e diagonal, the existence and unicity of a factor analysis mo del are not guara n teed. As it turns out, the right to o ls to deal with ∗ Institute of Biomedical Engineering, CNR-ISIB, Pado v a, lorenzo.finesso@ isib.cnr.it † Kortew eg-de V ries Institute f or Mathematics, Universiteit v an Amsterdam, Amsterdam, spreij@s cience.uva.nl 1 this situation c o me from the theo ry o f sto chastic realiz a tion, see [5] (try ing to sp ot the ma ster’s hand) for a n early contribution on the s ub ject. In the present pap er w e mak e a first attempt at understa nding how to build an optimal appr oximate factor a na lysis mo del. The c riterion we hav e c hosen to ev a lua te the distance b etw een cov ariances is the I-divergence b etw ee n the corres p onding normal laws. The algo rithm that w e propo se for t he co nstruction of the best approximation is inspired b y the alterna ting minimization procedure of [4] and [6]. 2 The mo del Consider tw o indep endent, zer o mean, nor mal vectors X and ε o f resp ective dimensions k a nd n . W e will assume that C o v( X ) = I , the identit y matrix, and C ov( ε ) = D > 0, a dia gonal matrix. Let H b e a n n × k matrix (in this pap er k < n ) and let the ra ndom vector Y be defined by Y = H X + ε. (2.1) Under these assumptions (2 .1) is called a facto r ana lysis (F A) mo del o f s iz e k for the vector Y . Notice that allowing C ov( X ) = P > 0 do es not pr o duce a more gener al mo del, as a square ro ot of P can a lw ays b e a bsorb ed in H . W e will say that a norma l vector Y admits a F A mo del of size k if it is eq ual in distribution to H X + ε for so me X and ε as ab ov e, i.e. if its cov a riance Σ 0 can b e written a s Σ 0 = H H ⊤ + D . Not every no rmal vector Y admits a F A mo del, the hard constra int b eing imp osed by the diago nal structure of D . A probabilistic interpretation stems from C ov( Y | X ) = D (see equation (A.1) of the App endix) i.e. the n comp onents of Y a r e conditionally indep enden t given the k < n comp onents of some v ector X . In Remark 3.1 of the next section the condition for the exis tence of a F A mo del is slightly reformulated. Although the construction of a n exact F A mo del is not a lways p ossible, o ne can search for a bes t a ppr oximate mo del, according to so me c r iterion. In this pa- per we opt for minimizing the I- divergence (Kullback-Leibler distance) be t ween normal laws. Recall that given t wo pro babilit y measure s P 1 and P 2 , defined on the same measurable space, such that P 1 ≪ P 2 , the I-divergence of P 1 with resp ect to P 2 is defined as D ( P 1 || P 2 ) = E P 1 log d P 1 d P 2 . If P 1 and P 2 are no r mal measures on the same spac e R n , with zero means and strictly p ositive cov aria nce ma trices Σ 1 and Σ 2 resp ectively , the I-divergence D ( P 1 || P 2 ) takes the explicit for m D ( P 1 || P 2 ) = 1 2 log | Σ 2 | | Σ 1 | + 1 2 tr(Σ − 1 2 Σ 1 ) − n 2 . (2.2) Since the I-divergence o nly depends o n the cov ariance matrices, we usually write D (Σ 1 || Σ 2 ) instead of D ( P 1 || P 2 ). 2 The approximate factor a nalysis problem c an b e p osed as follows: Problem 2.1. Given the po sitive co v a riance matrix Σ 0 ∈ R n × n and the integer k < n minimize D (Σ 0 || H H ⊤ + D ) = 1 2 log | H H ⊤ + D | | Σ 0 | + 1 2 tr(( H H ⊤ + D ) − 1 Σ 0 ) − n 2 . ov er all pair s ( H, D ) wher e H ∈ R n × k and D > 0 is o f size n and diagonal. Notice that D (Σ 1 || Σ 2 ), computed a s in (2.2), can b e considere d as a divergence betw een tw o p ositive definite matrice s , witho ut referr ing to norma l distributions . Hence P roblem 2.1 also has a meaning, when one refra ins from assumptions like normality . Existence o f the minimum is guaranteed b y the following Prop ositio n 2.2. Ther e exist matric es H ∗ ∈ R n × k and D ∗ > 0 of size n and diagonal minimizing the I-diver genc e in Pr oblem 2.1. The pr o o f is deferr e d to section 4.2, since it uses later results. In order to co nstruct an a lgorithm fo r the solutio n of Problem 2.1 w e will imitate the approach of [6]. The algor ithm will therefore be der ived b y a relaxation tech- nique, lifting the o riginal minimiza tion pro blem to a higher dimensional spac e . In the larger space a double minimiza tion pr oblem equiv alent to P roblem 2 .1 can b e formulated, lea ding in a natural way to an alternating minimizatio n algorithm. 3 Lifting of the original problem In this section we will embed Pr oblem 2.1 into a higher dimensional space. First we intro duce the r elev ant sets of cov ariances. Given k < n we deno te by Σ = { Σ ∈ R ( n + k ) × ( n + k ) : Σ = Σ 11 Σ 12 Σ 21 Σ 22 > 0 } . (3.1) where Σ 11 is n × n . Two subsets o f Σ will play a sp ecial role. Σ 0 = { Σ ∈ Σ : Σ 11 = Σ 0 } . (3.2) where Σ 0 is a g iven cov ar ia nce. W e also consider the s ubs et Σ 1 = { Σ ∈ Σ : Σ = H H ⊤ + D H Q ( H Q ) ⊤ Q ⊤ Q } , (3.3) where H ∈ R n × k , Q ∈ R k × k inv er tible, D > 0 diagonal. Elements of Σ 1 will often b e denoted by Σ( H , D , Q ). 3 Remark 3. 1. Notice that a normal vector Y , with C ov( Y ) = Σ 0 , admits a F A mo del of size k iff Σ 0 ∩ Σ 1 6 = ∅ . Supp osing that this is the case, ta k e Σ ∈ Σ 0 ∩ Σ 1 then, for some ( H , D , Q ), one ha s Σ = Σ 0 H Q ( H Q ) ⊤ Q ⊤ Q = H H ⊤ + D H Q ( H Q ) ⊤ Q ⊤ Q > 0 . is a b onafide cov ar iance of a no rmal vector V of dimension n + k . Partition V ⊤ = ( Y ⊤ , Z ⊤ ) ⊤ . It is ea sy to verify that C ov( Y ) = Σ 0 = H H ⊤ + D is the same as C ov( H X + ε ) for s ome X standard norma l and ε normal, independent from X , and with diag onal cov ariance D . The lifted minimization problem ca n b e p osed as follows Problem 3.2 . min Σ ′ ∈ Σ 0 , Σ 1 ∈ Σ 1 D (Σ ′ || Σ 1 ) which can b e viewed as an iterated minimizatio n problem ov er each o f the v aria bles. T he tw o resulting par tial minimization problems will b e inv e stigated in the following sectio ns. In sectio n 3 .3 we will show the connection b etw een Problems 2.1 and 3.2. More precisely , we will prove Prop ositio n 3.3 . L et Σ 0 b e given. It holds that min H,D D (Σ 0 || H H ⊤ + D ) = min Σ ′ ∈ Σ 0 , Σ 1 ∈ Σ 1 D (Σ ′ | Σ 1 ) . 3.1 The first partial minimization problem In this section we consider the firs t o f the tw o partial minimization problems. Here we minimize, for a given p ositive definite matrix Σ ∈ Σ , the divergence D (Σ ′ || Σ) ov er Σ ′ ∈ Σ 0 . The unique solution to this problem can b e co mputed analytically and follows from Lemma 3.4. L et ( Y , X ) b e a ra ndom ve ctor distribute d ac c or ding to some Q = Q Y ,X and let P the set of al l distributions P = P Y ,X whose mar ginal P Y = P 0 , for some fix e d P 0 ≪ Q Y . Then min P ∈ P D ( P || Q ) = D ( P ∗ || Q ) wher e P ∗ is given by the R adon-Niko dym derivative dP ∗ dQ = dP 0 dQ Y . Mor e over, D ( P ∗ || Q ) = D ( P 0 || Q Y ) . (3.4) and, for any other P ∈ P , one has the Pythagor e an law D ( P || Q ) = D ( P || P ∗ ) + D ( P ∗ || Q ) . (3.5) 4 Pro of First we show that (3.4) ho lds. Recall tha t Y has law P 0 under P ∗ , then D ( P ∗ || Q ) = E P ∗ log dP ∗ dQ = E P ∗ log dP 0 dQ Y = E P 0 log dP 0 dQ Y = D ( P 0 || Q Y ) . T o show that P ∗ is a minimizer it is clear ly s ufficien t to prov e tha t (3.5) holds. D ( P || Q ) = E P log dP dP ∗ + E P log dP ∗ dQ = D ( P || P ∗ ) + E P log dP 0 dQ Y = D ( P || P ∗ ) + E P 0 log dP 0 dQ Y = D ( P || P ∗ ) + D ( P ∗ || Q ) , where we used the fact that all P ∈ P hav e margina l P Y = P 0 . Remark 3.5. The law P ∗ is easily characteriz ed in terms o f the problem data P 0 and Q noticing that the marginal P ∗ Y = P 0 and the conditional P ∗ X | Y = Q X | Y . W e now apply Lemma 3 .4 to the c ase of norma l laws and solve the first par tial minimization. See also [2] for a different pro of. Prop ositio n 3.6. L et Q and P 0 b e zer o me an normal laws with str ictly p ositive c ovarianc es Σ ∈ Σ and Σ 0 ∈ R n × n r esp e ctively. Then, min Σ ′ ∈ Σ 0 D (Σ ′ || Σ) is attaine d by the zer o m e an normal law P ∗ with c ovarianc e Σ ∗ = Σ 0 Σ 0 Σ − 1 11 Σ 12 Σ 21 Σ − 1 11 Σ 0 Σ 22 − Σ 21 Σ − 1 11 (Σ 11 − Σ 0 )Σ − 1 11 Σ 12 > 0 . Mor e over, D (Σ ∗ || Σ) = D (Σ 0 || Σ 11 ) . Pro of This follows fro m Remark 3 .5. A dir ect co mputation gives Σ ∗ 12 = E P ∗ X Y ⊤ = E P ∗ ( E P ∗ [ X | Y ] Y ⊤ ) = E P ∗ ( E Q [ X | Y ] Y ⊤ ) = E P ∗ (Σ 21 Σ − 1 11 Y Y ⊤ ) = Σ 21 Σ − 1 11 E P 0 Y Y ⊤ = Σ 21 Σ − 1 11 Σ 0 . Likewise, w e have Σ ∗ 22 = E P ∗ X X ⊤ = C ov P ∗ ( X ) = C ov P ∗ ( X | Y ) + E P ∗ ( E P ∗ [ X | Y ] E P ∗ [ X | Y ] ⊤ ) = C ov Q ( X | Y ) + E P ∗ ( E Q [ X | Y ] E Q [ X | Y ] ⊤ ) = Σ 22 − Σ 21 Σ − 1 11 Σ 12 + E P ∗ (Σ 21 Σ − 1 11 Y (Σ 21 Σ − 1 11 Y ) ⊤ ) = Σ 22 − Σ 21 Σ − 1 11 Σ 12 + E P 0 (Σ 21 Σ − 1 11 Y Y ⊤ Σ − 1 11 Σ 12 ) = Σ 22 − Σ 21 Σ − 1 11 Σ 12 + Σ 21 Σ − 1 11 Σ 0 Σ − 1 11 Σ 12 . 5 Notice that, since Σ > 0 by assumption, Σ ∗ 22 − Σ ∗ 21 (Σ ∗ 11 ) − 1 Σ ∗ 12 = Σ 22 − Σ 21 Σ − 1 11 Σ 12 > 0 which, tog ether with the as s umption Σ 0 > 0, shows that Σ ∗ > 0. The last relation, D (Σ ∗ || Σ) = D (Σ 0 || Σ 11 ), refle c ts equation (3 .4). 3.2 The second partial minimization problem In this s e ction we turn to the second par tial minimization problem. Here we minimize, for a given p ositive definite matrix Σ ∈ Σ , the divergence D (Σ || Σ 1 ) ov er Σ 1 ∈ Σ 1 . Clearly this problem ca nnot hav e a unique so lution in terms of the matrices H and Q . Indeed, if U is a unitar y k × k matr ix and H ′ = H U , Q ′ = U ⊤ Q , then H ′ H ′⊤ = H H ⊤ , Q ′⊤ Q ′ = Q ⊤ Q and H ′ Q ′ = H Q . Nev ertheless, the optimal matrices H H ⊤ , H Q and Q ⊤ Q are unique, as we will see in P rop osition 3.7. First we need to intro duce so me no tation and conv entions. If P is a p ositive definite matrix, we denote by P 1 / 2 any ma trix satisfying ( P 1 / 2 ) ⊤ ( P 1 / 2 ) = P , and by P − 1 / 2 its inv erse. If M is any squa re ma tr ix, we denote by ∆( M ) the diagonal matrix ∆( M ) ii = M ii . Recall that we denote by Σ( H, D, Q ) a typical element of Σ 1 . Prop ositio n 3.7. Given Σ ∈ Σ the min Σ 1 ∈ Σ 1 D (Σ || Σ 1 ) is attaine d at a Σ ∗ 1 such that Σ 1 ∈ Σ 1 is solve d by Q ∗ = Σ 1 / 2 22 , H ∗ = Σ 12 Σ − 1 / 2 22 , D ∗ = ∆(Σ 11 − Σ 12 Σ − 1 22 Σ 21 ) . Thus the minimizing matrix Σ ∗ 1 = Σ( H ∗ , D ∗ , Q ∗ ) b e c omes Σ ∗ 1 = Σ 12 Σ − 1 22 Σ 21 + ∆(Σ 11 − Σ 12 Σ − 1 22 Σ 21 ) Σ 12 Σ 21 Σ 22 . (3.6) Mor e over, the Pythagor e an law D (Σ || Σ( H, D , Q )) = D (Σ || Σ ∗ 1 ) + D (Σ ∗ 1 || Σ( H, D, Q )) (3.7) holds for any Σ( H , D , Q ) ∈ Σ 1 , and ther efor e Σ ∗ 1 is u nique. Pro of It is sufficient to s how the v alidity of (3.7 ). W e first compute 2 D (Σ || Σ( H, D, Q )) − 2 D (Σ || Σ ∗ 1 ) . It follows fro m Lemma A.A.1 that | Σ( H, D, Q ) | = | D | × | Q ⊤ Q | . In view of equation (2.2) the a bove difference b ecomes log | D | + log | Q ⊤ Q | − lo g | D ∗ | − log | Q ∗ ⊤ Q ∗ | + tr Σ( H, D, Q ) − 1 Σ − tr Σ ∗− 1 1 Σ . (3.8) 6 Using Cor o llary A.A.2, we compute Σ( H, D, Q ) − 1 = D − 1 − D − 1 H Q −⊤ − Q − 1 H ⊤ D − 1 Q − 1 ( H ⊤ D − 1 H + I ) Q −⊤ , (3.9) and hence we get that tr Σ( H, D, Q ) − 1 Σ = tr D − 1 (Σ 11 − H Q −⊤ Σ 21 ) + tr − Q − 1 H ⊤ D − 1 Σ 12 + Q − 1 ( H ⊤ D − 1 H + I ) Q −⊤ Σ 22 = tr D − 1 (Σ 11 − 2 H Q −⊤ Σ 21 ) + Q − 1 ( H ⊤ D − 1 H + I ) Q −⊤ Σ 22 . (3.10) Apply now Lemma A.A.1 to (3.6 ) and wr ite ∆ = ∆(Σ 11 − Σ 12 Σ − 1 22 Σ 21 ), to get Σ ∗− 1 1 = Σ( H ∗ , D ∗ , Q ∗ ) − 1 = ∆ − 1 − ∆ − 1 Σ 12 Σ − 1 22 − Σ − 1 22 Σ 21 ∆ − 1 Σ − 1 22 Σ 21 ∆ − 1 Σ 12 Σ − 1 22 + Σ − 1 22 . Therefore tr Σ ∗− 1 1 Σ = tr ∆ − 1 × (Σ 11 − Σ 12 Σ − 1 22 Σ 21 ) + tr I k = tr ∆ − 1 ∆) + k = n + k . (3.11) Combining equations (3.8), (3.10), and (3.11), we find that D (Σ || Σ( H, D , Q )) − D (Σ || Σ ∗ 1 ) = log | D | + log | Q ⊤ Q | − log | D ∗ | − log | Q ∗ ⊤ Q ∗ | + tr D − 1 (Σ 11 − H Q −⊤ Σ 21 ) + tr − Q − 1 H ⊤ D − 1 Σ 12 + Q − 1 ( H ⊤ D − 1 H + I ) Q −⊤ Σ 22 − ( n + k ) . (3.12) W e pro ceed with the c o mputation of 2 D (Σ ∗ 1 || Σ( H, D, Q )). 2 D (Σ( H ∗ , D ∗ , Q ∗ ) || Σ( H, D , Q )) = log | D | + log | Q ⊤ Q | − log | D ∗ | − log | Q ∗ ⊤ Q ∗ | − ( n + k ) + tr Σ( H, D, Q ) − 1 Σ( H ∗ , D ∗ , Q ∗ ) . (3.13) Combining equations (3.6), (3.9), a nd tr D − 1 (Σ 12 Σ − 1 22 Σ 21 + ∆) = tr D − 1 Σ 11 , we o btain tr Σ( H, D, Q ) − 1 Σ( H ∗ , D ∗ , Q ∗ ) = tr D − 1 Σ 11 − 2tr D − 1 H Q −⊤ Σ 21 + tr Q − 1 ( H ⊤ D − 1 H + I ) Q −⊤ Σ 22 . (3.14) Insertion o f (3.14) in to (3.13) and a comparison with (3.12) y ields the result. Remark 3.8. Notice that the matrix H ∗ H ∗ ⊤ is strictly dominated by Σ 11 (in the sense of p ositive matrices ). This easily follows from Σ 11 − H ∗ H ∗ ⊤ = Σ 11 − Σ 12 Σ − 1 22 Σ 21 > 0, and the a ssumption Σ > 0 . By the sa me token D ∗ > 0. 7 3.3 The link t o the original problem W e now establish the connection betw een the lifted pr oblem and the orig ina l Problem 2.1. Pro of of Prop o s ition 3.3 Let Σ 1 = Σ( H , D , Q ) and denote by Σ ∗ = Σ ∗ (Σ 1 ), the so lutio n of the first partial minimization o ver Σ 0 . W e have, fo r all Σ ′ ∈ Σ 0 , D (Σ ′ || Σ 1 ) ≥ D (Σ ∗ || Σ 1 ) = D (Σ 0 || H H ⊤ + D ) ≥ min H,D D (Σ 0 || H H ⊤ + D ) , where we used P rop osition 2.2 to write min on the RHS. It follows that inf Σ ′ ∈ Σ 0 , Σ 1 ∈ Σ 1 D (Σ ′ || Σ 1 ) ≥ min H,D D (Σ 0 || H H ⊤ + D ) . Conv ersely , let ( H ∗ , D ∗ ) be the minimizer of ( H , D ) 7→ D (Σ 0 || H H ⊤ + D ), pick an arbitra r y inv e rtible Q ∗ , and let Σ ∗ = Σ( H ∗ , D ∗ , Q ∗ ) b e the co rresp onding element in Σ 1 . F urthermore, let Σ ∗∗ ∈ Σ 0 be the minimizer of Σ 7→ D (Σ || Σ ∗ ) ov er Σ 0 . Then min H,D D (Σ 0 || H H ⊤ + D ) = D (Σ 0 || H ∗ H ∗ ⊤ + D ∗ ) ≥ D (Σ ∗∗ || Σ ∗ ) ≥ inf Σ ′ ∈ Σ 0 , Σ 1 ∈ Σ 1 D (Σ ′ || Σ 1 ) , which shows the opp osite inequality . Finally , to show that we can re pla ce the infima with minima a lso in the lifted problem, notice that (see Prop osition 3.6) D (Σ ∗∗ || Σ ∗ ) = D (Σ 0 || H ∗ H ∗ ⊤ + D ∗ ). 4 Alternating minimization algorithm In this section we co m bine the t wo partia l minimiza tion problems ab ove to derive an itera tiv e alg orithm fo r Proble m 2.1. It turns out that this algo rithm is also instrumental in proving the ex is tence of a solution to P roblem 2.1. 4.1 The algorithm W e supp o se that the given matrix Σ 0 is strictly p ositive de finite. Pick the initial v alues H 0 , D 0 , Q 0 such that H 0 is of full rank, D 0 > 0 is diagonal, Q 0 and H 0 H ⊤ 0 + D 0 are inv ertible. A t the t - th itera tion the matrices H t , D t and Q t are av ailable. Sta r t so lving the first partial minimization problem with Σ = Σ ( H t , D t , Q t ). Use the resulting 8 matrix as data for the second pa rtial minimization, the solutio n of which gives the up date rules Q t +1 = Q ⊤ t Q t − Q ⊤ t H ⊤ t ( H t H ⊤ t + D t ) − 1 H t Q t + Q ⊤ t H ⊤ t ( H t H ⊤ t + D t ) − 1 Σ 0 ( H t H ⊤ t + D t ) − 1 H t Q t 1 / 2 , (4.1) H t +1 = Σ 0 ( H t H ⊤ t + D t ) − 1 H t Q t Q − 1 t +1 , (4.2) D t +1 = ∆(Σ 0 − H t +1 H ⊤ t +1 ) . (4.3) In (4.1) there is some free dom in co mputing the square ro o t that deter mines Q t +1 . Prop erly choosing the s q uare ro o t will result in the disapp ear ance of Q t from the algo rithm. This is an attrac tiv e feature, since Q t only ser ves as an auxiliary v aria ble. One can write the RHS of equation (4.1), b efore taking the square ro o t, as Q ⊤ t ( I − H ⊤ t ( H t H ⊤ t + D t ) − 1 ( H t H ⊤ t + D t − Σ 0 )( H t H ⊤ t + D t ) − 1 H t ) Q t and denoting R t = I − H ⊤ t ( H t H ⊤ t + D t ) − 1 ( H t H ⊤ t + D t − Σ 0 )( H t H ⊤ t + D t ) − 1 H t (4.4) a p ossible square ro ot is g iven by R 1 / 2 t Q t . Notice that R t only inv olves the iterates H t and D t . The up date equation (4.1 ) can therefor e b e r ewritten as H t +1 = Σ 0 ( H t H ⊤ t + D t ) − 1 H t R − 1 / 2 t . (4.5) The final version of the algor ithm is given by equa tions (4.3),(4.4), and (4.5) which, for clarity , we present as Algorithm 4.1. R t = I − H ⊤ t ( H t H ⊤ t + D t ) − 1 ( H t H ⊤ t + D t − Σ 0 )( H t H ⊤ t + D t ) − 1 H t , (4.6) H t +1 = Σ 0 ( H t H ⊤ t + D t ) − 1 H t R − 1 / 2 t , (4.7) D t +1 = ∆(Σ 0 − H t +1 H ⊤ t +1 ) . (4.8) In or der to av oid taking a squa re ro ot at each step one can intro duce the ma- trices K t = H t Q t and P t = Q T t Q t and wr ite the up dates for K t and P t . Equa - tions (4.1), (4.2), a nd (4.3) easily give Algorithm 4.2. K t +1 = Σ 0 ( K t P − 1 t K ⊤ t + D t ) − 1 K t , (4.9) P t +1 = P t − K ⊤ t ( K t P − 1 t K ⊤ t + D t ) − 1 ( K t P − 1 t K ⊤ t − Σ 0 )( K t P − 1 t K ⊤ t + D t ) − 1 K t , D t +1 = ∆(Σ 0 − K t +1 P − 1 t +1 K ⊤ t +1 ) . After the fina l iter a tion, the T -th say , one can take H T = K T Q − 1 T , where Q T is a squar e ro ot of P T . 9 Notice that in bo th Algorithm 4.1 and 4.2 it is require d to inv ert n × n matri- ces (like e.g . ( H t H ⊤ t + D t ) − 1 ). Applying cor ollary A.A.2 o ne gets ( H t H ⊤ t + D t ) − 1 H t = D − 1 t H t ( I + H ⊤ t D − 1 t H t ). Hence, we can r e pla ce e.g. (4 .5) with H t +1 = Σ 0 D − 1 t H t ( I + H ⊤ t D − 1 t H t ) − 1 R − 1 / 2 t . (4.10) By the same token one ca n write K t +1 = Σ 0 D − 1 t K t ( P t + K ⊤ t D − 1 t K t ) − 1 P t to replace (4.9). Some prop erties of the algor ithm are summarized in the next prop osition. Prop ositio n 4.3 . F or Algorithm 4.1 the fol lowing hold for al l t . (a) D t > 0 and ( D t ) ii ≤ (Σ 0 ) ii . (b) R t is invertible. (c) If H 0 is of ful l c olumn r ank, so is H t . (d) H t H ⊤ t ≤ Σ 0 . (e) If Σ 0 = H 0 H ⊤ 0 + D 0 then the algorithm stops. (f ) The obje ctive function de cr e ases at e ach iter ation. Mor e pr e cisely, let Σ 0 ,t b e the solut ion of the first p artial minimization with data Σ t = Σ( H t , D t , Q t ) . Then D (Σ 0 || H t +1 H ⊤ t +1 ) − D (Σ 0 || H t H ⊤ t ) = − D (Σ t +1 || Σ t ) + D (Σ 0 ,t || Σ 0 ,t +1 ) . (g) The limit p oints ( H , D ) of the algorithm satisfy the r elations H = (Σ 0 − H H ⊤ ) D − 1 H, D = ∆(Σ 0 − H H ⊤ ) . Pro of (a) This follows from Remar k 3.8. (b) Use the identit y I − H ⊤ t ( H t H ⊤ t + D t ) − 1 H t = ( I + H ⊤ t D − 1 t H t ) − 1 and the assumption Σ 0 > 0. (c) Use the ass umption Σ 0 > 0, (a), and (b). (d) Aga in from Rema rk 3.8 a nd the construction of the algo rithm as a combi- nation of the tw o par tial minimization problems. (e) This is a triviality upo n noticing that o ne can take R t = I in this case . (f ) It follows fr o m a concatenatio n of Lemma 3 .4 and P rop osition 3 .7. Notice that we can expr ess the decrease as the sum of tw o I-divergences, since the Pythagor ean law holds for b oth partial minimiza tions. (g) W e consider Alg o rithm 4.2 first. Assume that all v a riables conv erg e . Then, from (4.9), the limit p oints K, P, D satisfy the relation K = Σ 0 D − 1 K ( P + K ⊤ D − 1 K ) − 1 P . Postmultiplication by P − 1 ( P + K ⊤ D − 1 K ) yields, after rea r- ranging terms, K = (Σ 0 − K P − 1 K ⊤ ) D − 1 K . Let now Q b e a squa re ro ot of P and H = K Q − 1 to get the first relatio n. The rest is triv ia l. 10 4.2 Pro of of Prop osition 2.2 Let D 0 and H 0 be arbitrar y and p er fo rm one step of the algor ithm to get matrices D 1 and H 1 . It fo llows from P rop osition 4.3 that D (Σ 0 || H 1 H ⊤ 1 + D 1 ) ≤ D (Σ 0 || H 0 H ⊤ 0 + D 0 ). Moreover, H 1 H ⊤ 1 ≤ Σ 0 and D 1 ≤ ∆(Σ 0 ). Hence the sea rch for a minim um can b e confined to the set o f matrices ( H, D ) satisfying H H ⊤ ≤ Σ 0 and D ≤ ∆(Σ 0 ). Next, w e claim that it is a lso sufficient to restrict the search for a minimum to all matrices ( H, D ) such that H H ⊤ + D ≥ εI for some sufficiently small ε > 0. Indeed, if the last inequality is violated, then H H ⊤ + D has an eigenv alue less than ε . W rite the Jo r dan deco mpos itions H H ⊤ + D = U Λ U ⊤ , a nd let Σ U = U ⊤ Σ 0 U . Then D (Σ 0 || H H ⊤ + D ) = D (Σ U || Λ), as one easily verifies. Denoting by λ i the eigenv a lues of H H ⊤ + D and letting σ ii be the diagonal elements of Σ U , we can write D (Σ U | Λ) = − 1 2 log | Σ U | + 1 2 P i log λ i − n 2 + 1 2 P i σ ii λ i . Let λ i 0 be a minimu m eigenv alue and take ε sma ller than the minim um of all σ ii , which is p ositive, since Σ 0 is strictly positive definite. Then the contribution for i = i 0 in the summation to the divergence D (Σ U || Λ) is at least lo g ε + 1, which tends to infinity for ε → 0. This proves the claim. So, w e have shown that a minimizing pair ( H , D ) has to s a tisfy H H ⊤ ≤ Σ 0 , D ≤ ∆(Σ 0 ), a nd H H ⊤ + D ≥ εI , for some ε > 0. In o ther words we have to minimize the I-divergence o ver a compact s et on which it is clearly co nt inuous. This proves Pro po sition 2.2. A App endix F or eas e of reference we collect here some standar d formulas for the norma l distribution and some matrix alg ebra. A.1 Multiv ariate normal distribution Let ( X ⊤ , Y ⊤ ) ⊤ be a zero mean normal vector with cov ariance matrix Σ = Σ X X Σ X Y Σ Y X Σ Y Y . Assume that Σ Y Y is inv ertible. The conditional law of X given Y is normal with E [ X | Y ] = Σ X Y Σ − 1 Y Y Y a nd C ov[ X | Y ] = Σ X X − Σ X Y Σ − 1 Y Y Σ Y X . (A.1) A.2 P artitioned matrices Lemma A.1. L et A, D b e squar e matric es. As s u me invertibility wher e r e qu ir e d. A C B D = I C D − 1 0 I A − C D − 1 B 0 0 D I 0 D − 1 B I , 11 A C B D = I 0 B A − 1 I A 0 0 D − B A − 1 C I A − 1 C 0 I , A C B D − 1 = ( A − C D − 1 B ) − 1 − ( A − C D − 1 B ) − 1 C D − 1 − D − 1 B ( A − C D − 1 B ) − 1 D − 1 B ( A − C D − 1 B ) − 1 C D − 1 + D − 1 . Corollary A.2. ( D − B AC ) − 1 = D − 1 + D − 1 B ( A − 1 − C D − 1 B ) − 1 C D − 1 . Pro of F or Lemma A.1 a check will suffice. The C o rollary follows using the tw o decomp ositions o f the Lemma with A replace d by A − 1 and co mparing the tw o expressions o f the lower rig h t blo ck of the inv erse matrix . References [1] T.W. Anderso n (1 984), An intr o duction t o multivariate statist ic al analysis , Second ed., Wiley . [2] E. Cramer (1998), Co nditional iterative prop ortional fitting for Gaussian distributions, J. Multivariate Analysi s , 65(2) , 2 61–27 6 . [3] E. Cramer (2000), P robability mea sures with given mar g inals and condi- tionals: I-pro jections and conditiona l iterative prop ortional fitting, S tatis- tics & De cisions , 18(3) , 3 11–32 9. [4] I. Csisz´ ar and G. T usn´ ady (19 84), I nfo r mation g eometry and a lter nating minimization pr o c edures, Statistics & D e cisons, supplement issue 1 , 205- 237. [5] L. Finesso and G. Picci (19 84), Linear s tatistical mo dels and sto chastic realization theory . A nalysis and optimization of systems, Part 1 (Nic e, 1984) , 445– 470, Lectur e Notes in Co n trol and Inform. Sci., 6 2, Springe r, Berlin. [6] L. Finess o and P . Spr eij (2006 ), No nneg ative matrix fac torization and I- divergence alternating minimization, Line ar Algebr a and its Applic ations , 416 , 27 0–287 . 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment