Multivariate data analysis: The French way

This paper presents exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture. We present the general framework of duality diagrams which encompasse…

Authors: Susan Holmes

Multivariate data analysis: The French way
IMS Collectio ns Probability and Statistics: Essays in Honor of David A . F reedman V ol. 2 (2 008) 219–23 3 c  Institute of Mathematical Statistics , 2008 DOI: 10.1214/ 1939403 07000000 455 Multiv aria te data analysis: The F renc h w a y Susan Holmes ∗ 1 Stanfor d Unive rsity Abstract: This pap er pr esen ts exploratory tec hniques f or multiv ari ate data, man y of them well known to F rench statisticians and ecologists, but few well understoo d in North American culture. W e present the general framework of duality diagrams which encompasses discri minant analysis, corr espondence analysis and pri ncipal comp onen ts, and we show how this fr amework can b e generalized to the r egressi on of graphs on cov ariates. 1. Motiv ation David F reedman is w ell known for his in terest in multiv ariate pro jections [ 5 ] and his s kepti cism with regards to mo del-based multiv aria te inference, in pa r ticular in cases where the num b er of v ar iables and observ ations are of the same order (see F reedman and Peters [ 12 , 13 ]). Brought up in a completely foreign culture, I would like to share an alien ap- proach to some mo dern multiv ariate sta tistics that is not well known in North American statistica l culture. I hav e wr itten the pap er ‘the F r ench way’ with theo- rems and abstract formulation in the b eginning a nd examples in the latter sections; Americans are welcome to skip ahead to the motiv ating examples. Some F renc h statisticians, fed Bo urbakist ma thematics and categ ory theory in the 60’s a nd 70’s as all mathematicians w ere in F rance a t the time, suffered from abstraction envy . Having completely rejected the probabilistic enterprise a s use- less for practical reasons , they co mpo sed their own a bstract framework for talking ab out data in a geometrical co nt ext. I will explain the framew ork known as the dualit y diagr a m develope d by Cazes, Cailliez, Pag ` es, Escoufier and their follow- ers. I will try to s how ho w asp ects of the general framework are s till useful to day and how muc h every idea fr om Benzecri’s co rresp ondence analys is to Esco ufier’s conjoint a nalysis has b een rediscov e r ed man y times. Section 2.1 sets out the a b- stract picture. Sections 2 .2-2.6 treat extensions of clas sical multiv ariate techniques: principal co mpo nen ts analys is, instrument al v ariables, ca no nical corr elation analy- sis, discriminan t analysis, corres po ndence analysis from this unified view. Section 3 shows how the methods apply to the a nalysis of netw ork da ta. ∗ Supported by NSF Grant DMS-0241246. 1 Stanford Universit y , D epartmen t of Statistics, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305-4065, U SA, e-mail: susan@st at.stan ford.edu AMS 2000 subje ct classific ations: 62H25, 62H20. Keywor ds and phr ases: bo otstrap, corresp ondence analysis, dualit y diagram, R V-coefficient, ST A TIS. 219 220 S. Holmes 2. The dualit y diagram Established by the F rench school of “Analys e des Donn ´ ees” in the early 197 0 ’s, this a pproach was only published in a few texts [ 1 ] and tec hnical r epo rts [ 9 ], no ne of which were transla ted in to E nglish. My Ph.D. advisor , Yves Escoufier [ 8 , 10 ] publicized the metho d to biolog ists and ecologists, presenting a form ula tion based on his R V-co efficient that I will develop b elow. The firs t soft w are implementation of dualit y based metho ds describ ed here were done in LEAS ( 1984), a Pascal pro gram written for Apple I I computers. The most recent implementation is the R pack a g e ade-4 (see Appendix A for a review o f v a rious implementations o f the metho ds describ ed here). 2.1. Notation The data a re p v a riables meas ured o n n obser v ations. They are recor ded in a matrix X with n rows (the obs erv ations) a nd p columns (the v ariables ). D n is an n × n matrix of weigh ts on the “ observ ations” , whic h is most often dia gonal. W e will a ls o use a ” neighborho o d” rela tion (thought of as a metric on the obser v ations) defined b y taking a symmetric definite positive matr ix Q . F or example, to standardize the v a riables Q ca n b e c hosen as Q =      1 σ 2 1 0 0 0 ... 0 1 σ 2 2 0 0 ... 0 0 1 σ 2 3 0 ... ... ... ... 0 1 σ 2 p      . These three matrices form the ess ent ial “triple” ( X , Q, D ) defining a multiv ariate data analysis. As the approa ch her e is geometrica l, it is importa n t to see that Q and D define geo metries or inner pro ducts in R p and R n , resp ectively , thro ugh x t Qy = < x, y > Q , x, y ∈ R p , v t D w = < v , w > D , x, y ∈ R n . F rom these definitions w e see there is a close relation b etw een this appr oach a nd kernel based methods, for mo re details s ee [ 24 ]. Q can be seen as a linea r function from R p to R p ∗ = L ( R p ), the spa c e of sca la r linear functions on R p . D can b e seen as a linear function fro m R n to R n ∗ = L ( R n ). Esco ufier [ 8 ] prop osed to ass o ciate to a data set an op erato r from the space o f obser v a tions R p in to the dual of the space of v ariables R n ∗ . This is summarized in the following diagram [ 1 ] which is ma de commut ative b y defining V and W a s X t D X and X QX t resp ectively , (commutativ e just says that V Q = X t D X Q and W D = X QX t D ). W e call V Q the characterizing op erator of the diagram. R p ∗ − − → X R n Q x       y V D    y x    W R p ← − − X t R n ∗ The F r ench way 221 This is known as the dualit y diagram becaus e knowledge of the eigendecomp o- sition of X t D X Q = V Q leads to that of the dual oper ator X QX t D . The main consequence is an easy transition b etw een principal comp onents a nd principal axes as we will see in the next section. The ter ms dualit y diagram or triple ar e often used interc ha ngeably . Remarks. 1. The duality diagr a m is equiv alent to a triple of three matrices ( X , Q, D ) such that X is n × p and Q and D are symmetric matrice s of the right s ize ( Q is p × p and D is n × n ). The oper ators defined as X QX t D = W D and X t D X Q = V Q are called the c hara c ter istic o per ators of the diagram [ 8 ]. W e say an op era tor O is B -symmetric if < x, Oy > B = < Ox, y > B , o r equiv alently B O = O t B . In particular , V Q is Q - symmetric and W D is D -symmetric. 2. V = X t D X will be the v a r iance-cov a r iance matrix if X is cent ered with regards to D ( X ′ D 1 n = 0) and D is the diagonal ma trix with all elemen ts equal to 1 n . 3. There is an important symmetry betw een the rows and columns of X in the diagram, and one ca n imagine s ituations where the r ole of observ ation or v a riable is not uniquely defined. F or instance in microa rray studies the ge nes can b e considered either as v ar iables or observ ations. This ma kes sense in many contemporary s ituations whic h ev ade the more classical notion of n observ ations s een as a ra ndom sample of a p opulation. It is certainly not the case that the 30 ,000 prob es ar e a sample o f ge nes s ince these pro be s try to b e an exhaustive set. 2.1.1 . Pr op erties of the diagr am Here ar e so me o f the pro pe r ties that prov e us e ful in v ario us settings: • Rank of the diagr am: X , X t , V Q and W D all hav e the sa me rank r , which will usually b e smaller than both n and p . • F or Q and D symmetric matrices, V Q a nd W D are dia gonalisable and ha v e the same eigenv a lues. W e deno te them in decrea sing o rder λ 1 ≥ λ 2 ≥ λ 3 ≥ · · · ≥ λ r ≥ 0 = · · · = 0 . • Eigendecomp osition of the diag ram: V Q is Q symmetric, thus we can find Z such that (2.1) V QZ = Z Λ , Z t QZ = I p , where Λ = diag ( λ 1 , λ 2 , . . . , λ r , 0 , . . . , 0) and I p is the identit y ma trix in R p . This g eneralized eigendeco mpo sition of V Q is often called the (generalized) PCA of the triple ( X , Q, D ). In practical co mputations, we start by finding the Cholesky decompos itions of Q and D , which exist as long a s these matr ice s a re symmetric and p ositive definite; ca ll these H t H = Q and K t K = D . Here H and K are upp er triangular. Then we ca n use the singular v alue deco mpo s ition o f K X H t : K X H t = U S T t , with T t T = I p , U t U = I n , S diago nal, 222 S. Holmes to give us X = K − 1 U S T t ( H t ) − 1 = K − 1 U S T t ( H − 1 ) t and X t = H − 1 T S U t ( K t ) − 1 . Thu s H X t D X H t = T S 2 T t = T Λ T t with Λ = S 2 and finally we ca n see that Z = H − 1 T satisfies ( 2.1 ). The renormalize d columns of Z , A = S Z ar e called the principal axes and satisfy: A t QA = Λ . Similarly , we can define L = K − 1 U that satisfies (2.2) W D L = L Λ , L t D L = I n , where Λ = diag ( λ 1 , λ 2 , . . . , λ r , 0 , . . . , 0) . C = LS is us ua lly called the matr ix of principal comp onents. It is nor med so that C t D C = Λ . When we impos e that C or Z be of r educed ra nk q < min ( n, p ), we take just their first q columns, and have thus a chiev ed what is known as the gener alized PCA of ra nk q . • T ransition F ormulæ: Of the four matrices Z , A, L a nd C we only have to compute one, all others ar e obtained by the transition formulæ provided by the duality prop erty o f the diagram: X QZ = LS = C , X t D L = Z S = A. • The T race ( V Q ) = T r ace ( W D ) is often called the inertia of the dia gram (in- ertia in the sense o f Huyghens’ iner tia for mul a for insta nce). The inertia with regards to a p oint A o f a cloud of p i -weigh ted points be ing P n i =1 p i d 2 ( x i , a ). When we lo ok at or dinary PCA with Q = I p , D = 1 n I n , and the v ariables are centered, the inertia is the sum of the v ar iances of all the v a riables. If the v ar iables are s tandardized ( Q is the diag onal matr ix of inv erse v ariances), then the inertia is the num b er of v aria bles p . 2.2. C omp ari n g two diagr ams: the R V c o efficient Many pr oblems can be rephra sed in terms of comparis o n of tw o “dualit y diagrams” or put mor e simply , tw o characterizing op erators , built from tw o “triples” , usually with one of the triples being a resp onse or ha ving constraints impo sed on it. W e usually try to mak e o ne triple match the o ther in some o ptimal way . T o co mpare tw o symmetric op erator s, there is either a vector cov ar iance cov V ( O 1 , O 2 ) = T r ( O t 1 O 2 ) or a v ec to r co rrelation [ 8 ] RV ( O 1 , O 2 ) = T r ( O t 1 O 2 ) p T r ( O t 1 O 1 ) tr ( O t 2 O 2 ) . If w e were in the specia l case of compar ing tw o v ariables X and Y then the computation of the R V co efficien t co mparing the tw o triples  X n × 1 , 1 , 1 n I n  and  Y n × 1 , 1 , 1 n I n  would g ive the s quare of the corre la tion b etw een the v ar iables RV = ρ 2 . Th us we see that in general the R V co efficient is an extension o f the notion of correla tio n to the multiv aria te context. The F r ench way 223 Generalized PCA o f rank q o f a D centered ma trix X as defined abov e ca n b e seen as providing b est approximation F in the R V-sense. T o be more pre c is e, we are lo o king for the matrix F of rank q which once inser ted in a triple with the sa me weigh ts o n the obser v ations D a nd no weigh ting of the v aria bles will ma ximizes the RV coefficient b etw een characterizing op erato r s. Thus F is the choice o f matr ix of rank q < p that max imizes RV  X QX t D , F F t D  = T r ( X QX t D F F t D ) q T r ( X QX t D ) 2 T r ( F F t D ) 2 . This maxim um is attained wher e F is c hosen as the matrix com bining the first q eigenv ector s of X QX t D normed s o that F t D F = Λ q , the diagona l matrix where only the first q eigenv alues are non zer o. The maximum R V is RV max = s P q i =1 λ 2 i P p i =1 λ 2 i . Of co urse, cla s sical PC A has D = 1 n I , Q = I , but the extra flexibility is often useful. W e define the distance betw een triplets ( X , Q, D ) a nd ( Z , P, D ) where Z is also n × p , as the distance deduced from the R V inner pro duct b etw een oper ators X QX t D and Z P Z t D . In fact, the reason the F r ench like this sc heme so m uch is that most multiv ariate linear metho ds can b e r eframed in these terms. W e will give a few exa mples s uch as Pr incipal Component Analysis (PCA in English, A CP in F rench) , Co rresp ondence Analysis (CA in Eng lish, AFC in F rench), Discriminan t Analysis (LDA in Eng lis h, AFD in F rench), PCA with reg ards to instrumental v a riable (PCAIV in Eng lish, ACPVI in F rench) a nd Canonical Corr elation Analysis (CCA in Eng lish, AC in F rench). 2.3. E xplaining one diagr am by another Principal Component Analysis with resp ect to Ins tr umen tal V ar ia bles was a tech- nique develop ed b y C. R. Rao [ 25 ] to find the best set of co efficients in the multi- v a riate regr ession setting where the resp onse is m ultiv aria te, g iven by a matrix Y . In ter ms of diag rams and R V co efficients, this pr o blem can b e rephr ased as that of finding M to asso cia te to X so that ( X, M , D ) is as close as p ossible to ( Y , Q, D ) in the R V sense. The answer is provided by defining M such that Y QY t D = λX M X t D . If this is po ssible then the tw o eigendecomp ositions of the triple give the same answers. W e simplify nota tion b y the following a bbreviations: X t D X = S xx , Y t D Y = S y y , X t D Y = S xy and R = S − 1 xx S xy QS y x S − 1 xx . Then k Y QY t D − X M X t D k 2 = k Y QY t D − X RX t D k 2 + k X RX t D − X M X t D k 2 . 224 S. Holmes The first term on the rig ht hand side do es not depend on M , and the second term will b e zer o for the choice M = R . If we add the extra c onstraint that we only allow ourselves a rank q approxima- tion, with q < min (rank (X), rank (Y)), the optimal choice of a p ositive definite matrix M is to take M = R B B t R wher e the c o lumns of B a re the eige n vectors of X t D X R with: B = 1 √ λ 1 β 1 , . . . , 1 p λ q β q ! such that    X t D X Rβ k = λ k β k , β t k Rβ k = λ k , k = 1 , . . . , q , λ 1 > λ 2 > · · · > λ q . The P CA with r e g ards to instr umen tal v ariables of rank q is equiv alent to the PCA of ra nk q o f the triple ( X , R, D ) where R = S − 1 xx S xy QS y x S − 1 xx . 2.4. O ne diagr am to r eplac e two diagr ams Canonical co rrelation analysis was in tro duced by Hotelling [ 18 ] to find the common structure in t wo sets of v a r iables X 1 and X 2 measured on the same observ a tions. This is equiv alent to merging the tw o matrices columnwise to form a large ma trix with n r ows a nd p 1 + p 2 columns and taking as the w eight ing of the v a riables the matrix defined by the tw o diago nal blo cks ( X t 1 D X 1 ) − 1 and ( X t 2 D X 2 ) − 1 Q =     ( X t 1 D X 1 ) − 1 0 0 ( X t 2 D X 2 ) − 1     R p 1 ∗ − − − − → X 1 R n I p 1 x     y V 1 D   y x   W 1 R p 1 ← − − − − X t 1 R n ∗ R p 2 ∗ − − − − → X 2 R n I p 2 x     y V 2 D   y x   W 2 R p 2 ← − − − − X t 2 R n ∗ R p 1 + p 2 ∗ − − − − − → [ X 1 ; X 2 ] R n Q x     y V D   y x   W R p 1 + p 2 ← − − − − − − [[ X 1 ; X 2 ] t R n ∗ This analys is g ives the sa me eigenv e c to rs as the analysis o f the tr iple ( X t 2 D X 1 , ( X t 1 D X 1 ) − 1 , ( X t 2 D X 2 ) − 1 ), also known as the cano nical correlation analysis of X 1 and X 2 . These eigenvectors a re known as the ca nonical v ariables . 2.5. D iscriminant analysis If we wan t to find linear combinations of the orig inal v a r iables X n × p that c har- acterize best the gro up structure of the p oints g iven by a zero/one group coding matrix Y , with as many co lumns as groups (call this num b er g ), we can phras e the problem as a duality diag ram. Supp ose that the o bserv ations are given individual weigh ts in the dia g onal matrix D , and tha t the v ariables are cent ered with regar ds to these weigh ts. The F r ench way 225 Let A b e the g × p matrix of group means in each of the p v ariables. This satisfies Y t D X = ∆ Y A where, ∆ Y = Y t D Y = diag ( w 1 , w 2 , . . . , w g ) , w k = X i : y ik =1 d i . The w k ’s ar e the gro up weights, as they are the sums o f the w e ig ht s as defined b y D for all the elements in that group. Call T the matrix T = X t D X , in the standar d case with all diago na l element s of D equal to 1 n this is just the s tandard v a riance- cov ariance, otherwise it is a generalization thereof. The generalize d b etw een gr o up v a riance-cov a riance is B = A t ∆ Y A and call the b etw een g roup v a riance cov aria nce the matrix W = ( X − Y A ) t D ( X − Y A ). Prop ositio n 1. (A generalized Huyghens’ formula). T = B + W. Pr o of. Expanding W gives W = X t D X − X t D Y A − A t Y t D X + A t Y t D Y A = T − A ′ ∆ Y A − A ′ ∆ Y A + A ′ ∆ Y A = T − B . The duality diagr am for linear discriminant analysis is R p ∗ − − → A R g T − 1 x       y B ∆ Y    y x    AT − 1 A t R p ← − − A t R g ∗ . This cor resp onds to the tr iple ( A, T − 1 , ∆ Y ), b ecause ( X t D Y )∆ − 1 Y ( Y t D X ) = A t ∆ Y A and gives equiv alent results to the tr iple ( Y t D X , T − 1 , ∆ − 1 Y ). The discriminating v ar iables a re the eigenv ectors of the o per ator A t ∆ Y AT − 1 . They can also b e seen as the PCA with reg ards to instrumental v ar iables of ( Y , ∆ − 1 Y , D ) with regards to ( X, M , D ). 2.6. C orr esp ondenc e anal y sis Corresp ondence a nalysis can b e used to analys e several t ypes of multiv ariate data. All inv olve some categoric a l v ariables. Here a re some exa mples of the type of data that can b e deco mpo sed using this method: • Contingency T ables (cro s s-tabulation of t wo categorica l v ariables). • Multiple Contingency T a bles (cro ss-tabulation of several ca tegorical v ari- ables). 226 S. Holmes • Binary tables obtained by cutting cont in uous v ar ia bles int o classes a nd then reco ding b oth these v a r iables a nd any extra catego rical v ariables int o 0/1 tables, 1 indica ting pre s ence in that class. So for insta nce a co ntin uous v ariable cut in to three classes will provide three new binary v ar iables of which only one can take the v alue one fo r an y given observ a tion. T o first approximation, corresp o ndence analysis can be understo o d as an extensio n of pr incipal comp onents analys is (PCA) where the v ariance in P CA is replaced b y an inertia pro po rtional to the χ 2 distance of the ta ble fro m independence. CA decomp oses this measure of departure fro m indep endence along axes that are o r- thogonal a ccording to a χ 2 inner pro duct. If we are compar ing tw o catego rical v a riables, the s implest p o s sible mo del is that of indep endence in which c a se the counts in the table would ob ey approximately the margin pro ducts iden tit y . F or an m × p con tingency table N with n = P m i =1 P p j =1 n ij observ ations and ass o ciated to the fre q uency matrix F = N n . Under indep endence, the approximation n ij . = n i · n n · j n n can als o b e written: N . = cr t n where r = 1 n N 1 p is the vector of row sums of F and c t = 1 n N ′ 1 m are the co lumn sums. The departure from indep endence is mea sured by the χ 2 statistic X 2 = X i,j [ ( n ij − n i · n · j n 2 n ) 2 n i · n · j n 2 n ] . Under the usual v a lidity assumptions that the cell count s n ij are not to o sma ll, this statistic is distributed as a χ 2 with ( m − 1 )( p − 1) degrees of free do m if the data a re independent. If w e do not reject indep endence, there is no more to b e said ab out the table, no in ter action of interest to analyse. There is in fact no ‘mu ltiv ar iate’ effect. On the contrary if this statistic is la rge, we decompos e it into one dimensional comp onents. Corresp ondence a nalysis is equiv alent to the eige ndeco mpo sition of the triple ( X, Q, D ) with X = D − 1 r F D − 1 c − 1 t 1 , Q = D c , D = D r , D c = diag( c ), D r = diag( r ), X ′ D r 1 m = 1 p , the av e r age of each column is one. Notes: 1. Consider the matrix D r − 1 F D c − 1 and ta ke the principa l comp onents with regards to the weights D r for the rows a nd D c for the columns. The recentered ma trix D r − 1 F D c − 1 − 1 ′ m 1 p has a g eneralized singular v alue decomp osition D r − 1 F D c − 1 − 1 ′ m 1 p = U S V ′ , with U ′ D r U = I m , V ′ D c V = I p having total inertia: D r ( D r − 1 F D c − 1 − 1 ′ m 1 p ) ′ D c ( D r − 1 F D c − 1 − 1 ′ m 1 p ) = X 2 n . The F r ench way 227 2. PCA of the row profiles F D r − 1 , taken with weigh t matr ix D c and the metric Q = D c − 1 . 3. Notice that X i f i · ( f ij f i · f · j − 1) = 0 and the row a nd columns profiles ar e centered X j f · j ( f ij f i · f · j − 1) = 0 This method has been rediscov er ed many times, the most recently by Jo n Klein- ber g’s in his method for analyzing Hubs and Author ities [ 19 ]. See F ouss, Saerens and Renders [ 1 1 ] for a detailed comparison. In sta tistics the most commonplace use of Corresp ondence Analysis is in ordi- nation or seriation, that is , the sea rch for a hidden gradient in contingency tables. As an example we take data analyzed by Cox a nd Brandwoo d [ 4 ] a nd Diaconis [ 6 ], who wan ted to seriate Plato’s w o rks using the prop or tion of sen tence endings in a given b o ok with a given stress pattern. The seven bo oks studied here are Republic, Laws, Critias, Philebus, Sophist, Timœus. W e use abbrevia tions of these names a s our column la bels in the data analysis b elow. The stress patterns use the last five syllables of every sentence and combine long or shor t sylla bles (abbrevia ted by - and U in the data b elow). Thus there are 32 poss ible s tr ess patterns, and 32 rows in our co n tingency table. W e prop ose the use o f corres po ndence ana lysis on the ta ble of frequencies o f sentence endings, for a detailed analysis see Charnomo r dic and Holmes [ 2 ]. The first 10 row pr ofiles (as p ercentages) ar e as follows: Rep Laws Crit Phil Pol Soph Tim UUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2 .4 -UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3 .9 U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6 .0 UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1 .8 UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3 .4 UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3 .5 --UUU 2.1 2.7 3.3 4.3 3.3 3.3 3 .4 -U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3 .4 -UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1 .7 -UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3 .3 .......etc (there are 32 rows in all) The eigenv alue deco mpo sition (called the s c r ee plot) of the c hi-s quare distance matrix (see [ 2 ]) shows that tw o axes out of a p oss ible 6 (the matrix is of ra nk 6 ) will provide a summary of 85 % of the depa rture fro m independence. This suggests that a planar representation will provide a go o d v isual summary of the da ta . Eigenv alue inerti a % cumu lativ e % 1 0.0917 0 68.96 68.96 2 0.0212 0 15.94 84.90 3 0.0091 1 6.86 91.7 6 4 0.0060 3 4.53 96.2 9 5 0.0027 6 2.07 98.3 6 6 0.0021 7 1.64 100.00 228 S. Holmes Fig 1 . Corr esp ondenc e analysis of Plato’s works. W e can see from the plot that there is a seriation that in most cases follows a parab ola or a rch [ 16 ] from Laws on o ne extreme b eing the latest work and Republica being the earlies t a mong thos e studied. 3. F rom di scriminan t ana lysis to net w orks Consider a gra ph w ith vertices the members of a so cial gr oup and edges if tw o mem ber s interact. W e suppo se each vertex comes with an o bs erv ation vector x i , and that each has the same weigh t 1 n . In the ex treme case of discr iminant analysis, the graph is supp osed to connect all the p oints of a group in a complete graph, a nd be disconnected b et ween observ ations from different groups. Discriminant Analysis is just the explana tion of this particular g raph by linear c ombinations of v ariables. What w e prop ose here is to e x tend this to more g eneral g raphs in a similar wa y . W e will supp o se all the observ a tions are the no des of the g raph and each has the same weigh t 1 n . The ba sic decomp osition of the v a riance is written co v ( x j , x k ) = t j k = 1 n n X i =1 ( x ij − ¯ x j )( x ik − ¯ x k ) . Call the group means, ¯ x gj = 1 n g X i ∈ G g x ij , g = 1 , . . . , q , X i ∈ G g ( x ij − ¯ x gj )( ¯ x gj − ¯ x k ) = ( ¯ x gj − ¯ x k ) X i ∈ G g ( x ij − ¯ x gj ) = 0 . The F r ench way 229 As in pr o p osition 1 , Huyghens’ formula is t j k = w j k + b j k , where w j k = q X g =1 X i ∈ G g ( x ij − ¯ x gj )( x ik − ¯ x gk ) , b j k = q X g =1 n g n ( ¯ x gj − ¯ x j )( ¯ x gk − ¯ x k ) , T = W + B . As w e show ed ab ov e, linear discr iminant analysis finds the linear combinations a such that a t B a a t T a is maximized. This is equiv a le nt to maximizing the quadra tic fo r m a t B a in a , sub ject to the constraint a t T a = 1. As we s aw ab ove, the eigenv a lue problem B a = λT a or T − 1 B a = λa if T − 1 exists. provides λ as needed. Then a ′ B a = λa ′ T a = λ . W e extend this to graphs by relaxing the gro up definition to pa r tition the v ariation into lo cal and globa l comp onents. 3.1. D e c omp osing the varianc e into lo c al and glob al c omp onents Lebart was a pio neer in adapting the eigenv ec to r deco mpo sitions to cater to spatial structure in the data [ 20 , 21 , 2 2 ]. W e can aga in decomp ose the v a riance into parts, but this time the criteria for the deco mpo sition is not defined b y group mem b er ship as in LDA but by the neighbor ho o d relation given by the s pa tial structure. W e ca ll the set o f edges of the undirected neigh b or ho o d gr a ph E . The usual elemen twise definition of cov ariances is given by co v ( x j , x k ) = 1 n n X i =1 ( x ij − ¯ x j )( x ik − ¯ x k ) = 1 2 n 2 n X i =1 n X i ′ =1 ( x ij − x i ′ j )( x ik − x i ′ k ) . F or the v aria nces we hav e v ar ( x j ) = 1 2 n 2    X ( i,i ′ ) ∈ E ( x ij − x i ′ j ) 2 + X ( i,i ′ ) / ∈ E ( x ij − x i ′ j ) 2    . If we call M the incidence matrix of the graph m ij = 1 ( i, j ) ∈ E . The deg r ee of vertex i is m i = P n i ′ =1 m ii ′ . W e take the co nv ention that there are no self lo ops. Then another way of writing the v a riance formula is v ar ( x j ) = 1 2 n 2    n X i =1 n X i ′ =1 m ii ′ ( x ij − x i ′ j ) 2 + X ( i,i ′ ) / ∈ E ( x ij − x i ′ j ) 2    . Call lo ca l v a riance v ar loc ( x j ) = 1 2 m n X i =1 n X i ′ =1 m ii ′ ( x ij − x i ′ j ) 2 where m = P n i =1 P n i ′ =1 m ii ′ . The to tal v a riance is the v a riance of the co mplete graph. Geary’s r atio [ 14 ] is used to see whether the v a riable x j can b e considered as 230 S. Holmes independent of the gr a ph structure. If the neighbor ing v alues of x j seem p o sitively correla ted then the lo cal v ariance will only b e an underestimate of the v a riance: G = c ( x j ) = v ar loc ( x j ) v ar ( x j ) . Call D the diagonal matrix with the total degree s of e a ch node in the diagonal D = diag ( m i ). F or all v ar iables taken together, j = 1 , . . . , p note the lo c a l cov ariance matrix V = 1 2 m X t ( D − M ) X , if the graph is just made of disjoint gr oups of the same size. This is prop or tional to the W within class v a riance-cov a riance ma trix. The prop ortionality can b e acco mplished by taking the average of the sum of squares to the av er age of the neig hbo ring no des [ 23 ]. W e can generalize the Geary index to account for irr egular gr aphs coherently . In this case we weigh t eac h no de by its degree. Then we can write the Gear y ratio for an y n-vector x as c ( x ) = x t ( D − M ) x x t D x , D =      m 1 0 0 0 m 2 0 0 . . . . . . . . . . . . 0 0 m n      . W e can a sk for the co ordinate(s) that are the most co r related to the gr aph structure, then if we wan t to minimize the Geary ratio, choose x s uch that c ( x ) is minimal. This is equiv a lent to minimizing x t ( D − M ) x under the constraint x t D x = 1. It can b e solved by finding the smallest eigenv alue µ with eigenv ector x such that: ( D − M ) x = µD x, D − 1 ( D − M ) x = µx, (1 − µ ) x = D − 1 M x. This is exactly the defining equa tio n o f the corresp ondence analysis of the matrix M. This can b e extended to as man y co or dinates as we like, in particular w e can take the first 2 largest eige nvectors and provide the b est planar representation of the gra ph in this wa y . 3.2. R e gr ession of gr aphs on no de c ovari ates The cov ariables mea sured on the no des can b e essential to understanding the fine structure of gra phs. W e call X the n × p matrix of mea surements a t the vertices of the graph; they may be a combination of b oth categor ical v ar iables (gene families, GO classes) and contin uous meas ur ement s (express ion scores ). W e can use the PCAIV method defined in Section 2 to the e ig env ectors of the graph defined a b ov e. This provides a metho d that uses the cov a riates in X to explain the graph. T o b e mor e precise, given a g raph ( V , E ) with adjacency ma trix M , define the La placian L = D − 1 ( M − I ) , D = diag ( d 1 , d 2 , . . . , d n ) diagona l matrix o f degrees . Using the e ig enanalysis o f the g r aph, we can summarize the graph with a few v a riables, the first few relev ant eigenv ecto r s of L , these ca n then be reg ressed on the cov ariates using P rincipal Comp onents with r esp ect to Instrumen tal V a riables [ 25 ] as defined ab ov e to find the linear co mbi nation o f no de cov ariates that expla in the gra ph v a r iables the best. The F r ench way 231 App endix A: Resources A.1. R e ading There are few references in Eng lis h explaining the dualit y/o pe r ator p oint of view, apart fro m the alr e ady cited r eferences of Es coufier [ 8 , 10 ]. F r´ ederique Gla¸ con’s PhD thesis [ 15 ] (in F rench) clearly lays out the duality principle b efore go ing on to explain its application to the conjoin t analysis of s everal ma trices, or data cubes. The interested reader fluent in F rench could a lso consult any one o f several Ma sters level textbo oks o n the sub ject for many details a nd examples: • Brigitte Es c o ffier and J´ erˆ ome Pag` es [ 7 ] hav e a textbo o k w ith many examples, although their approach is geometric, they do not delve into the Dualit y Diagram, mor e than explaining on page 1 00 its use in transition form ula betw een e ig enbases of the differen t spaces. • [ 22 ] is o ne o f the broa der b o oks on m ultiv ar ia te a nalyses, mak ing co nnections betw een modern uses of eigendecompo s ition tec hniques, clustering and seg- men tation. This b o ok is unique in its c hapter on stabilit y a nd v alidatio n of results (without going as fa r as sp eaking of inference). • Cailliez and Pag` es [ 1 ] is har d to find, but was the first textbo ok completely based o n the diag ram appro ach, as was the case in the earlier literature they use transp osed matrices . A.2. Softwar e The metho ds describ ed in this article are all av ailable in the for m of R pac k a ges which I reco mmend. The mos t complete pac k ag e is a de4 [ 3 ] whic h co v ers a lmost all the problems I men tion except that of reg ressing gr aphs on cov ariates. How ev er, a complete understanding o f the duality diagr am ter mino lo gy a nd philosophy is necessary as these provide the building blo cks for all the functions in the form of a class called du di (this actually sta nds for dualit y diagr am). One of the most impo r tant features in all the ‘d udi.* ’ functions is that when the ar gument scan nf is at its default v a lue T RUE , the first step imp osed on the user is the p er usal of the scree plot of eig env alues. This can b e very impo r tant, as choosing to retain 2 v alues b y default b efor e consulting the eigenv alues ca n le a d to the ma in mista ke that ca n be made when using these tec hniques: the separation of tw o close eigenv a lues . When t w o eigenv a lues are close the plane will b e stable, but not ea ch individual axis or principal comp onent re s ulting in er roneous results if for instance the 2 nd a nd 3rd eigenv alues were very c lo se and the user c ho se to take 2 a xes[ 17 ]. Another useful a ddition also comes from the ecologica l communit y a nd is c a lled vegan . Here is a list of sugg e s ted functions from several pack age s : • Principal Comp onents Analysis (PCA) is av ailable in pr comp and p rincom p in the standa rd pack a ge st ats as pc a in vegan and as dud i.pca in a de4 . • Two versions of PCAIV are av ailable, one is ca lled Redundancy Analysis (RD A) and is av a ilable a s rda in veg an a nd pcaiv in a de4 . • Corresp ondence Analysis (CA) is av aila ble in c ca in v egan and as dudi .coa in ade4 . • Discriminant analysis is av ailable a s lda in stats , as discrim in in ade4 • Canonical Correlation Analysis is a v aila ble in cancor in stat s (Beware c ca in ade4 is Ca no nical Corre s po ndence Analysis). • ST A TIS (Conjoin t analysis of sev eral tables) is av ailable in a de4 . 232 S. Holmes Ac kno wledgments. I would lik e to thank an anonymous referee for a very careful reading o f the original version, Elizab eth Purdom for discussions ab out m ultiv ar iate analysis a nd Yves Escoufier for reading this pape r and tea ching me muc h ab out Dualit y ov er the years. Persi Diaconis suggested lo oking a t the Plato data in 1993 and has pr ovided man y enlighteni ng discussio ns ab out the American wa y . References [1] Cailliez, F. and P ag ` es, J. (1976). Intr o duction ` a l’analyse des donn´ ees. SMASH, Paris. [2] Charnomordic, B. and Hol mes, S. (20 01). Corres po ndence analysis for microarr ays. Statist. Gr aph. Comput. New lsetter 12 19–2 5. [3] Chessel, D., D u four, A. B. and Thioulouse, J. (2004). The a de4 pack- age – I: One- table metho ds . R News 4 5 –10. [4] Co x, D. R. and Brandw o od, L. (195 9 ). On a discriminatory problem con- nected with the w orks of Plato. J. R oy. Statist. So c. Ser. B 21 195 –200 . MR01091 02 [5] Diaconis, P. and Freedman, D. (19 84). Asymptotics of g raphical pro jec- tion pursuit. Ann. Statist. 12 793–8 15. MR07512 74 [6] Diaconis, P. and Sa l zmann, J. (2007 ). Pr o jection pursuit for discrete data. In Pr ob ability and Statistics: Essays in Honor of David A. F re e dman (D. Nolan and T. Sp eed, eds.) 265–28 8. IMS, Hayw a r d, CA. [7] Escoffier, B. and P a g ` es, J. (1 9 98). Anal yse factoriel les simples et multi- ples: Obje ct ifs, m ´ etho des et interpr´ etation . Duno d, P aris. [8] Escoufier, Y. (1977). O p er ators rela ted to a data matrix. In R e c ent Devel- opments in Statistics (J. Barra, F. Brodea u, G. Romier and B. v a n Cutsem, eds.) 125 –131. Amsterdam. North Holla nd. MR04617 91 [9] Escoufier, Y. (1979). Cours d’analyse des donn ´ ees. Cours Polycopi´ e 7901, IUT, Montpellier. [10] Escoufier, Y. (1 987). The duality diagr am: A means of b etter practical appli- cations. In Developments in Nu meric al Ec olo gy (P . Legendre a nd L. Leg endre, eds.) 139 –156. Springer , B erlin. MR09135 39 [11] Fouss, F., Renders, J. -M. and Saerens, M. (2 004). Some relationships betw een Kleinberg’s hubs and authorities, cor resp ondence a nalysis, and the salsa algorithm. In JADT 200 4, Intern ational Confer enc e on the Statistic al Analy sis of T extual Data 4 45–4 55. Louv ain-la-Neuve. [12] Freedman, D. and Peters, S. (1 984a). Bo o tstrapping a regres sion equation: Some empirical results. J. Amer. Statist. Asso c. 79 97– 106. MR07428 58 [13] Freedman, D. and Peters, S. (198 4b). Bo otstra pping an econometric mo del: Some empirical r esults. J. Business Ec onom. Statist. 2 15 0–15 8. [14] Gear y, R. (1954). The conti guity ratio and statistical mapping. The Inc or- p or ate d Statistician 5 11 5–14 5. [15] Glac ¸ on, F. (1981). Analyse conjointe de plusieurs matr ices de donn´ ees. Com- paraison de diff´ er ent es m´ etho des. P h.D. thesis, Scient ific and Medica le Univ. Grenoble, Grenoble. [16] Hill, M. and Gauch, H. (1980). Detrended cor r esp ondence analysis, a n improv ed ordination technique. V e getatio 42 47–5 8. [17] Holmes, S. (1985). Outils Informatiques p our l’Ev aluation de la P e r tinence d’un Resultat en Analyse des Donn´ ees . Ph.D. thesis, USTL, Montpellier. [18] Hotelling, H. (1936 ). Relations betw een tw o sets of v ariables. Biometrika 28 321– 377. The F r ench way 233 [19] Kleinberg , J. M. (1999 ). Hubs, author ities, a nd communit ies. ACM Comput. Surv. 31 Article 5. MR17156 62 [20] Lebar t, L. (19 79). T r aitement des Donn´ ees Statistiques . Duno d, Paris. [21] Lebar t, L., Morineau, A. a n d W ar wick, K. M. (19 84). Multivariate D e- scriptive Statistic al A nalysis . Wiley , New Y ork. MR07449 90 [22] Lebar t, L., P iron, M. and Morineau, A. (2000). Statistique explor atoir e multidimensionnel le . Dunod, Paris. [23] Mom, A. (198 8). M ´ etho do logie s tatistique de la classification des r´ eseaux de transp orts. Ph.D. thesis, USTL, Mo nt pellier . [24] Purdom, E. (2006 ). Co mparative multiv ariate methods. P h.D. thesis, Stan- ford Univ., Stanford, CA. [25] Rao, C. R. (196 4). The use a nd interpretation of principal comp onent analys is in applied resea rch. S ankhy¯ a Ser. A 2 6 329– 359. MR01843 75

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment