ClustOfVar: An R Package for the Clustering of Variables

J S S Journal of Statistical Software MMMMMM YYYY, V olume VV, Issue II. http://www.jstatsoft.or g/ ClustOfV ar : An R P ac k age for the Clustering of V ariables Marie Cha v en t Univ ersit y of Bordeaux V anessa Kuen tz CEMA GREF Bordeaux Beno ˆ ıt Liquet Univ ersit y of Bordeaux J ´ erˆ ome Saracco Univ ersit y of Bordeaux Abstract Clustering of v ariables is as a wa y to arrange v ariables into homogeneous clusters, i.e., groups of v ariables which are strongly related to eac h other and thu s bring the same information. These approaches can then b e useful for dimension reduction and v ariable selection. Sev eral sp eciﬁc metho ds hav e b een dev elop ed for the clustering of numerical v ariables. How ev er concerning qualitative v ariables or mixtures of quantitativ e and qual- itativ e v ariables, far few er metho ds hav e b een prop osed. The R pac k age ClustOfV ar was sp eciﬁcally developed for this purp ose. The homogeneity criterion of a cluster is deﬁned as the sum of correlation ratios (for qualitative v ariables) and squared correlations (for quan titative v ariables) to a synthet ic quan titativ e v ariable, summarizin g “as goo d as p os- sible” the v ariables in the cluster. This synthetic v ariable is the ﬁrst principal comp onen t obtained with the PCAMIX metho d. Tw o algorithms for the clustering of v ariables are proposed: iterative relocation algorithm and ascendant hierarchi cal clustering. W e also propose a b o otstrap approach in order to determine suitable num b ers of clusters. W e illustrate the metho dologies and the asso ciated pack age on small datasets. Keywor ds : Dimension reduction, hierarchical clustering of v ariables, k-means clustering of v ariables, mixture of quanti tative and qualitativ e v ariables, stabilit y . 1. In tro duction Principal Comp onent Analysis (PCA) and Multiple Corresp ondence Analysis (MCA) are app ealing statistical tools for m ultiv ariate description of resp ectively n umerical and categorical data. Rotated principal comp onen ts fulﬁll the need to get more in terpretable components. 2 ClustOfV ar : An R Pac k age for the Clustering of V ariables Clustering of v ariables is an alternativ e since it makes p ossible to arrange v ariables int o homogeneous clusters and thus to obtain meaningful structures. F rom a general p oint of view, v ariable clustering lumps together v ariables whic h are strongly related to eac h other and thus bring the same information. Once the v ariables are clustered into groups such that attributes in each group reﬂect the same asp ect, the practitioner may b e spurred on to select one v ariable from eac h group. O ne may also wan t to construct a syn thetic v ariable. F or instance in the case of quan titativ e v ariables, a solution is to realize a PCA in eac h cluster and to retain the ﬁrst principal comp onent as the synthetic v ariable of the cluster. A simple and frequently used approac h for clustering a set of v ariables is to calculate the dissimilarities b etw een these v ariables and to apply a classical cluster analysis metho d to this dissimilarit y matrix. W e can cite the functions hclust of the R pack age stats ( R Deve lop- men t Core T eam 2011 ) and agnes of the pack age cluster ( Maech ler, Rousseeuw, Struyf, and Hub ert 2005 ) whic h can b e used for single, complete, av erage link age hierarc hical clustering. The functions diana and pam of the pack age cluster can also b e used for resp ectively divisive hierarc hical clus tering and partitioning around medoids ( Kaufman and Rousseeuw 1990 ). But the dissimilarit y matrix has to b e calculated ﬁrst. F or quan titativ e v ariables man y dissim- ilarit y measures can b e used: correlation co eﬃcient s (parametric or nonparametric) can b e con ver ted to diﬀeren t dissimilarities dep ending if the aim is to lump together correlated v ari- ables regardless of the sign of the correlation or if a negativ e correlation co eﬀcient b etw een t wo v ariables sho ws disagreemen t betw een them. F or categorical v ariables, many associa- tion measures can b e used as χ 2 , Rand, Belson, Jaccard, Sok al and Jordan among others. Man y strategies can then b e applied and it can b e diﬃcult for the user to c ho ose oneof them. Moreo v er, no syn thetic v ariable of the clusters are directly provid ed with this approach. Besides these classical metho ds dev oted to the clustering of observ ations, there exists methods sp eciﬁcally dev oted to the clustering of v ariables. The most famous one is the V AR CLUS pro cedure of the SAS softw are. Recen tly sp eciﬁc metho ds based on PCA w ere prop osed by Vigneau and Qannari ( 2003 ) with the name Clustering around Latent V ariables (CL V) and b y Dhillon, Marcotte, and Roshan ( 2003 ) with the name Diametrical Clustering. But all these sp eciﬁc approaches work only with quantitativ e data and as far as w e kno w, they are not implement ed in R . The aim of the pack age ClustOfV ar is then to prop ose in R , metho ds sp eciﬁcally dev oted to the clustering of v ariables with no restriction on the t yp e (quantitativ e or qualitative) of the v ariables. The clustering metho ds dev elop ed in the pac k age work with a mixture of quan- titativ e and qualitative v ariables and also work for a set exclusively contain ing quan titativ e (or qualitativ e) v ariables. In addition note that missing data are allow ed: they are replaced b y means for quantitativ e v ariables and by zeros in the indicator matrix for qualitative v ari- ables. Tw o methods are prop osed for the clustering of v ariables: a hierarc hical clustering algorithm and a k-means type partitioning algorithm are resp ectively implemen ted in the functions hclustvar and kmeansvar . These tw o metho ds are based on PCAMIX, a principal comp onen t metho d for a mixture of qualitativ e and quan titativ e v ariables ( Kiers 1991 ). This metho d includes the ordinary PCA and MCA as sp ecial cases. Here we use a Singular V alue Decomp osition (SVD) approach of PCAMIX ( Cha v ent, Kuen tz, and Saracco 2011 ). These t wo clustering algorithms aim at maximizing an homogeneity criterion. A cluster of v ariables is deﬁned as homogeneous when the v ariables in the cluster are strongly link ed to a central quan titativ e syntheti c v ariable. This link is measured b y the squared Pearson correlation for the quan titativ e v ariables and b y the correlation ratio for the qualitativ e v ariables. The quan- Journal of Statistical Softw are 3 titativ e cen tral syntheti c v ariable of a cluster is the ﬁrst principal comp onen t of PCAMIX applied to all the v ariables in the cluster. Note that the syn thetic v ariables of the clusters can b e used for dimension reduction or for reco ding purp ose. Moreo ver a metho d based on a b o otstrap approach is also prop osed to ev aluate the stability of the partitions of v ariables and can b e used to determine a suitable num ber of clusters. It is implemented in the function stability . The rest of this pap er is organized as follo ws. Section 2 contains a detailed description of the homogeneit y criterion and a description of the PCAMIX pro cedure for the determination of the cen tral syn thetic v ariable. Section 3 describ es the clustering algorithms and the b o otstrap pro cedure. Section 4 provides tw o data-driv en examples in order to illustrate the use of the functions and ob ject s of the pack age ClustOfV ar . Finally , section 5 gives concluding remarks. 2. The homogeneit y criterion Let { x 1 , . . . , x p 1 } b e a set of p 1 quan titativ e v ariables and { z 1 , . . . , z p 2 } a set of p 2 qualitativ e v ariables. Let X and Z be the corresp onding quantitativ e and qualitative data matrices of dimensions n × p 1 and n × p 2 , where n is the n umber of observ ations. F or seek of simplicit y , we denote x j ∈ R n the j -th column of X and we denote z j ∈ M 1 × . . . × M p 2 the j -th column of Z with M j the set of categories of z j . Let P K = ( C 1 , . . . , C K ) b e a partition into K clusters of the p = p 1 + p 2 v ariables. Syn thetic v ariable of a cluster C k . It is deﬁned as the quan titativ e v ariable y k ∈ R n the “most linked” to all the v ariables in C k : y k = arg max u ∈R n    X x j ∈ C k r 2 u , x j + X z j ∈ C k η 2 u | z j    , where r 2 denotes the squared Pearson correlation and η 2 denotes the correlation ratio. More precisely , the correlation ratio η 2 u | z j ∈ [0 , 1] measures the part of the v ariance of u explained b y the categories of z j : η 2 u | z j = P s ∈M j n s ( ¯ u s − ¯ u ) 2 P n i =1 ( u i − ¯ u ) 2 , where n s is the frequency of category s , ¯ u s is the mean v alue of u calculated on the observ ations b elonging to category s and ¯ u is the mean of u . W e ha v e the follo wing imp ortant results ( Escoﬁer ( 1979 ), Sap orta ( 1990 ), Pag ` es ( 2004 )): • y k is the ﬁrst principal comp onent of PCAMIX applied to X k and Z k , the matrices made up of the columns of X and Z corresp onding to the v ariables in C k ; • the empirical v ariance of y k is equal to: V AR ( y k ) = X x j ∈ C k r 2 x j , y k + X z j ∈ C k η 2 y k | z j . The determination of y k using PCAMIX is carried on according to the following steps: 1. Reco ding of X k and Z k : 4 ClustOfV ar : An R Pac k age for the Clustering of V ariables (a) ˜ X k is the standardized version of the quantitativ e matrix X k , (b) ˜ Z k = JGD − 1 / 2 is the standardized version of the indicator matrix G of the quali- tativ e matrix Z k , where D is the diagonal matrix of frequencies of the categories. J = I − 1 0 1 /n is the centering op erator where I denotes the identi ty matrix and 1 the vector with unit en tries. 2. Concatenation of the t wo reco ded matrices: M k = ( ˜ X k | ˜ Z k ). 3. Singular V alue Decomp osition of M k : M k = U Λ V 0 . 4. Extraction/calculus of useful outputs: • √ n U Λ is the matrix of the principal comp onent scores of PCAMIX; • y k is the ﬁrst column √ n U Λ; • V AR ( y k ) = λ k 1 where λ k 1 is the ﬁrst eigen v alue in Λ. Note that we recen tly developed an R pack age named PCAmixdata with a function PCAmix whic h pro vide the principal comp onents of PCAMIX and a function PCArot whic h pro vide the principal comp onent after rotation. Homogeneit y H of a cluster C k . It is a measure of adequacy b etw een the v ariables in the cluster and its centr al synthetic quantitativ e v ariable y k : H ( C k ) = X x j ∈ C k r 2 x j , y k + X z j ∈ C k η 2 y k | z j = λ k 1 . (1) Note that the ﬁrst term (based on the squared P earson correlation r 2 ) measures the link b et ween the quan titative v ariables in C k and y k indep enden tly of the sign of the relationship. The second one (based on the correlation ratio η 2 ) measures the link b etw een the qualitativ e v ariables in C k and y k . The homogeneity of a cluster is maxim um when all the quantitativ e v ariables are correlated (or an ti-correlated) to y k and when all the correlation ratios of the qualitativ e v ariables are equal to 1. It means that all the v ariables in the cluster C k bring the same information. Homogeneit y H of a partition P K . It is deﬁned as the sum of the homogeneities of its clusters: H ( P K ) = K X k =1 H ( C k ) = λ 1 1 + . . . + λ K 1 , (2) where λ 1 1 , . . . , λ K 1 are the ﬁrst eigen v alues of PCAMIX applied to the K clusters C k of P K . 3. The clustering algorithms The ai m is to ﬁnd a partition of a set of quantitativ e and/or qualitativ e v ariables suc h that the v ariables within a cluster are strongly related to each other. In other w ords the ob jectiv e is to ﬁnd a partition P K whic h maximizes the homogeneity function H deﬁned in ( 2 ). F or this, a hierarchical and a partitioning clustering algorithms are prop osed in the pack age Journal of Statistical Softw are 5 ClustOfV ar . A bo otstrap pro cedure is also prop osed to ev aluate the stability of the partitions in to K = 2 , 3 , . . . , p − 1 clusters and then to help the user to determine a suitable num ber of clusters of v ariables. The hierarc hical clustering algorithm. This algorithm builds a set of p nested partitions of v ariables in the following w ay: 1. Step l = 0: initialization. Start with the partition in p clusters. 2. Step l = 1 , . . . , p − 2: aggregate t w o clusters of the partition in p − l + 1 clusters to get a new partition in p − l clusters. F or this, c ho ose clusters A and B with the smallest dissimilarit y d deﬁned as: d ( A, B ) = H ( A ) + H ( B ) − H ( A ∪ B ) = λ 1 A + λ 1 B − λ 1 A ∪ B . (3) This dissimilarity measures the lost of homogeneity observ ed when the tw o clusters A and B are merged. Using this aggregation measure the new partition in p − l clusters maximizes H among all the partitions in p − l clusters obtained b y aggregation of tw o clusters of the partition in p − l + 1 clusters. 3. Step l = p − 1: stop. The partition in one cluster is obtained. This algorithm is implemented in the function hclustvar which builds a hierarc hy of the p v ariables. The function plot.hclustvar giv es the dendrogram of this hierarch y . The height of a cluster C = A ∪ B in this dendrogram is deﬁned as h ( C ) = d ( A, B ). It is easy to v erify that h ( C ) ≥ 0 but the property “ A ⊂ B ⇒ h ( A ) ≤ h ( B )” has not b een pro v ed y et. Nev ertheless, inv ersions in the dendrogram hav e never b een observed in practice neither on sim ulated data nor on real data sets. Finally the function cutreevar cuts this dendrogram and gives one of the p nested partitions according to the num b er K of cluster giv en in input b y the user. The partitioning algorithm. This partitioning algorithm requires the deﬁnition of a simi- larit y measure betw een tw o v ariables of any t yp e (quan titativ e or qualitativ e). W e use for this purp ose the squared canonical correlation b etw een tw o data matrices E and F of dimensions n × r 1 and n × r 2 . This correlation, denoted b y s , can b e easily calculated as follows: The pro cedure for the its determination is simple: s ( E , F ) =      ﬁrst eigenv alue of the n × n matrix EF 0 FE 0 if min( n, r 1 , r 2 ) = n, ﬁrst eigenv alue of the r 1 × r 1 matrix E 0 FF 0 E if min( n, r 1 , r 2 ) = r 1 , ﬁrst eigenv alue of the r 2 × r 2 matrix F 0 EE 0 F if min( n, r 1 , r 2 ) = r 2 . This similarit y s can also b e deﬁned as follo ws: - F or tw o quantitativ e v ariables x i and x j , let E = ˜ x i and F = ˜ x j where ˜ x i and ˜ x j are the standardized versions of x i and x j . In this case, the squared canonical correlation is the squared P earson correlation: s ( x i , x j ) = r 2 x i , x j . 6 ClustOfV ar : An R Pac k age for the Clustering of V ariables - F or one qualitativ e v ariable z i and one quantitativ e v ariable x j , let E = ˜ Z i and F = ˜ x j where ˜ Z i is the standardized version of the indicator matrix G i of the qualitative v ariable z i . In this case, the squared canonical correlation is the correlation ratio: s ( z i , x j ) = η 2 x j | z i . - F or t wo qualitative v ariables z i and z j ha ving r and s categories, let E = ˜ Z i and F = ˜ Z j . In this case, the squared canonical correlation s ( z i , z j ) do es not corresp ond to a well known asso ciation measure. Its in terpretation is geometrical: the closer to one is s ( z i , z j ), the closer are the t wo linear subspaces spanned b y the matrices E and F . Then the tw o qualitativ e v ariables z i and z j bring similar information. This similarit y measure is implemented in the function mixedVarSim . The clustering algorithm implemen ted in the function kmeansvar builds then a partition in K clusters in the following w ay: 1. Initialization step: t wo p ossibilities are a v ailable. (a) A non random initialization: an initial partition in K clusters is giv en in input (for instance the partition obtained b y cutting the dendrogram of the hierarc hy) . (b) A random initialization: i. K v ariables are randomly selected among the p v ariables as initial cen tral syn thetic v ariables (named centers hereafter). ii. An initial partition into K clusters is built by allo cating eac h v ariable to the cluster with the closest initial center : the similarit y b etw een a v ariable and an initial cent er is calculated using the function mixedVarSim . 2. Rep eat (a) A represen tation step: the quan titative central synthetic v ariable y k of eac h cluster C k is calculated with PCAMIX as deﬁned in section 2 . (b) An allocation step: a partition is constructed by assigning eac h v ariable to the clos- est cluster. The similarit y b etw een a v ariable and the central syn thetic quan titativ e v ariable of the corresp onding cluster is calculated with the function mixedVarSim : it is either a squared correlation (if the v ariable is quan titativ e) or a correlation ratio (if the v ariable is qualitativ e). 3. Stop if there is no more changes in the partition or if a maxim um num ber of iterations (ﬁxed by the user) is reached. This iterative pro cedure kmeansvar prov ides a partition P K in to K clusters which maximizes H but this optimum is lo cal and dep ends on the initial partition. A solution to ov ercome this problem and to av oid the inﬂuence of the choice of an arbitrary initial partition is to consider m ultiple random initializations. In this case, steps 1(b), 2 and 3 are rep eated, and we propose to retain as ﬁnal partition the one which provides the highest v alue of H . Journal of Statistical Softw are 7 Stabilit y of partitions of v ariables. This pro cedure ev aluates the stabilit y of the p nested partitions of the dendrogram obtained with hclustvar . It works as follo ws: 1. B b o ostrap samples of the n observ ations are drawn and the corresp onding B dendro- grams are obtained with the function hclustvar . 2. The partitions of these B dendrograms are compared with the partitions of the initial hierarc h y using the corrected Rand index. The Rand and the adjusted Rand indices are implemen ted in the function Rand (see Hub ert and Arabie ( 1985 ) for details on these indices). 3. The stability of a partition is ev aluated by the mean of the B adjusted Rand indices. The plot of this stabilit y criterion according to the num ber of clusters can help the user in the c hoice of a sensible and suitable num ber of clusters. Note that an error message ma y app ear with this function in some case of rare categories of qualitative v ariable. Indeed, if this rare category disapp ears in a b o otstrap sample of observ ations, a column of iden tical v alues is then formed and the standardization of this v ariable is not p ossible in PCAMIX step. 4. Illustration on simple examples W e illustrate our R pac k age ClustOfV ar on tw o real datasets: the ﬁrst one only concerns quan titativ e v ariables, the second one is a mixture of quan titative and qualitative v ariables. 4.1. First example: Quan titativ e data W e use the dataset decathlon whic h contains n = 41 athletes described according to their p erformances in p = 10 diﬀeren t sp orts of decathlon. R> library("ClustOfVar") R> data("decathlon") R> head(decathlon[,1:4]) 100m Long.jump Shot.put High.jump SEBRLE 11.04 7.58 14.83 2.07 CLAY 10.76 7.40 14.26 1.86 KARPOV 11.02 7.30 14.77 2.04 BERNARD 11.02 7.23 14.25 1.92 YURKOV 11.34 7.09 15.19 2.10 WARNERS 11.11 7.60 14.31 1.98 In order to ha v e an idea of the links b et we en these 10 quan titative v ariables, we will construct a hierarch y with the function hclustvar . R> X.quanti <- decathlon[,1:10] R> tree <- hclustvar(X.quanti) R> plot(tree) 8 ClustOfV ar : An R Pac k age for the Clustering of V ariables 0.0 0.5 1.0 1.5 Aggregation levels number of clusters Height 1 2 3 4 5 6 7 8 9 Jav eline High.jump Shot.put Discus Long.jump 400m 100m 110m.hurdle Pole .vault 1500m 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Cluster Dendrogram Height Figure 1: Graphical output of the function plot.hclustvar . In Figure 1 , the plot of the aggregation leve ls suggests to choose 3 or 5 clusters of v ariables. The dendrogram, on the right hand side of this ﬁgure, show s the link b etw een the v ariables in terms of r 2 . F or instance, the tw o v ariables “discus” and “shot put” are linked as well as the t wo v ariables “Long.jump” and “400m” , but the user m ust keep in mind that the dendrogram do es not indicate the sign of these relationships: a careful study of these v ariables shows that “discus” and “shot put” are correlated whereas “Long .jump” and “400m” are an ti-correlated. The user can use the stability function in order to hav e an idea of the stabilit y of the partitions of the dendrogram represen ted in Figure 1 . R> stab <- stability(tree,B=40) R> plot(stab, main="Stability of the partitions") R> stab$matCR R> boxplot(stab$matCR, main="Dispersion of the ajusted Rand index") On the left of Figure 2 , the plot of the mean (o v er the B = 40 bo otstrap samples) of the adjusted Rand indices is obtained with the function plot.clustab . It clearly suggests to c ho ose 5 clusters. The b o xplots on the righ t of Figure 2 show the disp ersion of these indices o ver the B = 40 b o otstrap replications for partition, and they suggest 3 or 5 clusters. In the following w e choose K = 3 clusters b ecause PCA applied to each of the 3 clusters gives eac h time only one eigenv alue greater than 1. The function cutree cuts the dendrogram of the hierarch y and gives a partition int o K = 3 clusters of the p = 10 v ariables: R> P3<-cutreevar(tree,3) R> cluster <- P3$cl uster R> princomp(X.quanti[,which( cluster==1)],cor=TRUE)$sdev^2 R> princomp(X.quanti[,which( cluster==2)],cor=TRUE)$sdev^2 R> princomp(X.quanti[,which( cluster==3)],cor=TRUE)$sdev^2 The partition P 3 is contained in an ob ject of class clustvar . Note that partitions obtained with the kmeansvar function are also ob jects of class clustvar . The function print.clustvar giv es a description of the v alues of this ob ject. Journal of Statistical Softw are 9 ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 Stability of the partitions number of clusters mean adjusted Rand criterion 2 3 4 5 6 7 8 9 ● ● ● ● ● ● ● ● ● ● P2 P3 P4 P5 P6 P7 P8 P9 0.0 0.2 0.4 0.6 0.8 1.0 Dispersion of the ajusted Rand index Figure 2: Graphical output of the functions stability and plot.clustab . R> P3<-cutreevar(tree,3,mats im=TRUE) R> print(P3) Call: cutreevar(obj = tree, k = 3) name description "$var" "list of variab les in each cluster" "$sim" "similarity matrix in each cluster" "$cluster" "cluster memberships" "$wss" "within-cluster sum of squares" "$E" "gain in cohesi on (in %)" "$size" "size of each cluster" "$scores" "score of each cluster" The v alue $wss is H ( P K ) where the homogeneit y function H was deﬁned in ( 2 ). The gain in cohesion $E is the p ercentage of homogeneit y whic h is accounted by the partition P K . It is deﬁned by: E ( P K ) = H ( P K ) − H ( P 1 ) p − H ( P 1 ) . (4) The v alue $sim pro vides the similarity matrices of the v ariables in each cluster (calculated with the function mixedVarSim ). Note that it is time consuming to p erform these similarity matrices when the num b er of v ariables is large. Th us they are not calculated b y default: matsim=TRUE must b e sp eciﬁed in the parameters of the function hclustvar (or kmeansvar ) if the user wan ts this output. W e pro vide b elow the similarity matrix for the ﬁrst cluster of this partition into 3 clusters. > round(P3$sim$cluster1,digi t=2) 100m Long.jump 400m 110m.hurdle 100m 1.00 0.36 0.27 0.34 10 ClustOfV ar : An R Pac k age for the Clustering of V ariables Long.jump 0.36 1.00 0.36 0.26 400m 0.27 0.36 1.00 0.30 110m.hurdle 0.34 0.26 0.30 1.00 The v alue $cluster is a v ector of integers indicating the cluster to whic h each v ariable is allo cated. R> P3$cluster 100m Long.jump Shot.put High.jump 400m 110m.hurdle 1 1 2 2 1 1 Discus Pole.vault Javeline 1500m 2 3 2 3 The v alue $var gives a description of each cluster of the partition. More precisely it pro vides for eac h cluster the squared loadings on the ﬁrst principal comp onent of PCAMIX (whic h is the cen tral syn thetic v ariable of this cluster). F or quantitativ e v ariables (resp. qualitativ e), the squared loadings are squared correlations (resp. correlation ratio) with this central synthetic v ariable. F or instance the squared correlation b etw een the v ariable “100m” and the cen tral syn thetic v ariable of “cluster1” is 0.68. R> P3$var $cluster1 squared loading 100m 0.6822349 Long.jump 0.6873076 400m 0.6652279 110m.hurdle 0.6427661 $cluster2 squared loading Shot.put 0.7861012 High.jump 0.4991778 Discus 0.6023186 Javeline 0.2546550 $cluster3 squared loading Pole.vault 0.6237239 1500m 0.6237239 The v alue $scores is the n × K matrix of the scores of the n observ ations on the ﬁrst principal comp onen ts of PCAMIX applied to the K clusters: PCAMIX is applie d 3 times here, one time in eac h cluster. Each column is then a syn thetic v ariable of a cluster. The central synthetic v ariable of “cl uster1” for instance is the ﬁrst column of the 41 × 3 matrix ab ov e. This column giv es the scores of the 41 athletes on the ﬁrst comp onent of PCAMIX applied to the v ariables of “cluster1” (100m, Long.jump, 400m, 110m.hurdle). Journal of Statistical Softw are 11 R> head(part_hier$scores) cluster1 cluster2 cluster3 SEBRLE 0.2640687 -1.0353928 -1.4405915 CLAY 1.3816943 -0.3454687 -1.7840860 KARPOV 1.1098485 -0.7209119 -1.7043603 BERNARD -0.1949061 0.7082857 -1.5017373 YURKOV -2.0319539 -1.8850107 0.2702640 WARNERS 1.1385110 1.0929346 -0.3490226 Note that this 41 × 3 matrix of the scores of the 41 athletes in each cluster of v ariables is of course diﬀeren t from the 41 × 3 matrix of the scores of the athletes on the ﬁrst 3 principal comp onen ts of PCAMIX (here PCA) applied to the initial dataset. The 3 synthetic v ariables for instance can b e correlated whereas the ﬁrst 3 principal comp onen ts of PCAMIX are not correlated by construction. But the matrix of the synthet ic v ariables in $scores can b e used as the matrix of the principal comp onen ts of PCAMIX for dimension reduction purp ose. 4.2. Second example: A mixture of quan titativ e and qualitativ e data W e use the dataset wine whic h contains n = 21 frenc h wines describ ed b y p = 31 v ariables. The ﬁrst tw o v ariables “Lab el” and “Soil ” are qualitative with resp ectively 3 and 4 categories. The other 29 v ariables are quan titativ e. R> data("wine") R> head(wine[,c(1:4)]) Label Soil Odor.Intensity Aroma.quality 2EL Saumur Env1 3.074 3.000 1CHA Saumur Env1 2.964 2.821 1FON Bourgueuil Env1 2.857 2.929 1VAU Chinon Env2 2.808 2.593 1DAM Saumur Reference 3.607 3.429 2BOU Bourgueuil Reference 2.857 3.111 In order to hav e an idea of the links b etw een these 31 quantitativ e and qualitative v ariables, w e construct a hierarc hy using the function hclustvar . R> X.quanti <- wine[,c(3:29)] R> X.quali <- wine[ ,c(1,2)] R> tree <- hclustvar(X.quanti,X.quali ) R> plot(tree) In Figu re 3 , w e plot the dendrogram. It sho ws for instance that the qualitati ve v ariable “lab el” is link ed (in term of correlation ratio) with the quan titativ e v ariable “Phenolic” . The user c ho oses according to this dendrogram to cut this dendrogram in to K = 6 clusters: R> part_hier<-cutreevar(tree ,6) R> part_hier$var$"cluster1" 12 ClustOfV ar : An R Pac k age for the Clustering of V ariables Phenolic Label Spice.bef ore.shaking Spice Odor .Intensity.bef ore.shaking Odor .Intensity Bitterness Soil Astringency Visual.intensity Nuance Aroma.persistency Attack.intensity Intensity Alcohol Surf ace.f eeling Aroma.intensity Flower .bef ore.shaking Flower Aroma.quality.bef ore.shaking Quality.of .odour Fruity.before .shaking Fruity Acidity Balance Smooth Harmony Plante Aroma.quality 0.0 1.0 2.0 3.0 Cluster Dendrogram Height Figure 3: Dendrogram of the hierarc hy of the 31 v ariables of the wine dataset. squared loading Odor.Intensity 0.7617528 Spice.before.shaking 0.61602 43 Odor.Intensity.1 0.666332 5 Spice 0.5357837 Bitterness 0.6620632 Soil 0.7768805 A close reading of the output for “cluster1” shows that the correlation ratio b etw een the qualitativ e v ariable “Soil” and the syn thetic v ariable of the cluster is ab out 0.78. The squared correlation b etw een “Odor.In tensit y” and the synthetic v ariable of the cluster is 0.76. The cen tral syn thetic v ariables of the 6 clu sters are in p art_hier$scores . This 21 × 6 quan ti- tativ e matrix can replace the original 21 × 31 data matrix mixing qualitative and quan titativ e v ariables. This matrix of the syn thetic v ariables can then b e used for reco ding a mixed data matrix (or a qualitative data matrix) in to a quantitativ e data matrix, as is usually done with the matrix of the principal comp onents of PCAMIX. The function kmeansvar can also provide a partition in to K = 6 clusters of the 31 v ariables. Journal of Statistical Softw are 13 R> part_km<-kmeansvar(X.quan ti,X.quali,init=6,nstart=10) The gain in cohesion of the partition in ( 4 ) obtained with the k-means type partitioning algorithm and 10 random initializations is smaller than that of the partition obtained with the hierarchi cal clustering algorithm (51.02 versus 56.84): R> part_km$E [1] 51.02414 R> part_hier$E [1] 56.84082 In practice, sim ulations and real datasets show ed that the quality of the partitions obtained with hclustvar seems to b e better than that obtained with kmeansvar . But for large datasets (with a large num ber of v ariables), the function hclustvar meets problems of computation time. In this case, the function kmeansvar will b e faster. 5. Concluding remarks The R pack age ClustOfV ar proposes hierarc hical and k-means type algorithms for the clus- tering of v ariables of any type (quan titativ e and/or qualitative). This pack age proposes useful tools to visualize the links b et ween the v ariables and the re- dundancy in a data set. It is also an alternativ e to principal component analysis metho ds for dimension reduction and for reco ding qualitative or mixed data matrices in to quantitativ e data matrix. The main diﬀerence b etw een PCA and the approach of clustering of v ariables presen ted in this pap er, is that the synthetic v ariables of the clusters can b e correlated whereas the principal comp onents are not correlated by construction. The pac k age ClustOfV ar is not p erforming wel l with datasets ha ving v ery large n um b er of v ariables: the computational time b ecomes relativel y long. A future work is to prop ose a new v ersion of the pac k age with versions of the functions hclustvar , kmeansvar and stability dev elop ed for parallel computing. W e men tion that the pac k age ClustOfV ar can deal with missing data. Ho w ev er let us note that the imputation metho d used in the co de is simple and may not p erform well when the prop ortion of missing data is to o large. In that case, one of the numerous R pac k ages devoted to missing data imputation should b e used prior to ClustOfV ar . References Cha ven t M, Kuentz V, Saracco J (2011). “Orthogonal Rotation in PCAMIX.” submitte d p ap er . Dhillon I, Marcotte E, Roshan U (2003). “Diametrical Clustering for Iden tifying Anti- correlated Gene Clusters.” Bioinformatics , 19 (13), 1612–1619. Escoﬁer B (1979). “T raitemen t Simultan ´ e de V ariables Qualitativ es et Quan titative s en Anal- yse F actorielle [S imultaneous T reatment of Qualitat ive and Quan titative V ariables in F actor Analysis].” L es c ahiers de l’analyse des donn´ ees , 4 (2), 137–146. 14 ClustOfV ar : An R Pac k age for the Clustering of V ariables Hub ert L, Arabie P (1985). “Comparing Partitions.” Journal of Classiﬁc ation , pp. 193–208. Kaufman L, Rousseeuw P (1990). Finding Gr oups in Data: A n Intr o duction to Cluster A nalysis . John Wiley & Sons. Kiers H (1991). “Simple Structure in Comp onent Analysis T echniques for Mixtures of Quali- tativ e and Quantitativ e V ariables.” Psychometrika , 56 , 197–212. Maec hler M, Rousseeu w P , Struyf A, Hubert M (2005). “Cluster Analysis Basics and Ex- tensions.” Rousseeu w et al provided the S original whic h has b een p orted to R b y Kurt Hornik and has since been enhanced b y Martin Maec hler: sp eed improv ements, silhouette() functionalit y , bug ﬁxes, etc. See the ’Changelog’ ﬁle (in the pack age source). P ag` es J (2004). “Analyse F actorielle de Donn´ ees Mixtes [F actor Analysis for Mixed Data].” R evue de Statistique Appliqu ´ ee , 52 (4), 93–111. R Dev elopmen t Core T eam (2011). R: A L anguage and Envir onment for Statistic al Comput- ing . R F oundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R- project.org/ . Sap orta G (1990). “Sim ultaneous T reatmen t of Quantitat ive and Qualitative data.” In Atti dela XXXV riunone scientiﬁc a; so ciet` a intaliana di statistic a , pp. 63–72. Vigneau E, Qannari E (2003). “Clustering of V ariables Around Laten t Comp onents.” Com- munic ations in Statistics Simulation and Computation , 32 (4), 1131–1150. Aﬃliation: Marie Chav ent - Univ. Bordeaux, IMB, UMR 5251, F-33400 T alence, F rance. - CNRS, IMB, UMR 5251, F-33400 T alence, F rance. - INRIA, F-33400 T alence, F rance. E-mail: Marie.Chavent@u-bo rdeaux2.fr V anessa Kuen tz Cemagref, UR ADBX, F-33612 Cestas Cedex, F rance E-mail: vanessa.kuentz@cem agref.fr Benoit Liquet - INSERM, ISPED, Cen tre INSERM U-897-Epidemiologie-Biostatistique, Bordeaux, F-33000 - Univ. Bordeaux, ISPED, Centre INSERM U-897-Epidemiologie-Biostatistique , Bordeaux, F-33000, F rance E-mail: Benoit.Liquet@ispe d.u-bordeaux2.fr Journal of Statistical Softw are 15 Jerome Saracco - IPB, IMB, UMR 5251, F-33400 T alence, F rance. - CNRS, IMB, UMR 5251, F-33400 T alence, F rance. - INRIA, F-33400 T alence, F rance. E-mail: jerome.saracco@mat h.u-bordeaux1.fr Journal of Statistical Software http://www.jstatsoft.or g/ published by the American Statistical Asso ciation http://www.amstat.org/ V olume VV, Issue I I Submitte d: yyyy-mm-dd MMMMMM YYYY A c c epte d: yyyy-mm-dd

ClustOfVar: An R Package for the Clustering of Variables

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment