A fast and recursive algorithm for clustering large datasets with $k$-medians

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursiv…

Authors: Herve Cardot, Peggy Cenac, Jean-Marie Monnez

A fast and recursive algorithm for clustering large datasets with   $k$-medians
A fast and recursi v e algorithm for clustering lar ge datasets with k -medians Herv ´ e Cardot ∗ ( a ) , Peggy C ´ enac ( a ) and Jean-Marie Monnez ( b ) (a) Institut de Math ´ ematiques de Bourgogne, UMR 5584, Uni versit ´ e de Bourgogne, 9 A venue Alain Sa vary , 21078 Dijon, France (b) Institut Elie Cartan, UMR 7502, Nancy Uni versit ´ e, CNRS, INRIA, B.P . 239-F 54506 V andoeuvre l ` es Nancy Cede x, France Nov ember 26, 2024 Abstract Clustering with fast algorithms lar ge samples of high dimensional data is an important chal- lenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the k -means algorithm, a new class of recursiv e stochastic gradient algo- rithms designed for the k -medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allo wed to arri ve sequentially . It is pro ved that the stochastic gradient algorithm con verges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the av eraged versions, which are known to have better performances, and a data-driv en procedure that allows automatic selection of the value of the descent step is proposed. The performance of the a veraged sequential estimator is compared on a simulation study , both in terms of compu- tation speed and accuracy of the estimations, with more classical partitioning techniques such as k -means, trimmed k -means and P AM (partitioning around medoids). Finally , this new on- line clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 indi vidual tele vision audiences measured every minute over a period of 24 hours. K eywor ds : av eraging, high dimensional data, k -medoids, online clustering, partitioning around medoids, recursi ve estimators, Robbins Monro, stochastic approximation, stochastic gradient. ∗ Corresponding author . 1 1 Intr oduction Clustering with fast algorithms lar ge samples of high dimensional data is an important challenge in computational statistics and machine learning, with applications in v arious domains such as image analysis, biology or computer vision. There is a vast literature on clustering techniques and recent discussions and re views may be found in Jain et al. (1999) or Gan et al. (2007). Moreover , as argued in Bottou (2010), the development of fast algorithms is ev en more crucial when the computation time is limited and the sample is potentially very lar ge, since fast procedures will be able to deal with a larger number of observ ations and will finally provide better estimates than slo wer ones. W e focus here on partitioning techniques which are able to deal with lar ge samples of data, assuming the number k of clusters is fixed in advance. The most popular clustering methods are probably the non sequential (Forgy (1965)) and the sequential (MacQueen (1967)) versions of the k -means algorithms. They are very fast and only require O ( k n ) operations, where n is the sample size. They aim at finding local minima of a quadratic criterion and the cluster centers are given by the barycenters of the elements belonging to each cluster . A major drawback of the k -means algorithms is that they are based on mean v alues and, consequently , are very sensitive to outliers. Such atypical values, which may not be uncommon in large samples, can deteriorate significantly the performances of these algorithms, ev en if they only represent a small fraction of the data as explained in Garc ´ ıa-Escudero et al. (2010) or Croux et al. (2007). The k -medians approach is a first attempt to get more robust clustering algorithms; it was suggested by MacQueen (1967) and de veloped by Kaufman and Rousseeuw (1990). It consists in considering criteria based on least norms instead of least squared norms, so that the cluster centers are the spatial medians, also called geometric or L 1 -medians (see Small (1990)), of the elements belonging to each cluster . Note that it has been proved in Lalo ¨ e (2010) that under general assumptions, the minimum of the objecti ve function is unique. Many algorithms hav e been proposed in the literature to find this minimum. The most popular one is certainly the P AM (partitioning around medoids) algorithm which has been dev eloped by Kaufman and Rousseeuw (1990) in order to search for local minima among the elements of the sample. Its computation time is O ( k n 2 ) and as a consequence, it is not very well adapted for lar ge sample sizes. Many strategies hav e been suggested in the literature to reduce the computation time of this algorithm. For example subsampling (see e.g the algorithm CLARA in Kaufman and Rousseeuw (1990) and the algorithm CLARANS in Ng and Han (2002)), local distances computation (Zhang and Couloigner (2005)) or the use of weighted distances during the iteration steps (Park and Jun (2008)), allo w one to reduce significantly the computation time without deteriorating the accuracy of the estimated partition. 2 T rimmed k -means (see Garc ´ ıa-Escudero et al. (2008, 2010) and references therein) is also a popular modification of the k -means algorithm that is more robust (see Garc ´ ıa-Escudero and Go- daliza (1999)) in the sense that it has a strictly positi ve breakdown point, which is not the case for the k -medians. Note ho wev er that the breakdown point is a pessimistic indicator of rob ustness since it is based on the worst possible scenario. For a small fraction of outliers whose distance is mod- erate to the cluster centers, k -medians remain still competiti ve compared to trimmed k -means as seen in the simulation study . Furthermore, from a computational point of view , performing trimmed k -means needs to sort the data and this step requires O ( n 2 ) operations, in the worst cases, at each iteration so that its ex ecution time can get lar ge when one has to deal with large samples. Borro wing ideas from MacQueen (1967) and Hartigan (1975) who have first proposed sequen- tial clustering algorithms and Cardot et al. (2011) who hav e studied the properties of stochastic gradient algorithms that can give efficient recursiv e estimators of the geometric median in high di- mensional spaces, we propose in this paper a recursive strategy that is able to estimate the cluster centers by minimizing a k -medians type criterion. One of the main advantages of our approach, compared to previous ones, is that it can be computed in only O ( k n ) operations so that it can deal with very lar ge datasets and is more robust than the k -means. Note also that by its recursi ve nature, another important feature is that it allows automatic update and does not need to store all the data. A key tuning parameter in our algorithm is the descent step value. W e found empirically that rea- sonable v alues are gi ven by the empirical L 1 loss function. W e thus also consider an automatic two steps procedure in which one first runs the sequential version of the k -means in order to approximate the v alue of the L 1 loss function and then run our stochastic k -medians with an appropriate descent step. The paper is organized as follows. W e first fix notations and present our algorithm. In the third Section, we state the almost sure consistency of the stochastic gradient k -medians to a stationary point of the underlying objecti ve function. The proof heavily relies on Monnez (2006). In Section 4, we compare on simulations the performance of our technique with the sequential k -means, the P AM algorithm and the trimmed k -means when the data are contaminated by a small fraction of outliers. W e note that applying av eraging techniques (see Polyak and Juditsk y (1992)) to our estimator , with a small number of different initializations points, is a very competitiv e approach ev en for moderate sample sizes with computation times that are much smaller . In Section 5, we illustrate our new clustering algorithm on a large sample, of about 5000 individuals, in order to determine profiles of tele vision audience. A major dif ference with P AM is that our algorithm searches for a solution in all the space whereas P AM, and its refinements CLARA and CLARANS, only look for a solution among the elements of the sample. Consequently , approaches such as P AM are not adapted to deal 3 with temporal data presented in Section 5 since the data mainly consist of 0 and 1 indicating that the television is switched on or switched off during each minute of the day . Proofs are gathered in the Appendix. 2 The stochastic gradient k -medians algorithm 2.1 Context and definitions Let (Ω , A , P ) be a probability space. Suppose we hav e a sequence of independent copies Z 1 , . . . , Z n of a random vector Z taking v alues in R d . The aim is to partition Ω into a finite number k of clusters Ω 1 , . . . , Ω k . Each cluster Ω i is represented by its center , which is an element of R d denoted by θ i . From a population point of view , the k -means and k -medians algorithms aim at finding local minima of the function g mapping R dk to R and defined as follows, for x = ( x 1 , . . . , x k ) 0 with for all i , x i ∈ R d , g ( x ) def = E  min r =1 ,...,k Φ( k Z − x r k )  , (1) where Φ is a real, positi ve, continuous and non-decreasing function and the norm k . k in R d takes account of the dimension d of the data, for z ∈ R d , k z k 2 = d − 1 P d j =1 z 2 j . The particular case Φ( u ) = u 2 , leads to the classical k -means algorithm, whereas φ ( u ) = | u | leads to the k-medians. Before presenting our new recursiv e algorithm, let us introduce now some notations and recall the recursi ve k -means algorithm dev eloped by MacQueen (1967). Let us denote by I r the indicator function, I r ( z ; x ) = k Y j =1 1 1 {k z − x r k≤k z − x j k} , which is equal to one when x r is the nearest point to z , among the set of points x i , i = 1 , . . . , k . The k -means recursiv e algorithm proposed by MacQueen (1967) starts with k arbitrary groups, each containing only one point, X 1 1 , . . . , X k 1 . Then, at each iteration, the cluster centers are updated as follo ws, X r n +1 = X r n − a r n I r ( Z n ; X n ) ( X r n − Z n ) , (2) where for n ≥ 2 , a r n = (1 + n r ) − 1 and n r = 1 + P n − 1 ` =1 I r ( Z ` ; X ` ) is just the number of elements allocated to cluster r until iteration n − 1 . For n = 1 , let a r 1 = 1 2 . This also means that X r n +1 is simply the barycenter of the elements allocated to cluster r until iteration n, X r n +1 = 1 1 + P n ` =1 I r ( Z ` ; X ` ) X r 1 + n X ` =1 I r ( Z ` ; X ` ) Z ` ! . 4 The interesting point is that this recursi ve algorithm is very fast and can be seen as a Robbins-Monro procedure. 2.2 Stochastic gradient k -medians algorithms Assuming Z has an absolutely continuous distrib ution, we hav e P (   Z − x i   =   Z − x j   ) = 0 , for any i 6 = j and x i 6 = x j . Then, the k -medians approach relies on looking for minima, that may be local, of the function g which can also be written as follo ws, for any x such that x j 6 = x i when i 6 = j , g ( x ) = k X r =1 E [ I r ( Z ; x ) k Z − x r k ] . (3) In order to get an explicit Robbins-Monro algorithm representation, it remains to exhibit the gradient of g . Let us write g in integral form. Denoting by f the density of the random variable Z, we ha ve, g ( x ) = k X r =1 Z R d \{ x r } I r ( z ; x ) k z − x r k f ( z ) dz . For j = 1 , . . . , d , it can be checked easily that ∂ ∂ x r j ( k z − x r k ) = x r j − z j k z − x r k , and since I r ( z ; x )    x r j − z j    k z − x r k f ( z ) ≤ f ( z ) , for z 6 = x r , the partial deri v ati ves satisfy , ∂ g ∂ x r j ( x ) = Z R d \{ x r } I r ( z ; x ) x r j − z j k z − x r k f ( z ) dz . W e define, for x ∈ R dk , ∇ r g ( x ) def = E  I r ( Z ; x ) x r − Z k x r − Z k  . (4) W e can no w present our stochastic gradient k -medians algorithm. Gi ven a set of k distinct initialization points in R d , X 1 1 , · · · , X k 1 , the set of k cluster centers is updated at each iteration as follo ws. For r = 1 , . . . , k , and n ≥ 1 , X r n +1 = X r n − a r n I r ( Z n ; X n ) X r n − Z n k X r n − Z n k (5) = X r n − a r n ∇ r g ( X n ) − a r n V r n , 5 with X n = ( X 1 n , · · · , X k n ) , and V r n def = I r ( Z n ; X n ) X r n − Z n k X r n − Z n k − E " I r ( Z n ; X n ) X r n − Z n k X r n − Z n k      F n # , F n = σ ( X 1 , Z 1 , . . . , Z n − 1 ) . The steps a r n , also called gains, are supposed to be F n -measurable. W e denote by ∇ g ( x ) = ( ∇ 1 g ( x ) , . . . , ∇ k g ( x )) 0 the gradient of g and define V n def = ( V 1 n , . . . V k n ) 0 . Let A n be the diagonal matrix of size dk × dk , A n =                  a 1 n . . . a 1 n . . . a k n . . . a k n                  , each a r n being repeated d times. Then, the k -medians algorithm can be written in a matrix way , X n +1 = X n − A n ∇ g ( X n ) − A n V n , (6) which is a classical stochastic gradient descent. 2.3 T uning the stochastic gradient k -medians and its averaged v ersion The behavior of algorithm (5) depends on the sequence of steps a r n , r ∈ { 1 , . . . , k } and the vector of initialization X 1 . These tw o sets of tuning parameters play distinct roles and we mainly focus on the choice of the step v alues, noting that, as for the k -means, the estimation results must be compared for different sets of initialization points in order to get a better estimation of the cluster centers. Assume we hav e a sample of n realizations Z 1 , . . . , Z n of Z and a set of initialization points of the algorithm, the selected estimate of the cluster centers is the one minimizing the following empirical risk, R ( X n ) = 1 n n X i =1 k X r =1 I r ( Z i ; X n ) k Z i − X r n k (7) Let us denote by n r = 1 + P n − 1 ` =1 I r ( Z ` ; X ` ) the number of updating steps for cluster r, until iteration n − 1 , for r ∈ { 1 , . . . , k } . A classical form of the descent steps a r n can be gi ven by a r n =    a r n − 1 if I r ( Z n ; X n ) = 0 , c γ (1 + c α n r ) α otherwise, (8) 6 where c γ , c α and 1 / 2 < α ≤ 1 control the g ain. Adopting an asymptotic point of vie w , one could belie ve that α should be set to α = 1 with suitable constants c α and c γ , which are unknown in practice, in order to attain the optimal para- metric rates of con vergence of Robbins Monro algorithms (see e.g. Duflo (1997), Th. 2.2.12). Our experimental results on simulated data have shown that the conv ergence of algorithm (5) with de- scent steps defined in (8) is then very sensiti ve to the values of the parameters c γ and c α which hav e to be chosen very carefully . A simulation study performed in the particular case k = 1 by Cardot et al. (2010) showed that the direct approach could lead to inaccurate results and is nearly always less effecti ve than the av eraged algorithm presented belo w , e ven for well chosen descent step values. From an asymptotic point of vie w , it has been pro ved in Cardot et al. (2011) that the av eraged stochastic gradient estimator of the geometric median, corresponding to k = 1 , is asymp- totically ef ficient under classical assumptions. Intuiti vely , when the algorithm is not too far from the solution, a veraging allo ws to decrease substantially the variability of the initial algorithm which oscillates around the true solution and thus improv es greatly its performances. Consequently , we prefer to introduce an av eraging step (see for instance Polyak and Juditsky (1992) or Pelletier (2000)), which does not slow do wn the algorithm and provides an estimator which is much more effecti ve. Our av eraged estimator of the cluster centers, which remains re- cursi ve, is defined as follows, for r ∈ { 1 , . . . , k } , n ≥ 1 , and for the value X r n +1 obtained by combining (5) and (8), ¯ X r n +1 =      ¯ X r n if I r ( Z n ; X n ) = 0 , n r ¯ X r n + X r n +1 n r + 1 otherwise, (9) with starting points ¯ X r 1 = X r 1 , r = 1 , . . . , k . Then standard choices (see e.g. Bottou (2010) and references therein) for α and c α are α = 3 / 4 and c α = 1 , so that one only needs to select v alues for c γ . 3 Almost sur e con ver gence of the algorithm 3.1 A con vergence theor em The following theorem is the main theoretical result of this paper . It states that the recursiv e algo- rithm defined in (6) con ver ges almost surely to the set of stationary points of the objectiv e function defined in (3), under the follo wing assumptions. (H1) a) The random v ector Z is absolutely continuous with respect to Lebesgue measure. b) Z is bounded: ∃ K > 0 : k Z k ≤ K a.s. 7 c) ∃ C : ∀ x ∈ R d such that k x k ≤ K + 1 , E h 1 k Z − x k i < C . (H2) a) ∀ n ≥ 1 , min r a r n > 0 . b) max r sup n a r n < min( 1 2 , 1 8 C ) a.s. c) P ∞ n =1 max r a r n = ∞ a.s. d) sup n max r a r n min r a r n < ∞ a.s. (H3) P k r =1 P ∞ n =1 ( a r n ) 2 < ∞ a.s. (H3’) P k r =1 P ∞ n =1 E h ( a r n ) 2 I r ( Z n ; X n ) i < ∞ . Theorem 1. Assume that X 1 is absolutely continuous and that k X r 1 k ≤ K , for r = 1 , . . . , k . Then under Assumptions (H1a,c), (H2a,b), (H3) or (H3’), g ( X n ) and k X r =1 ∞ X n =1 a r n k∇ r g ( X n ) k 2 con verg e almost sur ely . Mor eover , if the hypotheses (H1b) and (H2c,d) ar e also fulfilled then ∇ g ( X n ) and the distance between X n and the set of stationary points of g conver ge almost sur ely to zer o. A direct consequence of Theorem 1 is that if the set of stationary points of g is finite, then the sequence ( X n ) n necessarily con verges almost surely towards one of these stationary points because X n +1 − X n con verges almost surely towards zero. By Cesaro means arguments, the av eraged sequence ¯ X n also con verges almost surely to wards the same stationary point. 3.2 Comments on the hypotheses Note first that if the data do not arrive online and X 1 is chosen randomly among the sample units then X 1 is absolutely continuous and k X r 1 k ≤ K , for r = 1 , . . . , k under (H1a,b). Moreov er , the absolute continuity of Z is a technical assumption that is required to get decomposition (3) of the L 1 error . Proving the con vergence in the presence of atoms would require to decompose this error in order to take into account the points which could hav e a non-null probability to be at the same distance. Such a study is clearly be yond the scope of the paper . Note howe v er that in the simple case k = 1 , it has been established in Cardot et al. (2011) that the stochastic algorithm for the functional median is conv ergent provided that the distribution, which can be a mixture of a continuous and a discrete distribution, does not char ge the median. Hypothesis (H1c) is a stronger version of a more classical hypothesis needed to get consistent estimators of the spatial median (see Chaudhuri (1996)). As noted in Cardot et al. (2011), it is 8 closely related to small ball properties of Z and is fulfilled when P ( k Z − x k ≤  ) ≤ κ 2 , for a constant κ that does not depend on x and  small enough. This implies in particular that hypothesis (H1c) can be satisfied only when the dimension d of the data satisfies d ≥ 2 . Hypotheses (H2) and (H3) or (H3’) deal with the stepsizes. Considering the general form of gains a r n gi ven in (8), they are fulfilled when the sizes n r of all the clusters grow to infinity at the same rate and α ∈ ]1 / 2 , 1] . 4 A simulation study W e first perform a simulation study to compare our recursi ve k -medians algorithm with the follo w- ing well known clustering algorithms : recursiv e version of the k -means (function kmeans in ), trimmed k -means (function tkmeans in the package tclust , with a trimming coefficient α set to default, α = 0 . 05 ) and P AM (function pam in the package cluster ). Our codes are av ailable on request. Comparisons are first made according to the v alue of the empirical L 1 error (7) which must be as small as possible. W e note that the results of our av eraged recursiv e procedure defined by (5), (8) and (9) are very stable when the value of the tuning parameter c γ is not too far from the minimum v alue of the L 1 error , with α = 3 / 4 and c α = 1 . This leads us to propose, in Section 4.2, an automatic clustering algorithm which consists in first approximating the L 1 error with a recursi ve k -means and then performing our recursiv e k -medians with the selected v alue of c γ , denoted by c in the following. W e have no mathematical justification for such an automatic choice of the tuning parameter c but it always work ed well on all our simulated experiments. This important point of our algorithm deserves further in v estigations that are beyond the scope of the paper . Note howe ver that this intuitiv e approach will certainly be inef fecti ve when the dispersion is very dif ferent from one group to another . It would then be possible to consider refinements of the previous algorithm which would consist in considering different v alues of tuning parameter c for the different clusters. W e only present here a few simulation experiments which highlight both the strengths and the drawbacks of our recursi ve k -medians algorithm. 9 4.1 Simulation protocol Simulation 1 : a simple experiment in R 2 W e first consider a very simple case and dra w independent realizations of v ariable Z, Z = (1 −  ) ( π 1 Z 1 + π 2 Z 2 + π 3 Z 3 ) + δ z , (10) which is a mixture, with weights π 1 = π 2 = π 3 = 1 / 3 , of three biv ariate random Gaussian vectors Z 1 , Z 2 and Z 3 with mean vectors µ 1 = ( − 3 , − 3) , µ 2 = (3 , − 3) and µ 3 = (4 . 5 , − 4 . 5) and cov ari- ance matrices V ar ( Z 1 ) =   2 1 1 3   , V ar ( Z 2 ) =   3 1 1 2   and V ar ( Z 3 ) =   2 − 1 − 1 3   . Point z = ( − 14 , 14) is an outlier and parameter  controls the level of the contamination. A sample of n = 450 realizations of Z is drawn in Figure 1. -10 -5 0 5 -5 0 5 10 X1 X2 Figure 1: Simulation 1. A sample of n = 450 realizations of Z . An outlier is located at position (-14,14). 10 0 10 20 30 40 50 -4 -2 0 2 4 Time index Figure 2: Simulation 2. A sample of n = 36 realizations of Z with d = 50 . The mean values µ 1 , µ 2 and µ 3 of the three natural clusters are drawn in bold lines. Simulation 2 : larger dimension with different corr elation levels W e also performed a simulation experiment, with a mixture of three Gaussian random v ariables as in (10), but in higher dimension spaces with correlation le vels that v ary from one cluster to another . No w , Z 1 , Z 2 and Z 3 are independent multiv ariate normal distributions in R d , with means µ 1 j = 2 sin(2 π j / ( d − 1)) , µ 2 j = 2 sin(2 π / 3 + 2 π j / ( d − 1)) , and µ 3 j = 2 sin(4 π / 3 + 2 π j / ( d − 1)) , for j = 1 , . . . , d. The cov ariance functions C ov ( Z ij , Z i` ) = 1 . 5 ρ | j − ` | i , for j, ` ∈ 1 , . . . , d and i ∈ { 1 , 3 } are controlled by a correlation parameter ρ, with ρ 1 = 0 . 1 , ρ 2 = 0 . 5 and ρ 3 = 0 . 9 . Note that this covariance structure corresponds to autoregressi ve processes of order one with autocorrelation ρ. As before, δ z = (4 , . . . , 4) ∈ R d plays the role of an outlying point. A sample of n = 36 independent realizations of Z, without outliers, is drawn in Figure 2 for a dimension d = 50 . 11 4.2 L 1 error and sensiti vity to parameter c As noted in Bryant and W illiamson (1978), comparing directly the distance of the estimates from the cluster centers µ 1 , µ 2 and µ 3 may not be appropriate to e v aluate a clustering method. Our comparison is thus first made in terms of the v alue of the empirical L 1 error (7) which should be as small as possible. For all methods, we considered that there were k = 3 clusters. 0 2 4 6 8 10 1.9 2.0 2.1 2.2 2.3 Parameter c L1 error Figure 3: Simulation 1 with  = 0 . 05 and n = 250 . Mean empirical L 1 error (ov er 50 replications) for the P AM algorithm (dashed line), the k -means ( c = 0 ) and the stochastic k -medians (bold line), for c ∈ ]0 , 10] . W e first study the simple case of Simulation 1. The empirical mean L 1 error of the P AM algorithm, the k -means and the averaged k -medians, for 50 replications of samples with sizes n = 250 and a contamination le vel  = 0 . 05 is presented in Figure 3. The number of initialization points equals 10 for both the k -means and the k -medians. When the descent parameter c equals 0, the initialization point is given by the estimated centers by the k -means, so that the empirical L 1 error corresponds in that case to the k -means error , which is sightly above 2.31. W e first note that this L 1 error is always lar ger , ev en if the contamination lev el is small, than the P AM and the k -medians 12 errors, for c ∈ ]0 , 10] . Secondly , it appears that for c ∈ [0 . 5 , 4] , the k -medians L 1 error is nearly constant and is clearly smaller than the L 1 error of the P AM algorithm. This means that, e ven if the sample size is relativ ely small ( n = 250 ), the recursiv e k -medians can perform well for values of c which are of the same order of the L 1 error . 0 1 2 3 4 5 6 7 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Parameter c L1 error Figure 4: Simulation 2 with n = 500 , d = 50 , and  = 0 . 05 . The mean empirical L 1 error (ov er 50 replications) is represented for the P AM algorithm (dashed line), the MacQueen version of the k -means ( c = 0 ) and the recursiv e k -medians estimator (bold line), for c ∈ ]0 , 7] . W e now consider 50 replications of samples drawn from the distribution described in simulation 2, with n = 500 , d = 50 and  = 0 . 05 . The number of initialization points for the k -means and the k -medians is now equal to 25 and the empirical mean L 1 error is presented in Figure 4. W e first note that the performances of the P AM algorithm clearly decrease with the dimension. The k -means performs better e ven if there are 5% of outliers and if it is not designed to minimize an L 1 error criterion. This can be explained by the fact that P AM, as well as CLARA and CLARANS, look for a solution among the elements of the sample. Thus these approaches can hardly explore all the dimensions of the data when d is lar ge and n is not large enough. On the other hand, the 13 k -medians and the k -means look for a solution in all R d and are not restricted to the observed data and thus pro vide better results in terms of L 1 error . As before, we can also remark that the minimum error , which is around 1.36, is attained for c in the interv al [0 . 5 , 3] . 0 10 20 30 40 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 Parameter c L1 error Figure 5: Simulation 2 with n = 1000 , d = 200 ,  = 0 . 05 , and Z multiplied by a factor 10. The mean L 1 loss function (o ver 50 replications) is represented for the P AM algorithm (dashed line), the MacQueen version of the k -means ( c = 0 ) and our recursiv e k -medians estimator (bold line), for c ∈ ]0 , 40] . W e finally present results from Simulation 2 in which we consider samples with size n = 1000 , of v ariable 10 Z, with d = 200 . The contamination le vel is  = 0 . 05 and 50 initialization points were considered for the k -means and k -medians algorithms. Since Z has been multiplied by a factor 10, the minimum of the L 1 error is now around 13.6. W e remark, as before, that because of the dimension of the data, d = 200 , P AM is outperformed by the k -means and the k -medians ev en in the presence of a small fraction of outliers (  = 0 . 05 ). The minimum of the L 1 error for the k -medians estimator is again very stable for c ∈ [5 , 25] with smaller v alues than the L 1 error of the 14 k -means clustering. As a first conclusion, it appears that for large dimensions the k -medians can giv e results which are much better than P AM in terms of empirical L 1 error . W e can also note that the av eraged recursi ve k -medians is not very sensitiv e to the choice of parameter c provided its value is not too far from the minimum value of the L 1 error . Thus we only consider, in the following subsection, the data-driv en version of our av eraged algorithm described in Section 2.3 in which the v alue of c is chosen automatically , its value being the empirical L 1 error of the recursiv e k -means. This data-dri ven k -medians algorithm can be summarized as follows 1. Run the k -means algorithm and get the estimated centers. 2. Set c as the value of the L 1 err or of the k -means, evaluated with formula (7). 3. Run the avera ged k -medians defined by (5), (8) and (9), with c computed in Step 2 and c α = 1 . 4.3 Classification Error Rate W e now make comparisons in terms of classification error measured by the Classification Error Rate (CER) introduced by Chipman and T ibshirani (2005) and defined as follows. For a gi ven partition P of the sample, let 1 P ( i,i 0 ) be an indicator for whether partition P places observations i and i 0 in the same group. Consider a partition Q with the true class labels, the CER for partition P is defined as CER = 2 n ( n − 1) X i>i 0   1 P ( i,i 0 ) − 1 Q ( i,i 0 )   . (11) The CER equals 0 if the partitions P and Q agree perfectly whereas a high v alue indicates disagree- ment. Since P AM, the k -means and our algorithm are not designed to detect outliers automatically , we only e v aluate the CER on the non-outlying pairs of elements i and i 0 of the sample. W e present in Figure 6, results for 500 replications of Simulation 1, with a sample size n = 500 and no outliers (  = 0 ). W e note, in this small dimension context with no contamination, that the L 1 errors are comparable. Ne vertheless, in terms of CER, the P AM, the k -means and the data-dri ven k -medians algorithms hav e approximately the same performances. For the trimmed k -means, the results are not as ef fecti ve, since this algorithm automatically classifies 5% of the elements of the sample as outliers. W e then consider the same experiment as before, the only dif ference being that there is now a fraction of  = 0 . 05 of outliers. The results are presented in Figure 7. The k -means algorithm is clearly af fected by the presence of outliers and both its L 1 error and its CER are now much 15 kmeans PAM kmed tkm 1.30 1.35 1.40 1.45 L1 error kmeans PAM kmed tkm 0.01 0.02 0.03 0.04 0.05 0.06 0.07 CER Figure 6: Simulation 1 with  = 0 and n = 500 . On the left, the L 1 empirical error . On the right, CER for the k -means, P AM, the data-dri ven recursive k -medians algorithm (kmed) and the trimmed k -means (tkm). 16 kmeans PAM kmed tkm 1.6 1.8 2.0 2.2 2.4 2.6 L1 error kmeans PAM kmed tkm 0.00 0.05 0.10 0.15 0.20 0.25 CER Figure 7: Simulation 1 with  = 0 . 05 and n = 500 . On the left, the L 1 empirical error . On the right, CER for the k -means, P AM, the data-driv en recursive k -medians algorithm (kmed) and the trimmed k -means (tkm). 17 larger than for the other algorithms. P AM and the recursiv e k -medians hav e similar performances, e ven if P AM is slightly better . The trimmed k -means no w detects the outliers and also has good performances. If the contamination lev el increases to  = 0 . 1 , as presented in Figure 8, then P AM and the trimmed k -means (with a trimming coefficient α = 0 . 05 ) are strongly affected in terms of CER and do not recov er the true groups. The k -medians algorithm is less af fected by this larger le vel of contamination. Its median CER is nearly unchanged, meaning that for at least 50 % of the samples, it gi ves a correct partition. kmeans PAM kmed tkm 2.1 2.2 2.3 2.4 2.5 2.6 2.7 L1 error kmeans PAM kmed tkm 0.05 0.10 0.15 0.20 0.25 0.30 CER Figure 8: Simulation 1 with  = 0 . 1 and n = 1000 . On the left, the L 1 empirical error . On the right, CER for the k -means, P AM, the data-driv en recursive k -medians algorithm (kmed) and the trimmed k -means (tkm). W e no w consider Simulation 2, with a dimension d = 50 and a fraction  = 0 . 05 of outliers. The L 1 empirical errors and the CER, for sample sizes n = 500 , are gi ven in Figure 9. It clearly appears that P AM has the largest L 1 errors and the trimmed k -means and the data-dri ven k -medians the smallest ones. Intermediate L 1 errors are obtained for the k -means. In terms of CER, the partitions obtained by the k -means and P AM are not effecti ve and do not reco ver well the true partition in the majority of the samples. The trimmed k -means and our algorithm always perform well and have similar lo w v alues in terms of CER. 18 kmeans PAM kmed tkm 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 L1 error kmeans PAM kmed tkm 0.00 0.05 0.10 0.15 0.20 0.25 0.30 CER Figure 9: Simulation 2 with  = 0 . 05 , n = 500 and d = 50 . On the left, the L 1 empirical error . On the right, CER for the k -means, P AM, the data-driv en recursiv e k -medians algorithm (kmed) and the trimmed k -means (tkm). 19 4.4 Computation time The codes of all the considered estimation procedures call C routines and pro vide the same output. Mean computation times, for 100 runs, various sample sizes and numbers of clusters are reported in T able 1. They are based on one initialization point. From a computational point of vie w , the recursive k -means based on the MacQueen algorithm as well as the av eraged stochastic k -medians algorithm are always faster than the other algorithms and the gain increases as the sample size gets lar ger . F or example, when k = 5 and n = 2000 , the stochastic k -medians is approximately 30 times faster than the trimmed k -means and 350 times faster than the P AM algorithm. The data- dri ven recursiv e k -medians has a computation time of approximately the sum of the computation time of the recursiv e k -means and the stochastic k -medians. This also means that when the allocated computation time is fix ed and the dataset is v ery large, the data-driv en recursiv e k -medians can deal with sample sizes that are 15 times larger than the trimmed k -means and 175 times larger than the P AM algorithm. T able 1: Comparison of the mean computation time in seconds, for 100 runs, of the different esti- mators for various sample sizes and number of clusters k . The computation time are giv en for one initialization point. n=250 n=500 n=2000 Estimator k=2 k=4 k=5 k=2 k=4 k=5 k=2 k=4 k=5 k -medians 0.33 0.35 0.36 0.45 0.47 0.48 1.14 1.25 1.68 P AM 1.38 3.34 4.21 5.46 15.12 20.90 111 302.00 596.00 T rimmed k -means 2.20 5.45 6.76 5.32 11.19 13.48 22.97 42.72 51.00 MacQueen 0.21 0.29 0.31 0.25 0.43 0.50 0.60 1.38 1.76 When the sample size or the dimension increases, the computation time is ev en more critical. For instance, when d = 1440 and n = 5422 as in the example of Section 5, our data-driven recursiv e k -medians estimation procedure is at least 500 times faster than the trimmed k -means. It takes 5.5 seconds for the sequential k -means to con ver ge and then about 3.0 seconds for the averaged k - medians, whereas it takes more than 5700 seconds for the trimmed k -means. 5 Determining television audience pr ofiles with k -medians The M ´ ediam ´ etrie compan y provides e very day the official estimations of tele vision audience in France. T elevision consumption can be measured both in terms of how long people watch each chan- 20 nel and when they watch television. M ´ ediam ´ etrie has a panel of about 9000 individuals equipped at home with sensors that are able to record and send the audience of the different television channels. Among this panel, a sample of around 7000 people is dra wn e very day and the tele vision consump- tion of the people belonging to this sample is sent to M ´ ediam ´ etrie at night, between 3 and 5 am. Online clustering techniques are then interesting to determine automatically , the number of clusters being fix ed in adv ance, the main profiles of vie wers and then relate these profiles to socio-economic v ariables. In these samples, M ´ ediam ´ etrie has noted the presence of some atypical behaviors so that robust techniques may be helpful. 0 200 400 600 800 1000 1200 1400 0.0 0.2 0.4 0.6 0.8 1.0 minutes TV audience Figure 10: A sample of 5 observations of individual audience profiles measured ev ery minute over a period of 24 hours. W e are interested in b uilding profiles of the e volution along time of the total audience for people who watched tele vision at least one minute on the 6th September 2010. About 1600 people, among the initial sample whose size is around 7000, did not watch television at all this day , so that we finally get a sample of n = 5422 individual audiences, aggre gated along all tele vision channels and measured every minute over a period of 24 hours. An observation Z i is a vector belonging to [0 , 1] d , with d = 1440 , each component gi ving the fraction of time spent watching television during the corresponding minute of the day . A sample of 5 individual temporal profiles is drawn in Figure 21 10. Clustering techniques based on medoids and representati ve elements of the sample ( e.g. P AM, CLARA and CLARANS) are not really interesting in this context since they will return centers of the form of the profiles drawn in Figure 10 which are, in great majority , constituted of 0 and 1 and are consequently difficult to interpret. Furthermore, the dimension being very large, d = 1440 , these algorithms which do not consider all the dimensions of the data, as seen in the simulation study , will lead to a minimum value of the empirical L 1 error (7) that will be substantially larger than for the k -means and our recursi ve k -medians. Indeed, at the optimum, the v alue of the L 1 empirical error is 0.2455 for the k -medians, 0.2471 for the k -means and 0.2692 for P AM. The cluster centers, estimated with our av eraged algorithm for k = 5 , with a parameter v alue selected automatically , c = 0 . 2471 , and 100 different starting points, are drawn in Figure 11. They hav e been ordered in decreasing order according to the sizes of the partitions and labelled Cl.1 to Cl.5. Cluster 1 (Cl.1) is thus the largest cluster and it contains about 35% of the elements of the sample. It corresponds to individuals that do not watch television much during the day , with a cumulati ve audience of about 42 minutes. At the opposite, cluster 5, which represents about 12% of the sample, is associated to high audience rates during nearly all the day with a cumulativ e audience of about 592 minutes. Clusters 2, 3 and 4 correspond to intermediate consumption lev els and can be distinguished according to whether the audience occurs during the evening or at night. For example Cluster 4, which represents 16% of the sample, is related to people watching television late at night, with a cumulati ve audience of about 310 minutes. 22 0 200 400 600 800 1000 1200 1400 0.0 0.2 0.4 0.6 0.8 1.0 minutes TV audience Cl. 1 Cl. 2 Cl. 3 Cl. 4 Cl. 5 Figure 11: Cluster centers for temporal television audience profiles measured e very minute over a period of 24 hours. 23 A ppendix : Pr oof of Theor em 1 The proof of Theorem 1 relies on the follo wing light v ersion of the main theorem in Monnez (2006), section 2.1. The proof of Theorem 1 consists in checking that all the conditions of the follo wing theorem are satisfied. Theorem 2 (Monnez (2006)) . Assuming ( A 1 a ) g is a non ne gative function; ( A 1 b ) Ther e exists a constant L > 0 such that, for all n ≥ 1 , g ( X n +1 ) − g ( X n ) ≤ h X n +1 − X n , ∇ g ( X n ) i + L k X n +1 − X n k 2 a.s. ; ( A 1 c ) The sequence ( X n ) is almost sur ely bounded and ∇ g is continuous almost everywher e on the compact set containing ( X n ) ; ( A 2) Ther e exists four sequences of random variables ( B n ) , ( C n ) , ( D n ) and ( E n ) in R + adapted to the sequence ( F n ) suc h that a.s.: ( A 2 a )   √ A n E [ V n |F n ]   2 ≤ B n g ( X n ) + C n and P ∞ n =1 ( B n + C n ) < ∞ ; ( A 2 b ) E [ k A n V n k 2 |F n ] ≤ D n g ( X n ) + E n and P ∞ n =1 ( D n + E n ) < ∞ ; ( A 3) sup n a r n < min( 1 2 , 1 4 L ) a.s., P ∞ n =1 max r a r n = ∞ a.s. and sup n max r a r n min r a r n < ∞ a.s. then the distance of X n to the set of stationary points of g conver ges almost sur ely to zer o. Pr oof of Theor em 1. Let us no w check that all the conditions in Theorem 2 are fulfilled in our context. Step 1: proof of ( A 1 b ) Let A = X n and B = X n +1 . Since X n is absolutely continuous with respect to Lebesgue measure, P k r =1 I r ( Z ; A ) = 1 a.s. and one gets g ( B ) = E h min r k Z − B r k i = E " k X r =1 I r ( Z ; A ) min j   Z − B j   # , 24 and it comes g ( B ) ≤ k X r =1 E [ I r ( Z ; A ) k Z − B r k ] , which yields g ( B ) − g ( A ) ≤ k X r =1 E [ I r ( Z ; A ) ( k Z − B r k − k Z − A r k )] . The application x 7→ k z − x r k is a continuous function whose gradient ∇ r k z − x r k = x r − z k x r − z k is also continuous for x r 6 = z . Then almost surely for d ≥ 2 , there exists C r = A r + µ r ( B r − A r ) , 0 ≤ µ r ≤ 1 , such that k Z − B r k − k Z − A r k = h B r − A r , ∇ r k Z − C r ki . Consequently for all d ≥ 2 , g ( B ) − g ( A ) ≤ k X r =1 E [ I r ( Z ; A ) h B r − A r , ∇ r k Z − C r ki ] , so that g ( B ) − g ( A ) ≤ k X r =1 E [ I r ( Z ; A ) h B r − A r , ∇ r k Z − C r k − ∇ r k Z − A r ki ] + k X r =1 E [ I r ( Z ; A ) h B r − A r , ∇ r k Z − A r ki ] def = (1) + (2) On the one hand (2) = k X r =1 h B r − A r , ∇ r g ( A ) i = h B − A, ∇ g ( A ) i , and on the other hand (1) ≤ k X r =1 k B r − A r k E [ k∇ r k Z − C r k − ∇ r k Z − A r kk ] , hence since k∇ r k Z − C r k − ∇ r k Z − A r kk =     C r − Z k C r − Z k − A r − Z k A r − Z k     ≤ 2 k C r − A r k k A r − Z k , one gets, with (H1c) (1) ≤ 2 k X r =1 k B r − A r k k C r − A r k E  1 k Z − A r k  ≤ 2 C k X r =1 k B r − A r k 2 = 2 C k B − A k 2 . 25 Consequently , we have g ( B ) − g ( A ) ≤ h B − A, ∇ g ( A ) i + 2 C k B − A k 2 . Step 2: Proof of the assertion: ∀ n ≥ 1 , for all r = 1 , ...k , k X r n k ≤ K + 2 sup n a r n Let us pro ve by induction on n that for all n ∈ N ∗ , for all r = 1 , . . . , k , k X r n k ≤ K + 2 sup n a r n . This inequality is trivial for the case n = 1 : k X r 1 k ≤ K . Let n ∈ N ∗ such that k X r n k ≤ K + 2 sup n a r n , ∀ r ∈ { 1 , . . . , k } . Let r ∈ { 1 , . . . , k } . First we assume that k X r n k ≤ K + a r n . Then it comes   X r n +1   ≤ k X r n k + a r n I r ( Z n ; X n ) ≤ k X r n k + a r n ≤ K + 2 a r n . No w in the case when K + a r n < k X r n k ≤ K + 2 sup n a r n , one gets k X r n k > K + a r n ≥ k Z n k + a r n , and then k X r n − Z n k ≥ |k X r n k − k Z n k| > a r n . Since for I r ( Z n ; X n ) = 0 , X r n +1 = X r n , it remains to deal with the unique index r such that I r ( Z n ; X n ) = 1 . In that case, X r n +1 = X r n − a r n X r n − Z n k X r n − Z n k =  1 − a r n k X r n − Z n k  X r n + a r n Z n k X r n − Z n k . By (H1b) and from the inequalities a r n / k X r n − Z n k < 1 and k Z n k ≤ K < k X r n k , we hav e,   X r n +1   <  1 − a r n k X r n − Z n k  k X r n k + a r n k X r n k k X r n − Z n k = k X r n k , which leads to   X r n +1   ≤ K + 2 sup n a r n and concludes the proof by induction. Step 3: Proof of ( A 1 c ) From the integral form ∂ g ∂ x r j ( x ) = Z R d \{ x r } I r ( z ; x ) x r j − z j k z − x r k f ( z ) dz , it is easy to see that ∂ g ∂ x r j is a continuous function of x . Step 4: Proof of ( A 2 a ) 26 The definition of V r n implies that E [ V r n |F n ] = 0 and hence E [ V n |F n ] = 0 . Step 5: Proof of ( A 2 b ) E h k A n V n k 2 |F n i = k X r =1 E h ( a r n ) 2 k V r n k 2 |F n i ≤ k X r =1 ( a r n ) 2 E " I r ( Z n ; X n ) k X r n − Z n k 2 k X r n − Z n k 2    F n # ≤ k X r =1 ( a r n ) 2 . Hence assuming (H3), one gets E " ∞ X n =1 E h k A n V n k 2 |F n i # < ∞ . In the case when (H3’) holds instead of (H3), one has E " ∞ X n =1 E h k A n V n k 2 |F n i # ≤ ∞ X n =1 k X r =1 E  ( a r n ) 2 I r ( Z n ; X n )  < ∞ . Consequently , ∞ X n =1 E h k A n V n k 2 |F n i < ∞ a.s , which concludes the proof. Acknowledgements. W e thank the anonymous referees for their valuable suggestions. W e also thank the M ´ ediam ´ etrie company for allo wing us to illustrate our sequential clustering technique with their data. Refer ences Bottou, L., 2010. Large-scale machine learning with stochastic gradient descent. In: Leche vallier , Y ., Saporta, G. (Eds.), Compstat 2010. Physica V erlag, Springer ., pp. 177–186. Bryant, P ., Williamson, J. A., 1978. Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65, 273–281. Cardot, H., C ´ enac, P ., Chaouch, M., 2010. Stochastic approximation to the multi v ariate and the func- tional median. In: Leche vallier , Y ., Saporta, G. (Eds.), Compstat 2010. Physica V erlag, Springer ., pp. 421–428. 27 Cardot, H., C ´ enac, P ., Zitt, P .-A., 2011. Efficient and fast estimation of the geometric median in Hilbert spaces with an av eraged stochastic gradient approach. Bernoulli , to appear . Chaudhuri, P ., 1996. On a geometric notion of quantiles for multiv ariate data. J. Amer . Statist. Assoc. 91 (434), 862–872. Chipman, H., T ibshirani, R., 2005. Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7, 286–301. Croux, C., Gallopoulos, E., V an Aelst, S., Zha, H., 2007. Machine learning and rob ust data mining. Computational Statistics and Data Analysis 52, 151–154. Duflo, M., 1997. Random iterati ve models. V ol. 34 of Applications of Mathematics (New Y ork). Springer-V erlag, Berlin. For gy , E., 1965. Cluster analysis of multi v ariate data: efficienc y vs. interpretability of classifica- tions. Biometrics 21, 768–769. Gan, G., Ma, C., W u, J., 2007. Data Clustering: Theory , Algorithms, and Applications. SIAM, Philadelphia. Garc ´ ıa-Escudero, L., Godaliza, A., 1999. Robustness properties of k -means and trimmed k -means. Journal of the American Statistical Association 94, 956–969. Garc ´ ıa-Escudero, L., Godaliza, A., Matr ` an, C., Mayo-Iscar, A., 2008. A general trimming approach to cluster analysis. Annals of Statistics 36, 1324–1345. Garc ´ ıa-Escudero, L., Godaliza, A., Matr ` an, C., Mayo-Iscar , A., 2010. A re view of rob ust clustering methods. Adv . Data Anal. Classif. 4, 89–109. Hartigan, J., 1975. Clustering algorithms. John W iley & Sons, Ne w Y ork. Jain, A., Marty , M., Flynn, P ., 1999. Data clustering: a revie w . A CM Computing surve ys 31, 264– 323. Kaufman, L., Rousseeuw , P ., 1990. Finding groups in data: an introduction to cluster analysis. John W iley & Sons, Ne w Y ork. Lalo ¨ e, T ., 2010. l 1 -quantization and clustering in Banach spaces. Mathematical Methods of Statistics 19, 136–150. 28 MacQueen, J., 1967. Some methods for classification and analysis of multiv ariate observations. In: Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley , Calif., 1965/66). Univ . California Press, Berkele y , Calif., pp. V ol. I: Statistics, pp. 281–297. Monnez, J.-M., 2006. Almost sure con ver gence of stochastic gradient processes with matrix step sizes. Statist. Probab . Lett. 76 (5), 531–536. Ng, R. T ., Han, J., 2002. Clarans: a method for clustering objects for spatial data mining. IEEE T ransactions on Knowledge and Data Engineering 14, 1003–1016. Park, H.-S., Jun, C.-H., 2008. A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications 36, 3336–3341. Pelletier , M., 2000. Asymptotic almost sure efficienc y of averaged stochastic algorithms. SIAM J. Control Optim. 39 (1), 49–72 (electronic). Polyak, B., Juditsky , A., 1992. Acceleration of stochastic approximation. SIAM J. Control and Optimization 30, 838–855. Small, C. G., 1990. A surve y of multidimensional medians. International Statistical Re view / Re vue Internationale de Statistique 58 (3), 263–277. Zhang, Q., Couloigner , A., 2005. A new and ef ficient k-medoid algorithm for spatial clustering. Lecture Notes in Computer Science 3482, 181–189. 29

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment