Kernelized Locality-Sensitive Hashing for Semi-Supervised Agglomerative Clustering

K ernelized Locality- Sensitive Hashing f or Semi-Sup ervised A gglomera tive Clustering Boyi Xie Departmen t of C om puter Science Columbia University New Y ork, NY 1002 7 xie@cs.colum bia.edu Shuheng Zheng Departmen t of Industrial Engineer ing & Operation s R esearch Columbia University New Y ork, NY 1002 7 sz2228@colum bia.edu Abstract Large scale ag glomera ti ve clustering is hindered by comp utational burdens. W e propo se a novel scheme where e xac t in ter-instance distance calculation is re- placed by the Hamming distance between Kernelized Locality-Sensitive Hashing (KLSH) hashed values. This results in a metho d that dr astically decreases com - putation time. Additionally , we take advantage of certain labeled data po ints via distance metric learning to achie ve a competitiv e precision a nd recall comparin g to K-Means but in much less comp utation time. 1 Intr oduction Our pr oposed research top ic is to do c lustering on a scalable dataset from a sem i-supervised ap- proach based on hashing metho ds. In pa rticular, our g oal is to explore the underly ing data distribu- tion by c lustering the data points and differentiating the classes. When a small set o f labeled data that come fro m only a subset of the classes is gi ven, we want to ﬁnd out the whole data distrib ution for a complete set of classes. For examp le, we are gi ven a set o f lab els of two classes, can we sep- arate th ese tw o classes well and at the same time discover the existence o f a third class. It requires using the infor mation from the labeled data to ﬁnd a transformation me tric that can split the two classes well; and after this data transfo rmation , we can discover that the re is a third class exists. Suppose there is a handwr itten digit recognition task and the dataset contains digits ‘2’, ‘7’ and ‘4’. If a general agglom erative clustering is run, it mig ht end up with 2 clusters that ‘2’ an d ‘7’ in one cluster and ‘4’ in the other, due to the similarity of their shapes. Howe ver , when a small labeled set of classes ‘2’ and ‘7’ is gi ven, we can lear ned a degree of g ranular ity for similarity compariso n. By using a data transfo rmation that maximally can split ‘2’ and ‘7’ into two clusters, we are able to identify the e xisten ce of another cluster , digit ‘4’. Because agglomerative clustering suf fers from its computatio n inef ﬁciency , a major con tribution of this paper is to in troduce a machin e learned hash- ing method - kernelized locality-sensitiv e h ashing ( KLSH) - into agglom erative clustering . This results in an efﬁcient computation in clustering for large-scale dataset. Our pap er is stru ctured as follows . W e pr ovide backgrou nd stud y and r elated work in section 2. Section 3 presents ou r algorithms for distanc e metric learn ing and KLSH clustering. Section 4 describes the experiments with a discussion of the results, follo wed by conclusio ns in s ection 5 . 1 2 Related W ork There has be en m uch previous work on cluster seeding to address the limitatio n that iterati ve clus- tering tec hniques (e.g. K-Mea ns and Expectatio n Maximization (EM)) are sensitiv e to the choice of initial starting poin ts (seeds) . Th e p roblem add ressed is h ow to select seed poin ts in th e ab sence of prior knowledge. Kaufman and Rousseeuw [1] prop ose an elaborate mechanism: the ﬁrst seed is the instance that is most central in th e data; the rest of the representatives are selected by choosing instances that promise to be closer to mor e of the remaining instances. Pena et al. [2] empirically compare the four in itialization methods for the K- Means algorithm a nd illustrate that the ra ndom and Kaufman in itializations outperfo rm the other two, since they make K-Mean s less depend ent on the initial cho ice of seeds. In K-Means++ [3], the ran dom starting poin ts are chosen with speciﬁc probab ilities: that is, a point p is chosen as a seed with probability p ropo rtional to p ’ s contribution to the overall potential (deﬁned by the su m of square d distances between each po int and the closest center). By augmenting K- Means u sing this simple, randomized seedin g techn ique, K-M eans++ is θ ( log K) competitive with the optimal clustering . Bradley and Fayyad [4] propo se reﬁning the initial seeds by tak ing into a ccount the modes o f the u nderly ing distribution. This reﬁned initial seed en- ables the i terative algorith m to con verge to a better local minimum. Semi-supervised learn ing is also seen a s un supervised learnin g guided by con straints. Noticed th at c lustering is heavily depen dent on distance metrics and a par ticular algorithm is an ex ecuto r to follow the rules, [5] pointed out the desire to u se a systematic way to learn distance m etric for clu stering from labeled data. It is based on posing metric learning as a conv ex optimizatio n prob lem. When th e data size is g rowing exponentially , hashin g is a technique especially goo d at solving large scale problems. [6] described Locality-Sensitive Hashin g (LSH) method, which is an ef ﬁcient algorithm for the appro ximate a nd exact nearest neighbor problem. Their goal is to pre process a dataset of ob jects (e.g. imag es) so that later, gi ven a new query object, on e can quickly retur n the dataset object that is m ost similar to the q uery . The technique is of signiﬁcant intere st in a wide variety of areas of unsuper vised lea rning. Hierarch ical clustering tries to solve a similar problem, from another perspective. By iteratively ﬁnding nea rest neighbo rs, it gro ups data into clusters. Kernelized LSH is later p ropo sed by [7] fo r fast image search. It gen eralizes LSH to accommodate arbitrary kernel fu nctions, making it possible to preserve the algorithm’ s su b-linear time similarity search guaran tees for a wide class of useful similarity func tions. 3 Methods In this section, we describe our method s to solve a large scale semi-super vised learning problem by ﬁrst introdu cing th e distance learning metrics, and then our fast agglom erative clustering method based on kernelized locality-sensitiv e hashing (KLSH). 3.1 Distance Metric Learning Under th e c ircumstances that the d ata g iv en to us ha s a few labeled points, and we know wh ich points for sure belongs to the same or different classes. W e have a similarity and a dissimilarity matrix S and D respectively . For entr y s i,j in similarity matrix S , s i,j = 1 if d ata x i and x j are in the same class and 0 oth erwise. Similarly for dissimilarity matrix D . Based on [5], we try to learn a distance metr ic || x − y || A = p ( x − y ) T A ( x − y ) , where x , y are two d ata p oints, A is a po siti ve semi-deﬁnite matrix of distance parameters am ong data points. Th e idea is to minimize the distance between similar points while keeping dissimilar points apart. min X ( x i ,x j ) ∈ S || x i − x j || 2 A X ( x i ,x j ) ∈ D || x i − x j || A ≥ 1 A  0 It can be solved ef ﬁciently using constraine d Newton’ s descent on the objective f unction 2 Algorithm 1 Semi-Superv ised Clustering with Hashing Input: DataLa beled(limited ), Da ta x 1 , ..., x N . Step1: Lea rn distance metric A from labeled (limited) data Step2: Build A -Distan ce KLSH table (Algorithm 2). Step3: Initialization: • Let cluster d istribution R 0 = {{ x i } , i = 1 , ..., N } , i.e. ea ch d ata p oint is an ind ividual cluster C i . • Let proximity matrix P 0 = P ( X ) of hash keys. • Set t = 1 . repeat • t = t + 1 • Find C i , C j such that d ( C i , C j ) = min r,s =1 ,...,N ,r 6 = s d ( C r , C s ) . • Merge C i , C j into a single cluster C q and form R t = ( R t − 1 − C i , C j ) ∪ C q . • Deﬁne the pro ximity matrix P t from P t − 1 by (a) d eleting th e two rows and co lumns that correspo nding to the merged clusters and (b) adding a ro w and a colum n of the new cluster . until (The remaining numbe r of clusters is equal to a speciﬁed k ; or the inconsistency coefﬁcient exceeds a threshold.) Step4: Retrieve actual data instances fro m KLSH hash table for the corresponding clusters Algorithm 2 Build A -Distance KLSH T ab le Input: Data x i , ..., x N , distance parameter A . Step1: Rand omly select p points from data, denoted as x 1 , ..., x i , ..., x p . Build kernel K ( i, j ) = e xp ( − ( d ( x i , x j ) 2 /σ 2 ) , for i, j = 1 , ..., p , where d ( x, y ) = p ( x − y ) ′ A ( x − y ) . Step2: Ap ply SVD to K , suppose K = U Σ U T . K − 1 / 2 = U Σ − 1 / 2 U T . Step3: Form a p -dim vector e s , where t dimensions are 1 while oth ers are 0 . These t dimension s are chosen randomly . Step4: w = K − 1 / 2 e s . For any x , the bit is created as h ( x ) = si g n (Σ p i =1 w i k ( x, x i )) . X ( x i ,x j ) ∈ S || x i − x j || 2 A − l og ( X ( x i ,x j ) ∈ D || x i − x j || A ) . This semi-superv ised pa rt is to learn a distance metric for data transfo rmation b efore the main ag- glomerative clu stering. 3.2 Clustering with KLSH Curse of dimension ality is a well-known pr oblem for learning on large scale datasets. It is related to the f act that the cost of computation gro ws e xp onentially with th e increase of data dimensions, or the num ber of data instances. This is a p roblem directly affects clustering approach es that based on density estimation in input space. For instance, in K-Means or general agglomerative clustering, the cost in iteratively es timatin g new centroid locatio ns an d re-ar ranging data instan ces to clusters exerts a signiﬁcan t burden on the per- forman ce. This h appens especially to d ataset in high dimen sion, where frequ ently computing inter- instance distances are highly expensiv e. Our prop osed semi-supervised clustering algorith m using kernelized locality-sensitive hashin g (KLSH) in Algo rithm 1 aims to solve the large scale agglom erative clustering prob lem. It ﬁrst learn a distance metric A from a small set of labeled data (step 1 in Algorithm 1) . The s econ d step is to build KLSH table that map the da ta in to hashed bits. In the r est of the pro cedure , an agglo mera- ti ve clusterin g is p erform ed. Instead of explicitly co mputing inter-instance distances, the clustering is done based on th e KLSH-hashed data points by measuring their Hamming distance. Su ch kernel- ized locality-sensitiv e h ashing m ethod has a high probab ility of preserving n eighbo rhood s so it’ s a reasonable substitute for the exact inter -instance distances. 3 T able 1: Exp eriment results that compare four m ethods: (1) K-Mea ns, (2) K-Mean s with Distan ce- metric Learning , (3) Agglomerative clusterin g using KLSH and (4) Ag glomerative clustering using KLSH with Distance-metric Learnin g. Precision , recall an d com putation time is reported . They all run on data underly ing 10 classes and the hash code is 32-bit. # I N S T . K - M E A N S K - M E A N S W / D L A G G L . K L S H A G G L . K L S H W / D L P R E R E C T I M E P R E R E C T I M E P R E R E C T I M E P R E R E C T I M E 5 0 0 0 . 5 9 0 . 5 6 4 1 3 . 2 4 6 . 5 3 7 . 4 9 6 1 5 . 6 4 7 . 5 7 3 . 3 0 5 2 . 1 5 5 . 6 3 1 . 2 7 2 2 . 4 6 6 1 0 0 0 0 . 5 6 8 . 5 4 0 4 6 . 3 9 8 . 5 8 0 . 5 5 6 3 3 . 7 3 6 . 5 2 0 . 3 3 6 7 . 2 5 5 . 6 1 3 . 2 5 0 5 . 2 4 6 1 5 0 0 0 . 5 7 4 . 5 3 0 6 9 . 2 5 2 . 5 5 6 . 5 3 9 1 8 6 . 4 6 9 . 5 8 4 . 1 8 0 1 3 . 8 4 3 . 6 1 0 . 1 5 6 8 . 0 7 7 2 0 0 0 0 . 5 8 9 . 5 6 3 7 9 . 4 9 9 . 4 5 5 . 4 4 8 1 1 2 . 1 7 8 . 6 0 9 . 3 5 5 3 . 0 5 2 . 6 1 7 . 2 9 2 1 8 . 0 7 0 3 0 0 0 0 . 5 2 3 . 5 0 3 1 6 4 . 8 5 3 . 5 5 2 . 5 4 1 1 3 9 . 7 7 3 . 6 2 4 . 2 3 5 5 8 . 6 4 6 . 5 4 8 . 3 0 6 2 3 . 1 3 6 5 0 0 0 0 . 5 6 0 . 5 3 1 3 3 9 . 5 9 9 . 5 6 5 . 5 3 0 3 3 3 . 3 1 3 . 5 7 9 . 2 3 0 1 2 6 . 2 8 0 . 5 9 0 . 2 5 2 1 2 2 . 5 5 8 4 Experiments Our exper iment is based on the MNIST dataset of handwritten digits. W e ev aluate our KLSH ag- glomerative clu stering algorithm via a comparison to K-Means. 4.1 Datasets W e obtained hand written digits from th e MNIST data repository . There are 10 classes o f rasterized images (correspo nding to digits from ‘0’ to ‘9’). W e used up to 50,0 00 data points for experiments. 4.2 Experiment Setup W e ran experiments using bo th K-Mean s and KL SH agg lomerative clustering with and witho ut distance metric lea rning. Hash string length, the nu mber of classes, an d the number of d ata p oints are varied o ne at a tim e. W e report p recision, r ecall an d the co mputatio n time. All the experim ents were done on a machine with 8-core Intel processor s of 2.8 GHz and 8 GB of RAM. 4.3 Results and Analysis T ables 1-3 summarize our results, and the followings ar e se veral trends to notice. First in T ab le 1, we observed that KL SH agg lomerative clustering c an achieve the same le vel of precision for a fractio n of the computa tional cost. The downside is th at recall is cau sed by the factor o f 2 . The decrease in recall is ca used b y the fact that KLSH canno t recover all of the p oints in the nearest neigh borh ood. The ad dition of distance metric lear ning has noticeable beneﬁts o n perfor mance for KLSH Agglomer ativ e Clustering. In T able 2, we an alyzed the effect of an increase in the number of classes (while ﬁx ing the number of data points) on precision, recall, and computatio n tim e. Precision r emains constant while recall decreases. Th e computational costs remains relativ ely independ ent of the number of clusters. In T able 3, we analyzed the effect of h ash string len gth on clu stering validity . Increasing the len gth of hash string increases both the precision and recall. It is also ab le to adju st the tradeoff between ef ﬁciency and effecti veness. Notice that even if we use m = 32 bit binary h ash code, the re are s till 2 32 possible outcomes. If the hashing split data well, the number o f en tries o f the table will still b e very large. It increases the accuracy of clustering results but meanwhile leads to a higher compu tation cost during agglomerative clustering . According to the r esults, clu stering with KLSH h as sup erior p erform ance when the dataset is large and the n umber of real clusters is small. Comparing to K-Me ans, it h as large promising impr ove- ment on speed. When true clu ster nu mber is not large, it ach iev es h igh per forman ce on both speed and accura cy . Especially in a lower level o f the linkage tree, clu stering with b ias (d istance metric learning) can immediately correctly cluster similar data instances. 4 T able 2: Compare the perform ance of various number of under lying classes for agglom erative clus- tering using KLSH with distance metric learning. In this case, data size is 20,000 and the hash code is 32 -bit. W ith an increase in the nu mber of classes, the precision r emains constant while recall decreases # C L A S S E S P R E R E C T I M E 4 . 7 1 4 . 4 6 5 1 6 . 5 2 7 5 . 6 2 9 . 3 5 5 1 9 . 2 2 6 6 . 7 0 2 . 5 0 7 1 8 . 7 4 7 7 . 6 1 1 . 3 1 7 2 1 . 6 7 5 8 . 6 5 4 . 3 5 4 2 1 . 5 4 0 9 . 6 0 3 . 2 8 4 2 4 . 0 2 4 1 0 . 6 1 7 . 2 9 2 1 8 . 0 7 0 T able 3: Compare the per forman ce of various num ber of h ash code bits for a gglomer ativ e cluster ing using KLSH with distance metric learn ing. In this case, data size is 20,000 with u nderly ing 10 classes. In creasing the length of hash string increases both the precision and recall. # B I T S P R E R E C T I M E 8 . 4 0 2 . 2 4 5 0 . 0 4 3 1 6 . 5 9 9 . 1 1 1 1 . 0 0 0 3 2 . 6 1 7 . 2 9 2 1 8 . 0 7 0 6 4 . 6 3 5 . 3 8 0 9 2 . 4 5 2 5 Conclusions General hier archical clustering m ethods can not scale well on large d ataset d ue to th e expon entially growing num ber o f calculations o n inter-instance distance s. Kernelized locality -sensitiv e hashing (KLSH) provides a high probability of preservin g n eighbo rhood s and it’ s a r easonable s ub stitute for the exact inter-instance distan ces. Our proposed KLSH agglomer ativ e clustering alle viates the prob- lem b y calculating a red uced-sized Hamming distance and achieves e fﬁcient clustering comp utation. The incorpo ration of distance metric learning marginally improves the pr ecision and recall. Refer ences [1] L. Kaufman and P . J. Rousseeuw . Fi nd ing Gr o ups in Da ta: An Intr odu ction to Cluster Analysis . John W iley , 1990. [2] Jos ´ e Man uel Pe ˜ na, Jos ´ e Antonio Lo zano, and Pedro Lar ra ˜ nag a. An empirical compar ison o f four initialization m ethods f or the k-means algo rithm. P attern Recognition Letters , 20(10):1 027– 1040, 1999 . [3] David A rthur and Sergei V assilvitskii. k- means++: the advantages of careful seeding. I n Pr o- ceedings of the eigh teenth annua l ACM-SIAM symposium on Discr ete alg orithms , SOD A ’0 7, pages 1027 –1035 , Philadelp hia, P A, USA, 2007. [4] Jude W . Sh avlik, editor . R eﬁning Initial P oints fo r K-Mean s Clustering . Mo rgan Kaufmann, 1998. [5] Eric P . Xing, Andrew Y . Ng, Michael I. Jord an, an d Stuart Russell. Distance m etric learning , with a pplication to clustering with side-inform ation. In Adva nces in Neural I nformation Pr o- cessing Systems 15 , pages 505–5 12. MIT Press, 2002 . [6] Alexandr Andon i an d Piotr In dyk. Near-optimal h ashing algorithms for ap prox imate near est neighbo r i n hig h dimensions. Commu n. A CM , 51(1) :117–1 22, January 200 8. [7] Brian Kulis and Kristen Graum an. K ernelized locality -sensitiv e hashing for scalable image search. In IEEE International Confer ence on Computer V ision (ICCV) , 2009. 5

Kernelized Locality-Sensitive Hashing for Semi-Supervised Agglomerative Clustering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment