Large Scale Spectral Clustering Using Approximate Commute Time Embedding

Reading time: 6 minute
...

📝 Original Info

  • Title: Large Scale Spectral Clustering Using Approximate Commute Time Embedding
  • ArXiv ID: 1111.4541
  • Date: 2011-07-15
  • Authors: Y. Chen, J. Wang, J. Liu, J. Tang —

📝 Abstract

Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to $O(n^3)$ and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods.

💡 Deep Analysis

Deep Dive into Large Scale Spectral Clustering Using Approximate Commute Time Embedding.

Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to $O(n^3)$ and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approxim

📄 Full Content

Data clustering is an important problem and has been studied extensively in data mining research [11]. Traditional methods such as k-means or hierarchical techniques usually assume that data has clusters of convex shapes so that using Euclidean distance they can linearly separate them. On the other hand, spectral clustering can detect clusters of more complex geometry and has been shown to be more effective than traditional techniques in different application domains [20,24,17]. The intuition of spectral clustering is that it maps the data in the original feature space to the eigenspace of the Laplacian matrix where we can linearly separate the clusters and thus the clusters are easier to be detected using traditional techniques like k-means. This technique requires the eigen decomposition of the graph Laplacian which is proportional to O(n 3 ) and is not applicable for large graphs.

Recent studies try to solve this problem by accelerating the eigen decomposition step. They either involves sampling or low-rank matrix approximation techniques [7,30,31,4]. [7] used traditional Nyström method to solve the eigensystem solution on data representatives which were sampled randomly and then extrapolated the solution for the whole dataset. [31] performed the spectral clustering on a small set of data centers chosen by k-means or a random projection tree. Then all data points were assigned to clusters corresponding to its centers in the center selection step. A recent work in [4] used the idea of sparse coding to approximate the affinity matrix based on a number of data representatives so that they can compute the eigensystem very efficiently. However, all of them involve sampling techniques. Although the samples or representatives are chosen uniformly at random or by using a more expensive selection, it may not completely represent the whole dataset and may not correctly capture the cluster geometry structures. Moreover, all of them involves computing the eigenvectors of the Laplacian and cannot be used directly in graph data which are popularly available such as social networks, web graphs, and collaborative filtering graphs.

In this paper, we propose a different approach using an approximate commute time embedding. Commute time is a random walk based metric on graphs. The commute time between two nodes i and j is the expected number of steps a random walk starting at i will take to reach j for the first time and then return back to i. The fact that commute time is averaged over all paths (and not just the shortest path) makes it more robust to data perturbations. Commute time has found widespread applications in personalized search [23], collaborative filtering [2,6], anomaly detection [13], link prediction in social network [16], and making search engines robust against manipulation [10]. Commute time can be embedded in an eigenspace of the graph Laplacian matrix where the square pairwise Euclidean distances are the commute time in the similarity graph [6]. Therefore, the clustering using commute time embedding has similar idea to spectral clustering and they have quite similar clustering capability.

Another kind of study in [19] proposed a semisupervised framework using data labels to improve the efficiency of the power method in finding eigenvectors for spectral clustering. Alternatively, [3] used parallel processing to accelerate spectral clustering in a distributed environment. In our work, we only focus on the acceleration of spectral clustering using a single machine in an unsupervised manner.

The contributions of this paper are as follows:

• We show the similarity in idea and implementation between spectral clustering and clustering using commute time embedding. The experiments show that they have quite similar clustering capabilities.

• We show the weakness of sampling-based approximate approaches and propose a fast and accurate spectral clustering method using approximate commute time embedding. This does not sample the data, does not compute any eigenvector, and can work directly in graph data. Moreover, the approximate embedding can be applied to different other applications which utilized the commute time.

• We show the effectiveness of the proposed methods in terms of accuracy and performance in several synthetic and real datasets. It is more accurate and faster than the state-of-the-art approximate spectral clustering methods.

The remainder of the paper is organized as follows. Sections 2 and 3 describe the spectral clustering technique and efforts to approximate it to reduce the computational time. Section 4 reviews notations and concepts related to commute time and its embedding, and the relationship between spectral clustering and clustering using commute time embedding. In Section 5, we present a method to approximate spectral clustering with an approximate commute time embedding. In Section 6, we evaluate our approach using experiments on several synthetic and real datasets. Sections 7

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut