Unsupervised and semi-supervised clustering by message passing: Soft-constraint affinity propagation

Reading time: 5 minute
...

📝 Original Info

  • Title: Unsupervised and semi-supervised clustering by message passing: Soft-constraint affinity propagation
  • ArXiv ID: 0712.1165
  • Date: 2008-10-20
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Soft-constraint affinity propagation (SCAP) is a new statistical-physics based clustering technique. First we give the derivation of a simplified version of the algorithm and discuss possibilities of time- and memory-efficient implementations. Later we give a detailed analysis of the performance of SCAP on artificial data, showing that the algorithm efficiently unveils clustered and hierarchical data structures. We generalize the algorithm to the problem of semi-supervised clustering, where data are already partially labeled, and clustering assigns labels to previously unlabeled points. SCAP uses both the geometrical organization of the data and the available labels assigned to few points in a computationally efficient way, as is shown on artificial and biological benchmark data.

💡 Deep Analysis

Deep Dive into Unsupervised and semi-supervised clustering by message passing: Soft-constraint affinity propagation.

Soft-constraint affinity propagation (SCAP) is a new statistical-physics based clustering technique. First we give the derivation of a simplified version of the algorithm and discuss possibilities of time- and memory-efficient implementations. Later we give a detailed analysis of the performance of SCAP on artificial data, showing that the algorithm efficiently unveils clustered and hierarchical data structures. We generalize the algorithm to the problem of semi-supervised clustering, where data are already partially labeled, and clustering assigns labels to previously unlabeled points. SCAP uses both the geometrical organization of the data and the available labels assigned to few points in a computationally efficient way, as is shown on artificial and biological benchmark data.

📄 Full Content

Clustering is a very important problem in data analysis [2,3]. Starting from a set of data points, one tries to group data such that points in one cluster are more similar in between each other than points in different clusters. The hope is that such a grouping unveils common functional characteristics. As an example, one of the currently most important application fields for clustering is the informatical analysis of biological high-throughput data, as given e.g. by gene expression data. Different cell states result in different expression patterns.

If data are organized in a well-separated way, one can use one of the many unsupervised clustering methods to divide them into classes [2,3]; but if clusters overlap at their borders or if they have involved shapes, these algorithms in general face problems. However, clustering can still be achieved using a small fraction of previously labeled data (training set), making the clustering semisupervised [4,5]. While designing algorithms for semi-supervised clustering, one has to be careful: They should efficiently use both types of information provided by the geometrical organization of the data points as well as the already assigned labels.

In general there is not only one possible clustering. If one goes to a very fine scale, each single data point can be considered its own cluster. On a very rough scale, the whole data set becomes a single cluster. These two extreme cases may be connected by a full hierarchy of cluster-merging events.

This idea is the basis of the oldest clustering method, which still is amongst the most popular one: hierarchi-cal agglomerative clustering [6,7]. It starts with clusters being isolated points, and in each algorithmic step the two closest clusters are merged (with the cluster distance given, e.g., by the minimal distance between pairs of cluster elements), until only one big cluster appears. This process can be visualized by the so-called dendrogram, which shows clearly possible hierarchical structures. The strong point of this algorithm is its conceptual clarity connected to an easy numerical implementation. Its major problem is that it is a greedy and local algorithm, no decision can be reversed.

A second traditional and broadly used clustering method is K-means clustering [8]. In this algorithm, one starts with a random assignment of data points to K clusters, calculates the center of mass of each cluster, reassigns points to the closest cluster center, recalculates cluster centers etc., until the cluster assignment is converged. This method is a very efficiently implementable method, but it shows a strong dependence on the initial condition, getting trapped by local optima. So the algorithm has to be rerun many times to produce reliable clusterings, and the algorithmic efficiency is decreased. Further on K-means clustering assumes spherical clusters, elongated clusters tend to be divided artificially in sub-clusters.

A first statistical-physics based method is super-paramagnetic clustering [9,5]. The idea is the following: First the network of pairwise similarities becomes preprocessed, only links to the closest neighbors are kept. On this sparsified network a ferromagnetic Potts model is defined. In between the paramagnetic high-temperature and the ferromagnetic low-temperature phase a super-paramagnetic phase can be found, where already large clusters tend to be aligned. Using Monte-Carlo simulations, one measures the pairwise probability for any two points to take the same value of their Potts variables. If this probability is large enough, these points are identified to be in the same cluster. This algorithm is very elegant since it does not assume any cluster number of structure, nor uses greedy methods. Due to the slow equilibration dynamics in the super-paramagnetic regime it needs, however, the implementation of sophisticated cluster Monte-Carlo algorithms. Note that also super-paramagnetic clusterings can be obtained by message passing techniques, but these require an explicit breaking of the symmetry between the values of the Potts variables to give non-trivial results.

Also in the last years, many new clustering methods are being proposed. One particularly elegant and powerful method is affinity propagation (AP) [12], which gave also the inspiration to our algorithm. The approach is slightly different: Each data point has to select an exemplar in between all other data points. This shall be done in a way to maximize the overall similarity between data points and exemplars. The selection is, however, restricted by a hard constraint: Whenever a point is chosen as an exemplar by somebody else, it is forced to be also its own self-exemplar. Clusters are consequently given as all points with a common exemplar. The number of clusters is regulated by a chemical potential (given in form of a self-similarity of data points), and good clusterings are identified via their robustness with respect to changes in this chemic

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut