Distributed Lance-William Clustering Algorithm

Reading time: 6 minute
...

📝 Original Info

  • Title: Distributed Lance-William Clustering Algorithm
  • ArXiv ID: 1709.06816
  • Date: 2017-09-21
  • Authors: ** - Gavriel Yarmish (Brooklyn College, City University of New York) - Philip Listowsky (Kingsborough College, City University of New York) - Simon Dexter (City University of New York, Brooklyn College) **

📝 Abstract

One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.

💡 Deep Analysis

Deep Dive into Distributed Lance-William Clustering Algorithm.

One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.

📄 Full Content

1

DISTRIBUTED LANCE-WILLIAM CLUSTERING ALGORITHM Yarmish, Gavriel, Brooklyn College, City University of New York Listowsky, Philip, Kingsborough College, City University of New York
Dexter, Simon, City University of New York, Brooklyn College

Keywords: Optimization, Hierarchal Clustering, Distributed Computing, Parallel, K-means, Data Mining

Abstract: One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.

1 Introduction Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. These include the K- means clustering algorithm and a number of hierarchal clustering methods. Hierarchal clustering methods have advantages over K-means. Unfortunately, Hierarchal Clustering can be expensive. It is for this reason that hierarchal clustering has been avoided in some applications. One example of such an application is the clustering of candidate protein structures into a limited number of groups (Zheng, Gallicchio, Deng, Andrec, & Levy, 2011). If and when clustering is used it is generally K-means clustering. Hierarchal Clustering, itself, is also divided into sub-methods such as simple hierarchal clustering (SHC) and complete hierarchal clustering (CHC). There are a few other methods of hierarchal clustering. While SHC has faster algorithms, CHC, which is more useful, is O(n3). This can be prohibitive for large n.

We present a parallel algorithm for the efficient clustering of proteins into groups. The input consists of an n by n distance matrix. 2 Choice of clustering method 2.1 Clustering Methods Clustering is the generic term for methods of categorizing objects into groups or clusters. There are hierarchal and non-hierarchal methods. There are methods that build clusters top-

2

down – that is they begin with all item in one large cluster and break down that cluster over a number of iterations. There are also bottom-up or agglomerative algorithms.
Non-hierarchal or partition methods include K-means, Graph Theoretic and other methods. The non-hierarchal clustering methods in general pre-sets the number of clusters and then fit the n items into those clusters. The name K-means indicates that there is a pre-set number of K clusters. It is an efficient algorithm but does not have the advantages of hierarchal clustering.
Hierarchal clustering methods, on the other hand, first assign every item to its own cluster. For n items, we start, therefore, with n clusters. The algorithm then iterates and combines two of the clusters into one larger cluster. It iterates n times until there is one large cluster. A snapshot is taken after every iteration of what clusters there were at the end of each iteration. This is preserved in an upside down tree (Dendrogram). If our goal, for example, was to divide the data into 10 clusters, we simply look 10 levels down the tree – on that level will be 10 clusters.
Thus, there are numerous advantages of hierarchal clustering. One advantage is that there is no pre-set number of clusters. Another advantage of hierarchal clustering is that it’s output is a full tree (also known as a Dendrogram) of clusters. The bottom ‘row’ of the tree is each of the n items in its own cluster. The second to the bottom row shows n-1 clusters and the top of the tree shows one large cluster. Once finished we can choose any of the output levels of the tree depending on how finely clustered we want it.

Hierarchal Agglomerative Clustering methods include:

  1. single-linkage
  2. complete linkage
  3. Average link
  4. Centroids
  5. Ward’s method

We found that Hierarchal bottom-up agglomerative clustering to be useful in many situations but having a cost of O(n3). This makes it too expensive to use in many real world applications.
One exception is Single-Linkage hierarchal clustering which can be solved by an algorithm that mimics the Prim’s Minimum Spanning Tree Algorithm.

To get a

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut