One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.
Deep Dive into Distributed Lance-William Clustering Algorithm.
One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.
1
DISTRIBUTED LANCE-WILLIAM CLUSTERING ALGORITHM
Yarmish, Gavriel, Brooklyn College, City University of New York
Listowsky, Philip, Kingsborough College, City University of New York
Dexter, Simon, City University of New York, Brooklyn College
Keywords: Optimization, Hierarchal Clustering, Distributed Computing, Parallel, K-means,
Data Mining
Abstract: One important tool is the optimal clustering of data into useful categories. Dividing similar objects into
a smaller number of clusters is of importance in many applications. These include search engines, monitoring of
academic performance, biology and wireless networks. We first discuss a number of clustering methods. We
present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each
other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of
objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel
processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our
method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be
scalable both in terms of processing speed and storage.
1 Introduction
Dividing similar objects into a smaller number of clusters is of importance in many
applications. These include search engines, monitoring of academic performance, biology and
wireless networks. We first discuss a number of clustering methods. These include the K-
means clustering algorithm and a number of hierarchal clustering methods. Hierarchal
clustering methods have advantages over K-means. Unfortunately, Hierarchal Clustering can
be expensive. It is for this reason that hierarchal clustering has been avoided in some
applications. One example of such an application is the clustering of candidate protein
structures into a limited number of groups (Zheng, Gallicchio, Deng, Andrec, & Levy, 2011).
If and when clustering is used it is generally K-means clustering. Hierarchal Clustering,
itself, is also divided into sub-methods such as simple hierarchal clustering (SHC) and
complete hierarchal clustering (CHC). There are a few other methods of hierarchal clustering.
While SHC has faster algorithms, CHC, which is more useful, is O(n3). This can be
prohibitive for large n.
We present a parallel algorithm for the efficient clustering of proteins into groups. The input
consists of an n by n distance matrix.
2
Choice of clustering method
2.1
Clustering Methods
Clustering is the generic term for methods of categorizing objects into groups or clusters.
There are hierarchal and non-hierarchal methods. There are methods that build clusters top-
2
down – that is they begin with all item in one large cluster and break down that cluster over a
number of iterations. There are also bottom-up or agglomerative algorithms.
Non-hierarchal or partition methods include K-means, Graph Theoretic and other methods.
The non-hierarchal clustering methods in general pre-sets the number of clusters and then fit
the n items into those clusters. The name K-means indicates that there is a pre-set number of
K clusters. It is an efficient algorithm but does not have the advantages of hierarchal
clustering.
Hierarchal clustering methods, on the other hand, first assign every item to its own cluster.
For n items, we start, therefore, with n clusters. The algorithm then iterates and combines two
of the clusters into one larger cluster. It iterates n times until there is one large cluster. A
snapshot is taken after every iteration of what clusters there were at the end of each iteration.
This is preserved in an upside down tree (Dendrogram). If our goal, for example, was to
divide the data into 10 clusters, we simply look 10 levels down the tree – on that level will be
10 clusters.
Thus, there are numerous advantages of hierarchal clustering. One advantage is that there is
no pre-set number of clusters. Another advantage of hierarchal clustering is that it’s output is
a full tree (also known as a Dendrogram) of clusters. The bottom ‘row’ of the tree is each of
the n items in its own cluster. The second to the bottom row shows n-1 clusters and the top of
the tree shows one large cluster. Once finished we can choose any of the output levels of the
tree depending on how finely clustered we want it.
Hierarchal Agglomerative Clustering methods include:
- single-linkage
- complete linkage
- Average link
- Centroids
- Ward’s method
We found that Hierarchal bottom-up agglomerative clustering to be useful in many situations
but having a cost of O(n3). This makes it too expensive to use in many real world
applications.
One exception is Single-Linkage hierarchal clustering which can be solved by an algorithm
that mimics the Prim’s Minimum Spanning Tree Algorithm.
To get a
…(Full text truncated)…
This content is AI-processed based on ArXiv data.