Distributed Kernel K-Means for Large Scale Clustering
📝 Abstract
Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version have still a wide audience because of their conceptual simplicity and efficacy. However, the systematic application of the kernelized version of k-means is hampered by its inherent square scaling in memory with the number of samples. In this contribution, we devise an approximate strategy to minimize the kernel k-means cost function in which the trade-off between accuracy and velocity is automatically ruled by the available system memory. Moreover, we define an ad-hoc parallelization scheme well suited for hybrid cpu-gpu state-of-the-art parallel architectures. We proved the effectiveness both of the approximation scheme and of the parallelization method on standard UCI datasets and on molecular dynamics (MD) data in the realm of computational chemistry. In this applicative domain, clustering can play a key role for both quantitively estimating kinetics rates via Markov State Models or to give qualitatively a human compatible summarization of the underlying chemical phenomenon under study. For these reasons, we selected it as a valuable real-world application scenario.
💡 Analysis
Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version have still a wide audience because of their conceptual simplicity and efficacy. However, the systematic application of the kernelized version of k-means is hampered by its inherent square scaling in memory with the number of samples. In this contribution, we devise an approximate strategy to minimize the kernel k-means cost function in which the trade-off between accuracy and velocity is automatically ruled by the available system memory. Moreover, we define an ad-hoc parallelization scheme well suited for hybrid cpu-gpu state-of-the-art parallel architectures. We proved the effectiveness both of the approximation scheme and of the parallelization method on standard UCI datasets and on molecular dynamics (MD) data in the realm of computational chemistry. In this applicative domain, clustering can play a key role for both quantitively estimating kinetics rates via Markov State Models or to give qualitatively a human compatible summarization of the underlying chemical phenomenon under study. For these reasons, we selected it as a valuable real-world application scenario.
📄 Content
DISTRIBUTED KERNEL K-MEANS FOR LARGE
SCALE CLUSTERING
Marco Jacopo Ferrarotti1, Sergio Decherchi1, 2 and Walter Rocchia1
1 Istituto Italiano di Tecnologia, Genoa, Italy
sergio.decherchi@iit.it
2 BiKi Technologies s.r.l, Genoa, Italy ABSTRACT Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k- means and its kernelized version have still a wide audience because of their conceptual simplicity and efficacy. However, the systematic application of the kernelized version of k-means is hampered by its inherent square scaling in memory with the number of samples. In this contribution, we devise an approximate strategy to minimize the kernel k-means cost function in which the trade-off between accuracy and velocity is automatically ruled by the available system memory. Moreover, we define an ad-hoc parallelization scheme well suited for hybrid cpu-gpu state-of-the-art parallel architectures. We proved the effectiveness both of the approximation scheme and of the parallelization method on standard UCI datasets and on molecular dynamics (MD) data in the realm of computational chemistry. In this applicative domain, clustering can play a key role for both quantitively estimating kinetics rates via Markov State Models or to give qualitatively a human compatible summarization of the underlying chemical phenomenon under study. For these reasons, we selected it as a valuable real-world application scenario. KEYWORDS Clustering, Unsupervised Learning, Kernel Methods, Distributed Computing, GPU, Molecular Dynamics
- INTRODUCTION Grouping unlabelled data samples into meaningful groups is a challenging unsupervised Machine Learning (ML) problem with a wide spectrum of applications, ranging from image segmentation in computer vision to data modelling in computational chemistry [1]. Since 1957, when k-means was originally introduced, a plethora of different clustering algorithms arose without a clear all- around winner. Among all the possibilities, k-means as originally proposed, is still widely adopted mainly because of its simplicity and the straightforward interpretation of its results. The applicability of such simple, yet powerful, algorithm however is limited by the fact that, by construction, it is able to correctly identify only linearly separable clusters and it does require an explicit feature space (i.e. a vector space where each sample has explicit coordinates). To overcome both these limitations one can take advantage of the well-known kernel extension of k-means [2]. Computational complexity and memory occupancy are the major drawbacks of kernel k-means: the size of the kernel matrix to be stored together with the number of kernel function evaluations scales quadratically with the number of samples. This computational burden has historically limited the success of kernel k-means as an effective clustering technique. In fact, even though the potential of such approach has been theoretically demonstrated, few works in the literature [3] explore possibly more efficient approaches able to overcome the 𝑂(𝑁2) computational cost. We selected a real-world challenging application scenario, namely Molecular Dynamics (MD) simulations of biomolecules in the field of computational chemistry. Such atomistic simulations, obtained by numerical integration of the equations of motion, are a valuable tool in the study of biomolecular processes of paramount importance such as drug-target interaction [4]. MD simulations produce an enormous amount of data in the form of conformational frames (i.e. atoms positions at a given time step) that need to be processed and converted into humanly readable models to get mechanistic insights. Clustering can play a crucial role in this, as demonstrated by the success of recent works [1] and by the popularity of Markov state models [5]. We stress the fact that kernel k-means, without requiring an explicit feature space, is particularly suited for clustering MD conformational frames where roto-translational invariance is mandatory. We introduce here an approximated kernel k-means algorithm together with an ad-hoc distribution strategy particularly suited for massively parallel hybrid CPU/GPU architectures. We reduce the number of kernel evaluations both via a mini-batch approach and an a priori sparse representation for the cluster centroids. As it will be clear, such twofold approximation is controlled via two straightforward parameters: the number of mini-batches 𝐵 and the sparsity degree of the centroid representation 𝑠. These two knobs allow to finely adapt the algorithm to the available computational resources to cope with virtually any sample size. The rest of the paper is organized as follow: in section 2, we briefly review the standard kernel k- means [2] [6] algorithm. In section
This content is AI-processed based on ArXiv data.