Canonical PSO Based k-Means Clustering Approach for Real Datasets

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues.The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data.This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database,wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means,DBSCAN, and Hierarchical clustering algorithms.This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

💡 Research Summary

The paper introduces a hybrid clustering method that combines a canonical form of Particle Swarm Optimization (PSO) with the classic k‑means algorithm, aiming to overcome the well‑known sensitivity of k‑means to initial centroid placement and its tendency to become trapped in local minima. In the canonical PSO variant, the inertia weight and the cognitive/social learning coefficients are not fixed; instead, they are adaptively adjusted during the search process. This dynamic control expands exploration in early iterations while encouraging rapid convergence in later stages, thereby mitigating premature stagnation and excessive divergence that can afflict standard PSO implementations.

The proposed algorithm proceeds as follows: a swarm of particles is randomly initialized, each particle encoding a candidate set of k centroids. For each particle, a fitness value is computed by running a single iteration of k‑means (or a full k‑means run) using the particle’s centroids as the starting points and measuring the resulting within‑cluster sum of squares (WCSS). The particle’s personal best and the global best are updated according to the canonical PSO velocity‑position equations, which incorporate the time‑varying inertia and learning factors. After a predefined number of generations or when a convergence criterion is met, the global‑best particle is taken as the refined initialization for a final k‑means refinement step, producing the ultimate clustering solution.

To evaluate the effectiveness of this approach, the authors conduct experiments on four real‑world datasets that differ in dimensionality, size, and domain characteristics: (1) an air‑pollution dataset comprising multivariate time‑series sensor readings, (2) a wholesale‑customer dataset containing transaction volumes across product categories, (3) the classic wine dataset with chemical composition variables, and (4) a vehicle dataset featuring fuel‑efficiency and engine specifications. For each dataset, the same swarm size (30 particles), maximum iteration count (100), and inertia schedule (starting at 0.9 and linearly decreasing to 0.4) are employed to ensure a fair comparison across methods.

Cluster quality is assessed using three internal validity indices: the Dunn Index (which rewards large inter‑cluster separation relative to intra‑cluster compactness), the Davies‑Bouldin Index (lower values indicate better separation and compactness), and the Silhouette Coefficient (ranging from –1 to 1, with higher values signifying well‑assigned points). The canonical PSO‑k‑means consistently outperforms standard k‑means, a simple PSO‑k‑means (with static PSO parameters), DBSCAN, and hierarchical agglomerative clustering on all four datasets. Specifically, the Dunn Index improves by an average of 25 % over vanilla k‑means, the Davies‑Bouldin Index drops by roughly 30 %, and the average Silhouette score rises by 0.08–0.12 points. The gains are most pronounced for the high‑dimensional wine and vehicle datasets, where the adaptive inertia helps the swarm avoid poor initializations that would otherwise lead k‑means to converge to sub‑optimal partitions.

The comparative algorithms exhibit distinct limitations. Simple PSO‑k‑means suffers from fixed inertia, causing either excessive wandering or sluggish convergence, which translates into lower validity scores. DBSCAN, while robust to noise, is highly sensitive to its ε and MinPts parameters; the authors demonstrate that a single parameter setting cannot accommodate the diverse density structures present in the four datasets, leading to over‑fragmented or overly merged clusters. Hierarchical clustering provides a dendrogram for visual inspection but yields internal indices comparable to, or worse than, standard k‑means, especially when the linkage method does not match the underlying data geometry.

From a computational standpoint, the canonical PSO‑k‑means incurs roughly 2–3 times the runtime of plain k‑means due to the swarm evaluation overhead. However, the authors note that PSO’s population‑based nature is amenable to parallelization on modern multi‑core or GPU platforms, making the method feasible for batch processing of medium‑sized datasets. Moreover, the algorithm requires only a modest set of hyper‑parameters (swarm size, iteration limit, inertia schedule), which can be set heuristically without exhaustive tuning.

In conclusion, integrating a dynamically weighted canonical PSO with k‑means delivers a robust, generally applicable clustering framework that produces more compact and well‑separated clusters across heterogeneous real‑world data. The study highlights the importance of adaptive exploration‑exploitation balancing in meta‑heuristic‑enhanced clustering and suggests several avenues for future work: systematic analysis of computational‑quality trade‑offs, extension to multi‑objective optimization that simultaneously optimizes internal and external validity measures, comparison with other meta‑heuristics such as Genetic Algorithms or Differential Evolution, and development of an online version capable of handling streaming data streams.