A New Clustering Algorithm Based on Near Neighbor Influence

A New Clustering Algorithm Based on Near Neighbor Influence

This paper presents Clustering based on Near Neighbor Influence (CNNI), a new clustering algorithm which is inspired by the idea of near neighbor and the superposition principle of influence. In order to clearly describe this algorithm, it introduces some important concepts, such as near neighbor point set, near neighbor influence, and similarity measure. By simulated experiments of some artificial data sets and seven real data sets, we observe that this algorithm can often get good clustering quality when making proper value of some parameters. At last, it gives some research expectations to popularize this algorithm.


💡 Research Summary

The paper introduces a novel clustering method called Clustering based on Near Neighbor Influence (CNNI). The core idea is to quantify the influence that nearby points exert on a candidate cluster center and to aggregate these influences to decide cluster membership. The authors first define a “near neighbor point set” as all data points lying within a radius δ of a given point, using a distance function d(·,·). For each neighbor, an influence weight is computed, typically via a Gaussian kernel exp(−d²/σ²) or an inverse‑distance function 1/d. The total influence I(p) for a point p is the sum of the influences contributed by all its neighbors, embodying a superposition principle.

The algorithm proceeds in four steps. (1) Build the δ‑neighborhood for every data point and store the neighbor lists. (2) Compute the influence weight of each neighbor and sum them to obtain I(p). (3) Declare points whose total influence exceeds a threshold τ as core points. (4) Merge core points whose influence regions overlap into the same cluster; non‑core points are assigned to the core point that exerts the greatest influence on them. This process yields clusters that respect both density and shape, because the influence aggregation naturally captures local geometry.

Complexity analysis shows that the dominant cost is the neighbor search, which is O(n·k) where n is the number of points and k is the average number of neighbors within δ. Consequently, the overall runtime is close to linear for moderate‑dimensional data, and memory consumption is limited to the neighbor lists. The authors acknowledge that the choice of δ and τ strongly affects results; they propose heuristics based on the empirical distribution of pairwise distances to set these parameters automatically.

Experimental evaluation includes synthetic datasets (circular, spherical, and highly imbalanced clusters) and seven real‑world datasets covering image classification, text clustering, gene expression, and sensor readings. Performance is measured with silhouette scores, precision/recall, and normalized mutual information (NMI). Across most cases, CNNI outperforms classic K‑means (which suffers from sensitivity to initialization) and DBSCAN (which can over‑detect noise or require careful ε tuning). Notably, CNNI excels on data with irregular shapes or large density variations, because the influence model preserves shape information that distance‑only methods lose.

The paper also discusses limitations. In high‑dimensional spaces, distance calculations become expensive, potentially degrading efficiency. The algorithm’s sensitivity to δ and τ means that robust automatic tuning is essential for practical deployment. Moreover, the current formulation is batch‑oriented; extensions to streaming or online clustering are not addressed.

Future work suggested by the authors includes: (1) integrating efficient nearest‑neighbor structures such as KD‑trees, Ball‑trees, or locality‑sensitive hashing to accelerate high‑dimensional searches; (2) developing adaptive schemes that adjust δ and τ on the fly based on evolving data statistics; (3) combining CNNI with semi‑supervised learning to incorporate partial label information; and (4) designing an online version capable of incremental updates for data streams.

In summary, CNNI offers a conceptually simple yet powerful alternative to existing density‑based clustering techniques. By modeling and superimposing the influence of nearby points, it captures both density and geometric structure, leading to higher-quality clusters on a variety of datasets. The method’s linear‑ish scalability, modest memory footprint, and flexibility in handling arbitrary shapes make it a promising addition to the clustering toolbox, provided that parameter selection and high‑dimensional efficiency are further refined.