Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers
Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^$-HC, which can automatically determine an optimal number of clusters $k^$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.
💡 Research Summary
This paper addresses two fundamental challenges in federated clustering (FC): the unknown number of clusters and the presence of highly imbalanced, non‑IID data across clients. Existing FC approaches either assume a pre‑specified cluster count k and uniform cluster sizes, or they require multiple communication rounds, which increase privacy risk and communication overhead. To overcome these limitations, the authors propose Fed‑k*‑HC, a one‑shot federated hierarchical clustering framework that automatically determines the optimal number of clusters k* while preserving data privacy.
The method works as follows. Each client first over‑partitions its local dataset into a large number of micro‑subclusters (e.g., twice the expected number of true clusters). For each subcluster the client computes a prototype (centroid) together with density‑related statistics such as average intra‑cluster distance and density gap. Only these compact descriptors are uploaded to the server, ensuring that raw data never leaves the client.
At the server side, all received prototypes are treated as nodes in a similarity graph. A special distance metric d(C_i, C_j) and an overlap measure o(C_i, C_j) are defined to capture both spatial proximity and density similarity. The server then performs a density‑based hierarchical merging: the most similar pair of subclusters is merged iteratively. The merging process is guided by the concepts of Natural Neighbors (NN_b) and Strict Natural Neighbors (SNN). When every cluster’s neighbor set becomes saturated—i.e., no further meaningful merges are possible—the algorithm stops, and the current number of clusters is declared as the automatically determined k*. This stopping criterion effectively eliminates the “uniform effect” where large clusters dominate small ones.
Key contributions include: (1) a novel micro‑subcluster generation step that faithfully captures client‑side data heterogeneity; (2) a hierarchical merging scheme that can handle clusters of varying sizes and shapes without requiring a preset k; (3) a one‑shot communication protocol that transmits only lightweight prototypes, thereby reducing privacy exposure and bandwidth usage.
Extensive experiments were conducted on six public datasets (MNIST, FEMNIST, CIFAR‑10, Reuters, synthetic imbalanced data, etc.) and three imbalanced scenarios. Baselines comprised one‑shot methods (KFed, MUFC, F3KM, OrcHestra) and multi‑round approaches (FedET, FedGT). Evaluation metrics were clustering accuracy (ACC), normalized mutual information (NMI), and the absolute difference |k* − K| between the automatically inferred and true cluster counts. Fed‑k*‑HC consistently outperformed baselines, achieving an average ACC improvement of 5–12 percentage points and NMI gains of 0.05–0.12. Moreover, k* matched the ground‑truth K in 92 % of cases, whereas existing methods often deviated by 20 % or more. Communication cost was limited to a single round, with transmitted data proportional only to the number of micro‑subclusters per client, yielding a >70 % reduction compared to iterative schemes.
The paper also discusses limitations. The number of micro‑subclusters must be set a priori; too few may miss fine‑grained structure, while too many increase computational and communication load. High‑dimensional data can make distance calculations expensive, and client‑side prototype generation adds computational burden for resource‑constrained devices. Future work is suggested on adaptive subcluster count selection and dimensionality‑reduction techniques to further improve scalability.
In summary, Fed‑k*‑HC provides a practical, privacy‑preserving solution for federated clustering under realistic conditions of unknown cluster numbers and imbalanced data. Its one‑shot hierarchical merging paradigm opens new avenues for personalized federated learning, medical data analysis, and other domains where decentralized, unlabeled data must be clustered without compromising user privacy.
Comments & Academic Discussion
Loading comments...
Leave a Comment