DPMM-CFL: Clustered Federated Learning via Dirichlet Process Mixture Model Nonparametric Clustering
Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.
💡 Research Summary
The paper introduces DPMM‑CFL, a novel clustered federated learning (CFL) framework that eliminates the need to pre‑specify the number of clusters (K) by leveraging a Dirichlet Process (DP) mixture model. Traditional CFL methods such as FeSEM, IFCA, or K‑means‑based approaches require a fixed K, which is unrealistic when the latent client grouping is unknown or evolves over time. DPMM‑CFL places a DP prior G ∼ DP(α, G₀) over cluster parameters, allowing an unbounded number of clusters while the actual number of occupied clusters is bounded by the number of clients M. The base distribution G₀ is chosen as a spherical Gaussian (mean 0, variance 1), providing a weakly informative prior.
The algorithm proceeds in communication rounds. At the beginning of each round the server broadcasts the current cluster models Ωₜ₋₁ to all clients according to their previous assignments cₜ₋₁. Each client i initializes its local weights ωₜ,0,i with the model of its assigned cluster and performs Q steps of stochastic gradient descent (SGD) on its private data, producing updated parameters ωₜ,i. After all clients have finished, the server collects the set {ωₜ,i} and performs Bayesian non‑parametric clustering. Because the posterior over assignments p(c | ω, α, G₀) has no closed form when K is unknown, the authors employ split‑merge Markov chain Monte Carlo (MCMC) with restricted Gibbs proposals, rather than simple Gibbs sampling, to avoid getting trapped in local modes. The split‑merge moves are accepted according to a Metropolis–Hastings ratio that incorporates both the proposal probability and the DP‑induced prior p(c | α) together with the marginal likelihood of each cluster p(Wₖ | G₀). This step can create new clusters or merge existing ones, effectively adapting Kₜ at each round.
Once a new assignment cₜ is sampled, the server aggregates the local updates within each cluster by weighted averaging (weights proportional to each client’s local sample size nᵢ) to obtain updated cluster models Ωₜ,k. These models are then redistributed to the clients belonging to the corresponding clusters, and the next round begins. The coupling of federated optimization (local SGD + weighted averaging) with Bayesian clustering ensures that the model parameters and the cluster structure co‑evolve, leading to a self‑consistent solution.
Theoretical foundations are provided: the DP is described via its Polya urn representation and the Chinese Restaurant Process (CRP) prior p(c | α). With a Normal–Normal conjugate likelihood (ωᵢ | μₖ ∼ N(μₖ, Σ) and μₖ ∼ N(μ₀, Σ₀)), the marginal likelihood of a cluster can be computed analytically, enabling efficient evaluation of the posterior during MCMC. The authors also discuss the role of the concentration parameter α, noting that larger α encourages more clusters, while smaller α favors reuse of existing clusters.
Empirical evaluation uses two standard vision benchmarks, Fashion‑MNIST and CIFAR‑10, each partitioned among 200 simulated clients. Two non‑IID data generation schemes are considered: (1) Dirichlet partitioning, which creates a clear latent clustering structure with K = 10 ground‑truth clusters, and (2) class‑split partitioning, where each cluster receives a random subset of three classes and each client sees two of those classes, resulting in an unknown latent structure. Clients employ a convolutional neural network; the final fully connected layer serves as the representation for clustering. Training hyper‑parameters include learning rate 0.005, momentum 0.9, batch size 32, and Q = 10 local steps per round. The DP concentration α is set to 1.0, and the base Gaussian has unit variance.
Results compare DPMM‑CFL against FeSEM with multiple fixed K values. In the Dirichlet scenario, DPMM‑CFL automatically infers an average of ~12 clusters, close to the true K = 10, and consistently outperforms FeSEM across the entire range of fixed K. In the class‑split scenario, where the true number of clusters is unknown, DPMM‑CFL discovers ~25 clusters, which coincides with the K at which FeSEM attains its peak performance. Although FeSEM can slightly surpass DPMM‑CFL at very high K (>20), it requires an exhaustive sweep over K, which is computationally expensive; DPMM‑CFL achieves comparable performance without such a sweep. The authors also analyze convergence: the number of clusters Kₜ initially fluctuates due to split‑merge moves, then stabilizes after a transient phase (≈30–40 rounds), after which client assignments remain steady and per‑cluster models converge in accuracy and macro‑F1.
In summary, DPMM‑CFL provides a principled, threshold‑free mechanism for jointly learning cluster assignments and per‑cluster models in federated settings. By integrating DP‑based Bayesian non‑parametric clustering with standard federated averaging, it adapts to unknown and dynamic client heterogeneity, reduces the need for hyper‑parameter tuning, and delivers superior or comparable predictive performance. Future work may explore adaptive tuning of α and G₀, extensions to more complex model architectures, privacy‑preserving posterior sampling, and real‑world deployment on resource‑constrained edge devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment