Reading time: 21 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.20363
  • Date:
  • Authors: Unknown

📝 Abstract

Federated learning (FL) supports privacypreserving, decentralized machine learning (ML) model training by keeping data on client devices. However, non-independent and identically distributed (non-IID) data across clients biases updates and degrades performance. To alleviate these issues, we propose Clust-PSI-PFL, a clustering-based personalized FL framework that uses the Population Stability Index (PSI) to quantify the level of non-IID data. We compute a weighted PSI metric, W P SI L , which we show to be more informative than common non-IID metrics (Hellinger, Jensen-Shannon, and Earth Mover's distance). Using PSI features, we form distributionally homogeneous groups of clients via K-means++; the number of optimal clusters is chosen by a systematic silhouette-based procedure, typically yielding few clusters with modest overhead. Across six datasets (tabular, image, and text modalities), two partition protocols (Dirichlet with parameter α and Similarity with parameter S), and multiple client sizes, Clust-PSI-PFL delivers up to 18% higher global accuracy than state-of-the-art baselines and markedly improves client fairness by a relative improvement of 37% under severe non-IID data. These results establish PSI-guided clustering as a principled, lightweight mechanism for robust PFL under label skew. Code will be released upon acceptance.

📄 Full Content

1) We employ the PSI as a practical, low-overhead metric to quantify the level of non-IID data in FL, and analyze its properties by contrasting it with existing non-IID alternative metrics. 2) We introduce Clust-PSI-PFL, a clustering-based personalization framework that leverages PSI to partition clients and train per-cluster models, thereby mitigating the degradation caused by increasing non-IID data. 3) Across non-IID scenarios, Clust-PSI-PFL yields up to 18% boost in global test accuracy with an improvement of 37% in fairness over the state-of-the-art baselines.

To the best of our knowledge, this is the first study to integrate PSI into FL both as a non-IID data quantifier and as a basis for client clustering to enhance global model performance.

a) Basics of FL: In Fig. 1 (left), the standard roundbased FL pipeline shows: a server that initializes a global model and broadcasts it to K clients; each trains locally on private data and returns updates (optionally protected via multiparty secure computing (MPC) [8] or differential privacy (DP) [9]); the server aggregates the model’s weight (e.g., FedAvg) to form a new global model and rebroadcasts, repeating until convergence.

Non-IID client data is a primary challenge in FL [2]. Unlike IID regimes, non-IID data client distributions (Fig. 1, right) induce client drift: local parameters W i t deviate from the global optimum W (Opt) t , impeding convergence and yielding non-optimal solutions. In addition, aggregating misaligned updates can drive the global W (Avg) t to converge towards a poor local optimum, further degrading accuracy.

Differences in per-client label distributions are a primary source of difficulty for FL, adversely impacting both accuracy and convergence dynamics [10], [11]. To quantify the severity of non-IID data, several measures have been proposed [2]. Within PFL, three predominant mitigation families are widely studied: regularization-based techniques, clientselection strategies, and clustering-based solutions [3], as discussed in the following sections.

b) Non-IID Data Quantification in FL: Prior work measures client-distribution mismatch with three common metrics. Hellinger distance (HD) estimates deviation from a balanced reference and has shown strong discrimination, particularly in cross-device FL [12], [13]. Jensen-Shannon distance (JSD) supports high-level comparisons and the construction of client-to-client similarity graphs to mitigate label-skew effects [14], [15]. Earth Mover’s/Wasserstein distance (EMD) is used to guide client scheduling and assess non-IID severity by comparing client and global label distributions [16].

c) Regularization-based Baselines for Non-IID Data: One of the most widespread solutions to tackle the non-IID data effects is FedProx [7]. It generalizes FedAvg by adding a proximal term µ ≥ 0 that controls the penalization strength. This penalizes drift and stabilizes optimization under non-IID data.

Under the FedOpt framework [17], clients perform local stochastic gradient descent (SGD), while the server uses an adaptive update (e.g., Adagrad, Yogi, Adam) to the aggregated weights, thereby improving stability and effectiveness under non-IID data. The variants differ in moment accumulation/normalization (e.g., Yogi’s conservative second moment), offering robustness to sparse gradients and non-convexity and reducing hyperparameter sensitivity compared to FedAvg, as supported by theory and experiments.

FedAvgM [18] augments FedAvg with server-side momentum, smoothing aggregated updates, and stabilizing training under non-IID data. By maintaining a moving average of the global update direction, it reduces oscillations on skewed partitions and often surpasses FedAvg. The trade-off of FedAvgM is additional hyperparameters (momentum coefficient, server learning rate) that require careful tuning. d) Selection-based Baselines for Non-IID Data: To mitigate non-IID effects, Power-of-choice (PoC) [19] preferentially samples high-loss clients, balancing convergence speed and bias; it achieves up to 3× faster convergence and ∼ 10% higher test accuracy than random selection. In contrast to PoC, our PSI-guided clustering focuses training within distributionally coherent client clusters, further improving accuracy and efficiency.

HACCS [20] clusters clients by data-histogram similarity and, within each cluster, selects low-latency participants, preserving distributional coverage while favoring fast devices. This yields robustness to individual dropouts (as long as similar clients remain), maintains balanced representation, and accelerates convergence. FedCLS [21] guides client sampling using group-level label information and Hamming distances between one-hot label vectors, preferentially selecting clients with complementary (diverse) label distributions. Compared to FedAvg with random selection, it achieves faster convergence, higher accuracy, and more stable training (well-suited to non-IID federated settings).

e) Clustering-based Baselines for Non-IID Data: Clustered Federated Learning (CFL), proposed by Sattler et al. [22], tackles non-IID data by recursively clustering clients via the cosine similarity of their local updates when training stalls (small global norm, large client norm). It bipartitions to minimize cross-cluster similarity and repeats; the method is model-agnostic, FedAvg-compatible, and requires no preset number of clusters.

FedSoft, by Ruan and Joe-Wong [23], extends clustered FL to soft clustering, letting each client be a mixture of source distributions and jointly learning cluster models plus per-client personalized models. It uses a proximal local objective that encodes cluster models for knowledge transfer with per-round cost comparable to standard FL. The method provides convergence guarantees and outperforms hard-clustering baselines on synthetic and real datasets.

f) Limitations of State-of-the-art Approaches: Progress in non-IID FL spans diagnosis (HD/JSD/EMD), stabilization (FedProx/FedOpt/FedAvgM), client selection (PoC/HACCS/FedCLS), and clustering (CFL/FedSoft). Yet gaps persist: metrics are largely descriptive and labelhistogram-based (missing intra-class shifts); regularizers still bias toward a single global model and require tuning; selection via proxies can skew participation and seldom enforces within-round homogeneity; clustering is thresholdsensitive (hard) or complex and tuning-heavy (soft).

We build on and complement these advances by quantifying client-distribution dissimilarities with the PSI to form homogeneous client clusters for FL training, thereby improving performance relative to prior approaches. In our evaluation, we include all aforementioned methods as baselines and additionally report CL and FedAvg.

This section presents our Clust-PSI-PFL approach; a highlevel workflow is shown in Fig. 2. Before training, every client transmits a compact label-frequency count (not raw data) to the server. The server aggregates these frequencies to form a reference label distribution and computes the PSI to quantify each client’s divergence from that reference. Based on the PSI metric, clients are grouped into distributionally homogeneous clusters, and a dedicated model is trained per cluster.

In the label-skew setting, the client-wise P SI [24] for client i is defined in Eq. 1. This expression coincides with the statistical “Jeffreys divergence” [25]:

where P (y = c) denotes the global probability mass function (pmf) of label y for class c, computed at the server from the aggregated label counts reported by clients; P i (y = c) is the pmf of label y for class c on client i; and C is the total number of classes for y. The superscript L indicates that the computation pertains to the label-skew setting. As discussed earlier, we focus on label skew; attribute and quantity skew are left for future work.

In addition to the aggregate P SI L i , we decompose Eq. 1 into class-wise contributions:

The terms P SI L i,c quantify, for each class c, the ith client deviation from the global label distribution. We later assemble the client-feature vector [ P SI L i , P SI L i,1 , . . . , P SI L i,C ] for clustering clients.

Since P SI is computed on a per-client basis, the federationwide degree of non-IID in the label-skew setting is summarized by the weighted average, denoted W P SI L , as defined in Eq. 3:

where, K denotes the total number of participating clients, n i is the number of samples on the ith client, and N = K i=1 n i is the aggregate sample count across all clients. Smaller values of P SI i and W P SI L reflect greater inter-client homogeneity, whereas larger values denote stronger non-IID data.

Upon computing the client PSI at the server, we partition clients into clusters according to their PSI signatures. Since PSI quantifies distributional divergence, grouping clients with similar PSI profiles forms more homogeneous cohorts. Aggregating updates within such cohorts reduces cross-distribution mixing, mitigates the weight-update mismatch induced by non-IID data, and improves convergence stability and speed.

Clients are clustered with K-means++, selected for its robust seeding and broad adoption [26]. Each client is encoded as a feature vector that concatenates the overall PSI and the perclass PSI values (Eqs. 1 and 2). These features are standardized to have a mean of zero and a variance of one. This step is lightweight, adding minimal server-side overhead, since its cost scales mainly with the (typically small) number of classes C. Run K-means with K-means++ initialization on X to obtain candidate clusters z (j)

Compute average silhouette score s(j) on X using the candidate clusters z (j)

if s(j) > best score then

end if 12: end for 13: return τ , z (τ )

The cluster count τ is a key hyperparameter in Clust-PSI-PFL. To avoid ad hoc choices, we select τ using the PSIbased algorithm for selecting the number of clusters, which is aligned with the statistical characteristics of the client population. Given the resulting assignments z (τ ) , we partition clients into τ clusters and train a FedAvg-type FL model for each cluster; the efficacy of this design is validated empirically in Section IV.

We provide a concise analysis of the computational complexity of the Clust-PSI-PFL clustering phase. The cost arises mainly from (i) computing PSI for each client and (ii) selecting the number of clusters via K-means++ with the silhouette score. Computing PSI over K clients and C classes requires O(K C) operations. Let I be the average number of K-means++ iterations, and let the search examine K -2 candidate cluster counts j ∈ {2, . . . , K -1}. A single K-means++ run with j clusters costs O(I K j (C + 1)), while computing the silhouette costs O(K 2 (C + 1)). Summed over the K -2 candidates, the model-selection stage costs O (C + 1)

Therefore, the total clustering

In practice, C and the candidate range (K -2) are small (and silhouettes can be computed on a subsample), so the overhead is modest; subsequent training proceeds as in FedAvg, except that aggregation is performed independently for each of the τ clusters.

The weighted PSI (W P SI L ) is well-suited metric to quantify non-IID in FL. We benchmark it against HD, JSD, and EMD [13], standard divergences for comparing distributions (see Section II). Unlike these baselines, W P SI L delivers finegrained, client-level diagnostics with minimal overhead and, thanks to PSI’s classwise decomposition, enables per-class attribution to pinpoint labels driving divergence, capabilities HD/JSD/EMD generally lack without extra modifications. Thus, W P SI L is an interpretable and efficient primary metric for non-IID assessment.

To evaluate the practical utility of W P SI L , we conducted a broad empirical study spanning multiple datasets, levels of non-IID data, partitioning schemes, and random seeds (see Section IV). Centralized datasets were partitioned using the Dirichlet [13] and Similarity [27] partition protocols, after which we computed W P SI L , HD, JSD, and EMD. Fig. 3 depicts how these metrics vary with the degree of non-IID data, parameterized by α and S. Our findings are as follows: (i) W P SI L exhibits an exponential-decay trend as IID data increases under both protocols; (ii) JSD and HD change more nearly linearly with non-IID data across both schemes; and (iii) EMD is less reliable as a non-IID quantifier due to nonmonotonic behavior, yielding identical values at distinct non-IID levels.

Using the computed metrics as features, we trained Light-GBM, Support Vector Machine (SVM), Multilayer Perceptron (MLP), and a Regression Tree to predict the Dirichlet parameter α and the Similarity parameter S. Feature-importance scores from these models (Fig. 4) consistently ranked W P SI L highest, indicating it is the most informative measure of non-IID structure.

Observation 1: W P SI L consistently appears as the strongest predictive metric for the degree of non-IID data.

In this section, we describe the simulation environment, datasets, and evaluation protocol to enable reproducibility, and subsequently present the main results of our analyses.

We investigate the following three empirical questions (EQs), which probe core properties of Clust-PSI-PFL and its comparison to state-of-the-art baselines: [32] Image Detect clothing type 10 70,000 CIFAR10 [33] Image Classify objects 10 60,000 Sent140 [34] Text Sentiment analysis 3 1,600,000 Amazon reviews [35] Text Product classification 6 50,000

• EQ1: How does the silhouette-based procedure for selecting the optimal number of clusters behave across different client configurations? I. These datasets were selected for their widespread use in prior studies on non-IID phenomena in FL.

We employ two widely used partitioning protocols to simulate varying client data distributions. Dirichlet [13] controls non-IID data with a single parameter α: a smaller α generates more non-IID client partitions. In addition, we use the Similarity partition protocol [27], which is governed by a single parameter S ∈ [0, 1]: first, allocate S * 100% of the dataset uniformly at random to clients (IID component); then sort the remaining (100-S * 100)% by label and distribute it evenly across clients (label-skewed component). Larger S produces more IID partitions, while smaller S induces pathological non-IID data.

c) Selection of α and S Values: The α values were chosen to cover the entire spectrum of non-IID settings, measured by W P SI L ranging from 0 (IID) to large values (extreme non-IID). We tested eleven α values within this range, but, for brevity, present results for α = {50, 0.7, 0.3, 0.2, 0.09, 0.05}, as they are representative and consistent with the overall findings. Note that the effect of the Dirichlet α parameter is dataset-dependent: the attainable degree of non-IID data varies with the number of classes. In datasets with few classes, very small values (e.g., α < 0.3) are infeasible. The S values for the Similarity partition protocol were selected to span the full non-IID data range, from S = 0 (maximally non-IID) to S = 1 (IID). We evaluated eleven S settings across this interval, but, for brevity, report results for S ∈ {1, 0.03, 0}, which are representative of the overall trends. d) Models: For ACSIncome, we use a single linear layer (equivalent to logistic regression). For Serengeti, we employ a multilayer perceptron with three connected hidden layers (500 units each, ReLU activations) and a softmax output layer. For FMNIST, we adopt a convolutional neural network (CNN) with three convolutional layers (8, 16, and 32 channels), followed by max pooling and a fully connected layer with 2048 ReLU units. For CIFAR-10, we use a deeper CNN with three convolutional blocks; each block stacks two 3 × 3 convolution layers (ReLU, same padding) with batch normalization, followed by 2 × 2 max pooling and dropout (rates 0.2/0.3/0.4 across the three blocks). The blocks use 32, 64, and 128 filters, respectively. The resulting features are flattened and passed to a 128-unit ReLU dense layer with batch normalization and dropout of 0.5, followed by a softmax layer over C classes (input shape H × W × 3).

For Sent140, we utilized a sequential NN with an embedding layer, a dropout layer (rate 0.5), an LSTM layer (10 units, dropout 0.2, recurrent dropout 0.2), and a dense output layer with softmax activation. For Amazon reviews, we used a lightweight deep neural network (DNN): a trainable embedding layer (vocabulary size V , embedding dimension d) followed by a Flatten layer, a small dense hidden layer with 8 ReLU units, and a softmax output over C classes. We run the experiments using five different data partitions every time (i.e., five random seeds) to improve robustness. With standard deviations mostly below 0.02 (see Table III and IV), additional seeds add little value; five were enough to obtain stable estimates and preserve method rankings while avoiding unnecessary computation.

The candidate hyperparameter grids and the corresponding optimal settings for each baseline are summarized in Table II. The search ranges were chosen in accordance with recommendations from the original baseline papers. Final values were selected via an exhaustive grid search over all parameter combinations. We train for T = 40 communication rounds; at each round, a fraction q = 0.5 of the K clients is sampled uniformly without replacement, and each selected client runs E = 5 local epochs. We use official test sets when available; otherwise, we adopt a train/test split of 80%/20%. e) Metrics: This section describes the set of metrics that are often considered in our experiments.

Accuracy. The fraction of correctly classified samples relative to the total number of samples; larger values indicate better model performance.

Because data in FL are partitioned across clients, accuracy can be evaluated at two levels: global (after aggregating across clients) and local (per client).

Local accuracy. Formally, local accuracy is computed individually for each client k as follows: A k = CC k n k where CC k indicates the number of correctly classified samples on client k.

Global accuracy. The global accuracy is calculated as the weighted average of local client accuracies, where the weight corresponds to the number of samples that each client holds:

with K the total number of participating clients and n k the number of data samples on client k.

Client fairness. We measure the difference of each solution from 1.0 (perfect accuracy). For such purpose, we calculate the average distance AD = 1 K K i=1 (|A i -1.0|) and the corresponding standard deviation as SDAD = 2 . Smaller AD values represent a smaller distance to the perfect model (desired), and smaller SDAD showcases a smaller variation (desired) in such distances.

This subsection evaluates the global and local performance of Clust-PSI-PFL in comparison to the baseline methods.

a) Number of Clusters Behavior: In this subsection, we answer EQ1 by examining the behavior of the number of clusters (τ ) on PSI-based clusters determined via the systematic approach described in Section III.

Fig. 5 illustrates the silhouette score against the number of clusters for the ACSIncome dataset, changing the number of clients participating in the FL. Due to space constraints, we present only the ACSIncome dataset since its behavior closely mirrors that of the remaining datasets and non-IID data configurations.

Under Dirichlet (α = 0.3), the silhouette peaks at small τ : τ = 3 for K = 10, τ = 4 for K = 50, and τ = 3 for K = 100, indicating a compact 3 -4 cluster structure with weaker separation as K grows. Under Similarity (S = 0), the curve is high and nearly flat; applying our rule (choose the smallest τ at the maximum) yields τ = 3 for all K. Overall, b) Global-level Test Clust-PSI-PFL Behavior: At the global level, the server aggregates client updates (within each cluster) to produce a cluster-level model intended to generalize to the full population. This system-wide view evaluates accuracy, convergence speed, and robustness to non-IID data skew; the resulting global metrics summarize the end-to-end effectiveness of the FL pipeline across all participating clients.

To address EQ2, we analyze Fig. 7, which reports the mean test global accuracy of Clust-PSI-PFL (purple) and all baselines as a function of the number of clients, the Dirichlet parameter α, and the Similarity parameter S, together with the corresponding W P SI L . Across settings, Clust-PSI-PFL consistently surpasses competing baselines, including highly non-IID data scenarios, achieving up to 18% relative gains over strong contenders such as CFL and FedSoft. Moreover, its variability is low (standard deviation ∼ 0.01), indicating stable training. Performance generally improves as the client population grows (e.g., K = 50 or K = 100), where interclient distributional differences become more pronounced and clustering is more effective.

To summarize results across datasets, Table III reports mean test accuracy (± standard deviation over five seeds) for Clust-PSI-PFL and all baselines under both partition protocols (Dirichlet and Similarity) and across modalities. This table enables a direct, cross-dataset comparison of behavior under varying levels of non-IID data. For brevity, we show results for K = 100 clients; qualitatively similar trends hold for K ∈ {10, 50}.

Across all six datasets and both partition protocols, Clust-PSI-PFL (ours) attains the best accuracy in the vast majority of settings. Its advantage is most pronounced in the more pathological non-IID regimes (Dirichlet with small α and Similarity with small S) where it delivers large gains over strong baselines (FedAvg, FedAvgM, HACCS, CFL). In near-IID conditions (α = 50, S = 1), it matches the top performers, indicating no loss from clustering when non-IID data is low. Reported standard deviations are small, evidencing stable training. Overall, this analysis confirms that PSI-driven clustering yields consistently superior global accuracy across modalities and non-IID data levels. Observation 3: Clust-PSI-PFL consistently attains higher global accuracy than competing baselines across the full spectrum of non-IID data scenarios.

c) Local-level Test Clust-PSI-PFL Behavior: Locallevel analysis examines the performance of each client’s model before and after federated aggregation. This perspective surfaces variability arising from non-IID data distributions, computational limitations, and personalization requirements. By evaluating outcomes at the client granularity, we can quantify the impact of FL on individual nodes, assess fairness, and detect disparities in model quality across clients.

EQ3 is addressed using the evidence in Fig. 8, which plots the empirical cumulative distribution function (ECDF) of local accuracy for Clust-PSI-PFL and all baselines across varying α and S. The ECDF of our method lies consistently closer to the perfect-accuracy target, indicating improved client fairness. At α = 0.3, Clust-PSI-PFL attains AD = 0.17, versus AD ∼ 0.27 for CFL (37% of relative reduction) and AD ∼ 0.43 for FedAvgM. Under the pathological non-IID scenario (S = 0), our approach achieves AD = 0.01 from the unit-accuracy target with SDAD = 0.02, whereas CFL yields AD = 0.27 and SDAD = 0.28. These results show that Clust-PSI-PFL concentrates client performance nearer to the ideal with markedly lower dispersion.

To summarize local-accuracy results across datasets, Table IV reports the average distance from the perfect model (AD) with its dispersion (SDAD) as AD ± SDAD, for Clust-PSI-PFL and a focused set of strong baselines. Specifically, we include one representative per PFL family-regularization (Fe-dAvgM), selection (HACCS), and clustering (CFL)-together with the FedAvg reference, selected because these methods achieved the highest global accuracies in Table III. The columns mirror the non-IID settings for both Dirichlet and Similarity protocols, enabling cross-dataset and cross-modality comparisons as non-IID data varies (lower AD/SDAD is better). For brevity, results are shown for K = 100 clients; qualitatively similar trends were observed for K ∈ {10, 50}.

Table IV shows that Clust-PSI-PFL attains the smallest AD, and typically the smallest SDAD, across nearly all datasets, with gains widening as non-IID intensifies. Under severe Similarity skew (S ∈ {0, 0.03}) it reaches near-zero AD while baselines remain far (e.g., ACSIncome at S = 0: 0.01±0.02 vs. 0.27-0.35; FMNIST: 0.01±0.01 vs. 0.51-0.89). For Dirichlet at moderate non-IID (α ∈ {0.7, 0.3}), it also leads (e.g., ACSIncome α = 0.7: 0.14±0.10 vs. 0.28-0.36); at α ∼ 0.2 CFL can be slightly ahead on some vision sets, but our method dominates as non-IID strengthens and in Similarity. Near-IID (α = 50, S = 1), all methods are comparable, indicating no clustering penalty. Overall, consistently smaller AD/SDAD for Clust-PSI-PFL reflect stronger client fairness and stability.

Observation 4: Clust-PSI-PFL promotes client fairness, attaining an improvement of 37% compared to the baselines and under highly non-IID conditions.

V. LIMITATIONS OF CLUST-PSI-PFL While Clust-PSI-PFL shows consistent gains across datasets and non-IID scenarios, our current implementation does not incorporate privacy-preserving mechanisms such as DP or MPC. Integrating MPC-based secure aggregation does not change model performance (it adds only runtime/communication overhead), whereas DP may affect performance, with the impact governed by the budget ϵ. Second, the present version of Clust-PSI-PFL focuses on label skew: PSI features are derived from class frequency, and clients are clustered by similarity in label proportions. Other non-IID data types (e.g., attribute skew, quantity imbalance) are not explicitly modeled; thus, when labels are balanced but attribute distributions or data volumes differ, clusters may be suboptimal. We view this as a scoped design choice rather than a fundamental limitation: the PSI machinery can be extended to discretized attributes, marginals, or joint label-attribute bins, and incorporate sample counts.

In this study, we present Clust-PSI-PFL, a PFL framework that leverages the PSI as a principled metric for client clustering under non-IID data conditions. We demonstrate that PSI reliably quantifies distributional non-IID data across tabular, image, and text modalities, enabling the formation of distributionally coherent client groups and, in turn, yielding substantial gains in both global and local performance, even in highly non-IID regimes. Relative to state-of-the-art baselines, Clust-PSI-PFL achieves up to 18% higher global accuracy while enhancing client fairness, resulting in a relative improvement of 37%. These results position Clust-PSI-PFL as a practical mechanism for robust decentralized learning in non-IID data federated environments.

In future work, we will examine the applicability of PSI beyond label skew to other forms of data skew, such as attribute skew and quantity skew. Moreover, integrating formal privacy mechanisms, such as DP [36] and MPC [8], to strengthen the confidentiality of our FL pipeline lies outside the scope of this study and remains a promising direction for follow-up research.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut