Dataset Distillation-based Hybrid Federated Learning on Non-IID Data
In federated learning, the heterogeneity of client data has a great impact on the performance of model training. Many heterogeneity issues in this process are raised by non-independently and identically distributed (non-IID) data. To address the issue of label distribution skew, we propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate approximately independent and equally distributed (IID) data, thereby improving the performance of model training. In particular, we partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced while the data labels among different clusters are balanced. The cluster heads collect distilled data from the corresponding cluster members, and conduct model training in collaboration with the server. This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of non-IID data on model training. We perform a comprehensive analysis of the convergence behavior, communication overhead, and computational complexity of the proposed HFLDD. Extensive experimental results based on multiple public datasets demonstrate that when data labels are severely imbalanced, the proposed HFLDD outperforms the baseline methods in terms of both test accuracy and communication cost.
💡 Research Summary
The paper tackles the long‑standing challenge of non‑independent and identically distributed (non‑IID) data in federated learning (FL), focusing specifically on severe label‑distribution skew. In such settings, local models drift toward disparate optima, and simple weighted averaging (FedAvg) leads to slow convergence and poor global accuracy. Existing remedies—loss‑function redesign, adaptive aggregation, client clustering, or limited data sharing—either compromise privacy, incur high communication overhead, or fail to fully mitigate the statistical heterogeneity.
To address these issues, the authors propose Hybrid Federated Learning with Dataset Distillation (HFLDD), a framework that unifies two complementary ideas: (1) heterogeneous client clustering and (2) dataset distillation.
- Heterogeneous clustering partitions the set of N clients into K clusters such that within each cluster the label distribution is highly unbalanced, while across clusters the label distribution is approximately balanced. The server builds a similarity matrix from soft‑label statistics reported by clients and assigns clusters to minimize inter‑cluster label skew. This contrasts with prior clustering approaches that only group clients with similar data distributions.
- Dataset distillation is applied locally by every non‑head client. Using a first‑order meta‑learning method (Kernel Inducing Points, KIP), each client compresses its large local dataset T into a tiny synthetic set S (|S| ≪ |T|) that preserves the original loss landscape. The distilled data contain synthetic inputs ˆx and soft labels ˆy, and are transmitted to the designated cluster head. Because the synthetic set is orders of magnitude smaller than the raw data, communication cost is dramatically reduced, and privacy is enhanced—distilled samples do not reveal raw images and are more resistant to membership inference attacks.
The cluster head aggregates the distilled data from its members, forming a cluster‑level dataset that is effectively IID with respect to the global label space. The head then participates in the usual FL round with the central server: model parameters are exchanged, and the server aggregates heads’ updates via standard FedAvg. Consequently, the server‑head interaction behaves like traditional FL on IID data, preserving the well‑studied convergence properties.
Theoretical contributions include:
- A convergence bound showing that, under the assumption that the distilled dataset approximates the original loss within ε, the global objective f(ω)=∑_{i} (n_i/n) f_i(ω) converges at O(1/√T), identical to FedAvg, even in the presence of severe label skew.
- A closed‑form expression for communication cost per round: C_round = K·|ω| + K·|S|, where |ω| is the model size and |S| the distilled set size. Since |S| ≪ |T|, total communication over T rounds scales as O(T·|S|), a substantial reduction compared with transmitting full local models or raw data.
- An analysis of computational complexity: client‑side distillation costs O(|S|·d) (d = input dimension), while heads and the server retain the usual O(|ω|·d) cost of local training and aggregation. Thus, the overall system does not impose prohibitive extra computation.
Empirical evaluation spans four public benchmarks (CIFAR‑10, FMNIST, SVHN, TinyImageNet) with artificially induced label imbalance ratios of 10:1, 20:1, and 50:1. HFLDD is compared against a broad set of baselines: FedAvg, FedProx, SCAFFOLD, FedDistill variants (FedDM, FedVCK), and clustering‑based HFL methods (FedSeq, Semi‑FL). Results show:
- Accuracy gains of 3–7 % over the best baseline when label skew is extreme, confirming that the distilled, balanced data effectively counteracts drift.
- Communication savings of 30 %–50 % in terms of transmitted bytes and number of rounds needed to reach a target accuracy.
- Robustness to cluster size: moderate numbers of clusters (K = 4–8) provide the best trade‑off between data diversity in distilled sets and head workload.
- Privacy resilience: membership inference attacks succeed significantly less often on distilled data than on raw data, aligning with recent findings that synthetic datasets can act as a privacy shield.
The authors acknowledge limitations: the distillation step adds computational load for low‑power edge devices, and the current head selection is static rather than load‑aware. Future work is suggested on lightweight distillation algorithms, dynamic head scheduling, and formal differential‑privacy analysis of the distilled data.
In summary, HFLDD delivers a practical, privacy‑preserving solution for federated learning under severe non‑IID conditions by (i) reorganizing clients into heterogeneously labeled clusters, (ii) compressing local data into high‑fidelity synthetic sets, and (iii) leveraging the resulting balanced data for efficient server‑head collaboration. The framework retains the theoretical guarantees of classic FL while achieving notable improvements in accuracy and communication efficiency, making it a compelling candidate for real‑world edge AI deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment