Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client’s sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98% to 41.69%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: https://github.com/AIPMLab/ViT-FLDAF.

💡 Research Summary

This paper tackles two pervasive challenges in medical imaging AI—data privacy and severe class imbalance—by proposing a federated learning (FL) framework that couples a Vision Transformer (ViT) backbone with a novel Dynamic Adaptive Focal Loss (DAFL) and a distribution‑aware aggregation scheme. The authors argue that traditional FL approaches either treat client‑level heterogeneity and class‑level imbalance separately or rely on static weighting schemes that cannot adapt to the evolving data distribution across communication rounds. To overcome these limitations, they first introduce a dynamic class‑imbalance coefficient β_{k,c}^{(t)} for each client k and class c at round t. This coefficient is computed from the current local class frequencies and the model’s prediction confidence, thereby giving higher weight to minority classes and to samples that the model finds hard to classify. The loss function becomes L_{k}^{(t)} = –∑{c} β{k,c}^{(t)} (1‑p_{k,c}^{pred})^{γ} log(p_{k,c}^{pred}), effectively merging focal‑loss hardness modulation with real‑time distribution statistics.

Second, they modify the global model aggregation from the standard FedAvg to a weighted scheme ω_k^{(t)} = (N_k·β̄_{k}^{(t)}) / ∑{j}(N_j·β̄{j}^{(t)}), where N_k is the local data size and β̄_{k}^{(t)} is the average imbalance coefficient for client k. This design reduces the dominance of large but internally skewed clients and amplifies the contribution of smaller, more balanced sites, directly addressing the “client‑size vs. client‑bias” dilemma that often degrades FL performance in medical contexts.

The ViT backbone processes each image as a sequence of 16×16 patches, adds a learnable

Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment