Impact of Labeling Inaccuracy and Image Noise on Tooth Segmentation in Panoramic Radiographs using Federated, Centralized and Local Learning
Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four settings: baseline (unaltered data); label manipulation (dilated/missing annotations); image-quality manipulation (additive Gaussian noise); and exclusion of a faulty client with corrupted data. FL was implemented via the Flower AI framework. Per-client training- and validation-loss trajectories were monitored for anomaly detection and a set of metrics (Dice, IoU, HD, HD95 and ASSD) was evaluated on a hold-out test set. From these metrics significance results were reported through Wilcoxon signed-rank test. CL and LL served as comparators. Results: Baseline: FL achieved a median Dice of 0.94889 (ASSD: 1.33229), slightly better than CL at 0.94706 (ASSD: 1.37074) and LL at 0.93557-0.94026 (ASSD: 1.51910-1.69777). Label manipulation: FL maintained the best median Dice score at 0.94884 (ASSD: 1.46487) versus CL’s 0.94183 (ASSD: 1.75738) and LL’s 0.93003-0.94026 (ASSD: 1.51910-2.11462). Image noise: FL led with Dice at 0.94853 (ASSD: 1.31088); CL scored 0.94787 (ASSD: 1.36131); LL ranged from 0.93179-0.94026 (ASSD: 1.51910-1.77350). Faulty-client exclusion: FL reached Dice at 0.94790 (ASSD: 1.33113) better than CL’s 0.94550 (ASSD: 1.39318). Loss-curve monitoring reliably flagged the corrupted site. Conclusions: FL matches or exceeds CL and outperforms LL across corruption scenarios while preserving privacy. Per-client loss trajectories provide an effective anomaly-detection mechanism and support FL as a practical, privacy-preserving approach for scalable clinical AI deployment.
💡 Research Summary
This paper investigates the viability of federated learning (FL) for tooth segmentation in panoramic dental radiographs, comparing it against centralized learning (CL) and local learning (LL) under a variety of data corruption scenarios. The authors assembled a multi‑institutional dataset of 2,066 panoramic images collected from three hospitals and three dental clinics across China, Finland, and Denmark. Images were captured with different manufacturers’ equipment, and annotations were performed by four dentists using the FDI notation, yielding 33 tooth‑specific labels.
All experiments employed an Attention U‑Net architecture, which augments the classic U‑Net with attention gates at each skip connection to focus on anatomically relevant regions while suppressing background noise. Training used the AdamW optimizer (learning rate = 1e‑4) and Dice loss, for a total of 50 epochs. In the federated setting, training was distributed over five rounds (10 epochs per round) using the Flower framework and FedAvg aggregation. Each of the five simulated clients received an identical split of 331 training and 41 validation images; a held‑out test set of 206 images (10 % of the data) was used for all evaluations.
Four experimental configurations were defined: (1) baseline (unaltered data), (2) label manipulation (mask dilation with kernels of size 3–11 px and 10 % chance of omitting a random tooth mask for one client), (3) image‑quality manipulation (addition of Gaussian noise, μ = 0, σ = 25, to one client’s images), and (4) faulty‑client exclusion (the corrupted client omitted from CL and FL training). For each configuration, the authors measured Dice, Intersection‑over‑Union (IoU), Hausdorff Distance (HD), 95th‑percentile HD (HD95), and Average Symmetric Surface Distance (ASSD). Because none of the metric distributions satisfied normality (Shapiro–Wilk p < 0.05), non‑parametric Wilcoxon signed‑rank tests were applied to assess pairwise differences.
Results show that FL consistently matches or exceeds CL and markedly outperforms LL across all scenarios. In the baseline, FL achieved a median Dice of 0.94889 (ASSD = 1.332 px), slightly higher than CL (Dice = 0.94706, ASSD = 1.371) and substantially better than LL (Dice range = 0.93557–0.94026, ASSD = 1.519–1.698). Under label manipulation, FL retained a Dice of 0.94884 (ASSD = 1.465) versus CL’s 0.94183 (ASSD = 1.757) and LL’s 0.93003–0.94026 (ASSD = 1.519–2.115). With noisy images, FL led with Dice = 0.94853 (ASSD = 1.311), CL followed closely (Dice = 0.94787, ASSD = 1.361), while LL lagged (Dice = 0.93179–0.94026, ASSD = 1.519–1.774). When the faulty client was excluded, FL still outperformed CL (Dice = 0.94790 vs. 0.94550; ASSD = 1.331 vs. 1.393).
A notable contribution is the use of per‑client training and validation loss trajectories as an anomaly‑detection mechanism. The loss curves of the corrupted client diverged early, enabling reliable identification of data quality issues without exposing raw images. This capability is especially valuable in real‑world multi‑center collaborations where data integrity cannot be guaranteed a priori.
The authors conclude that FL offers a privacy‑preserving, scalable solution for dental imaging AI, delivering performance comparable to centralized approaches even when faced with heterogeneous labeling errors and image degradation. The study underscores FL’s robustness to realistic data imperfections and its built‑in monitoring tools, positioning it as a practical framework for future large‑scale, multi‑institutional clinical AI deployments. Further work is suggested to extend the methodology to other imaging modalities, incorporate differential privacy or secure aggregation, and evaluate computational overhead in resource‑constrained clinical settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment