Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images
Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
💡 Research Summary
The paper addresses the critical bottleneck of obtaining high‑quality annotated data for automated detection of electron‑dense deposits (EDD) in transmission electron microscopy (TEM) images of glomeruli. While crowdsourcing can provide a large volume of labels at low cost, it inevitably introduces various types of noise—background, missed detections, localization errors, and the particularly troublesome “box‑in‑box” (Bib) noise where a single bounding box encloses multiple deposits. To overcome these challenges, the authors propose an Active Label Cleaning (ALC) framework that combines active learning with a novel Label Selection Module (LSM) to efficiently clean crowdsourced datasets using a limited expert annotation budget.
The method proceeds in two stages. In Stage 1 (Active Learning), a small seed set of images is randomly selected and annotated by expert pathologists, forming an initial clean dataset Dₚ⁰. The same images also have crowdsourced labels D𝚌⁰. Two detection models are trained in parallel: Mₚ on the clean set and Mₐ on a consensus set that merges clean and crowdsourced labels where they agree above a IoU threshold. For every unlabeled image, both models generate predictions (Bᵢₚ and Bᵢₐ). The LSM compares these predictions with the original crowdsourced labels Bᵢ𝚌, categorizing each instance into four regions:
- Red (B_red) – high inconsistency among all three sources, indicating complex noise.
- Green (B_green) – present only in crowdsourced labels, likely containing complex noise (background, localization, Bib).
- Pink (B_pink) – detected only by Mₚ, suggesting missed annotations in the crowd set (simple miss noise).
- Gray (B_gray) – high agreement (IoU > 0.5) across sources, considered clean or only mildly noisy.
Each instance receives a confidence‑based score (s_red from Mₐ, s_green from Mₚ). Image‑level scores are obtained by summing the instance scores within an image (SCORE). The top‑k images with the highest SCORE—those most likely to contain valuable noisy information—are sent to pathologists for re‑annotation. The corrected labels are added to the clean set, the models are retrained, and the process repeats until a stopping criterion is met (g iterations).
In Stage 2 (Label Correction), the final cleaning models M_gₚ and M_gₐ are applied to the remaining unlabeled images. The LSM again classifies the residual crowdsourced labels. Simple‑noise categories (Gray and Pink) are automatically corrected using M_gₚ predictions, because M_gₚ has been trained exclusively on clean data and exhibits superior detection accuracy. Complex‑noise categories (Red and Green) are forwarded to experts for manual correction, with predictions from both models provided as suggestions. To further reduce expert effort, a Bib Correction Module automatically removes Bib‑type noise within the Green set by checking for overlapping high‑IoU boxes (threshold γ) and retaining the most reliable counterpart.
The authors evaluate the framework on a private dataset of 1,112 TEM images (2,048 × 2,048 pixels) from 202 patients with membranous nephropathy. Five medical students supplied the crowdsourced annotations, while three senior pathologists supplied the gold‑standard labels. Experiments show that the proposed method achieves an AP₅₀ of 67.18 %, an 18.83 % absolute improvement over training directly on noisy crowdsourced labels. Compared with a model trained on the full expert‑annotated set (≈70.2 % AP₅₀), the ALC approach reaches 95.79 % of that performance while reducing annotation cost by 73.30 %. Ablation studies confirm that prioritizing Red and Green instances for expert review is the key driver of performance gains. The method also outperforms recent noise‑robust detectors (NOTE‑RCNN, OA‑MIL) and prior label‑cleaning approaches (CA‑BBC, Mao et al.) that rely on a single model’s predictions.
Key contributions include: (1) a systematic, instance‑level noise grading mechanism via the LSM; (2) an active‑learning loop that strategically selects the most informative noisy images for expert re‑annotation, maximizing ROI of limited expert time; (3) a hybrid automatic‑manual correction pipeline that automatically fixes simple errors while reserving expert effort for complex cases, further enhanced by a dedicated Bib‑noise remover. The work demonstrates a practical pathway to building reliable medical AI systems when expert annotation resources are scarce. Future directions suggested by the authors involve validating the approach across multiple institutions and disease types, integrating reinforcement‑learning‑based selection policies for the LSM, and embedding the pipeline into real‑time clinical workflows.
Comments & Academic Discussion
Loading comments...
Leave a Comment