Leveraging Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging

Leveraging Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their reliability for biomedical applications. In this work, we introduce a new approach to improve model calibration by leveraging multi-rater annotations. We propose to train separate models on the annotations from single experts and aggregate their predictions to emulate consensus. This improves upon label sampling strategies, where models are trained on mixed annotations, and offers a more principled way to capture inter-rater variability. Experiments on a colorectal organoid dataset annotated by two experts demonstrate that our rater-specific ensemble strategy improves calibration performance while maintaining comparable detection accuracy. These findings suggest that explicitly modelling rater disagreement can lead to more trustworthy object detectors in biomedical imaging.


💡 Research Summary

This paper addresses the critical problem of mis‑calibrated confidence scores in deep‑learning‑based object detectors applied to microscopy imaging. While such detectors achieve high detection accuracy, their predicted probabilities are often over‑confident, which hampers reliable decision‑making in biomedical contexts where uncertainty quantification is essential. The authors propose a novel calibration strategy that explicitly leverages multi‑rater annotations rather than treating them as a single noisy label source.

The core idea is to train separate detector models for each expert’s annotation set (Rater‑Specific, RS models) and then ensemble these models at inference time. Each RS model learns the individual bias of its annotator—differences in inclusion criteria, handling of ambiguous structures, and subjective interpretation of organoid boundaries. During inference, predictions from all RS models are grouped by spatial overlap (IoU > 0.5). For each group, the confidence scores are averaged across all models, counting a model that did not predict a box as contributing a zero confidence. This “agreement‑aware” averaging ensures that detections supported by many raters receive higher confidence, while those supported by only a few are down‑weighted, thereby reflecting inter‑rater disagreement as aleatoric uncertainty.

To evaluate calibration for object detection, the authors define Detection Expected Calibration Error (D‑ECE), an adaptation of the classic ECE metric that bins detections by confidence and compares average precision within each bin to the average confidence. Lower D‑ECE indicates better calibration. They also report standard detection metrics (mean Average Precision, mAP, and mean Average Recall, mAR) to verify that calibration improvements do not sacrifice accuracy.

Experiments are conducted on a colorectal cancer patient‑derived organoid (PDO) dataset consisting of 100 bright‑field images at 4× magnification. Two expert raters independently annotated 80 images (single‑rater set) and jointly annotated the remaining 20 images (consensus set). The authors fine‑tuned Mask‑R‑CNN heads (pre‑trained on a related organoid detection task) on the single‑rater data, creating two families of models: (1) Label‑Sampling (LS) models, trained each epoch on a randomly chosen rater’s labels, and (2) Rater‑Specific (RS) models, trained exclusively on one rater’s annotations. Hyper‑parameter sweeps yielded 20 top LS models and 10 top RS models per rater, selected based on validation mAP.

Two ensemble strategies were compared: (a) Label‑Sampling Ensemble (LSE), which aggregates the LS models (baseline), and (b) Rater‑Specific Ensemble (RSE), which aggregates RS models from both raters. Ensembles of varying sizes (2–20 models) were evaluated using 100 bootstrap resamples of the consensus test set. Results show that RSE consistently achieves lower D‑ECE than LSE across all ensemble sizes; the largest ensembles (20 models) yield D‑ECE = 0.08 for RSE versus 0.15 for LSE. Importantly, mAP remains essentially unchanged between the two strategies (≈0.45–0.46), indicating that the calibration gain does not come at the cost of detection performance.

The authors discuss why D‑ECE does not markedly improve with larger ensembles: all models share the same pre‑trained backbone and are fine‑tuned from identical checkpoints, limiting diversity. They suggest that training from scratch with different random initializations could further enhance ensemble benefits. Computational cost is also noted: inference time scales linearly with ensemble size, but this cost is identical for RSE and LSE.

Limitations include the modest dataset size and the use of only two raters, which restricts the generality of the findings. Future work is proposed to expand to larger, multi‑rater datasets, explore Bayesian or knowledge‑distillation techniques for efficient uncertainty modeling, and investigate alternative aggregation schemes that retain diversity while preserving calibration.

In conclusion, the study demonstrates that explicitly modeling rater disagreement by training and ensembling rater‑specific detectors yields substantially better calibrated confidence estimates than conventional label‑mixing ensembles, without degrading object detection accuracy. This approach offers a practical pathway to more trustworthy AI systems in microscopy and broader biomedical imaging applications, where calibrated uncertainty is a prerequisite for clinical adoption.


Comments & Academic Discussion

Loading comments...

Leave a Comment