Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent ‘clean’ lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p<0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.

💡 Research Summary

This paper tackles the pervasive problem of annotation variability in multi‑site prostate lesion segmentation from multiparametric MRI (mpMRI). In practice, each institution follows its own contouring protocol, leading to systematic differences in the ground‑truth masks that cause deep‑learning models to overfit to site‑specific styles and perform poorly on unseen sites. The authors propose to treat every observed annotation as a noisy observation of an unobserved “clean” lesion mask and to model the noise with site‑ and case‑specific sensitivity (α) and specificity (β) parameters. These parameters are placed under a hierarchical logistic‑normal prior that decomposes label quality into a global mean (µα, µβ) and deviations for each site (a_s, b_s) and each case (u_k, v_k). The prior is regularised with Gaussian (L2) penalties, which shrink site‑specific estimates toward the global mean and prevent degenerate values.

Learning proceeds via an Expectation‑Maximisation (EM) algorithm. In the E‑step, given the current network parameters (θ) and label‑quality parameters (ϕ), a voxel‑wise posterior q_k(x)=P(G_k(x)=1|X_k,Y_k) is computed by combining the image‑based prior π_k(x)=σ(f_θ(X_k)_x) from a UNet with the observation model defined by α and β. This posterior serves as a soft consensus mask that fuses image information and annotation reliability. In the M‑step, two sub‑steps are performed: (A) the UNet is updated by minimising a loss that combines soft‑label cross‑entropy and Dice, using q_k as the target; (B) the hierarchical label‑quality parameters are updated by maximising the expected complete‑data log‑likelihood, which depends only on aggregated sufficient statistics (TP_k, P_k, TN_k, N_k). Because the number of parameters is small, a second‑order L‑BFGS optimiser efficiently finds the MAP estimate under the L2‑regularised prior.

The method was evaluated on three distinct sites (total N = 2,857) each providing T2‑weighted, high‑b DWI, and ADC images together with a single expert contour. Two experimental setups were used: (1) a pooled split where all sites are mixed for training and each site is held out for testing; (2) a leave‑one‑site‑out (LOSO) split where one site is completely excluded from training. Baselines included a standard UNet, a bootstrapped self‑training variant, and a non‑hierarchical site‑EM that estimates independent α_s, β_s per site.

Results show that the proposed Hierarchical EM (HierEM) consistently outperforms all baselines. In the pooled setting, HierEM achieves the highest mean Dice scores across sites (≈39.7 % for Site 1, 29.5 % for Site 2, 35.6 % for Site 3) with comparable or slightly better 95 % Hausdorff distances. In the more challenging LOSO scenario, where baseline Dice drops to 25–33 %, HierEM raises performance to 27–33 % Dice and reduces boundary error to ~20 mm, demonstrating superior cross‑site generalisation. Estimated sensitivities are modest (31.5 %–47.3 %) reflecting the difficulty of detecting small lesions, while specificities are uniformly high (≈0.99), appropriate for the highly imbalanced lesion‑to‑background ratio. Risk‑coverage curves based on voxel‑wise predictive entropy reveal that discarding high‑uncertainty voxels markedly improves Dice, indicating that the model’s uncertainty estimates are well calibrated and useful for selective review.

Key contributions are: (1) a principled probabilistic model that explicitly separates global lesion characteristics from site‑specific annotation bias; (2) an EM‑based training loop that jointly optimises the segmentation network and the hierarchical label‑quality parameters; (3) interpretable site‑level sensitivity and specificity estimates that quantify annotation quality; and (4) a demonstration that modelling annotation noise improves robustness to unseen sites without any site‑specific fine‑tuning. The authors suggest future work extending the framework to multi‑annotator settings, incorporating more sites and lesion types, and exploring downstream clinical decision‑making using the provided uncertainty measures.

Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment