Region-Normalized DPO for Medical Image Segmentation under Noisy Judges

Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge’s top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.


💡 Research Summary

Medical image segmentation typically relies on dense pixel‑wise annotations, which are expensive and limit scalability. In contrast, many clinical pipelines already generate inexpensive quality‑control (QC) signals—such as model agreement, uncertainty estimates, or learned mask‑quality scores—that can be used as supervisory feedback without additional ground‑truth masks. However, these signals are noisy and biased, posing a risk for preference‑based fine‑tuning methods like Direct Preference Optimization (DPO), which can be destabilized by erroneous rankings.

This paper investigates DPO for segmentation when the “judge” providing preferences is a noisy, model‑based QC system. The authors first train a base segmenter on a small labeled set (D_seg). For each unlabeled image (D_pref), the current segmenter generates a slate of K diverse candidate masks using stochastic sampling, logit perturbations, thresholding, and simple morphological edits. A separate judge model—trained on a disjoint QC‑labeled subset (size N_QC) and never used to supervise the segmenter—scores each candidate by IoU with its own prediction, thereby ranking the slate.

Preference pairs are mined from the ranked slate using several strategies: Top‑vs‑Base (top candidate vs. current prediction), Top‑vs‑Random, Threshold (top vs. first candidate whose score falls below a margin τ), and Random. Standard DPO then optimizes a logistic loss on the log‑likelihood ratio Δθ between the preferred (y⁺) and rejected (y⁻) masks, regularized by a fixed reference model (the Stage‑1 checkpoint).

The key contribution is Region‑Normalized DPO (RN‑DPO). In segmentation, many pixels are identical between y⁺ and y⁻, contributing nothing to the likelihood ratio. Standard DPO normalizes the log‑likelihood over the entire image (|Ω|), implicitly scaling each pair by the fraction of disagreement |R|/|Ω|, where R is the set of pixels where the masks differ. This scaling can cause two problems: (1) when the disagreement region is tiny, the update is overly dampened, and (2) when the disagreement region is large—especially under noisy judges—the update can be dominated by potentially misranked comparisons, leading to harmful parameter changes.

RN‑DPO replaces the global normalization with a region‑wise normalization: the log‑likelihood is computed only over the disagreement region R, i.e., log π_Rθ(y|x) = (1/|R|) ∑{i∈R} ℓ{θ,i}(y_i). The same region‑wise likelihood is computed for the reference model, and the logistic loss is applied to the difference Δ_Rθ − Δ_Rref. This decouples update magnitude from the size of the disagreement region, preventing large, noisy comparisons from overwhelming the learning signal while preserving gradients when masks differ only slightly.

Experiments are conducted on two benchmark datasets: (1) JSR‑T chest X‑ray (multi‑label: heart, lungs, clavicles) and (2) ACDC cardiac MRI (multi‑class). For each dataset, the authors vary (a) the size of the supervised base set (N_seg), (b) the judge’s training budget (N_QC) to create weak and strong judges, and (c) the mining strategy. Baselines include vanilla DPO, robust DPO variants (rDPO, β‑DPO), and non‑preference methods such as pseudo‑label filtering and ensemble‑based selection.

Key findings:

  • Stability: RN‑DPO consistently yields smoother loss curves and lower epoch‑to‑epoch variance, indicating more stable optimization.
  • Performance under weak judges: When the judge is trained on only 10 QC examples, RN‑DPO outperforms vanilla DPO by 3–5 % Dice on average across all mining strategies, with the greatest gains for Threshold and Random mining where noisy rankings are common.
  • Peak performance with strong judges: With a strong judge (50 QC examples) and the Top‑vs‑Base strategy, RN‑DPO achieves the highest Dice scores (≈0.89 on JSR‑T, ≈0.93 on ACDC), surpassing vanilla DPO by 4–5 % and matching or exceeding the oracle‑like baselines.
  • Effect of disagreement size: Ablation studies varying the average |R| show that RN‑DPO’s gains are most pronounced when |R| is large, confirming that region normalization mitigates the over‑emphasis on noisy large‑disagreement pairs.

The authors also discuss computational overhead: computing the disagreement mask is a simple binary operation, adding negligible cost compared to the forward/backward passes. RN‑DPO therefore retains the simplicity and efficiency of DPO while improving robustness.

In summary, the paper introduces a principled modification to DPO tailored for dense prediction tasks where preference signals are noisy. By normalizing updates over the actual region of disagreement between masks, RN‑DPO reduces the influence of erroneous large‑scale comparisons and stabilizes training. The extensive empirical evaluation demonstrates that RN‑DPO consistently outperforms standard DPO and other strong baselines across different judge qualities and mining strategies, all without requiring any additional pixel‑level annotations. This work opens a practical pathway for leveraging inexpensive QC signals to continuously improve medical segmentation models in real‑world clinical settings where annotation resources are limited.


Comments & Academic Discussion

Loading comments...

Leave a Comment