Model Agnostic Preference Optimization for Medical Image Segmentation
Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.
💡 Research Summary
The paper introduces MAP — Model‑Agnostic Preference Optimization — a novel training framework that brings preference‑based fine‑tuning to a wide range of medical image segmentation models. Traditional preference‑optimization approaches in medical imaging have been limited to specific architectures (e.g., SAM) and rely on heuristic, low‑diversity sampling methods such as thresholding, which produce weak “good/bad” examples. MAP overcomes these limitations by exploiting dropout as a source of stochasticity. During inference, the model is run multiple times with different dropout rates, producing a set of K diverse segmentation hypotheses for each input image. Each hypothesis is scored against the ground‑truth mask using Dice; the highest‑scoring hypothesis is automatically labeled as the “good” example (ŷ⁺). A candidate set of lower‑scoring hypotheses is formed by enforcing a minimum Dice gap τ; the lowest‑scoring member of this set becomes the “bad” example (ŷ⁻). This automatic generation of preference pairs eliminates the need for human‑annotated rankings and dramatically reduces annotation cost.
Once preference pairs are collected, MAP optimizes the model using Direct Preference Optimization (DPO), a log‑ratio objective that directly increases the likelihood of preferred outputs while decreasing that of dispreferred ones, without training a separate reward model. The loss is:
L_DPO = ‑log σ
Comments & Academic Discussion
Loading comments...
Leave a Comment