Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation
While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.
💡 Research Summary
The paper “Uncovering Modality Discrepancy and Generalization Illusion for General‑Purpose 3D Medical Segmentation” critically examines the claim that emerging 3D medical foundation models are truly general‑purpose. While recent works have demonstrated impressive zero‑shot performance on a variety of structural imaging tasks (CT, MRI), the authors argue that these evaluations are heavily biased toward high‑contrast anatomical modalities and neglect functional imaging such as PET, which captures metabolic activity with diffuse, low‑contrast signals.
To expose this bias, the authors construct a new benchmark dataset called UMD, comprising 490 paired whole‑body PET/CT scans and 464 paired whole‑body PET/MRI scans. Each pair originates from the same patient in a single diagnostic session, guaranteeing exact spatial alignment between the structural (CT or MRI) and functional (PET) volumes. The dataset includes voxel‑level annotations for 13 organs (liver, kidneys, brain, heart, spleen, aorta, lungs, colon, bladder, pancreas, esophagus, stomach, etc.), amounting to roughly 675 k 2‑D slices and 12 k 3‑D organ masks. By providing ground truth for both modalities, the dataset enables a controlled intra‑subject comparison where imaging modality is the sole independent variable.
The authors evaluate five state‑of‑the‑art 3D general‑purpose segmentation foundation models: SAM‑Med3D‑turbо (point‑prompt), SegVol (large‑scale volumetric pre‑training), nnInteractive (point‑prompt), VISTA3D (text‑prompt), and SAT‑Pro (text‑prompt). All models are tested in a zero‑shot setting, receiving identical prompts (either a point location or a textual label) for each organ, and are asked to segment the organ simultaneously in the CT (or MRI) and the paired PET volume.
Results reveal a dramatic modality gap. On CT, average Dice scores range from 0.40 to 0.48, with certain high‑contrast organs (liver, brain, lungs) achieving Dice >0.6–0.9. In stark contrast, on PET the same models typically obtain Dice scores near zero (0.00–0.10) for almost all organs. Text‑based models (VISTA3D, SAT‑Pro) essentially fail on PET, while point‑based models only recover modest performance on high‑uptake organs such as the bladder or liver where tracer accumulation creates strong contrast. Statistical analysis confirms that the performance differences between CT/MRI and PET are highly significant (p < 0.01 or p < 0.0001). To contextualize these findings, the authors train task‑specific nnU‑Net models on a tiny subset (10 cases per modality) and show that even these modestly trained models substantially outperform the zero‑shot foundation models on PET, underscoring the latter’s inability to generalize across modalities.
The paper identifies two major pitfalls in current evaluation protocols: (1) entanglement of task complexity with modality, because existing benchmarks use different datasets for different modalities (e.g., CT for abdominal organs, MRI for cardiac or neuro structures), making it impossible to isolate modality effects; and (2) a structural bias in data distribution, where training and testing data are dominated by high‑contrast modalities, leaving functional imaging virtually unseen. Consequently, foundation models learn to rely on sharp anatomical edges rather than the diffuse metabolic patterns characteristic of PET.
The authors argue that the “general‑purpose” label is largely an illusion created by overly optimistic benchmark results that do not reflect real‑world clinical diversity. They call for a paradigm shift toward multi‑modal pre‑training, modality‑agnostic prompts, and evaluation frameworks that deliberately include functional imaging. The UMD dataset, released publicly, provides a clean test‑bed for such future work.
Limitations of the study include its focus on zero‑shot performance; the authors do not explore whether fine‑tuning or domain adaptation could rescue PET performance. Additionally, variations in PET tracer types, acquisition protocols, and noise levels are not dissected, which could further affect model robustness.
In conclusion, this work provides compelling empirical evidence that current 3D medical foundation models are far from truly modality‑agnostic. By rigorously isolating modality as the independent variable, the authors reveal a substantial generalization illusion and set the stage for the next generation of truly multi‑modal medical AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment