Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging
ArXiv ID: 2511.19471
Date: 2025-11-22
Authors: Researchers from original ArXiv paper

📝 Abstract

Foundation segmentation models (such as SAM and SAM-2) perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have poor contrast. Rather than fine-tune these models (e.g., MedSAM), we propose a compositional alternative where we treat the foundation model's output as an additional input channel (like an extra color channel) and pass it alongside the MRI to highlight regions of interest. We generate SAM-2 segmentation prompts (e.g., a bounding box or positive/negative points) using a lightweight 3D U-Net that was previously trained on MRI segmentation. However, the U-Net might have been trained on a different dataset. As such it's guesses for prompts are often inaccurate but often in the right region. The edges of the resulting foundation segmentation guesses are then smoothed to allow better alignment with the MRI. We also test prompt-less segmentation using DINO attention maps within the same framework. This "has-a" architecture avoids modifying foundation weights and adapts to domain shift without retraining the foundation model. It achieves 96% volume accuracy on basal ganglia segmentation, which is sufficient for our study of longitudinal volume change. Our approach is faster, more label-efficient, and robust to out-of-distribution scans. We apply it to study inflammation-linked changes in sudden-onset pediatric OCD.

💡 Deep Analysis

Deep Dive into Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging.

Foundation segmentation models (such as SAM and SAM-2) perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have poor contrast. Rather than fine-tune these models (e.g., MedSAM), we propose a compositional alternative where we treat the foundation model’s output as an additional input channel (like an extra color channel) and pass it alongside the MRI to highlight regions of interest. We generate SAM-2 segmentation prompts (e.g., a bounding box or positive/negative points) using a lightweight 3D U-Net that was previously trained on MRI segmentation. However, the U-Net might have been trained on a different dataset. As such it’s guesses for prompts are often inaccurate but often in the right region. The edges of the resulting foundation segmentation guesses are then smoothed to allow better alignment with the MRI. We also test prompt-less segmentation using DINO attention maps within the same framework. This

📄 Full Content

Published as a conference paper at AIAS 2025 Not Quite Anything: Overcoming SAM’s Limitations for 3D Medical Imaging Keith Moore Deptartment of Biomedical Data Science Stanford University Stanford, CA 94304, USA kem1@stanford.edu Abstract Foundation segmentation models (such as SAM and SAM-2) perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have poor contrast. Rather than fine-tune these models (e.g., MedSAM), we propose a compo- sitional alternative where we treat the foundation model’s output as an additional input channel (like an extra color channel) and pass it alongside the MRI to highlight regions of interest. We generate SAM-2 segmentation prompts (e.g., a bounding box or posi- tive/negative points) using a lightweight 3D U-Net that was previously trained on MRI segmentation. However, the U-Net might have been trained on a different dataset. As such it’s guesses for prompts are often inaccurate but often in the right region. The edges of the resulting foundation segmen- tation guesses are then smoothed to allow better alignment with the MRI. We also test prompt-less segmentation using DINO attention maps within the same framework. This “has-a” architecture avoids modifying foundation weights and adapts to domain shift without retraining the foundation model. It achieves 96% volume accuracy on basal ganglia segmentation, which is sufficient for our study of longitudinal volume change. Our approach is faster, more label-efficient, and robust to out-of-distribution scans. We apply it to study inflammation-linked changes in sudden-onset pediatric OCD. 1 Introduction Despite significant progress in segmenting natural images, foundation models like SAM (Kirillov et al., 2023), SAM-2 (Ravi et al., 2024), and DINO (Juneja et al., 2024) fail to properly segment brain Magnetic Resonance Images (MRIs). The issue is that regions in the brain (particularly the subcortical regions) are difficult to distinguish due to limited contrast or clear visual boundaries. Retraining or fine-tuning the model on medical images helps (e.g., MedSAM (Ravi et al., 2024)) but the resulting segmentation is inaccurate even with extensive prompting (see Figure 1 where we are trying to isolate the caudate with SAM). We have had success using supervised models (such as 3D U-Net and UNETR) for brain segmentation (Moore, 2022), but these models are sensitive to dataset distribution shift. We thought SAM might be doing poorly because it lacked 3D spatial information. In SAM-2, we tried encoding the Z dimension along the time dimension (i.e., treating the slices like frames of a movie), but this had no improvement to segmentation accuracy. The other option of reachitecting SAM-2 to add the Z-axies and retrain on MRI images was impractical. This led us to explore a different strategy. Instead of fine-tuning the foundation model, we use the existing inference engine to generate a 2D segmentation guess of each MRI slice as a second “channel” or color. This output is passed alongside the original MRI slice-by-slice into a multi-channel 3D U-Net to reconstruct 1 arXiv:2511.19471v1 [eess.IV] 22 Nov 2025 Published as a conference paper at AIAS 2025 Figure 1: Poor isolation of caudate despite point prompts and bounding box (red-positive, blue-negative, and green-”don’t care”). the segmented volume. Our hypothesis was that the foundation model might help the U-Net find the intended structures. In this architecture, the likely-inaccurate ”guesses” by the foundation model act like an alternate imaging sequence (similar to T2 or FLAIR MRI imaging modalities), drawing attention to features in the T1 image. This “has-a” construction leverages the foundation model for landmark discovery, adapts to distribution shift, and remains fast to train with minimal supervision. 1.1 Clinical Context and Motivation Our clinical objective is to analyze thousands of pediatric MRIs to investigate immune- related inflammation in neuropsychiatric disorders. We are particularly interested in the caudate and thalamus (small subcortical structures comprising approximately 20,000 voxels within an 8.6M voxel 3D image (240×240×155)). The effect we aim to measure is roughly a 10% change in volume across two timepoints requiring a segmentation accuracy exceeding 94% to distinguish signal from noise (see Appendix A.5). Unlike segmentation tasks involving large lesions or tumors, our goal is to detect small volume changes in pre-defined structures. Even small boundary errors can obscure clinically meaningful differences, requiring models with higher edge precision than typical MRI segmentation tools. Real-world scans vary in quality, resolution, and orientation across sites. While improved instrumentation may help, clinical research must often work with heterogeneous and imperfect data. We sought methods that were robust to distribution shift, do not require retraining, and can operate at scale w

…(Full text truncated)…

📄 Read Full PDF on ArXiv