Medical image foundation models (MIFMs) have demonstrated remarkable potential for a wide range of clinical tasks, yet their development is constrained by the scarcity, heterogeneity, and high cost of large-scale annotated datasets. Here, we propose RaSD (Randomized Synthesis and Disentanglement), a scalable framework for pre-training MIFMs entirely on synthetic data. By modeling anatomical structures and appearance variations with randomized Gaussian distributions, RaSD exposes models to sufficient multi-scale structural and appearance perturbations, forcing them to rely on invariant and task-relevant anatomical cues rather than dataset-specific textures, thereby enabling robust and transferable representation learning. We pre-trained RaSD on 1.2 million 3D volumes and 9.6 million 2D images, and extensively evaluated the resulting models across 6 imaging modalities, 48 datasets, and 56 downstream tasks. Across all evaluated downstream tasks, RaSD consistently outperforms training-from-scratch models, achieves the best performance on 17 tasks, and remains comparable to models pre-trained on large real datasets in most others. These results demonstrate that the capacity of synthetic data alone to drive robust representation learning. Our findings establish a paradigm shift in medical AI, demonstrating that synthetic data can serve as a "free lunch" for scalable, privacy-preserving, and clinically generalizable foundation models.
Medical image foundation models (MIFMs), pretrained on large and diverse datasets, have emerged as a promising paradigm for a broad spectrum of medical image tasks, with increasing relevance to clinical translation [1][2][3][4][5]. Through large-scale pre-training (Fig. 1-b), MIFMs acquire transferable visual representations that can be efficiently adapted to specific downstream applications [1], conferring two major advantages: 1) Enhanced performance: exposure to diverse data distributions during pre-training enables MIFMs to encode domain-invariant features, improving accuracy and robustness in fine-tuned tasks.
- Reduced training costs: their generalization capacity reduces the need for extensive taskspecific annotations and computational overhead, thereby streamlining model development. Such advantages have positioned MIFMs as a versatile tool for critical medical image applications [3][4][5] with growing potential for clinical integration.
Despite promising advances in MIFMs [3,[6][7][8], data remains a central bottleneck in their pretraining (Fig. 1-a). The acquisition of medical images is both expensive and constrained by regulatory and technical barriers [9]. Variability across institutions, imaging devices, and anatomical sites further exacerbates data heterogeneity [10]. These factors extremely increase the cost of medical data collection, limiting the scalability and diversity of datasets required for model generalization across diverse clinical scenarios [1]. In addition, high-resolution and volumetric data from modalities like Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and whole slide imaging (WSI) require significant storage space and high bandwidth for retrieval. Supervised learning [11] becomes impractical due to the high cost of manual annotation for large-scale datasets, while self-supervised learning [12][13][14] struggles to provide sufficiently strong supervision signals, failing to disentangle complex semantic information in medical images. Collectively, these limitations impair the ability of MIFMs to learn transferable and discriminative features across diverse clinical contexts.
Recent advances have underscored the growing potential of synthetic data in model training [4,[15][16][17]. Synthetic data can be generated efficiently and flexibly, producing diverse images with rich annotations at minimal cost. By eliminating concerns over data privacy [9], manual annotation, and storage costs, synthetic data presents a promising solution for the challenges of MIFM pre-training. Synthetic images can be generated on demand with automatically assigned labels, enabling scalable pre-training without relying on costly real datasets or risking patient privacy. A growing body of work has focused on training generative foundation models (FMs) to produce synthetic data tailored to specific downstream tasks [4,16,17]. However, these generative FMs still depend heavily on large quantities of real data during training, which may be costly, sensitive, or limited in coverage, and thus cannot fully escape the practical constraints of real-world data. In contrast, as shown in Fig. 1-a, rulebased or randomized synthesis approaches [18][19][20] suggest the possibility of constructing synthetic datasets without real data input, particularly in medical imaging domains where structure and anatomy follow relatively well-defined patterns. These observations lead us to formulate a hypothesis: “Synthetic medical image data may serve as a free lunch for MIFM pre-training, making it possible to train MIFMs without real data.” Although a preliminary work [20] has explored randomized synthesis for MIFM pre-training, their method focuses on enforcing appearance consistency, limiting the tasks where appearance is an important feature.
In this study, we propose a scalable synthetic data-driven pre-training framework for MIFM, Randomized Synthetic data Disentanglement (RaSD, Fig. 1-c,d), which operates entirely without reliance on real medical data. RaSD synthesizes diverse image-label pairs by sampling from randomized statistical distributions that capture common structural and appearance variations in medical images, eliminating the requirement for physical imaging devices and mitigating privacy risks. Its online synthesis pipeline enables streaming training, avoiding the storage of large datasets. Each synthetic image is generated alongside pixelwise annotations, obviating manual labeling and facilitating efficient supervised training. By exposing models to large-scale and diverse synthetic data, RaSD fosters the learning of transferable features that generalize beyond the limitations of real data scarcity. This framework establishes a new paradigm in medical AI, liberating
Fig. 1 Overview of the proposed RaSD framework. a) Comparison between the conventional real data paradigm and our RaSD paradigm. Real data-based pre-training of MIFMs suffers from costly and restricted data acquisition, labor
This content is AI-processed based on open access ArXiv data.