Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications

Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC’s RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.


💡 Research Summary

This paper addresses a critical bottleneck in nasopharyngeal carcinoma (NPC) radiotherapy: the frequent absence of one or more MRI sequences due to patient discomfort, long acquisition times, cost constraints, or contraindications to contrast agents. Missing sequences such as T1‑weighted contrast‑enhanced (T1c) or diffusion‑weighted imaging (DWI) degrade tumor delineation, staging, and adaptive treatment planning. Existing MRI synthesis approaches are largely modality‑specific, trained on a single anatomical region (often the brain), and provide little clinical interpretability, limiting their utility for NPC workflows.

The authors propose a unified “any‑to‑all” MRI synthesis framework, named OmniSyn, that can generate any missing MRI contrast from any available source contrast using a single foundation model. The architecture consists of two main components: (1) a contrastive visual encoder that learns modality‑invariant, anatomy‑preserving representations, and (2) a CLIP‑based text‑informed decoder that aligns visual features with clinically relevant language prompts (e.g., “skull base invasion”, “enhanced tumor”). The encoder is pre‑trained with a contrastive loss on paired multi‑contrast images from 13 institutions (40,825 slices), encouraging embeddings of the same anatomy across different modalities to be close while pushing apart embeddings from different patients or scanners. The decoder is fine‑tuned on the synthesis task, receiving both the modality‑invariant embedding and a textual description of the target contrast; a vision‑language alignment loss ensures that the generated image reflects the semantic intent of the prompt.

Training proceeds in three stages: (i) contrastive pre‑training of the encoder, (ii) vision‑language alignment using a CLIP‑style objective, and (iii) supervised synthesis training of the decoder. This staged approach leverages large‑scale, multi‑institutional data to achieve strong generalization and robustness to domain shifts.

Performance is evaluated on 26 validation sites (15,748 slices) covering both internal and external cohorts with diverse scanner models and acquisition protocols. OmniSyn achieves an average structural similarity index (SSIM) of 0.90 and peak signal‑to‑noise ratio (PSNR) of 27, outperforming a comprehensive set of baselines: GAN‑based models (pix2pix, CycleGAN), transformer‑based networks (SwinUNet, ResViT), diffusion models (DDPM), and recent language‑guided synthesis methods (BrainMVP, TUMSyn). Quantitatively, OmniSyn records the lowest mean‑squared error (MSE) and the highest SSIM/PSNR across all tested modality pairs (e.g., T1→T1c, T2→T1). Qualitatively, synthesized images exhibit sharper tissue boundaries, reduced hallucination, and better preservation of low‑contrast structures, which are crucial for accurate tumor delineation.

Robustness tests introduce controlled degradations such as motion artifacts, Gaussian noise, and intensity non‑uniformities. OmniSyn’s performance degrades minimally compared with baselines, demonstrating resilience to realistic clinical noise.

Beyond synthesis, the authors assess downstream utility in three clinically relevant tasks: (1) NPC tumor region‑of‑interest (ROI) segmentation, (2) brain tissue segmentation, and (3) clinical stage prediction. Using the synthesized images as inputs, segmentation models achieve Dice similarity coefficients of ≥0.88, a 3–5 % improvement over models trained on original incomplete data. Stage prediction accuracy also rises by 2–4 % when the text‑aligned representations are incorporated. These results indicate that the unified visual‑language embedding learned by OmniSyn transfers effectively to downstream radiotherapy tasks.

Ablation studies dissect the contributions of each component. Removing the contrastive encoder reduces SSIM by ~0.04, while omitting the text‑informed decoder lowers PSNR by ~2 dB, confirming that both modality‑invariant encoding and semantic guidance are essential. Experiments with different CLIP variants show that fine‑tuned CLIP‑B/16 yields the best zero‑shot alignment, while larger models provide marginal gains at higher computational cost.

Limitations are acknowledged: (i) the reliance on manually crafted textual prompts requires radiology expertise, potentially limiting scalability; (ii) inference time for full 3D volumes remains higher than lightweight GANs, posing challenges for real‑time clinical deployment; (iii) the study focuses on MRI only, whereas integrating CT could further enhance treatment planning. Future work will explore automated prompt generation, model compression techniques, and multimodal (CT‑MRI) fusion within the same foundation framework.

In summary, this work introduces the first foundation‑model‑based, any‑to‑all MRI synthesis system tailored for nasopharyngeal carcinoma. By unifying contrastive visual representation learning with vision‑language alignment, the model delivers high‑fidelity, semantically consistent synthetic MR images across diverse institutions, while simultaneously improving downstream radiotherapy tasks such as tumor segmentation and stage prediction. The approach represents a significant step toward clinically viable, AI‑driven imaging solutions that bridge the gap between technical image synthesis and real‑world oncologic decision making.


Comments & Academic Discussion

Loading comments...

Leave a Comment