Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.


💡 Research Summary

This paper addresses the critical challenge of restoring high‑quality standard‑dose PET (SPET) images from low‑dose PET (LPET) scans while minimizing radiation exposure. Leveraging simultaneously acquired MR images, the authors propose a novel framework called MFdiff (Multi‑modality Fusion Diffusion). MFdiff consists of two main components: a multi‑modality feature fusion module and a conditional diffusion restoration module.

The feature fusion module tackles the inherent mismatch between PET and MR modalities. It is divided into an Intra‑Modality Learning (IML) sub‑module and a Cross‑Modality Aggregation (CMA) sub‑module. IML employs two independent Transformer‑based “Modality Encoders”—one for LPET and one for MR—to extract rich global and detailed representations. These representations are further split into global (GP, GM) and detailed (DP, DM) feature maps via a dual‑branch Global/Detailed Encoder. By enforcing consistency on the global features while preserving modality‑specific details, the network avoids injecting irrelevant MR structures into the PET reconstruction. CMA then fuses the modality‑specific and shared features using channel concatenation, element‑wise multiplication, and addition, producing an optimized fusion feature that guides the subsequent diffusion process.

The conditional diffusion module treats the fused feature as a conditioning signal for a denoising diffusion probabilistic model (DDPM). During the reverse diffusion steps, the fusion feature steers the noise‑removal trajectory, ensuring that the generated SPET respects both the anatomical guidance from MR and the quantitative fidelity of PET. Architectural enhancements such as Gaussian perturbation, invertible residual blocks, and Transformer blocks improve stability, long‑range dependency modeling, and sample diversity, overcoming typical GAN drawbacks like mode collapse and over‑smoothing.

To mitigate the scarcity of paired LPET‑SPET data and the prevalence of out‑of‑distribution (OOD) scans in clinical practice, the authors introduce a two‑stage supervise‑assisted learning strategy. In stage 1, a large synthetic dataset—generated via physics‑based simulations of PET acquisition—is used to learn generalized priors, giving the model robust baseline denoising capabilities across a wide range of noise levels and scanner settings. In stage 2, the model is fine‑tuned on a limited set of real in‑vivo OOD data (different scanners, tracers, acquisition times), allowing it to acquire specific priors that adapt to domain shifts while retaining the generalized knowledge from stage 1.

Extensive experiments were conducted on both phantom brain data and in‑vivo brain PET/MR datasets. Quantitative metrics (PSNR, SSIM, RMSE) show that MFdiff outperforms state‑of‑the‑art 2D/3D CNN, GAN‑based, and direct multimodal diffusion approaches by 2–3 dB in PSNR and 0.02–0.04 in SSIM. Qualitative assessments confirm superior preservation of anatomical details and reduced artifact introduction, especially under OOD conditions. Ablation studies demonstrate that removing either the IML, CMA, or the two‑stage training leads to significant performance degradation, underscoring the importance of each component.

In summary, MFdiff delivers a comprehensive solution for PET restoration: (1) a sophisticated fusion mechanism that harmonizes PET and MR information without propagating mismatched features; (2) a conditional diffusion backbone that generates high‑fidelity SPET images with strong sample diversity; and (3) a pragmatic training paradigm that leverages abundant synthetic data while efficiently adapting to scarce, domain‑specific clinical data. The work advances the field toward safer, lower‑dose PET imaging, though future extensions to full 3D volumetric processing and broader clinical applications (e.g., oncology, cardiology) remain open research directions.


Comments & Academic Discussion

Loading comments...

Leave a Comment