Arxiv 2512.22237
📝 Original Info
- Title: Arxiv 2512.22237
- ArXiv ID: 2512.22237
- Date: 2026-02-23
- Authors: Researchers mentioned in the ArXiv original paper
📝 Abstract
Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.💡 Deep Analysis
This research explores the key findings and methodology presented in the paper: Arxiv 2512.22237.Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter
📄 Full Content
With the development of artificial intelligence, deep learning-based approaches have shown promise in improving PET image quality, leveraging convolutional neural networks (CNNs) [8]- [10] or generative adversarial networks (GANs) [11]- [13] to map noisy input data to high-quality images [14]. For example, Peng et al. [15] proposed a novel PET image reconstruction method by integrating CT images as inputs into a 3D U-Net-based [16] architecture to boost image quality. Chen et al. [17] introduced a 3D image space shuffle U-Net, incorporating shuffle/unshuffled layers into the U-Net architecture for low-dose reconstruction. Yang et al. [18] proposed a conditional weakly-supervised multi-task learning strategy with a multi-channel self-attention module, which improves noise reduction and contrast recovery by incorporating an auxiliary task as an anatomical regularize. Ouyang et al. [19] employed a basic GAN with task-specific perceptual loss for PET reconstruction through adversarial learning [20]- [22], and incorporated a pre-trained Amyloid classifier for guidance.
Recently, diffusion models have emerged as a powerful class of generative models, capable of learning complex data distributions and generating high-fidelity samples [23]- [24]. For example, Gong et al. [25] applied denoising diffusion probabilistic models (DDPM) [26] to PET image reconstruction, integrating MR prior information and PET data-consistency constraints to enhance performance and reduce uncertainty. Shen et al. [27] proposed the bidirectional condition diffusion probabilistic model, which learns a score function network [28] via evidence lower-bound optimization and employs two handcrafted conditions in latent space to generate high-quality images. Jiang et al. [29] created an unsupervised PET enhancement framework using a latent diffusion model trained on full-dose PET images, incorporating PET compression, Poisson diffusion, and CT-guided cross-attention. Han et al. [30] proposed a diffusion model-based PET reconstruction framework with a coarse prediction module and an iterative refinement module, enhanced by auxiliary guidance and contrastive diffusion strategies. Moreover, Huang et al. [31] developed a diffusion-transformer model integrating diffusion and transformer techniques with a joint compact prior to boost image quality and protect lesion details. Pan et al. [32] designed a diffusion-based PET consistency model that enhances low-dose PET image quality by learning a consistency function during reverse diffusion and employing shifted windows as visual transform-P ers. Xie et al. [33] established a dose-aware diffusion model for 3D low-dose PET imaging using neighboring slices as conditional information. Nevertheless, most of them typically operate in a single domain, either the image domain or the sinogram domain, without fully exploiting the complementary information between these two domains. Furthermore, these methods predominantly focus on image-domain features (e.g., texture, intensity distributions), neglecting the complementary value of meta-information. This oversight limits their ability to exploit functional-semantic correlations, leading to suboptimal preservation of physiological details in reconstructed images.
Considering all the above factors, we propose a Meta-information Guided cross-domain synergistic Diffusion Model (MiG-DM) for low-dose PET reconstruction. This model bridges the gap by aligning textual semantic cues with image reconstruction, thereby enhancing both structural fidelity and functional interpretability. At the same time, the cross-domain framework, which jointly processes data in the image and sinogram domains, offers a more comprehensive understanding of the PET imaging process, enabling more effective noise suppression and feature preservation. The main contributions of this work are summarized as follows:
Adaptive MI Encoding for Functional-semantic Deep Coupling in PET Reconstruction. A meta-information (MI) encoding module is introduced to achieve cross-modal alignment of MI semantics with image reconstruction in PET imaging. By converting PET-specific functional parameters into semantic prompts, considering patient characteristics, dose-related information, and semi-quantitative parameters, the module facilitates the creation of semantic prompt vectors through an MI encoder.
Cross-domain Synergy for Global Physical and Local Detail Optimization. A reconstruction framework that integrates projection-domain and image-domain processing is proposed to utilize both global and local information. In the projection domain, there is a specialized sinogram adapter that transforms raw projection data into feature space, effectively capturing the global physical structure of radiation distribution.
The remainder of this paper is organized as follows. Section Ⅱ provides a concise overview of related works in the field. Section Ⅲ elaborates on the key idea of the proposed method. The experimental setup and corresponding results are detailed in Section Ⅳ. A comprehensive discussion of the findings is presented in Section V, and the paper concludes with a succinct summary in Section Ⅵ.
Diffusion models have shown great potential for PET image reconstruction, where the primary goal is to restore high-quality images from extremely noisy data. It can typically be divided into a forward diffusion process and a reverse denoising process. The forward diffusion process is a continuous-time stochastic process that acts on a full-dose PET image 0
x , continuously injecting noise as the time step increases until the image satisfies pure Gaussian noise. The forward diffusion process can be represented by a Markov chain:
where
and . Here, t represents the variance schedule that controls the amount of noise added. The reconstruction process reverses the forward process by learning to predict the noise at each time step. The reverse denoising process is also a continuous-time stochastic process and can be represented by a Markov chain:
where t and 2 t represent the mean and variance of the Gaussian distribution, respectively. m n ( i , )
where denotes the noise added in the diffusion process, and represents the predicted noise. Since the diffusion model learns image features at different levels of noise, it can learn the image distribution more efficiently and achieve more outstanding generation quality compared to other generative models. Moreover, by injecting conditions into the reverse denoising process, the diffusion model can achieve accurate generation control and complete conditional generation tasks such as Image-to-image [34], [35] or text-to-image [36], [37].
For fine-tuning large pre-trained models, full parameter updates incur substantial computational and memory overhead. To address this, low-rank adaptation (LoRA) [38] introduces an efficient decomposition strategy that freezes the original pre-trained weights and learns task-specific updates through two low-rank matrices ). The effectiveness of LoRA is grounded in the low intrinsic dimensionality of neural networks [39]. During the full fine-tuning, meaningful parameter updates primarily occur within a low-dimensional subspace. The low-rank projection of LoRA effectively captures these essential updates, achieving comparable performance to full fine-tuning with dramatically reduced resource requirements.
This section details the MiG-DM framework that synergizes meta-information guidance with cross-domain diffusion modelling for low-dose PET reconstruction.
A. Motivation Meta-information in PET imaging encompasses patient-specific physiological parameters and metrics, including body height, weight, radiotracer injection dose, and dynamic imaging parameters. These data encode functional and physiological contexts critical for interpreting metabolic activity visualized in PET images. Unlike anatomical imaging modalities like CT and MRI, PET relies on functional signals where meta-information correlates with biological processes such as glucose metabolism or receptor binding. For instance, Halpern et al. [40] demonstrated that overweight patients require an extended PET acquisition protocol. Masuda et al. [41] evaluated the effect of optimizing injected dose on the image quality of overweight patients using lutetium oxyorthosilicate PET/CT with high-performance detector electronics. Gu et al. [42] explored the diagnostic image quality and lesion detectability of BMI-based reduced injection doses for 18 F-FDG PET/CT imaging. The relevance of meta-information to PET reconstruction is twofold: (1) it provides quantitative links between imaging signals and physiological reality, enabling more accurate modeling of radiotracer distribution; (2) it serves as prior knowledge to constrain the reconstruction process, particularly in low-dose scenarios where noise and artifacts compromise image quality. However, conventional deep learning-based methods mainly focus on image-domain features such as texture and intensity distributions, neglecting the complementary value of meta-information. This oversight limits their ability to exploit functional-semantic correlations, leading to suboptimal preservation of physiological details in reconstructed images.
Moreover, the sinogram represents the raw projection data in PET imaging, encoding the spatial distribution of coincidence events from radioactive tracers across multiple angular views. Employing projection-domain data for low-dose PET reconstruction offers several key advantages over image-domain methods. First, the projection domain inherently preserves the physical constraints of radiation transport, such as ray trajectories and emission distributions, thereby enabling direct modeling of the underlying physics. Second, unlike image-domain methods that are confined to local convolutional operations as illustrated in Fig. 2, the projection domain facilitates global information processing. Local operations in the projection domain induce global effects in the image domain, making cross-domain reconstruction particularly effective in optimizing both fine-scale and large-scale features for improved fidelity. Third, the sinogram exhibits structured redundancy derived from its geometric relationship between angular and radial coordinates. This structure enables efficient signal extraction without the need to contend with complex spatial textures present in image-domain data, allowing models to focus on genuine signal patterns. Finally, the projection domain is compatible with physical priors such as attenuation maps and system response functions. This compatibility helps mitigate challenges posed by limited data in low-dose scenarios, enhancing reconstruction fidelity by integrating domain-specific knowledge. Therefore, we develop a meta-information guided cross-domain synergistic diffusion model. By integrating a meta-information encoding module with a cross-domain architecture, the proposed model bridges the gap between textual metadata and image reconstruction while optimizing the use of cross-modal prior information. Fig. 3 illustrates the overall procedure of the proposed method.
In the MiG-DM framework, the incorporation of multi-modal information plays a pivotal role. Subsequently, the adaptive multi-modal encoding guidance process is elaborated through three key components: MI-prompt preparation, MI-encoder design, and contrastive learning.
Achieving precise alignment between meta-information and PET images is crucial, and the formulation of text prompts plays a pivotal role in this process. However, the formulation of such text prompts faces two main challenges. Firstly, conventional text prompts are typically designed for natural images [43], focusing on object categories, thus failing to capture the complex metabolic information of human anatomy contained in PET images. Secondly, text prompts generated by visual-language models [44] often concentrate on visual aspects such as color and shape, insufficient to deliver the in-depth functional information required for PET imaging. To address these challenges, we introduce a MI-prompt that incorporates effective functional-semantic information. As depicted in Fig. 4
Inspired by the vision-language model CLIP [45], we design a dedicated learning framework to obtain an MI encoded feature MI F that is aligned with PET images. First, we pre-train on large-scale paired natural images and texts to enable the model to initially acquire image-text understanding capabilities and semantic alignment between images and descriptive texts. In the fine-tuning process, the pre-trained parameters are copied and frozen as the initialization for the fine-tuning model. The model is then continuously trained on paired data of PET images with multiple dosage levels and MI-prompt. This design allows the model to retain its basic image-text understanding while further learning and mastering MI comprehension abilities. . In the fine-tuning process, a trainable LoRA module [48] is added to each ViT block and Transformer block. In the -th Transformer block of the MI-encoder, the multi-head attention module operates by modifying the query, key, and value via integration with the LoRA module. The modified query Q, key K , and value V are computed as follows:
where , ,
represent the pre-trained weights.
The output of the multi-head attention module is derived as
Additionally,
, and r represents the rank of the LoRA weights. The smaller r is, and the smaller the calculation amount increased by fine-tuning is. In the pre-training stage, A is initialized with Kaiming initialization while B performs zero initialization. This is to smoothly add LoRA weights as the fine-tuning process proceeds and prevent the gradient change from being too drastic. A similar approach is applied to add LoRA weights to the weight matrix of the multilayer perceptron (MLP):
The MLP consists of two linear layers with the Gaussian error linear unit (GeLU) activation function. Thus, the weight matrices of MLP and LoRA can be expressed as MLP
.
In the pre-training and fine-tuning processes, contrastive learning is utilized to align image and MI-prompt embeddings by maximizing the similarity for correct pairings and minimizing it for incorrect ones. Specifically, the image encoder transforms PET image data into feature representations Once the PET image features and MI features are mapped into the same feature space, a contrastive loss function between the image and MI features is optimized to align the two modalities below:
where represents learnable temperature parameter that reg- ulates the contrastive strength. The loss function optimizes the model in both directions: from MI-prompt to PET image and from PET image to MI-prompt. It compels the model to align similar semantics while amplifying the semantic disparities between distinct features.
In this subsection, we first introduce a sinogram adapter (SinoA) designed to learn the distribution of physical information in the projection domain. This information is then transferred to a synergistic diffusion (SD) model consisting of two components, SD1 and SD2. SD1 integrates the projection-domain features derived from SinoA into the image-domain reconstruction process, while SD2 incorporates MI-encoded features to enhance the functional semantics. , where 0 y denotes the output sinogram of the SRM and R represents the traditional reconstruction algorithm [49]. The transformation begins with a pixel unshuffling operation with a scale factor of 4, followed by a convolution block (ConvBlock) to generate the features 0 s F . These features are then refined through four processing stages, each comprising three residual blocks [50]. Stages 2 to 4 include additional down-sampling operations with a scale factor of 2, producing multi-scale feature representations
As the key generation process of MiG-DM, the synergistic diffusion model combines SD1 for projection-domain feature integration with SD2 for meta-information incorporation, achieving high-quality PET im-age reconstruction with preserved functional semantics. The specific architectures of SD1/SD2 and their connection mode are elaborated below.
To begin with, SD1 and SRM share the same model structure. Recognizing the critical role of the encoder in image understanding and the decoder in feature reconstruction, we inject projection-domain features from SinoA into the encoder of SD1 to enhance both local and global feature comprehension. The final four stages of the SD1 encoder generate feature maps Since these features share resolutions with the projection-domain feature group, their fusion employs element-wise addition as follows:
Notably, projection-domain guidance is restricted to the final four encoder stages of SD1 because parameters in deeper layers converge markedly faster than those in shallow counterparts. Consequently, the deep features can rapidly adapt to external conditioning, whereas shallow layers retain stable, data-driven representations of generic textures and edges. By injecting conditional information only at these deeper levels, we enhance semantic consistency and fine-detail fidelity without unduly perturbing the low-level feature hierarchy.
Then, SD2 and SRM share similar architectures, with the key distinction being the incorporation of cross attention modules [51] in the final three stages of both encoder and decoder, as well as intermediate layers. When processing feature maps at the i-th layer of the SD2 decoder, the output , ,
CrossAttn Softmax( )
where the image feature serves as Query while the MI feature MI F provides both Key and Value. Finally, we implement a resample strategy [52] to connect SD1 and SD2. Specifically, the reconstructed image x from SD1 undergoes N-timestep diffusion:
where (0, )
being a sample from the Gaussian distribution. SD2 then performs reverse reconstruction from timestep N to obtain the high quality PET image. It is worth noting that SD1 is injected with image-domain and projection-domain, thus endowing it with enhanced physical structure reconstruct capabilities. In contrast, SD2 is injected with MI-prompts, thereby granting it stronger detail and semantic reconstruct capabilities. The resampling step N can serve as a hyperparameter to balance the structural and semantic quality. Theorem 1 demonstrates the existence of an appropriate N enables the optimization of the reconstructed image quality across the two models. In our experiments, we set 50 N . An ablation study and further discussion of this choice are provided in Section V. ( )
The proof is provided in Appendix B.
In summary, the MiG-DM algorithm is provided for low-dose PET reconstruction, as seen in Algorithm 1. ,, )) , , 1
to 0 do 12:
13: End for 14: Output: 0 x
The SinoA-guided SD1 model and the MI-guided SD2 model play crucial roles. However, the repetition pattern between them is a key factor influencing the reconstruction qual-ity. Therefore, we design two different repetition patterns including “SD1-to-SD2 cascade” and “SD2-to-SD1 cascade” in Fig. 5. For the “SD1-to-SD2 cascade” mode, the SinoA-guided SD1 is performed first, followed by the MI-guided SD2 reconstruction via resample. For the “SD2-to-SD1 cascade” mode, the MI-guided SD2 reconstruction is carried out first, and then the SinoA-guided SD1 reconstruction is performed. Because meta-information guidance mainly aims to adjust the functional semantics of the PET image, we ultimately select the “SD1-to-SD2 cascade” repetition pattern. If reconstruction is directly performed without a certain basic morphology, the basic morphology of the results will be degraded. Although the SinoA-guided SD1 model will be run afterwards to optimize the reconstructed morphology, it will cause the functional semantics to degrade. Hence, the second pattern is less favorable than the first one.
Moreover, we choose to split and inject the image-domain PET images, projection-domain sinograms, and MI-prompt into the diffusion model as three separate components. A dedicated network focusing on single-modal data facilitates fast convergence and strong robustness, while the reconstruction probability distribution obtained by separating individual conditions is equivalent to that derived from aggregating all conditions. Relevant theoretical analysis detail is included in Theorem 2, while the specific experimental results are documented in Section Ⅴ.
Let lq denote a low-dose PET image, y a sinogram, and m meta-information. Assume that three conditions
, , lq y m are conditionally independent given t
x . Then the score of the multi-conditional reverse transition can be decomposed as
In this section, the performance of MiG-DM is compared with state-of-the-art methods, including U-Net [52], MPRNet [16], ViT-Rec [46], Pix2Pix [53], and IDDPM [24]. To ensure comparability and fairness of the experiments, all methods are
conducted on the same datasets. Open-source code is available at: https://github.com/yqx7150 /MiG-DM. Datasets: Two distinct datasets were employed to conduct a comprehensive evaluation. The first was the UDPET dataset obtained from the MICCAI 2024 ultra-low-dose PET imaging challenge, which comprises low-dose PET and full-dose PET images with dose reduction factors (DRFs) of 10 and 20. Low-dose images were generated by subsampling full scans and were perfectly aligned with their full-dose PET images. Each patient case included 673 axial 2D slices, cropped to 256×256 to remove background. All imaging data were acquired using the uEXPLORER whole-body PET system for 18 F-FDG-PET applications. The training cohort consisted of 101 patients per dose level, providing 67,973 slices for model development, while an independent set of 1,346 slices was reserved for testing. Secondly, the Clinical dataset selected from 10 patients was utilized to further assess the generalization capability of these models. Each patient case contained 450 axial 2D image slices, zero-padded to 256×256 for testing. The data was sampled from the DigitMI 930 PET/CT scanner that developed by RAYSOLUTION Healthcare Co., Ltd. The scanner integrates fully digital photon detectors and offers an axial field of view of 30.6 cm. Each scan covered 4 to 8 beds, with scan times ranging from 45 seconds to 3 minutes per bed. Low-dose PET data was obtained by resampling the listmode data into random intervals, retaining the random data per cycle and discarding the remaining data.
Parameter Configuration: For the fine-tuning of the MI-encoder, we set the LoRA rank 4 r and trained for 1000 iterations on paired PET images and MI-prompts using a batch size of 256 and a learning rate of 4 1 10 with the Adam optimizer. For the SD1, SD2, and SRM models, SD1 was trained on full-dose PET data, SD2 was trained on full-dose data and corresponding MI-prompts, and SRM was trained on paired low-dose and full-dose sinograms. These three models utilized a batch size of 8 and a learning rate of 4 1 10 with the AdamW optimizer, training for 300,000 iterations. Moreover, we used DDSA to connect the fine-tuned SD1 and SRM, and trained on paired low-dose sinograms and full-dose PET images for 100,000 iterations with a batch size of 8 and a learning rate of 4 1 10 using the AdamW optimizer. All training and testing experiments were conducted using two NVIDIA Ge-Force RTX 3090 GPUs, each with 24 GB of memory. Performance Evaluation: To quantitatively measure the error caused by MiG-DM, the peak signal-to-noise ratio (PSNR), structural similarity (SSIM), mean squared error (MSE) and learned perceptual image patch similarity (LPIPS) [54] are used to evaluate the quality of reconstruction images. To further evaluate the performance on the Clinical dataset, we have additionally incorporated clinical metrics which includes the difference of the maximum standardized uptake value (△SUV max ) and the difference of the mean standardized uptake value (△SUV mean ), both represent the quantitative gap between the reconstructed results and reference for the lesion, as well as the tumor-to-background (TBR) and contrast ratio (CR). The specific expressions for TBR and CR are as follows:
is the maximum standardized uptake value for the lesion.
To assess the efficacy of MiG-DM using the UDPET dataset, Table Ⅰ displays a comprehensive quantitative analysis across different DRFs and patient weight groups. The results demonstrate that MiG-DM consistently achieves higher PSNR and SSIM values, while exhibiting lower MSE and LPIPS values compared to other reconstruction methods. Specifically, for the ≤60 kg weight group at a DRF of 20, MiG-DM attains a PSNR vlaue of 46.15 dB and an SSIM value of 0.9884, which are among the highest scores recorded. Moreover, MiG-DM outperforms the second-best IDDPM by 0.94 dB in PSNR, 0.0013 in SSIM, 0.6617 in MSE, and 0.0053 in LPIPS. These improvements underscore the superior ability of MiG-DM to maintain image quality and detail preservation, especially under higher dose reduction factors, thereby enhancing the overall diagnostic value of low-dose PET imaging. To more accurately evaluate the agreement between reconstructed images and full-dose images, as shown in Fig. 7, we present the reconstructed images of various methods and their corresponding Bland-Altman plots, which depict the numerical differences between each method’s reconstructed images and the full-dose images. Where the red line represents the mean of the differences, and the blue dashed lines indicate the 95% limits of agreement (LoA). The results demonstrate that the images reconstructed by MiG-DM achieve the lowest mean difference of -0.0018 and the narrowest 95% LoA, ranging from -0.0360 to 0.0324. Meanwhile, the red arrows in the reconstructed images point to the detailed regions. For other methods, reconstruction errors in these detailed regions lead to the presence of extreme values in their Bland-Altman plots and an expansion of the 95% LoA range. Therefore, the smaller mean difference and 95% LoA of MiG-DM indicate that it has distinct advantages in both global image quality and local detailed lesions. In the left half of Fig. 8, the maximum intensity projection (MIP) image of the patient and the reconstruction results of MiG-DM in the coronal, sagittal and axial view under DRF of 10 and 20 are presented. MiG-DM obtains superior results from multiple perspectives. For instance, the red boxes in the coronal and sagittal highlight the detailed reconstruction of the patient’s brain, where MiG-DM successfully reconstructs the accurate morphology of white matter and gray matter under both DRF=10 and DRF=20. Additionally, the bule boxes in the coronal and sagittal, as well as the red box in the axial view, mark the reconstruction results of the patient’s abdominal tumor, which exhibit a morphology close to that of the full-dose image under both DRF conditions. In the right half of Fig. 8, box plots of PSNR and SSIM metrics for various methods are presented. At both DRF=10 and DRF=20, MiG-DM has the highest median values, the smallest interquartile range and contains fewer extreme values. In contrast, other SOTA methods, even if they exhibit a smaller interquartile range, tend to have lower metrics and a greater number of extreme values.
Comparison on Clinical Dataset: To evaluate the performance of different methods under clinical conditions, we tested various approaches using the Clinical dataset. Fig. 9 presents the coronal reconstruction results and corresponding error maps of U-Net, MPRNet, ViT-Rec, Pix2Pix, IDDPM and MiG-DM, where the red arrows indicate key lesion locations. For MiG-DM, the value of error map is generally low, and with accurate reconstruction of both shapes and quantitative values of multiple lesions. In contrast, although the reconstruction results of U-Net and IDDPM are visually acceptable, their error maps reveal deficiencies in preserving functional numerical fidelity. Meanwhile, MPRNet, ViT-Rec and Pix2Pix all exhibit blurring and distortion at lesion locations. In summary, MiG-DM remains capable of ensuring reconstruction performance in terms of visual presentation, complex structures and functional semantics simultaneously in clinical assessment.
To quantitatively evaluate the performance of SOTA methods on Clinical dataset, we employed image quality metrics and PET clinical metrics. Table Ⅱ reports the image quality metrics of different methods, where MiG-DM achieved the highest PSNR of 34.98 dB and SSIM of 0.9284, as well as the lowest MSE of 5.4466(*E-3) and LPIPS of 0.0516. In the Clinical dataset, MiG-DM demonstrated greater advantages over other methods. Conversely, the performance of some methods, such as U-Net and Pix2Pix, degraded significantly. This indicates that MiG-DM possesses stronger generalization ability and robustness. Table Ⅲ compares the results of PET clinical metrics, including average ΔSUV max , ΔSUV mean , TBR and CR. Smaller ΔSUV max and ΔSUV mean indicate a smaller discrepancy in metabolic indices compared with the reference full-dose images. Higher TBR denotes a greater contrast between the lesion and the background, rendering the lesion more distinct. Similarly, higher CR signifies a stronger contrast between the lesion and the liver, meaning the lesion is more prominent. The results of MiG-DM demonstrate that it attains the lowest ΔSUV max of 1.4751 and ΔSUV mean of 0.0404, which significantly outperform other methods, for example, IDDPM yields a ΔSUV max of 2.9282 and ΔSUV mean of 0.0848. This highlights that MiG-DM exhibits precise metabolic fidelity preservation capability. Additionally, in comparison with other methods, MiG-DM gains the highest TBE of 0.6416 and CR of 2.0652. These metrics underscore the superiority and efficacy of MiG-DM in enhancing tissue homogeneity and lesion conspicuity. In conclusion, the cross-domain reconstruction framework ensures that the model maintains superior reconstruction performance across different datasets. Meanwhile, the MI-encoder injects functional semantics into the model, endowing it with reliability in preserving the clinical functional information of PET.
To assess the influence of the MI-encoder and SinoA modules within MiG-DM, an ablation study was conducted on the UDPET dataset at DRF=20, with results presented in Table Ⅳ. We adopted the model excluding the MI-encoder and SinoA modules as the baseline, which obtained PSNR of 45.21, SSIM of 0.9871, MSE of 0.8971(*E-3) and LPIPS of 0.0237. With the introduction of the MI-encoder module, the MSE decreased significantly from 0.8971(*E-3) to 0.2388(*E-3), which indicates that the functional semantic information introduced by the MI-encoder enhances the model’s capability in restoring the quantitative values of lesion regions. When only the SinoA module is introduced, PSNR and MSE exhibit slight degradation, while SSIM and LPIPS show marginal improvement. This indicates that the global information introduced by SinoA enhances the overall reconstruction performance, such as global contrast, statistical consistency, and perceptual fidelity. Thus, when both MI-encoder and SinoA modules are integrated, the model synergistically leverages global statistic and functional semantic information, thereby enhancing reconstruction performance in both dimensions.
To evaluate the fine-tuning effect of the LoRA module, we conducted an ablation experiment where the MI-encoder with and without the LoRA module was incorporated into the cross-domain reconstruction framework, and their image quality metrics were displayed in Table Ⅴ. Compared with the MI-encoder without LoRA, the incorporation of LoRA into the MI-encoder resulted in performance gains of 0.41, 0.001, 0.0052(*E-3), and 0.0118 in PSNR, SSIM, MSE and LPIPS, respectively. The improvement in LPIPS is particularly significant, indicating that the fine-tuned MI-encoder can extract more effective MI representations. Such high-quality MI representations contribute substantially to enhancing the model’s perception performance and image quality. Fig. 10 visualizes the alignment probability distributions of the MI-encoder with and without LoRA under low-dose images at DRF=10, along with 1 correct MI-prompt and 8 incorrect MI-prompts. Without LoRA, the alignment probability for each MI-prompt is roughly identical, indicating that the MI-encoder without LoRA fails to identify the MI-prompt that correctly pairs with the PET image. In contrast, when equipped with LoRA, the MI-encoder can align the image with the correct MI-prompt with a probability of 89.06%. This demonstrates that LoRA enables the MI-encoder to acquire robust MI encoding capabilities through minimal parameter modifications. Furthermore, the MI-prompt consists of three components: Patient characteristics, Dose-related information and Semi-quantitative assessment. As illustrated in Fig. 10, any error in one of these components leads to a significant drop in the prediction probability, which implies that the three categories of information in the MI-prompt are equally important for the expression of functional semantics.
We have demonstrated that the cross-domain reconstruction framework can effectively learn both global and local information of PET images, and the introduction of the MI-prompt enables the model to acquire valid functional semantic information. It is worth noting that the order of cross-domain information and MI-prompt injection, the number of resampling steps, along with the number of sampling steps of SRM in projection-domain, have a certain impact on the accuracy and reliability of image reconstruction.
Analysis of Combination Pattern: Table Ⅵ presents the performance of the two combination patterns of MiG-DM on the UDPET dataset. Specifically, the SD1-to-SD2 cascade first acquires global information through SD1, then obtains functional semantic via SD2. In contrast, the SD2-to-SD1 cascade alternates the injection order of the two types of guiding information. SD1-to-SD2 cascade outperforms SD2-to-SD1 cascade in terms of PSNR, SSIM and LPIPS. This indicates that under the SD1-to-SD2 cascade combination pattern, the model first fuses global and local visual information to obtain an initial visual reconstruction result. Subsequently, functional semantic information performs numerically precise adjustments on specific regions of the morphologically intact image, thereby ensuring the consistency of functional semantics. However, when the processing order is reversed, functional information is adjusted first, and the subsequent injection of global information from the projection domain undermines the previously injected functional information, leading to degradation in image reconstruction performance.
Resample is employed to connect SD1 and SD2, with the aim of adding a certain degree of noise to the high quality image distribution generated by SD1, thereby bringing it closer to the denoising trajectory of SD2. Subsequently, functional semantics from the MI-prompt are injected during the denoising process of SD2 to reconstruct the result. Table Ⅶ show
Step: In the projection domain, sinograms comprise numerous sinusoidal curves with varying amplitudes and phases, most of which exhibit similar shapes and complex textures. Consequently, the number of sampling step affects the generation of projection-domain sinograms to a certain extent, which in turn influences the quality of global guiding information, ultimately leading to a degradation in image reconstruction quality. Therefore, we conduct an analysis and discussion on the number of sampling steps in the SRM. Fig. 11 illustrates the sinograms generated by SRM at sampling steps of 250, 500, 750 and 1000. When the number of sampling steps is low, abnormal projection lines appear in the sinograms, and the reconstructed images of MiG-DM are affected by erroneous global information, resulting in blurred images with lost details. As the number of sampling steps increases, the quality of sinogram generation improves significantly. For example, at the position circled by the orange dashed line, the previously missing projection lines are completely supplemented by when sample step is set to 1000, and the corresponding results also gain accurate lesion contours.
In this study, we proposed a meta-information-guided cross-domain reconstruction framework. This framework simultaneously learns local and global information in PET images through the cross-domain reconstruction architecture, thereby generating results with preserved global structures from sinogram and maintained local details from PET images. Subsequently, the MI-encoder is used to extract mete-information from MI-prompt, infusing functional semantics into the model. This enables the model to enhance the accuracy of quantitative value prediction based on high quality morphological images.
The cascade sampling process, which chains the output of SD1 to the input of SD2, requires careful tuning of the denoising process. The number of resampling steps N , is a critical hyper-parameter. The following analysis provides a theoretical characterization of its optimality. x is generated using SD1. From Eq. (A.3), the reverse process is executed starting from (0, )
to gain 0 x . This distribution is denoted as 1 0 ( ) p x . Perform N steps forward noising on 0
x . From Eq. (A.2), we can explicitly compute N x : x under this reverse process by 2 0 ( | ) N p x x . Consequently, the joint distribution of cascade sampling process can be formulated as follows:
- The marginal distribution of the final sample 0
x is given by:
where 1 S and 2 S denote the physical structure quality scores of 0
x and 0 x , respectively. Similarly, 1 D and 2 D denote the semantic information quality scores of 0 x and 0 x , respectively. T . The extreme value theorem therefore guarantees that ( ) Q N attains a maximum on [0, ] T , establishing the existence of optimal resampling step. Using the explicit form in Lemma 2, ( ) Q N can be written as: and the optimum lies at the boundary. In all cases, the existence of an optimal N is guaranteed. Remark 1. The optimal N balances structural information from SD1 and meta-information semantic from SD2. The N provides the optimal trade-off.
Let lq denote a low-dose PET image, y a sinogram, and m meta-information. Assume that three conditions , , lq y m are conditionally independent given t x . Then the score of the multi-conditional reverse transition can be decomposed as
OF STATE-OF-THE-ART METHODS IN TERMS OF THE AVERAGE PSNR, SSIM, MSE(*E-3), AND LPIPS ON THE CLINICAL DATASET.
📸 Image Gallery
