Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at https://github.com/Kai-Liu001/Fose.
Pansharpening is an image fusion technique that integrates a low-resolution multispectral image (LRMSI) with a high-resolution single-band panchromatic (PAN) image to produce a high-resolution multispectral product (HRMSI). Owing to physical and cost constraints, spaceborne multispectral and especially hyperspectral sensors typically offer limited spatial resolution that falls short of many downstream application requirements. In contrast, mounting a PAN sensor on the same platform is comparatively inexpensive. This sensor configuration makes pansharpening a de facto capability for modern satellites and a crucial enabler for numerous tasks, including motion detection, change detection [33], and semantic segmentation [39].
Pansharpening dates back to the 1970s [12]. Classical approaches are commonly grouped into three families: component substitution (CS), multi-resolution analysis (MRA), and variational optimization (VO). CS methods project the low-resolution LRMSI into a transform domain and replace its spatial component(s) with those extracted from the PAN image, yielding high spatial fidelity but often introducing spectral distortions while being computationally efficient [12,17]. MRA methods perform multiscale decompositions to extract spatial details from the PAN image and inject them into the LRMSI, generally preserving spectral information at the possible expense of spatial sharpness [19,24]. VO formulations provide stronger mathematical guarantees than CS and MRA but typically incur higher computational cost and require careful tuning of multiple hyperparameters [36,37].
With the rise of deep learning, data-driven pansharpening has advanced rapidly. PNN was the first model to introduce deep learning to pansharpening [15]. It upsamples the LRMSI to the PAN resolution, concatenates it with the PAN image along the channel dimension. Then it learns a supervised mapping to the ground-truth high-resolution MSI through a three-layer CNN. Subsequent end-to-end architectures take LRMSI and PAN as inputs and produce HRMSI in a single forward pass, offering low parameter counts and modest computational demands, albeit with limited peak performance. Scaling such networks by simply increasing width or depth often yields diminishing returns and, in practice, can degrade generalization due to overfitting. Diffusion-based models, by contrast, predict the residual between LRMSI and HRMSI, typically conditioning on the PAN image. Through a multi-step denoising process, they convert noise into the residual and have shown clear advantages in spatial detail and visual fidelity. However, their iterative sampling is computationally heavy, often one to two orders of magnitude slower than end-to-end models.
To reconcile these trade-offs, we propose Fose, which distills a multi-step diffusion model into a single-step generator and then fuses it with an end-to-end network via a lightweight convolutional adaptor. Concretely, we first strengthen a state-of-the-art diffusion-based pansharpening baseline by integrating adaptive convolution, achieving a strong reference model without a substantial parameter increase. We then perform single-step distillation using a VSD loss, accelerating inference by roughly 50× with negligible loss in accuracy. Then, we train a typical end-toend model to compensate for the degradation from the 50x compression. Finally, we fuse the outputs of the distilled one-step diffusion model and the end-to-end model using a shallow convolutional fusion head. During training of this fusion stage, the backbone models are frozen and only the fusion parameters are updated, enabling rapid convergence and performance that consistently surpasses either component alone. Our contributions are threefold:
• We propose Fose, the first one-step diffusion model for pansharping, fused with an end-to-end network. Fose obtains a 7.42 × speedup ratio compared to its multi-step baseline model while achieving better performance. • We propose a four-stage training strategy to gradually improve the performance and the speedup ratio. Generally, the strategy focuses on obtaining a strong baseline model for both the DM and E2E model, one-step distillation, and fusing them with light ensemble connector layers. • We conduct comprehensive experiments on three commonly used pansharpening datasets to validate the excellent performance compared to previous SOTA models. Also, the ablation study shows the effectiveness and robustness of the proposed Fose.
As a simple yet effective strategy, the representative singlescale coupling model exemplified by PNN employs a compact three-layer CNN that achieved state-of-the-art performance at the time. Subsequently, methods such as Fusion-Net [5] and DCFNet [35] adopted similar coupled-input designs. However, due to the limited feature representation capacity of these architectures, their performance in terms of spectral fidelity and generalization remains insufficient.
In multi-source image fu
This content is AI-processed based on open access ArXiv data.