SatFusion: A Unified Framework for Enhancing Remote Sensing Image via Multi-Frame and Multi-Source Image Fusion

SatFusion: A Unified Framework for Enhancing Remote Sensing Image via Multi-Frame and Multi-Source Image Fusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Remote sensing (RS) imaging is constrained by hardware cost and physical limitations, making high-quality image acquisition challenging and motivating image fusion for quality enhancement. Multi-frame super-resolution (MFSR) and Pansharpening exploit complementary information from multiple frames and multiple sources, respectively, but are usually studied in isolation: MFSR lacks high-resolution structural priors for fine-grained texture recovery, while Pansharpening depends on upsampled multispectral images and is sensitive to noise and misalignment. With the rapid development of the Satellite Internet of Things (Sat-IoT), effectively leveraging large numbers of low-quality yet information-complementary images has become increasingly important. To this end, we propose SatFusion, a unified framework for enhancing RS images via joint multi-frame and multi-source fusion. SatFusion employs a Multi-Frame Image Fusion (MFIF) module to extract high-resolution semantic features from multiple low-resolution multispectral frames, and integrates fine-grained structural information from a high-resolution panchromatic image through a Multi-Source Image Fusion (MSIF) module, enabling robust feature integration with implicit pixel-level alignment. To further mitigate the lack of structural priors in multi-frame fusion, we introduce SatFusion*, which incorporates a panchromatic-guided mechanism into the multi-frame fusion stage. By combining structure-aware feature embedding with transformer-based adaptive aggregation, SatFusion* enables spatially adaptive selection of multi-frame features and strengthens the coupling between multi-frame and multi-source representations. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that our methods consistently outperform existing approaches in terms of reconstruction quality, robustness, and generalizability.


💡 Research Summary

The paper introduces SatFusion, a unified deep‑learning framework that simultaneously exploits multiple low‑resolution multispectral (MS) frames and a high‑resolution panchromatic (PAN) image to produce a high‑resolution MS product. Traditional remote‑sensing fusion has been split into two largely independent streams: Multi‑Frame Super‑Resolution (MFSR), which aggregates several homogeneous low‑resolution frames, and pansharpening, which fuses a high‑resolution PAN with a single low‑resolution MS image. MFSR suffers from a lack of structural priors, limiting fine‑grained texture recovery, while pansharpening is highly sensitive to noise, misalignment, and the up‑sampling step required to match PAN resolution.

SatFusion bridges these gaps with three main components. The Multi‑Frame Image Fusion (MFIF) module ingests an arbitrary number of LR‑MS frames, aligns them implicitly, and extracts high‑level semantic features using a combination of 3‑D convolutions, residual blocks, and a transformer encoder. The Multi‑Source Image Fusion (MSIF) module then injects the fine‑grained spatial details of the PAN image by concatenating PAN features with the MFIF output and applying attention‑based fusion, achieving implicit pixel‑level alignment without explicit warping. Finally, a Fusion Composition module merges the multi‑frame and multi‑source representations and supervises the reconstruction with a weighted combination of pixel, spectral, and structural losses.

To further strengthen the coupling between frames and the structural guide, the authors propose SatFusion*. In this variant the PAN image is down‑sampled and concatenated with the LR‑MS frames before the MFIF encoder, allowing the transformer‑based aggregation to be directly conditioned on PAN‑guided reference vectors. This “position‑dependent fusion reference” enables spatially adaptive weighting of each frame’s contribution, improving robustness to inter‑frame misalignment, blur, and sensor noise.

Key technical contributions include: (1) a unified architecture that can embed existing MFSR or pansharpening networks without major redesign; (2) implicit alignment that eliminates costly pixel‑wise registration; (3) transformer‑driven adaptive aggregation that scales to any number of input frames, addressing the variable‑frame scenario common in Sat‑IoT deployments; (4) a PAN‑guided fusion stage that supplies high‑frequency structural priors to the multi‑frame pathway, mitigating the texture bottleneck of conventional MFSR.

Extensive experiments on four public datasets—WorldStrat, WV3, QB, and GF2—demonstrate that SatFusion consistently outperforms state‑of‑the‑art MFSR methods (e.g., TR‑MISR, RAMS, DeepSUM) and recent pansharpening models (e.g., Pan‑Mamba, FusionNet, diffusion‑based approaches). Quantitatively, SatFusion achieves average PSNR gains of over 1.5 dB and lower ERGAS scores, while visual inspection shows sharper edges and more faithful spectral content. Robustness tests that add varying levels of blur, noise, and misalignment to the MS inputs reveal that SatFusion and SatFusion* degrade far more gracefully than traditional pansharpening pipelines, confirming the benefit of multi‑frame complementary information. Moreover, SatFusion* maintains high performance when the number of input frames during testing differs from that used in training, highlighting its flexibility for real‑world satellite constellations where image availability fluctuates.

In summary, SatFusion offers a comprehensive solution for remote‑sensing image enhancement by jointly optimizing multi‑frame and multi‑source fusion. Its architecture leverages structural guidance, transformer‑based adaptive aggregation, and implicit alignment to deliver superior reconstruction quality, robustness to degradations, and scalability to arbitrary frame counts—features that are especially valuable for emerging Satellite‑IoT scenarios where large volumes of low‑quality, complementary imagery must be turned into actionable high‑resolution products.


Comments & Academic Discussion

Loading comments...

Leave a Comment