Multi-Expert Learning Framework with the State Space Model for Optical and SAR Image Registration
Optical and Synthetic Aperture Radar (SAR) image registration is crucial for multi-modal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: (i) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; (ii) limited textures in images hinder discriminative feature extraction; (iii) the local receptive field of Convolutional Neural Networks (CNNs) restricts the learning of contextual information, while the Transformer can capture long-range global features but with high computational complexity. To address these issues, this paper proposes a multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. Firstly, to improve the registration performance with limited textures, ME-SSM constructs a multi-expert learning framework to capture shared features from multi-modal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Secondly, ME-SSM introduces a state space model, Mamba, for feature extraction, which employs a multi-directional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. Additionally, ME-SSM uses a multi-level feature aggregation (MFA) module to enhance the multi-scale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration.
💡 Research Summary
The paper introduces ME‑SSM, a novel framework that tackles three major challenges in optical‑SAR image registration: (i) severe nonlinear radiometric differences, (ii) limited texture information, and (iii) the restricted receptive field of CNNs versus the high computational cost of Transformers. ME‑SSM combines a multi‑expert learning strategy with a state‑space model (Mamba) to deliver both rich local features and efficient global context modeling.
First, the Multi‑Expert Learning Framework (MELF) creates several transformed versions of each input image using affine operations such as rotation, scaling, and flipping. Separate expert networks— which can be CNNs, Vision Transformers, or Mamba blocks—process each transformed image, producing feature maps of identical dimensionality. A learnable soft router then evaluates the statistical characteristics of the original image and assigns dynamic weights to each expert’s output, aggregating them into a shared feature representation. This dynamic fusion mitigates the texture‑scarcity problem because transformed views expose complementary structures that a single network would miss.
Second, the global context is captured by integrating the Mamba state‑space model. Mamba replaces the quadratic‑complexity attention of Transformers with a linear‑time SSM that treats the image as a sequence of patches. The authors further enhance this by a multi‑directional cross‑scanning scheme: patches are processed in four directions (left‑to‑right, right‑to‑left, top‑to‑bottom, bottom‑to‑top). The recurrent‑like state updates propagate information across the entire image with O(N) complexity, enabling the model to learn long‑range dependencies crucial for aligning optical and SAR modalities that often share only coarse structural cues.
Third, a Multi‑Level Feature Aggregation (MFA) module is embedded within the Mamba backbone. MFA fuses features from multiple resolutions, allowing high‑resolution texture details to be combined with low‑resolution global cues. This hierarchical fusion improves the precision of pixel‑level correspondences while preserving robustness to large‑scale deformations.
The authors evaluate ME‑SSM on two public datasets: SEN1‑2 and the OS dataset, covering a range of resolutions and transformation scenarios. Using the Correct Matching Rate (CMR) at 1‑pixel and 3‑pixel thresholds, ME‑SSM outperforms state‑of‑the‑art CNN‑based, dilated‑CNN, and Transformer‑based methods. On SEN1‑2, CMR improves by 7.14 % (1‑pixel) and 1.95 % (3‑pixel); on OS, gains of 2.12 % and 0.84 % are reported. Importantly, the parameter count and FLOPs of the Mamba‑based backbone remain roughly 30 % lower than comparable Transformers, confirming the efficiency of the linear‑complexity design.
Ablation studies show that (a) the soft router preferentially emphasizes transformed‑image experts in low‑texture regions, (b) each scanning direction contributes uniquely to capturing vertical and horizontal long‑range dependencies, and (c) MFA significantly boosts performance over a plain Mamba block by integrating multi‑scale information.
In summary, ME‑SSM delivers a three‑fold advantage: (1) dynamic multi‑expert fusion that enriches feature representations under severe texture scarcity, (2) a state‑space model that provides global contextual awareness with linear computational cost, and (3) hierarchical multi‑scale aggregation that refines pixel‑level alignment. The framework is modality‑agnostic and can be extended to other multimodal registration tasks such as LiDAR‑optical or hyperspectral‑SAR fusion, making it a promising building block for future real‑time, multi‑sensor remote sensing systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment