SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing

SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.


💡 Research Summary

The paper introduces SOMA‑1M, a large‑scale, multi‑resolution SAR‑optical dataset designed to overcome the three major limitations of existing multimodal remote sensing benchmarks: limited spatial resolution, insufficient data volume, and poor alignment accuracy. SOMA‑1M comprises 1.3 million paired image tiles, each 512 × 512 pixels, drawn from a diverse set of sensors—Sentinel‑1, PIESAT‑1, Capella Space for SAR and Google Earth for optical imagery. The dataset spans three ground‑sample distances (0.5 m, 3 m, and 10 m), providing global coverage while also delivering fine‑grained detail for urban and infrastructure analysis. Twelve representative land‑cover classes are evenly distributed across the collection, ensuring scene diversity.

To achieve pixel‑level registration despite the fundamentally different imaging geometries of side‑looking SAR and nadir‑looking optical sensors, the authors devise a rigorous coarse‑to‑fine alignment pipeline. The coarse stage uses metadata‑based geocoding to obtain an initial alignment, followed by a fine stage that employs multi‑scale pyramids and a deep learning‑based matching network to correct nonlinear geometric distortions. The pipeline yields sub‑pixel alignment errors and is validated with a synthetic deformation test set.

Four hierarchical vision tasks are benchmarked on SOMA‑1M: (1) image matching, (2) image fusion, (3) SAR‑assisted cloud removal, and (4) cross‑modal translation. For each task, more than 30 state‑of‑the‑art algorithms—including traditional hand‑crafted descriptors, modern attention‑based matchers (SuperGlue, LightGlue, DiffGlue, MambaGlue), deep fusion networks, conditional GANs, and diffusion models—are evaluated. Across all tasks, models pretrained on SOMA‑1M consistently outperform those trained on existing datasets, with matching accuracy improving by over 12 percentage points, PSNR gains of ~1.5 dB in fusion, and notable visual and NDVI recovery improvements in cloud removal. The translation task benefits from the precise geo‑coordinates, achieving higher structural consistency than prior benchmarks.

The authors also analyze the domain gap introduced by the multi‑resolution nature of the dataset. Experiments reveal that models trained on a single resolution suffer performance drops when applied to other scales, while multi‑scale training strategies (scale‑mixed batches, resolution‑specific adapters) mitigate this effect. Additionally, a simulated cloud subset with varied morphologies and thicknesses provides a controlled testbed for SAR‑assisted reconstruction research.

In summary, SOMA‑1M delivers an unprecedented combination of scale (1.3 M pairs), resolution diversity (0.5 m–10 m), and pixel‑level alignment, filling a critical gap for foundation‑model pretraining and downstream multimodal remote sensing applications. The dataset, along with code and benchmark results, is publicly released at https://github.com/PeihaoWu/SOMA-1M, positioning it as a foundational resource for the next generation of robust, cross‑modal remote sensing algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment