GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.


💡 Research Summary

This paper introduces GeoFormer, an open-source deep learning framework designed to address the scarcity of accurate, globally consistent 3D urban data. GeoFormer jointly estimates two key urban form metrics—average Building Height (BH) and Building Footprint ratio (BF)—at a 100m grid resolution using exclusively freely available Sentinel-1 Synthetic Aperture Radar (SAR), Sentinel-2 multispectral optical imagery, and Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) data.

The core innovation lies in its architecture and approach. Instead of operating at the per-building or 10m pixel level, which often suffers from mixed-pixel effects in dense urban areas, GeoFormer adopts a “scene-level” perspective using 100m grid cells as its fundamental unit. This scale aligns with many city- and continental-scale climate and urban studies. The model employs a Swin Transformer backbone configured to process not just a target grid cell but also its surrounding neighborhood (e.g., a 5x5 window of cells, equivalent to 500m x 500m). This allows the model to learn the contextual relationships between a location’s built form and its immediate urban fabric, a significant advantage over CNN baselines that only look at the target cell in isolation. A multi-task learning head enables the simultaneous prediction of BH (a regression task) and BF (a regression task).

The training and evaluation data were built upon the open SHAFTS reference dataset, covering 54 geographically diverse cities worldwide. The authors implemented a rigorous “geo-blocked” data splitting strategy to ensure strict spatial independence between training, validation, and test sets, preventing data leakage and enabling a true assessment of generalization capability. Extensive preprocessing pipelines on Google Earth Engine ensured temporally consistent and cloud-free satellite inputs.

Comprehensive experiments demonstrate GeoFormer’s superiority. Evaluated across the 54 cities, GeoFormer with a 5x5 context window achieved a BH Root Mean Square Error (RMSE) of 3.19 meters and a BF RMSE of 0.050, outperforming the strongest CNN baseline (a U-Net) by 7.5% and 15.3%, respectively. Crucially, GeoFormer showed remarkable generalization in cross-continent transfer learning experiments, maintaining a BH RMSE under 3.5 meters when applied to continents not seen during training.

Ablation studies provided critical insights into the model’s data requirements: 1) Removing the DEM data caused a severe degradation in BH accuracy while leaving BF largely unaffected, confirming the indispensable role of topographic context for height estimation. 2) Modality tests revealed that removing the optical (Sentinel-2) channels caused greater performance loss than removing the SAR (Sentinel-1) channels, indicating that multispectral reflectance is the primary driver for retrieval at this scale. 3) Nevertheless, the full fusion of SAR, optical, and DEM data yielded the best overall accuracy, validating the complementary value of multi-source remote sensing. The study also found that excessive model capacity (e.g., a 9x9 window) led to overfitting rather than improved generalization.

By relying solely on open data and achieving state-of-the-art accuracy with strong generalization, GeoFormer represents a significant step toward democratizing access to high-quality 3D urban information. The authors have publicly released all code, pre-trained model weights, and global estimation products to foster reproducibility and further research in urban remote sensing and related fields.


Comments & Academic Discussion

Loading comments...

Leave a Comment