Building Damage Detection using Satellite Images and Patch-Based Transformer Methods
Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.
💡 Research Summary
The paper addresses the pressing need for rapid, automated building‑damage assessment after natural disasters by leveraging high‑resolution satellite imagery. While previous work on the xBD benchmark has largely relied on convolutional neural networks (CNNs) such as ResNet, these approaches suffer from severe class imbalance (the “no‑damage” class accounts for roughly 60 % of all building instances) and label noise, which limit macro‑averaged F1 scores. To explore whether Vision Transformers (ViTs) can overcome these limitations, the authors evaluate two lightweight ViT variants—DeiT‑T (≈5 M parameters, supervised ImageNet pre‑training) and DINOv2‑small (≈22 M parameters, self‑supervised LVD‑142M pre‑training)—on the xBD dataset for four‑class damage classification (no‑damage, minor, major, destroyed).
A central contribution is a “patch‑based preprocessing pipeline” designed to reduce background clutter and mitigate the impact of empty pixels. For each building polygon, a random building is selected, its centroid is located, and a square crop matching the model’s input resolution (224 × 224 for DeiT, 518 × 518 for DINOv2) is extracted. An alpha channel is added to track transparency; the proportion of fully transparent or near‑black pixels (“empty ratio”) is computed. If the empty ratio exceeds 0.01 % the crop is re‑sampled within a 100‑pixel radius until a sufficiently “filled” patch is obtained. This process removes large swaths of sky, water, or other irrelevant regions, forcing the transformer to focus on structural features of the building itself.
Training is conducted under strict compute constraints using Google Colab Pro+/A100 GPUs. Two fine‑tuning strategies are compared: (1) end‑to‑end (E2E) fine‑tuning, where all model weights are updated, and (2) frozen‑head (FH) fine‑tuning, where only the classification head is trainable. For DeiT, a grid search over learning rates {1e‑7, 1e‑6, 1e‑5, 5e‑5, 1e‑4} and batch sizes {8, 16, 24, 32} identified the best configuration (lr = 1e‑5, batch = 24, epochs = 5). DINOv2 E2E experiments used the two highest DeiT learning rates (1e‑5, 5e‑5) with batch size = 8 and only two epochs due to time limits. FH experiments for both models used lr = 1e‑3, batch = 24, epochs = 10.
Evaluation employs weighted accuracy, precision, recall, and macro‑averaged F1, with five independent runs on the held‑out test set to smooth out variance caused by the imbalance. The best DeiT‑E2E model achieves 78.2 % accuracy and a macro‑F1 of 0.599, outperforming its FH counterpart by more than 7 % points in accuracy and 0.07 in F1. DINOv2‑E2E reaches a macro‑F1 of 0.565, also better than its FH version but still trailing DeiT. Confusion matrices reveal that while “no‑damage” predictions are highly reliable, misclassifications occur among the three damaged categories, reflecting the subtle visual differences and residual class imbalance.
The study’s key insights are: (i) targeted patch extraction dramatically reduces irrelevant background, improving transformer attention on damage cues; (ii) even modest‑size ViTs can match or exceed CNN baselines when fine‑tuned end‑to‑end, despite limited computational resources; (iii) freezing the transformer backbone hampers performance, underscoring the importance of updating attention weights for this domain.
Limitations include the short training horizon for DINOv2 (only two epochs), lack of explicit techniques for handling label noise, and reliance on a single test split. Future work could explore cost‑sensitive loss functions, synthetic oversampling of minority damage classes, label‑cleaning pipelines, multi‑scale transformer architectures, and longer training schedules on more powerful hardware. Overall, the paper demonstrates that Vision Transformers, combined with a thoughtful preprocessing strategy, constitute a viable and competitive approach for satellite‑based building‑damage detection in disaster response scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment