Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation

February 20, 2026

Reading time: 4 minute

...

📝 Original Info

Title: Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation
ArXiv ID: 2512.18082
Date: 2025-12-19
Authors: - Shreshth Rajan (Harvard University, shreshthrajan@college.harvard.edu) - Raymond Liu (Harvard University, liur@g.harvard.edu)

📝 Abstract

Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.

💡 Deep Analysis

📄 Full Content

Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation Shreshth Rajan Harvard University Cambridge, MA, USA shreshthrajan@college.harvard.edu Raymond Liu Harvard University Cambridge, MA, USA liur@g.harvard.edu Abstract—Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that im- proves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline. I. INTRODUCTION AND RELATED WORK Semantic segmentation is a computer vision task that assigns a class label to every pixel in an image. In outdoor street scenes, semantic segmentation is used in applications such as autonomous driving, mobile robotics, and visual assistants for visually-impaired pedestrians to detect key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians. Performing semantic segmentation accurately and in real-time is crucial in these applications where inaccurate or delayed predictions can pose serious safety risks. Semantic segmentation of outdoor street scenes suffers from multiple degrees of domain shift: different locations, day/night conditions, weather conditions, and various image artifacts can all severely degrade segmentation model performance [3, 6]. Datasets containing high-quality segmentation labels of outdoor street scenes are typically largely comprised of images taken during the daytime with clear weather conditions [2, 7], or are limited to a few cities [10]. Segmentation models trained on images from one or more of these datasets may not generalize well to other domains. A. Retrieval-Augmented Segmentation Memory-based retrieval has been explored for semantic seg- mentation under domain shift. Pin the Memory [5] improved segmentation of urban scenes but required retraining with memory guidance. More recent work on few-shot medical image segmentation [12] retrieves similar samples at the image level using DINOv2 features and SAM2’s memory attention, achieving strong performance without retraining. However, Fig. 1. Retrieval is used on regions of high uncertainty to improve the robustness of outputs from a lightweight segmentation model. image-level retrieval is computationally expensive for dense prediction tasks. B. Uncertainty Estimation Test-time augmentation has been used to estimate uncertainty by measuring prediction variability across augmented inputs. Mutual information, which isolates epistemic uncertainty from aleatoric uncertainty, has shown promise for identifying model uncertainty that can be reduced through additional informa- tion [4]. Expected Pairwise KL Divergence (EPKL) measures disagreement between ensemble predictions and has achieved strong correlation (r > 0.9) with segmentation quality on medical imaging datasets [1]. C. Foundation Models DINOv2 [8] is a self-supervised vision transformer trained on 142M images that produces robust semantic features. Region-level representations [9] using DINOv2 have shown to be effective for retrieval tasks while being computationally efficient compared to image-level approaches. We propose a selective uncertainty-gated retrieval mecha- nism for domain adaptation that improves segmentation ac- curacy and calibration under domain shift without retraining (Fig. 1). Given a region of interest in an image, we retrieve similar regions from a memory bank and fuse their correspond- ing probability maps with the base model logits. Similarity is determined using embeddings from DINOv2 [8]. Retrieval is only done in regions with high uncertainty, allowing the model to adaptively refine its predictions without excessive latency. II. INITIAL APPROACH We implemented region-level uncertainty detection with SegFormer-B0 [11], a powerful and lightweight segmentation arXiv:2512.18082v1 [cs.CV] 19 Dec 2025 model with 3.7M parameters. The SegFormer-B0 model was fine-tuned on the Cityscapes dataset [2] containing vehicle- egocentric images and corresponding segmentation labels. For evaluation, we used the Cityscapes validation set. Uncertainty is measured using test-time augmentation. We generate five total predictions: one from the original image and four from augmented versions created using horizontal flip, rescaling (scales 0.9 and 1.1), and color jitter (brightness 0.1, contrast 0.1, saturation 0.05, hue 0.02). Initially, we co

📄 Read Full PDF on ArXiv