Arxiv 2512.14312

Reading time: 5 minute
...

📝 Original Info

  • Title: Arxiv 2512.14312
  • ArXiv ID: 2512.14312
  • Date: 2025-12-16
  • Authors: Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel

📝 Abstract

In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that visionlanguage models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600m×600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.

💡 Deep Analysis

Deep Dive into Arxiv 2512.14312.

In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that visionlanguage models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks,

📄 Full Content

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

Akila Premarathna Water Futures Data & Analytics International Water Management Institute Colombo, Sri Lanka A.Premarathna@cgiar.org Kanishka Hewageegana Water Futures Data & Analytics International Water Management Institute Colombo, Sri Lanka K.Hewageegana@cgiar.org

Garcia Andarcia Mariangel Water Futures Data & Analytics International Water Management Institute Colombo, Sri Lanka M.GarciaAndarcia@cgiar.org

In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision- language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600m×600m Geo- TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8’s true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
Keywords: MENA, wastewater treatment plants (WWTPs), vision-language models (VLMs), YOLOv8, Zero-shot learning, Few-shot learning I. INTRODUCTION
Water scarcity constitutes a fundamental problem in arid and semi-arid regions, where efficient water management strengthens economic growth, public well-being, and environmental harmony [1], [2]. In MENA nations, which have scarce freshwater resources and high urbanization rates, wastewater treatment plants (WWTPs) are significant in water reclamation and reuse that reduces pressure on scarce natural water resources [4], [5]. These plants are utilized to treat domestic, industrial, and agricultural wastewaters for irrigation and industrial reuse [6]. However, it is very difficult to monitor and manage WWTPs across large and remote territories. This statement is accurate in countries such as the United Arab Emirates (UAE), Kingdom of Saudi Arabia (KSA), and Egypt in the Middle East and North Africa (MENA) because they often lack necessary information or data. The main difficulty is keeping track of the environment and making sure the plants follow the rules [8]. High-resolution satellite imaging with the use of remote sensing technologies offers an upper hand for WWTP infrastructure detection and assessment with reduced reliance on time-consuming on-ground surveys [9]. Hence, traditional computer vision algorithms, such as the YOLOv8 object detection model, are effective in discriminating between WWTP structures such as circular/rectangular tanks, aeration basins, and clarifiers from the aerial imagery [11]. YOLOv8 uses convolutional neural networks to achieve real-time processing and high accuracy [12]. Yet, its dependence on large, manually labelled datasets has significant drawbacks, including high labor and time demands, as well as susceptibility to errors in distinguishing WWTPs from visual distractors like industrial tanks or agricultural ponds [13]. These drawbacks are primarily critical in environments with poor resources [14]. The emergence of VLMs, which effectively merge computer vision and natural language processing, offers an innovative alternative that addresses the drawbacks experienced with YOLO [15]. With the capability to leverage pre-trained foundation models, VLMs can provide zero-shot and few-shot learning to facilitate task generalization through explanatory prompts without having to rely on enormous, labelled datasets [16]. Their reasoning capability allows semantic interpretation of images and encoded results, i.e., detections in JSON format with confidence scores and descriptive labels, thereby alleviating the annotation burden and promoting efficiency for environmental monitoring remote sensing tasks [14]. This paper compares the efficienc

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut