📝 Original Info
- Title: Arxiv 2512.14312
- ArXiv ID: 2512.14312
- Date: 2025-12-16
- Authors: Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
📝 Abstract
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that visionlanguage models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600m×600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
💡 Deep Analysis
Deep Dive into Arxiv 2512.14312.
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that visionlanguage models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks,
📄 Full Content
From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment
Plants Using Satellite Imagery in MENA Region
Akila Premarathna
Water Futures Data & Analytics
International Water Management
Institute
Colombo, Sri Lanka
A.Premarathna@cgiar.org
Kanishka Hewageegana
Water Futures Data & Analytics
International Water Management
Institute
Colombo, Sri Lanka
K.Hewageegana@cgiar.org
Garcia Andarcia Mariangel
Water Futures Data & Analytics
International Water Management
Institute
Colombo, Sri Lanka
M.GarciaAndarcia@cgiar.org
In regions of the Middle East and North Africa (MENA), there
is a high demand for wastewater treatment plants (WWTPs),
crucial for sustainable water management. Precise identification
of WWTPs from satellite images enables environmental
monitoring. Traditional methods like YOLOv8 segmentation
require extensive manual labeling. But studies indicate that vision-
language models (VLMs) are an efficient alternative to achieving
equivalent or superior results through inherent reasoning and
annotation. This study presents a structured methodology for VLM
comparison, divided into zero-shot and few-shot streams
specifically to identify WWTPs. The YOLOv8 was trained on a
governmental dataset of 83,566 high-resolution satellite images
from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives),
15% non-WWTPs (negatives). Evaluated VLMs include LLaMA
3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and
Pixtral 12B (Mistral), used to identify WWTP components such as
circular/rectangular tanks, aeration basins and distinguish
confounders via expert prompts producing JSON outputs with
confidence and descriptions. The dataset comprises 1,207
validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and
equal non-WWTP sites from field/AI data, as 600m×600m Geo-
TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on
WWTP images showed several VLMs out-performing YOLOv8’s
true positive rate, with Gemma-3 highest. Results confirm that
VLMs, particularly with zero-shot, can replace YOLOv8 for
efficient, annotation-free WWTP classification, enabling scalable
remote sensing.
Keywords: MENA, wastewater treatment plants (WWTPs),
vision-language models (VLMs), YOLOv8, Zero-shot learning,
Few-shot learning
I. INTRODUCTION
Water scarcity constitutes a fundamental problem in arid
and semi-arid regions, where efficient water management
strengthens economic growth, public well-being, and
environmental harmony [1], [2]. In MENA nations, which
have scarce freshwater resources and high urbanization rates,
wastewater treatment plants (WWTPs) are significant in water
reclamation and reuse that reduces pressure on scarce natural
water resources [4], [5]. These plants are utilized to treat
domestic, industrial, and agricultural wastewaters for
irrigation and industrial reuse [6]. However, it is very difficult
to monitor and manage WWTPs across large and remote
territories. This statement is accurate in countries such as the
United Arab Emirates (UAE), Kingdom of Saudi Arabia
(KSA), and Egypt in the Middle East and North Africa
(MENA) because they often lack necessary information or
data. The main difficulty is keeping track of the environment
and making sure the plants follow the rules [8].
High-resolution satellite imaging with the use of remote
sensing technologies offers an upper hand for WWTP
infrastructure detection and assessment with reduced reliance
on time-consuming on-ground surveys [9]. Hence, traditional
computer vision algorithms, such as the YOLOv8 object
detection model, are effective in discriminating between
WWTP structures such as circular/rectangular tanks, aeration
basins, and clarifiers from the aerial imagery [11]. YOLOv8
uses convolutional neural networks to achieve real-time
processing and high accuracy [12]. Yet, its dependence on
large, manually labelled datasets has significant drawbacks,
including high labor and time demands, as well as
susceptibility to errors in distinguishing WWTPs from visual
distractors like industrial tanks or agricultural ponds [13].
These drawbacks are primarily critical in environments with
poor resources [14].
The emergence of VLMs, which effectively merge
computer vision and natural language processing, offers an
innovative
alternative
that
addresses
the
drawbacks
experienced with YOLO [15]. With the capability to leverage
pre-trained foundation models, VLMs can provide zero-shot
and few-shot learning to facilitate task generalization through
explanatory prompts without having to rely on enormous,
labelled datasets [16]. Their reasoning capability allows
semantic interpretation of images and encoded results, i.e.,
detections in JSON format with confidence scores and
descriptive labels, thereby alleviating the annotation burden
and promoting efficiency for environmental monitoring
remote sensing tasks [14].
This paper compares the efficienc
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.