Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLMs on the Level of Fouling Scale

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Marine biofouling on vessel hulls poses major ecological, economic, and biosecurity risks. Traditional survey methods rely on diver inspections, which are hazardous and limited in scalability. This work investigates automated classification of biofouling severity on the Level of Fouling (LoF) scale using both custom computer vision models and large multimodal language models (LLMs). Convolutional neural networks, transformer-based segmentation, and zero-shot LLMs were evaluated on an expert-labelled dataset from the New Zealand Ministry for Primary Industries. Computer vision models showed high accuracy at extreme LoF categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs, guided by structured prompts and retrieval, achieved competitive performance without training and provided interpretable outputs. The results demonstrate complementary strengths across approaches and suggest that hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment.

💡 Research Summary

This paper addresses the pressing need for scalable, safe, and objective assessment of marine biofouling on vessel hulls, a problem that currently relies on hazardous diver inspections and suffers from limited throughput. The authors benchmark two distinct families of AI methods—traditional computer‑vision (CV) models and large multimodal language models (LLMs)—for classifying the Level of Fouling (LoF) scale, a six‑point ordinal rating (0 = clean, 5 = heavy macro‑fouling) mandated by New Zealand’s biosecurity regulations.

The dataset consists of 762 underwater photographs supplied by the New Zealand Ministry for Primary Industries (MPI), each annotated by experts with a LoF label. The class distribution is heavily skewed (LoF 0: 7, LoF 1: 263, LoF 2: 70, LoF 3: 113, LoF 4: 126, LoF 5: 183), a factor that drives much of the performance disparity observed.

Computer‑Vision pipeline
Two convolutional neural networks—ResNet‑18 and ResNet‑50—serve as baseline image classifiers. Both are trained with identical hyper‑parameters (Adam optimizer, 0.001 learning rate, 100 epochs). ResNet‑50 achieves an overall accuracy of 87 %, excelling at the extremes (LoF 0 and LoF 5 > 94 % accuracy) but dropping to ~58 % for the intermediate categories (LoF 2‑4). To improve interpretability, the authors also train SegFormer, a transformer‑based semantic segmentation model, to produce pixel‑level masks for four mutually exclusive classes: water, clean hull, slime, and macro‑fouling. The segmentation achieves a mean IoU of 0.71, enabling precise coverage calculations (e.g., slime = 8 % of hull area). By mapping these coverage percentages onto the LoF thresholds (5‑16 % → LoF 3, etc.), the model can directly reproduce the expert decision tree. However, performance degrades under high turbidity, poor lighting, and when slime and macro‑fouling appear visually similar.

Multimodal LLM pipeline
The second approach leverages a state‑of‑the‑art vision‑language model accessed via the OpenRouter API (GPT‑4V). The authors employ a structured prompt that embeds the LoF decision tree, class definitions, and optional retrieval‑augmented generation (RAG) of domain‑specific documents. No fine‑tuning is performed; the model operates in a zero‑shot fashion. Across the full test set, the LLM attains an average accuracy of 81 %, notably outperforming the CV baseline on the middle LoF levels (by ≈12 %). Crucially, the LLM returns quantitative explanations (“slime covers ~9 % of the hull, macro‑fouling ~12 %”) alongside the predicted LoF label, providing a level of interpretability absent from pure CV outputs. Nevertheless, the model occasionally hallucinates, confusing slime with macro‑fouling in low‑resolution or heavily color‑distorted images, underscoring the need for post‑hoc verification.

Key insights

Data imbalance is the primary bottleneck for CV models; techniques such as class‑weighted loss, SMOTE‑style oversampling, or synthetic image generation are recommended.
Segmentation‑based coverage offers a transparent bridge between raw pixel predictions and regulatory thresholds, but requires robust preprocessing (HSV normalization, edge enhancement) to mitigate underwater lighting variability.
Multimodal LLMs bring world knowledge and textual reasoning to bear, allowing zero‑shot performance comparable to trained CV models while delivering human‑readable rationales. Their main drawbacks are occasional hallucinations and higher computational demand.
Hybrid architecture—feeding SegFormer‑derived coverage metrics into the LLM’s prompt—combines the precision of pixel‑level analysis with the explanatory power of language models, representing a promising path toward operational deployment.

Future directions
The authors propose extending the benchmark with (i) advanced data‑augmentation and domain‑adaptation to improve middle‑range LoF classification, (ii) lightweight multimodal models suitable for on‑board inference on energy‑constrained platforms, and (iii) integration of auxiliary metadata (vessel type, inspection location, season) to enrich the decision context. Ultimately, the study establishes a reproducible benchmarking framework and demonstrates that a synergistic CV‑LLM system can deliver scalable, accurate, and interpretable biofouling assessments, supporting New Zealand’s biosecurity objectives and offering a template for global maritime monitoring initiatives.

Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLMs on the Level of Fouling Scale

💡 Research Summary

Comments & Academic Discussion

Leave a Comment