Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.


💡 Research Summary

This paper investigates how state‑of‑the‑art open‑vocabulary object detection (OVD) models perform when confronted with low‑quality images that are common in real‑world applications such as robotics, autonomous driving, and mobile devices. While most prior OVD work evaluates models on high‑resolution, clean datasets like MS‑COCO or LVIS, the authors argue that degradation types such as compression artifacts, exposure errors, sensor noise, and motion blur can dramatically affect detection quality. To address this gap, they construct a new benchmark, the Low‑Quality Image Dataset, by taking the COCO 2017 validation set (5 000 images) and applying four degradation families—lossy JPEG compression, gamma correction, Gaussian noise, and average blur—each at five severity levels. This yields 100 000 processed images (5 000 × 4 × 5) plus the original set, and the processing pipeline and code are released publicly.

Six representative OVD models are evaluated on this benchmark: two variants of OWL‑ViT (B/16 and B/32), two variants of the newer OWLv2 (B/16 and L/14), GroundingDINO (Tiny), and Detic. All models receive the same 80 COCO category prompts as free‑text queries, and inference is performed with a confidence threshold of 0.1. Results are reported using mean Average Precision (mAP) across IoU thresholds from 0.50 to 0.90 in 0.05 steps, following the standard COCO evaluation protocol.

The experimental findings reveal two clear patterns. First, under mild degradations (e.g., JPEG quality 80, gamma 0.8–1.2, noise σ 20, blur kernel 6), most models retain performance close to that on pristine images. OWLv2‑L/14 consistently achieves the highest absolute mAP (≈35–36 %). Second, under severe degradations (JPEG quality 0, gamma 2.0, noise σ 50, blur kernel 12), all models suffer substantial drops, but the magnitude varies dramatically. OWL‑ViT models and Detic fall below 15 % mAP, whereas OWLv2‑L/14 still scores around 34 %, demonstrating superior robustness. Detailed tables show that compression primarily harms OWL‑ViT, while OWLv2’s performance remains relatively stable across the entire quality range. Gamma adjustments cause the least performance loss overall, again with OWLv2 leading. Noise and blur introduce roughly linear declines; however, the rate of decline is gentlest for OWLv2 and steepest for OWL‑ViT and Detic.

The authors interpret these results in terms of architectural and training differences. OWLv2 benefits from large‑scale image‑text pre‑training and multi‑scale feature fusion, which appear to endow it with better generalization to various visual corruptions. OWL‑ViT, despite its elegant Vision Transformer design, relies heavily on precise image‑text alignment; thus, compression artifacts and noise quickly disrupt its matching process. GroundingDINO shows moderate resilience but is particularly vulnerable to strong blur, likely because its grounding mechanism depends on fine‑grained spatial cues. Detic’s reliance on image‑level supervision makes it sensitive to any degradation that weakens the correspondence between visual features and textual labels.

The paper acknowledges several limitations. The synthetic degradations, while systematically varied, may not capture the full complexity of real‑world low‑quality captures (e.g., combined motion blur and low light). Moreover, the study evaluates off‑the‑shelf models without any fine‑tuning on degraded data, leaving open the question of how much performance can be recovered through targeted training or loss‑function redesign. Finally, computational efficiency under low‑quality conditions is not examined, which is relevant for edge devices.

Future work is suggested along three axes: (1) collecting and annotating authentic low‑quality image datasets to validate the synthetic benchmark; (2) exploring data‑augmentation strategies, robustness‑oriented loss functions, and domain‑adaptation techniques to improve OVD resilience; and (3) designing lightweight, real‑time OVD architectures that incorporate human‑like quality‑aware processing (e.g., adaptive denoising or exposure correction) before detection.

In summary, the study provides the first large‑scale, systematic evaluation of open‑vocabulary object detectors under realistic image degradations, releases a valuable benchmark for the community, and highlights that while current OVD models excel on clean data, only the OWLv2 family demonstrates notable robustness to severe quality loss. The findings motivate further research on robustness‑enhancing training regimes and model designs that can bridge the gap between laboratory performance and real‑world applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment