LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection

LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.


💡 Research Summary

The paper tackles a practical yet under‑explored scenario: adapting a large‑scale, pre‑trained object detector to a highly specialized target domain when only a single labeled exemplar per class is available and no model weights may be updated. The authors formally define this setting as “training‑free one‑shot domain generalization for object detection.” Existing approaches—few‑shot detection, cross‑domain detection, and cross‑domain few‑shot detection—rely on fine‑tuning or on abundant unlabeled target data, both of which are infeasible under extreme data scarcity.

To address the problem, the authors propose LAB‑Det (Language As a domain‑invariant Bridge for Detection). The core insight is that natural language can abstract away low‑level visual variations (color casts, lighting, texture noise) while preserving high‑level semantic cues (shape, material, texture). By converting each one‑shot exemplar into a rich textual description, the detector can be conditioned on language rather than visual features, eliminating the need for gradient‑based adaptation.

The pipeline consists of four stages:

  1. Exemplar‑to‑Language Projection – The support image is first segmented with the Segment‑Anything Model (SAM) using the provided bounding box as a prompt. The resulting object mask, together with a domain‑aware instruction (“This is a d‑domain image. The masked object is a t_c. Describe it in one short sentence using the word t_c”), is fed to the Describe‑Anything Model (DAM). DAM outputs a concise natural‑language sentence describing the object (e.g., “A vertical, elongated, irregularly shaped dark gray streak with a rough texture and uneven edges”). This sentence is tokenized into a set of phrases {p_c,m}.

  2. Language‑conditioned Candidate Detection – All phrases from all classes form a prompt library P. A frozen Grounding DINO detector receives P and an unseen target image, producing a set of candidate boxes {b_i} together with phrase‑box relevance scores s_gd(i,c,m) ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment