From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle’’, which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.

💡 Research Summary

The paper presents a unified framework that extends open‑vocabulary object detection (OVD) into true open‑world object detection (OWOD), enabling both the discovery of far‑out‑of‑distribution (FOOD) objects and the correct handling of near‑out‑of‑distribution (NOOD) objects while continuously learning new categories without catastrophic forgetting. Traditional closed‑set detectors assume a fixed label set, which is unrealistic for safety‑critical domains such as autonomous driving where unknown objects frequently appear. Recent OVD approaches leverage large vision‑language models (e.g., CLIP) to match image features with text embeddings, allowing zero‑shot detection of any class defined by a prompt. However, OVD still suffers from two major issues: (1) misclassification of NOOD objects that are semantically similar to known classes, and (2) complete failure to detect FOOD objects that lie far from any prompt in the semantic space.

To overcome these limitations, the authors introduce two novel components: (i) Open World Embedding Learning (OWEL) with a Pseudo Unknown Embedding, and (ii) Multi‑Scale Contrastive Anchor Learning (MSCAL).

Pseudo Unknown Embedding
The method first computes the mean text embedding of all known classes, denoted w, using a pretrained CLIP text encoder. A generic “objectness” embedding w₀ (e.g., the word “object”) is also obtained. The unknown‑class embedding w_U is defined as
w_U = w₀ – α·w,
where α is a scalar hyper‑parameter. Because w changes whenever new classes are added, w_U is recomputed at inference time, automatically shifting its position in the continuous semantic space to focus on regions not covered by known classes. This dynamic construction enables the detector to generate a dedicated embedding for FOOD objects without any additional visual data.

MSCAL
For each known class i, a class‑specific non‑linear projector g_i maps the multi‑scale feature pyramid P (produced by a DarkNet backbone and a cross‑modal RepVL‑PAN neck) into a lower‑dimensional space Z_i. In this space, a learnable class anchor μ_i is introduced. A contrastive loss is applied across all pyramid levels p: positive samples (spatial locations belonging to class i) are pulled toward μ_i, while samples from other classes and background are pushed away. The loss for class i is:

L_con_i = – (1/|Z_i⁺|) Σ_{j=1}^{p} Σ_{z∈Z_i⁺_j} log

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

💡 Research Summary

Comments & Academic Discussion

Leave a Comment