📝 Original Info
- Title: OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
- ArXiv ID: 2511.21064
- Date: 2025-11-26
- Authors: Researchers from original ArXiv paper
📝 Abstract
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.
💡 Deep Analysis
Deep Dive into OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection.
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represe
📄 Full Content
OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning
and Self-Evolving Detection
Chujie Wang*
Jianyu Lu*
Zhiyuan Luo
Xi Chen†
Chu He†
Wuhan University
chujie@whu.edu.cn
Abstract
Open-Vocabulary Object Detection (OVOD) aims to enable
detectors to generalize across categories by leveraging se-
mantic information.
Although existing methods are pre-
trained on large vision-language datasets, their inference is
still limited to fixed category names, creating a gap between
multimodal training and unimodal inference. Previous work
has shown that improving textual representation can signifi-
cantly enhance OVOD performance, indicating that the tex-
tual space is still underexplored. To this end, we propose
OVOD-Agent, which transforms passive category match-
ing into proactive visual reasoning and self-evolving detec-
tion. Inspired by the Chain-of-Thought (CoT) paradigm,
OVOD-Agent extends the textual optimization process into
an interpretable Visual-CoT with explicit actions. OVOD’s
lightweight nature makes LLM-based management unsuit-
able; instead, we model visual context transitions as a
Weakly Markovian Decision Process (w-MDP) over eight
state spaces, which naturally represents the agent’s state,
memory, and interaction dynamics. A Bandit module gener-
ates exploration signals under limited supervision, helping
the agent focus on uncertain regions and adapt its detec-
tion policy. We further integrate Markov transition matrices
with Bandit trajectories for self-supervised Reward Model
(RM) optimization, forming a closed loop from Bandit ex-
ploration to RM learning. Experiments on COCO and LVIS
show that OVOD-Agent provides consistent improvements
across OVOD backbones, particularly on rare categories,
confirming the effectiveness of the proposed framework.
1. Introduction
Open-Vocabulary Object Detection (OVOD) aims to extend
object detectors to arbitrary concepts by exploiting the se-
mantic priors learned from large-scale vision–language pre-
training [12, 21, 32, 43, 44]. With improved region–text
alignment and large-vocabulary modeling [4, 25, 38–40],
recent methods have substantially enhanced open-set recog-
Figure 1. We illustrate the state-transition behavior of OVOD-
Agent as it iteratively updates its category hypothesis. Starting
from an initial dictionary lookup, the agent applies attribute-aware
actions that adjust color, texture, and spatial cues to produce a
more accurate and grounded state description. The number of re-
quired actions varies across images, from single-step updates to
multi-step reasoning.
nition. However, although these models are trained with
multimodal supervision, their inference still depends on a
fixed set of category names. This turns detection into a sim-
ple matching routine and creates a clear mismatch between
multimodal training and unimodal inference. Consequently,
existing methods struggle to reason under visual ambiguity,
adapt to unfamiliar contexts, and detect rare or fine-grained
categories.
A growing body of work has shown that the textual
space has a much greater impact on OVOD performance
than previously assumed. Techniques such as prompt learn-
ing [8], prompt diversification [11], attribute-based descrip-
tions [31], and automatic class-name optimization [29] con-
1
arXiv:2511.21064v1 [cs.AI] 26 Nov 2025
sistently reshape the textual embedding space and lead to
marked improvements across OVOD benchmarks, suggest-
ing that this space remains far from saturated [2]. To en-
rich fine-grained semantics, recent studies introduce LLM-
generated priors to expand the textual domain [16, 26]. Yet
these approaches remain essentially static: a single adjust-
ment to the descriptor cannot capture the evolving relation-
ships among regions, contexts, and attributes during detec-
tion. What is still lacking is a context-dependent, iterative
mechanism that can continuously refine textual representa-
tions throughout inference.
The step-by-step reasoning paradigm of CoT [36],
together with advances in interactive vision–language
agents [9, 34, 41], has enabled multi-step visual opera-
tions in multimodal systems. Frameworks such as Visual-
CoT [22, 35] suggest that visual cues can be aligned with
textual semantics through progressive reasoning. In OVOD,
CoT-PL [5] further shows that incorporating such reason-
ing during training can improve the quality of pseudo la-
bels. However, these approaches inherit the agent-style de-
sign that places an LLM at the center of the decision pro-
cess [5, 16], which imposes substantial computational and
memory overhead during inference.
Some even rely on
multiple rounds of human feedback to guide the reason-
ing procedure [18]. Such designs run counter to the core
strengths of object detection—speed, scalability, and ease
of deployment—and ultimately make the pipeline unneces-
sarily heavy for the OVOD setting.
To address these limitations, we introduce OVOD-
Agent, a lightweight and LLM-free framework that trans-
fo
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.