OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
ArXiv ID: 2511.21064
Date: 2025-11-26
Authors: Researchers from original ArXiv paper

📝 Abstract

Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

💡 Deep Analysis

Deep Dive into OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection.

📄 Full Content

OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection Chujie Wang* Jianyu Lu* Zhiyuan Luo Xi Chen† Chu He† Wuhan University chujie@whu.edu.cn Abstract Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging se- mantic information. Although existing methods are pre- trained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can signifi- cantly enhance OVOD performance, indicating that the tex- tual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category match- ing into proactive visual reasoning and self-evolving detec- tion. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuit- able; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent’s state, memory, and interaction dynamics. A Bandit module gener- ates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detec- tion policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit ex- ploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework. 1. Introduction Open-Vocabulary Object Detection (OVOD) aims to extend object detectors to arbitrary concepts by exploiting the se- mantic priors learned from large-scale vision–language pre- training [12, 21, 32, 43, 44]. With improved region–text alignment and large-vocabulary modeling [4, 25, 38–40], recent methods have substantially enhanced open-set recog- Figure 1. We illustrate the state-transition behavior of OVOD- Agent as it iteratively updates its category hypothesis. Starting from an initial dictionary lookup, the agent applies attribute-aware actions that adjust color, texture, and spatial cues to produce a more accurate and grounded state description. The number of re- quired actions varies across images, from single-step updates to multi-step reasoning. nition. However, although these models are trained with multimodal supervision, their inference still depends on a fixed set of category names. This turns detection into a sim- ple matching routine and creates a clear mismatch between multimodal training and unimodal inference. Consequently, existing methods struggle to reason under visual ambiguity, adapt to unfamiliar contexts, and detect rare or fine-grained categories. A growing body of work has shown that the textual space has a much greater impact on OVOD performance than previously assumed. Techniques such as prompt learn- ing [8], prompt diversification [11], attribute-based descrip- tions [31], and automatic class-name optimization [29] con- 1 arXiv:2511.21064v1 [cs.AI] 26 Nov 2025 sistently reshape the textual embedding space and lead to marked improvements across OVOD benchmarks, suggest- ing that this space remains far from saturated [2]. To en- rich fine-grained semantics, recent studies introduce LLM- generated priors to expand the textual domain [16, 26]. Yet these approaches remain essentially static: a single adjust- ment to the descriptor cannot capture the evolving relation- ships among regions, contexts, and attributes during detec- tion. What is still lacking is a context-dependent, iterative mechanism that can continuously refine textual representa- tions throughout inference. The step-by-step reasoning paradigm of CoT [36], together with advances in interactive vision–language agents [9, 34, 41], has enabled multi-step visual opera- tions in multimodal systems. Frameworks such as Visual- CoT [22, 35] suggest that visual cues can be aligned with textual semantics through progressive reasoning. In OVOD, CoT-PL [5] further shows that incorporating such reason- ing during training can improve the quality of pseudo la- bels. However, these approaches inherit the agent-style de- sign that places an LLM at the center of the decision pro- cess [5, 16], which imposes substantial computational and memory overhead during inference. Some even rely on multiple rounds of human feedback to guide the reason- ing procedure [18]. Such designs run counter to the core strengths of object detection—speed, scalability, and ease of deployment—and ultimately make the pipeline unneces- sarily heavy for the OVOD setting. To address these limitations, we introduce OVOD- Agent, a lightweight and LLM-free framework that trans- fo

…(Full text truncated)…

📄 Read Full PDF on ArXiv