ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.
💡 Research Summary
The paper tackles a fundamental limitation of vision‑language models (VLMs) when applying Chain‑of‑Thought (CoT) reasoning: early conversion of visual inputs into text discards continuous information such as geometry, spatial layout, and fine‑grained structure. Recent works (Aurora, ICoT, CoVT, MINT‑CoT) try to enrich CoT with richer visual features but remain passive—they process pre‑computed visual tokens rather than actively deciding what to look at. Inspired by human active perception, the authors propose ViThinker, a framework that endows a VLM with the ability to generate decision (query) tokens (e.g., <query_seg>, <query_depth>) that trigger on‑demand synthesis of expert‑aligned visual features.
ViThinker consists of two main components: (1) an Active Generative Perception module that decouples “when to look” from “how to see.” When a decision token is emitted, a set of four observation tokens is reserved; their hidden states are aligned with frozen expert feature maps (SAM for segmentation, DepthAnything for depth, PIDINet for edges, DINOv2 for patch‑level semantics) via lightweight projection heads and a distance loss. This process internalizes the experts’ capabilities into the model’s parametric memory, allowing the model to “recall” the appropriate visual cue without invoking external tools. (2) A two‑stage curriculum. Stage 1 (skill acquisition) trains the model on data where expert outputs are prepended to the input, teaching the semantics of each decision token and how to reconstruct high‑fidelity visual embeddings. Stage 2 (policy optimization) presents multiple valid reasoning paths for each problem—ranging from minimal (single expert) to full (all experts)—generated automatically. A sparsity penalty on decision tokens (count of observation tokens per decision) encourages the model to adopt a cognitive budget and select the minimal sufficient set of queries for each reasoning step.
The overall training objective for a sample is the minimum loss over all valid chains: L = min_s∈S_valid
Comments & Academic Discussion
Loading comments...
Leave a Comment