Binary Verification for Zero-Shot Vision

Reading time: 6 minute
...

📝 Abstract

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.

💡 Analysis

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.

📄 Content

Binary Verification for Zero-Shot Vision Jeffrey Liu mycube.tv jeffreyliu@mycube.tv Rongbin Hu mycube.tv rongbinhu@mycube.tv Abstract We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, ex- plicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausi- ble candidates. We evaluate the workflow on referring ex- pression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization pro- vides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicat- ing generality. Our theory formalizes how open-ended vi- sion queries can be quantized to MCQs and further bina- rized into True/False verifications, establishing a hardness ladder (T/F ≤MCQ ≤K-way). A simple analysis ex- plains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific train- ing. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.

  1. Introduction Zero-shot vision [9, 36] is the ability of a model to perform a new visual task without any task-specific training or fine- tuning. It is increasingly critical for science and deploy- ment. Compared with supervised learning, which requires task-specific annotations that are often expensive and time- consuming to build, zero-shot vision enables rapid and cost- effective adaptation to new tasks and domains. Moreover, proprietary vision language models (VLMs) often deliver stronger out-of-the-box performance and ship with mature APIs and infrastructure that simplify integration, but they typically restrict task-specific training beyond prompting, making inference-time methods especially valuable. Modern VLMs [2, 3, 8, 12, 14, 20, 21, 26, 32] show strong zero-shot ability on tasks such as image caption- ing, broad scene recognition, and short-form visual QA. In typical zero-shot use, the model answers an open-ended prompt, for example, “Describe this image”. However, for more demanding tasks such as precise grounding, spatial reasoning, or fine-grained recognition, performance is often unreliable without task-specific fine-tuning. This unreliabil- ity stems from the interaction of several factors: free-form outputs make small numeric and referential errors both fre- quent and difficult to detect; model confidences are poorly calibrated, undermining decision thresholds; and more. A fundamental idea in computing is that any number can be represented in a binary format. Quantizing and bina- rizing values to bits simplify arithmetic for machines and enable scalable computation. We adopt an analogous view and instantiate the idea of quantization and binarization for zero-shot vision tasks with a simple, training-free workflow. First, we quantize the task by generating candidate units, e.g. detector boxes, grid cells, or entity pairs, so an open query becomes a multiple choice question (MCQ) with a small set of explicit hypotheses. Next, we binarize the deci- sion by asking, for each candidate hi, a single yes/no claim that checks whether hi matches the description, and the VLM must answer either True or False. Finally, we re- solve outcomes deterministically: if exactly one candidate is True, we return it; if multiple answers are True, we restrict to that subset and issue a compact MCQ; if all are False, we fall back to the original MCQ. We evaluate the unified workflow across three task fam- ilies using the same VLMs without any task-specific fine- tuning, and observe significant improvements on every task. • Referring Expression Comprehension (REC). REC [28] is the task of locating the specific object described by a natural-language expression in a given image. On REC (RefCOCO/+/g) [24, 38], our binary verification workflow with GPT-4o reaches ACC@0.5 of 71.7%, 67.1%, and 64.8%, respectively, substan- tially above zero-shot GroundingDINO [22] (50.4%, 51.4%, 60.4%). Note that the original GPT-4o only has ACC@0.5 of 8.4%, 7.3%, 9.1%, respectively. 1 arXiv:2511.10983v1 [cs.CV] 14 Nov 2025 • Spatial reasoning (Map, Grid, Maze). For spatial rea- soning [31], we evaluate three tasks, namely, Spatial- Map, Spatial-Grid, and Spatial-Maze, which probe di- rectional relations, grid area recognition, and path find- ing, respectively. We take the original task interface as the baseline and then apply our two-step binary verifica- tion workflow. On Spatial-Map, the baseline is already MCQ. Binarization improves over MCQ from 78.3% to 84.5% and from

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut