VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
💡 Research Summary
The paper introduces VRIQ (Visual Reasoning IQ), a comprehensive benchmark designed to evaluate the visual reasoning capabilities of Vision‑Language Models (VLMs) across both abstract puzzle‑style and natural‑image domains. VRIQ contains 1,500 expert‑authored multiple‑choice items, evenly split between five reasoning categories—Sequence Completion, Matrix Prediction, Odd‑One‑Out, Figure Rotation, and 3D Visualization—and two visual domains that share identical logical structures. Each item is annotated with eight perceptual dimensions (color, shape, count, position, rotation/orientation, 3D/depth, symmetry/pattern, distractor similarity) to enable fine‑grained analysis.
The evaluation framework is hierarchical. Tier 1 measures end‑to‑end accuracy, reflecting the combined effect of perception and reasoning. Tier 2 introduces diagnostic probes: Perceptual Probes (P‑probes) that ask trivial visual questions (e.g., “How many red circles are shown?”) and Reasoning Probes (R‑probes) that present the same logical rule in text form with the required visual facts supplied, thereby isolating reasoning ability. Tier 3 categorizes each failure as P‑only (perception bottleneck), R‑only (reasoning deficit), or P+R (both) based on performance on the corresponding probes.
A broad suite of models is evaluated, ranging from open‑source VLMs (Qwen2.5‑VL 3B/7B/32B, InternVL‑3‑9B, LLaVA‑v1.6 variants) to proprietary large‑scale systems (GPT‑5.1, GPT‑4o, GPT‑4o‑mini, Gemini‑2.5‑pro, OpenAI o3). Results show uniformly low performance: abstract puzzles average 28 % accuracy, natural‑image puzzles 45 %, barely above random guessing. Tool‑augmented reasoning (e.g., OpenAI o3) yields only modest gains of 5–7 percentage points. Error attribution reveals that 56 % of failures are P‑only, 43 % are P+R, and a mere 1 % are R‑only, indicating that perception, not reasoning, is the dominant limitation. Failure rates differ across perceptual dimensions; shape and counting errors dominate, while 3D/depth and rotation also cause significant drops. When R‑probes are considered, most models achieve high scores, confirming that they can apply logical rules when supplied with correct visual facts.
The authors conclude that current VLMs lack robust visual perception, especially for abstract geometric attributes and 3D transformations, and that advanced reasoning techniques only help when perception is reliable. They recommend future work focus on (1) strengthening visual encoders to capture fine‑grained attributes, (2) developing explicit perception‑reasoning interfaces or meta‑reasoning layers, and (3) leveraging VRIQ’s diagnostic probes for staged training and evaluation. All benchmark data, probe questions, and construction protocols will be released to foster community‑wide improvements in multimodal visual reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment