시각 인지와 추론을 위한 종합 퍼즐 벤치마크 SPHINX
📝 Abstract
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
💡 Analysis
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
📄 Content
SPHINX: A Synthetic Environment for Visual Perception and Reasoning Md Tanvirul Alam Rochester Institute of Technology Rochester, NY, USA ma8235@rit.edu Saksham Aggarwal Rochester Institute of Technology Rochester, NY, USA sxavse@rit.edu Justin Yang Chae University of Washington Seattle, WA, USA jchae3@uw.edu Nidhi Rastogi Rochester Institute of Technology Rochester, NY, USA nxrvse@rit.edu Abstract We present SPHINX, a synthetic environment for visual perception and reasoning that targets core cognitive prim- itives. SPHINX procedurally generates puzzles using mo- tifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construc- tion. The benchmark covers 25 task types spanning sym- metry detection, geometric transformations, spatial rea- soning, chart interpretation, and sequence prediction. Evaluating recent large vision–language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifi- able rewards (RLVR) substantially improves model accu- racy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for ad- vancing multimodal reasoning. Code and dataset avail- able at https://github.com/xashru/sphinx .
- Introduction Large language models (LLMs) have recently demon- strated striking advances in reasoning, achieving gold medal level performance at the International Mathemati- cal Olympiad [8] and strong results across mathematics, logical reasoning, and coding [13, 19, 22, 59, 64]. Be- cause reasoning is a core component of human intelli- gence, it has become a central benchmark for progress toward Artificial General Intelligence (AGI) [18]. Tech- niques such as Chain-of-Thought prompting [55], test- time compute scaling [22], and post-training strategies such as rule-based reinforcement learning in DeepSeek- 20 40 60 80 100 Geometric Reasoning Counting Symmetry & Pattern Recognition Sequence & Transformation Reasoning Topological & Graph Reasoning Human (75.4%) GPT-5 (51.1%) GPT-5 Mini (47.1%) Qwen2.5-VL-32B (32.2%) Figure 1. Radar plot shows accuracies (%) achieved by LVLMs and by humans on the broad categories of SPHINX. R1 have further improved model performance, helping mitigate reward hacking [19] and allowing more robust generalization across domains [2, 20, 61]. In contrast to the rapid progress of LLMs, large vision- language models (LVLMs) remain far less capable of visual reasoning [11, 32, 44, 69]. Unlike text-based systems that can leverage structured prompts and post- training strategies, LVLMs must jointly parse visual in- puts and integrate them with language, a substantially more complex challenge [6, 17, 19, 53, 61]. Current models often fail to construct coherent reasoning chains and stumble on tasks trivial to humans [65]. Although reinforcement learning has been applied to strengthen LVLMs [27, 39], progress is constrained by benchmarks that emphasize perception over reasoning, such as refer- 1 arXiv:2511.20814v1 [cs.CV] 25 Nov 2025 ring to expression comprehension or math-with-diagram datasets, where models frequently reduce visual inputs to text and rely on language reasoning [62, 71]. More recently, several works have begun to investi- gate abstract visual reasoning (AVR) in LVLMs [6, 12, 23, 25, 32, 62], yet these efforts still fall short of sys- tematically evaluating core perceptual primitives such as symmetry detection, mental rotation, and structured pattern matching. Cognitive science has long established that these abilities underpin fluid intelligence and matrix reasoning [7, 15, 40, 46], implying that practical machine- learning evaluation must directly target such primitives through controlled tasks that disentangle perception from abstraction. To address this gap, we introduce SPHINX, a synthetic environment that programmatically generates visual perception and reasoning tasks centered on symme- try, transformation, and related spatial operations. Each instance includes an unambiguous ground-truth solution, enabling a precise evaluation and systematic investiga- tion of failure modes. The framework also supports the generation of scalable datasets for reinforcement learn- ing, paralleling synthetic reasoning environments shown to benefit text-based LLMs [9, 47]. We make the following key contributions:
- We introduce SPHINX, a synthetic environment for generating visual perception and reasoning datasets, comprising 25 tasks in five broad categories (see Fig- ure 1). To the best of our knowledge, this represents the largest-scale synthetic environment designed for such tasks.
- We construct a benchmark dataset with 2,500 ques- tions using SPHINX and evaluate a range of propri- etary and open-source LVLMs. We provide a com- parative analysis between human performance and LVLM performance across task c
This content is AI-processed based on ArXiv data.