Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI-1 benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions the benchmark was designed to test? Here we investigate abstraction abilities of AI models using the closely related but simpler ConceptARC benchmark. Our evaluations vary input modality (textual vs. visual), use of external Python tools, and reasoning effort. Beyond output accuracy, we evaluate the natural-language rules that models generate to explain their solutions, enabling us to assess whether models recognize the abstractions that ConceptARC was designed to elicit. We show that the best models’ rules are frequently based on surface-level ``shortcuts,’’ capturing intended abstractions considerably less often than humans. In the visual modality, AI models’ output accuracy drops sharply; however, our rule-level analysis reveals that a substantial share of their rules capture the intended abstractions, even as the models struggle to apply these concepts to generate correct solutions. In short, we show that using accuracy alone to evaluate abstract reasoning can substantially overestimate AI capabilities in textual modalities and underestimate it in visual modalities. Our results offer a more faithful picture of AI models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

💡 Research Summary

The paper investigates whether state‑of‑the‑art multimodal reasoning models truly perform human‑like abstract reasoning, or merely achieve high accuracy by exploiting surface patterns. To answer this, the authors turn to ConceptARC, a simplified but concept‑focused benchmark derived from the Abstraction and Reasoning Corpus (ARC). ConceptARC contains 480 tasks organized around 16 basic spatial and semantic concepts (e.g., “top vs. bottom”, “inside vs. outside”, “same vs. different”). Human participants solve these tasks with 91 % accuracy, providing a strong baseline for comparison.

Four contemporary reasoning models are evaluated: OpenAI’s o3 and o4‑mini, Google Gemini 2.5 Pro, and Anthropic Claude Sonnet 4. For contrast, three non‑reasoning multimodal models (GPT‑4o, Llama 4 Scout, Qwen 2.5 VL 72B) are also tested. Each model is run under a matrix of conditions: (1) input modality – textual representation of grids versus raw images; (2) reasoning effort – a low‑budget and a medium‑budget token limit; (3) tool access – with or without the ability to generate and execute Python code. All experiments are pass@1 (single attempt per task) and temperature is fixed at 1.0 for fairness.

Crucially, models are asked not only for the transformed output grid but also for a natural‑language description of the rule they inferred, packaged as a JSON object. Human‑generated rules from the original ConceptARC study are used as a reference. The authors manually annotate every rule into three categories: (i) “correct‑intended” – the rule aligns with the abstract concept the task designer intended; (ii) “correct‑unintended” – the rule works on the demonstrations but relies on spurious correlations (e.g., ordering of integer color codes); and (iii) “incorrect” – the rule does not capture the demonstrations. Ambiguous or missing rules are marked “unclear” or “non‑responsive”.

Key findings:

Textual modality performance – In the textual setting, o3 reaches 68.3 % accuracy with low effort and 77.1 % with medium effort; o4‑mini, Claude, and Gemini show similar upward trends. However, rule analysis reveals that the majority of generated rules are either “correct‑unintended” or outright “incorrect”. Models often latch onto superficial cues such as the numeric encoding of colors, rather than the underlying spatial concepts. Thus, high grid‑accuracy masks a reliance on shortcuts.
Visual modality performance – When grids are supplied as images, all models’ grid accuracy collapses dramatically (as low as 6.7 % for o3 without tools). Enabling Python tool use yields notable gains (o3 improves to 18.1 % with medium effort + tools). Importantly, a substantial fraction of the rules produced under visual conditions are classified as “correct‑intended” (≈40 % for o3 with tools), indicating that the models can infer the intended abstractions but struggle to extract the necessary visual information and apply it reliably.
Effect of reasoning effort – Increasing the token budget consistently improves textual accuracy across models. In the visual case, the benefit is largely mediated by tool usage: the extra tokens are spent on generating and executing Python code that performs image processing (e.g., detecting grid size, extracting color values). Without tools, higher effort yields little visual improvement.
Tool access – Python tool activation is the single most impactful factor for visual tasks. Models use libraries such as OpenCV or Pillow to parse the image, then apply their learned abstract rule. This suggests that current multimodal models lack an integrated visual‑reasoning pipeline and rely on external code to bridge the gap.
Non‑reasoning models – All three baseline models achieve <5 % accuracy in both modalities and produce almost no meaningful rules, underscoring the importance of explicit reasoning capabilities for this benchmark.

The authors argue that relying solely on grid‑accuracy inflates perceived abstract‑reasoning abilities in textual settings and underestimates them in visual settings. The prevalence of “correct‑unintended” rules demonstrates that large models are adept at discovering spurious patterns that happen to solve the training demonstrations, a well‑documented phenomenon in deep learning. Conversely, the visual results reveal a disconnect between abstract‑concept recognition (as evidenced by rule quality) and concrete execution (as evidenced by low grid accuracy).

Implications and future directions:

Evaluation methodology: Incorporate rule‑level analysis as a standard complement to accuracy, especially for benchmarks targeting abstraction.
Architectural improvements: Develop models that can directly extract structured representations from images (e.g., object‑centric encodings) without external code, thereby reducing the reliance on tool calls.
Meta‑reasoning and tool integration: Design frameworks where tool use is a learned, differentiable component rather than a post‑hoc API call, enabling smoother end‑to‑end reasoning.
Curriculum and data design: Construct training data that discourages shortcut learning by varying superficial cues (e.g., randomizing color encodings) while preserving the core abstract relations.

In sum, the paper provides a nuanced portrait of current AI reasoning: textual models appear competent on the surface but often cheat with shortcuts; visual models are far from human‑level in execution despite occasionally grasping the right abstract ideas. Accurate assessment of abstract reasoning therefore demands multi‑dimensional metrics that go beyond raw correctness.

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment