Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning
Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce $\textit{Compositional-ARC}\unicode{x2014}$a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of abstract two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a small transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions. Notably, despite having only 5.7M parameters, this model significantly outperforms state-of-the-art LLMs$\unicode{x2014}$including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior$\unicode{x2014}$and performs on par with the winning model of the ARC prize 2024, an 8B-parameter LLM trained via test-time training. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.
💡 Research Summary
The paper introduces Compositional‑ARC, a novel benchmark designed to evaluate systematic generalization in abstract spatial reasoning. Building on the Abstraction and Reasoning Corpus (ARC), the dataset consists of 10 × 10 grid worlds populated with colored abstract objects. Five primitive geometric transformations are defined—translation, rotation, reflection, extension, and color change. Each transformation is triggered by one of three visual indicators: object shape, object color, or the presence of a neighboring “indicator” object. By combining two indicators, level‑1 compositions are formed; combining all three yields level‑2 compositions, which are never seen during training and serve as the test cases.
An “episode” in Compositional‑ARC contains a small set of few‑shot study examples (12 examples covering primitive transformations and all level‑1 compositions) and a single query that requires applying an unseen level‑2 composition. This mirrors the few‑shot compositional instruction task introduced by Lake & Baroni (2023), but replaces textual pseudo‑language with a visual interpretation grammar. The model must infer the mapping from visual indicators to transformations from the study examples and then apply the inferred rule to the query grid.
The authors adapt the Meta‑Learning for Compositionality (MLC) framework to this visual domain. A transformer‑based encoder‑decoder with only 5.7 M parameters is trained meta‑optimally on 100 000 episodes, each with a randomly re‑generated indicator‑to‑transformation mapping. This forces the network to rely on the study examples rather than memorizing static input‑output pairs, thereby learning to compose transformations dynamically.
Empirical results show that the 5.7 M model achieves over 92 % accuracy on level‑2 queries, dramatically outperforming several state‑of‑the‑art large language models (o3‑mini, GPT‑4o, Gemini 2.0 Flash) which score below 50 % on the same benchmark. Moreover, the small model matches the performance of the winning ARC‑Prize 2024 system, an 8 B‑parameter LLM that requires test‑time training, despite the former’s lack of any test‑time adaptation. Ablation studies demonstrate that relaxing transformation constraints (e.g., allowing multi‑step translations or 45° rotations) only modestly reduces performance, and that scaling the model up or down from 5.7 M leads to under‑ or over‑fitting respectively, confirming the efficiency of the chosen size.
The contributions are threefold: (1) the release of the Compositional‑ARC dataset, providing a controlled yet challenging benchmark for systematic generalization beyond language; (2) the demonstration that meta‑learning for compositionality extends successfully to visual reasoning tasks; (3) evidence that a modest‑sized model, when trained with the right meta‑learning regime, can surpass much larger LLMs on systematic generalization. Limitations include the restriction to 2‑D grids, a limited set of transformation parameters, and the computational cost of generating large numbers of meta‑learning episodes. Future work is suggested to broaden transformation families, explore 3‑D spatial reasoning, and investigate transfer to real‑world robotic manipulation tasks.
Overall, the study provides strong empirical support for the claim that systematic generalization is not exclusive to linguistic domains and that meta‑learning can be a powerful tool for building more robust, generalizable AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment