Bongards at the Boundary of Perception and Reasoning: Programs or Language?
Vision-Language Models (VLMs) have made great strides in everyday visual tasks, such as captioning a natural image, or answering commonsense questions about such images. But humans possess the puzzling ability to deploy their visual reasoning abilities in radically new situations, a skill rigorously tested by the classic set of visual reasoning challenges known as the Bongard problems. We present a neurosymbolic approach to solving these problems: given a hypothesized solution rule for a Bongard problem, we leverage LLMs to generate parameterized programmatic representations for the rule and perform parameter fitting using Bayesian optimization. We evaluate our method on classifying Bongard problem images given the ground truth rule, as well as on solving the problems from scratch.
💡 Research Summary
The paper tackles the classic Bongard problems (BPs), a set of twelve abstract images (six positive, six negative) that require a learner to invent new visual primitives and use them to separate the two categories. While modern Vision‑Language Models (VLMs) excel at everyday visual tasks, they struggle with the kind of few‑shot, concept‑creation reasoning that BPs demand. To bridge this gap, the authors propose a neuro‑symbolic framework that combines large language model (LLM)‑driven program synthesis with Bayesian parameter optimization, and falls back to Chain‑of‑Thought (CoT) reasoning when program synthesis fails.
The system consists of two main components. The hypothesis generator feeds all twelve BP images together with a few exemplar rules from other BPs into an LLM, prompting it to sample multiple candidate natural‑language rules. By exploiting the curriculum structure of the original Bongard set, the generator supplies both sequential and random exemplar rules to encourage diversity. Each sampled rule is then handed to the verifier, which asks the LLM to translate the rule into a parameterized Python function (e.g., a classify_image routine). The verifier runs Bayesian optimization over the parameters that the LLM flags as uncertain (such as a compactness threshold) for a limited number of iterations (15). If a program attains a training‑set accuracy of at least 0.9, it is considered successful; the best programs are then executed on the held‑out test images and a majority vote determines the final label. If no program reaches the threshold, the verifier switches to a pure CoT approach, prompting the LLM to label the test images directly using the five training pairs as in‑context examples.
Two tasks are evaluated. In the verification task, the ground‑truth rule is provided and the system must classify a held‑out image pair; in the solution task, the system must infer the rule from scratch. Experiments compare the proposed method against baseline VLMs (GPT‑4o and Claude 3.7) and against ablations that remove either the program‑based or the language‑based component. Results show that Claude 3.7 consistently outperforms GPT‑4o across all metrics. Program‑based verification excels on spatial and similarity‑based BPs (e.g., distinguishing “elongated” vs. “compact” shapes), achieving >85 % accuracy, while CoT shines on high‑level conceptual or numeric problems where encoding the rule as code is difficult. The hybrid approach yields the highest overall performance, confirming that the two reasoning modes are complementary.
A detailed analysis of the Bayesian optimization process reveals that only a handful of iterations are needed to locate suitable thresholds, and a single “refine” loop—where the LLM is asked to rewrite a failing program— improves success rates by roughly 10 %. Human performance data from prior work (average 47 % correct, top 5 humans ≈63 %) are used as a benchmark; the hybrid system approaches the average human level on the solution task, falling just short of the top human performance.
In summary, the paper demonstrates that integrating LLM‑generated, parameterized programs with VLM perception and a fallback CoT reasoning path can effectively solve Bongard problems, a benchmark that sits at the intersection of perception and abstract reasoning. This neuro‑symbolic strategy leverages the expressive power of code for precise geometric reasoning while retaining the flexibility of natural‑language inference, offering a promising direction for future AI systems that must invent and test new visual concepts on the fly.
Comments & Academic Discussion
Loading comments...
Leave a Comment