Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher’s targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated “image-code-instruction” triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).


💡 Research Summary

Socratic‑Geo tackles the acute shortage of high‑quality image‑text‑answer triples for geometric reasoning by introducing a fully autonomous, goal‑driven data synthesis framework that tightly couples data generation with model learning. The system consists of three specialized agents—Teacher, Solver, and Generator—that interact in a closed loop.

The Teacher agent creates parameterized Python scripts that both draw geometric diagrams and formulate the corresponding problem statements. Each script undergoes two self‑verification stages: Reflect checks mathematical solvability (ensuring a unique, correct solution exists), while RePI validates the rendered image for visual correctness and alignment with the textual description. Only scripts that pass both checks are admitted, guaranteeing that every (image, question, answer, code) quadruple is of high fidelity.

The Solver is trained purely via reinforcement learning using Group Relative Policy Optimization (GRPO). For each problem in the curriculum, the Solver attempts multiple solutions and receives binary rewards (1 for correct, 0 for incorrect). When all attempts fail, the failure log is fed back to the Teacher. The Teacher diagnoses the weakness, minimally modifies the underlying code (e.g., adding auxiliary lines, adjusting angles or lengths), and generates a new, validated problem. This new triple is immediately added to the curriculum, allowing the Solver to learn from data that directly addresses its current deficiencies. The loop thus implements learner‑driven, targeted data augmentation rather than blind random generation.

The Generator operates independently of the reasoning loop. After inventing a new problem, the Teacher translates the programmatic drawing instructions into natural‑language “drawing instructions.” These (instruction, image) pairs are collected and used to fine‑tune a diffusion‑based image synthesis model via supervised learning. In effect, the precise, rule‑based drawing knowledge of the Teacher is distilled into the neural weights of the Generator, enabling it to produce high‑quality geometric diagrams that faithfully follow textual specifications.

Experiments start from only 108 seed problems. Socratic‑Solver achieves an average accuracy of 49.11 % across six benchmark datasets, surpassing strong baselines by 2.43 points while using only a quarter of the data typically required, and delivering a +4.13 % gain over zero‑shot performance. Socratic‑Generator attains a 42.4 % relaxed score on the GenExam‑Math benchmark, setting a new state‑of‑the‑art for open‑source models and approaching the performance of the proprietary Gemini‑2.5‑Flash‑Image (43.1 %).

Key contributions of the paper are: (1) a goal‑driven, programmatic data synthesis pipeline that integrates problem diagnosis, code modification, and dual verification (Reflect and RePI); (2) a multi‑agent interaction framework that closes the loop between data generation and model learning, enabling dynamic curriculum evolution; and (3) the simultaneous optimization of a reinforcement‑learning‑based reasoning model and a diffusion‑based image generator, demonstrating that high performance can be achieved with minimal initial data. By directly generating and verifying image‑code‑text triples, Socratic‑Geo overcomes the limitations of prior text‑only Socratic frameworks, which could not guarantee visual‑textual consistency. The work showcases a scalable path toward autonomous creation of rich multimodal datasets for complex visual‑reasoning tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment