Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Although reinforcement learning (RL) has emerged as a promising approach for improving vision-language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self-improvement in multimodal systems. To address this challenge, we propose Vision-Zero, a label-free, domain-agnostic multi-agent self-play framework for self-evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in “Who Is the Spy”-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code have been released at https://github.com/wangqinsi1/Vision-Zero.


💡 Research Summary

Vision‑Zero introduces a label‑free, domain‑agnostic self‑play framework that enables continuous improvement of vision‑language models (VLMs) without human‑annotated data. Inspired by the social deduction game “Who Is the Spy?”, the authors design a visual version where multiple civilian agents receive a real image while a single spy agent receives a blank canvas. The game proceeds in two phases: a clue‑giving phase, where each agent generates a natural‑language hint based on its visual input and the dialogue history, and a voting phase, where civilians vote to identify the spy. The generated dialogue and vote outcomes serve as automatically created training data.

To avoid the common stagnation of pure self‑play, the paper proposes Iterative Self‑Play Policy Optimization (Iterative‑SPO), which alternates between self‑play and Reinforcement Learning with Verifiable Rewards (RLVR). In the RLVR stage, the correctness of the spy identification (vote accuracy) is used as a direct reward, and group‑normalization techniques mitigate role bias. This alternating schedule keeps the learning signal informative and prevents convergence to a local equilibrium.

Vision‑Zero requires only raw images; the authors demonstrate its flexibility using three datasets: 2,000 CLEVR synthetic scenes, 1,000 chart images from ChartQA, and 1,000 real‑world photos from ImgEdit. No manual labels are needed, and the rendering/collection cost is minimal.

Experiments on a Qwen2.5‑VL‑7B backbone show that Vision‑Zero consistently outperforms strong baselines that rely on expensive human‑labeled data. Notable gains include 3–6 percentage points improvements on MathVision (logical and mathematical reasoning), ChartQA (chart question answering), and RealWorldQA (vision‑centric document QA). The results demonstrate that strategic self‑play combined with verifiable reinforcement signals can drive scalable, cost‑effective VLM enhancement across diverse tasks.

In summary, Vision‑Zero contributes (1) a game environment whose required skills align directly with target multimodal tasks, (2) an iterative training algorithm that stabilizes learning and prevents performance plateaus, and (3) a label‑free, domain‑agnostic pipeline that achieves state‑of‑the‑art results, suggesting a promising new direction for large‑scale multimodal model development.


Comments & Academic Discussion

Loading comments...

Leave a Comment