Training Multi-Image Vision Agents via End2End Reinforcement Learning

Reading time: 1 minute
...

📝 Original Info

  • Title: Training Multi-Image Vision Agents via End2End Reinforcement Learning
  • ArXiv ID: 2512.08980
  • Date: 2025-12-05
  • Authors: Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Wei Lin, Guojun Yin

📝 Abstract

Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on realworld multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-toend reinforcement learning dedicated for complex multiimage tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed actiontrajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achi...

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut