VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

Reading time: 5 minute
...

📝 Original Info

  • Title: VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors
  • ArXiv ID: 2512.06759
  • Date: 2025-12-07
  • Authors: Wenbo Lyu, Yingjun Du, Jinglin Zhao, Xianton Zhen, Ling Shao

📝 Abstract

Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench

💡 Deep Analysis

Figure 1

📄 Full Content

VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors Wenbo Lyu1, Yingjun Du2, Jinglin Zhao3, Xianton Zhen4, Ling Shao1 1University of Chinese Academy of Sciences 2University of Amsterdam 2Huazhong University of Science and Technology 2United Imaging Healthcare lvwenbo23@mails.ucas.ac.cn Abstract Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Mod- els (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons—e.g., spotting visual differ- ences or assessing appropriateness-while relying heavily on language cues. Such settings overlook progressive, context- dependent reasoning and the challenge of visual-to-visual in- ference. To bridge this gap, we present VisChainBench, a large- scale benchmark designed to rigorously evaluate LVLMs’ abil- ity to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. Vis- ChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineer- ing troubleshooting), structured to mimic real-world decision- making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench Introduction The rapid development of Large Vision-Language Models (LVLMs) (Li et al. 2025; Bai et al. 2025; Google 2024; Liu et al. 2024a; openai 2024) has led to impressive progress in multimodal understanding, enabling applications rang- ing from visual question answering (Yu et al. 2023; Masry et al. 2022) to interactive robotics (Zitkovich et al. 2023; Li et al. 2024b). A core capability underlying these advances is the ability to reason over multiple images across extended interactions-a skill essential for real-world tasks such as tech- nical troubleshooting, intelligent device agents, or sequential visual decision-making. For instance, diagnosing a malfunc- tioning device may require analyzing its defective component, selecting the appropriate repair tool, and verifying the out- come—all through a chain of visual inputs. In realistic scenarios such as embodied agents (Yang et al. 2025; Koh et al. 2024; Jansen et al. 2024), models are ex- pected to track dynamic environmental changes across im- age sequences and respond accordingly. Similarly, personal assistant agents must interpret evolving screenshots and au- tonomously identify pending tasks. These applications re- Preprint Figure 1: Traditional VQA tasks rely on language prompts and re- sponses (top). VisChainBench introduces a purely visual paradigm, where the context, prompt, and answer are all images-challenging models to perform multi-step visual reasoning without textual cues (bottom). quire models to infer task objectives purely from visual con- text, without explicit textual prompts—a capability largely absent from current benchmark evaluations. This gap raises a critical question: what forms of reasoning must LVLMs per- form to succeed in such settings? From assessing the current visual state to anticipating future steps and making decisions under uncertainty with less language guidance, robust multi- modal reasoning is key. Although many LVLMs achieve strong results on percep- tual or single-step tasks, their ability to conduct multi-step, image-to-image, and context-dependent reasoning, as men- tioned above, remains underexplored. To address this, we introduce VisChainBench, a benchmark specifically designed to evaluate multi-turn visual reasoning under minimal linguis- tic guidance. Unlike prior datasets, our benchmark presents tasks as structured visual chains, requiring models to reason progressively across linked visual contexts. VisChainBench is constructed using a multi-agent pipeline: initial task structures are generated by language models, rele- vant images are retrieved or synthesized, and human annota- tors refine both the tasks and their annotations. This process ensures high-quality, diverse, and visually grounded task se- quences that challenge model reasoning beyond superficial patterns. arXiv:2512.06759v1 [cs.CV] 7 Dec 2025 Figure 2: Comparison of task formats in prior benchmarks. Previous benchmarks are often text-heavy and encourage shallow image comparisons, whereas our benchmark emphasizes progressive image-grounded reasoning. Our benchmark features the following distinctive aspects: (1) Multi-turn, multi-image structure. Each task simulates extended human-AI interactions, involving up to 6 dialogue turns and 27 evolving images—capturing real-world com- plexity such as procedural workflows and troubleshooting steps. (2) Image-centric reasoning with distractors. Tasks are designed to probe models’ ability to reason betwe

📸 Image Gallery

Agent2.png Begin.jpg example55.jpg faliure1.jpg faliure2.jpg faliure3.jpg headfingure.jpg scaling.jpeg sunburst.jpg web.jpg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut