JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Reading time: 5 minute
...

📝 Original Info

  • Title: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
  • ArXiv ID: 2512.14620
  • Date: 2025-12-16
  • Authors: ** - 미야이 아츠유키 (Atsuyuki Miyai) – 도쿄대 (University of Tokyo) - 오노하라 쇼타 (Shota Onohara) – 도쿄대 - 백 정훈 (Jeonghun Baek) – 도쿄대 - 아이자와 키요하루 (Kiyoharu Aizawa) – 도쿄대 **

📝 Abstract

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to Preprint.

💡 Deep Analysis

📄 Full Content

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction Atsuyuki Miyai Shota Onohara Jeonghun Baek Kiyoharu Aizawa miyai@cvm.t.u-tokyo.ac.jp {onohara, baek, aizawa}@hal.t.u-tokyo.ac.jp The University of Tokyo ™ https://mmmu-japanese-benchmark.github.io/JMMMU_Pro/ Dataset ‡ Code 3 Leaderboard Text Set Image Set Webpage🌐 iPad📱 Exam Sheet📝 JMMMU JMMMU-Pro TV Show📺 Whiteboard🪧 の城の名前は何でしょう A. 名古屋城B. 弘前城C. 彦根城D. 膜炎 の城の名前は何でしょう? A. 名古屋城B. 弘前城C. 彦根城D. 松本城 Image Set Various Real World Layouts Random Figure 1: Building JMMMU-Pro via Vibe Benchmark Construction. JMMMU-Pro extends JMMMU by embedding each question image and text into a single image. To construct JMMMU-Pro, we propose Vibe Benchmark Construction, where an image generation model creates questions, followed by human verification and prompt refinement to ensure quality. Experiments indicate that current open-source LMMs struggle with JMMMU-Pro. Abstract This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scal- able construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to Preprint. arXiv:2512.14620v1 [cs.CL] 16 Dec 2025 ensure quality. By leveraging Nano Banana Pro’s highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guid- ing future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks. 1 Introduction With the recent success of large multimodal models (LMMs) in English [34, 23, 24], there has been a growing interest in developing multilingual LMMs [52, 10, 58, 6] and LMMs specialized for non-English languages [41, 46]. Although LMM development in the Japanese domain has emerged [41, 4, 40], progress has been slower than in the English domain, in part due to the limited evaluation benchmarks. Given the large and rapidly growing population of Japanese LMM users, there is an increasing need to establish more Japanese benchmarks that can facilitate the development of LMMs capable of handling the Japanese language and culture seamlessly. Among the several benchmarks for Japanese LMMs [16, 5, 32, 33], one of the most representative is JMMMU (Japanese Massive Multi-discipline Multimodal Understanding Benchmark) [33]. Inspired by the MMMU benchmark [57], JMMMU is the first benchmark designed to evaluate LMMs on extensive, multi-disciplinary tasks in Japanese that require college-level subject knowledge, deliberate reasoning, and cultural understanding. JMMMU consists of a culture-agnostic (CA) subset of 720 items, constructed through translation from MMMU, and a culture-specific (CS) subset of 600 items that incorporate Japanese cultural elements. This systematic design enables apple-to-apple comparisons with the original MMMU through the CA subset while simultaneously evaluating cultural understanding through the CS subset. Due to its comprehensive and rigorous evaluation coverage, JMMMU has become a foundational benchmark for the development of Japanese LMMs [41, 42, 51]. A major limitation of existing Japanese benchmarks is that the question image and the question text are provided to the model as separate modalities. This evaluation setup differs substantially from the core human cognitive skill: Seamlessly integrating visual and textual information and interpreting them through visual perception. Equipping LMMs with this cognitive ability in Japanese is a crucial step toward developing embodied agents and robotic systems [64, 2, 15, 20] that can autonomously operate and explore real-world environments in Japan through visual perception. Furthermore, from the perspective of current LMMs’ use cases, users commonly provide LMMs with screenshots that include both Japanese text and images. Therefore, to foster core human cognitive skills and support a wide range of real-world use cases, it is essential to evaluate LMMs on sufficiently complex tasks where both the question image and the question text are presented throu

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut