📝 Original Info
- Title: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
- ArXiv ID: 2512.14620
- Date: 2025-12-16
- Authors: ** - 미야이 아츠유키 (Atsuyuki Miyai) – 도쿄대 (University of Tokyo) - 오노하라 쇼타 (Shota Onohara) – 도쿄대 - 백 정훈 (Jeonghun Baek) – 도쿄대 - 아이자와 키요하루 (Kiyoharu Aizawa) – 도쿄대 **
📝 Abstract
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to Preprint.
💡 Deep Analysis
📄 Full Content
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal
Understanding Benchmark via Vibe Benchmark Construction
Atsuyuki Miyai
Shota Onohara
Jeonghun Baek
Kiyoharu Aizawa
miyai@cvm.t.u-tokyo.ac.jp
{onohara, baek, aizawa}@hal.t.u-tokyo.ac.jp
The University of Tokyo
https://mmmu-japanese-benchmark.github.io/JMMMU_Pro/
Dataset
Code
3 Leaderboard
Text Set
Image Set
Webpage🌐
iPad📱
Exam Sheet📝
JMMMU
JMMMU-Pro
TV Show📺
Whiteboard🪧
の城の名前は何でしょう
A. 名古屋城B. 弘前城C. 彦根城D. 膜炎
の城の名前は何でしょう?
A. 名古屋城B. 弘前城C. 彦根城D. 松本城
Image Set
Various Real World Layouts
Random
Figure 1: Building JMMMU-Pro via Vibe Benchmark Construction. JMMMU-Pro extends
JMMMU by embedding each question image and text into a single image. To construct JMMMU-Pro,
we propose Vibe Benchmark Construction, where an image generation model creates questions,
followed by human verification and prompt refinement to ensure quality. Experiments indicate that
current open-source LMMs struggle with JMMMU-Pro.
Abstract
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline
Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scal-
able construction method. Following the evolution from MMMU to MMMU-Pro,
JMMMU-Pro extends JMMMU by composing the question image and question
text into a single image, thereby creating a benchmark that requires integrated
visual-textual understanding through visual perception. To build JMMMU-Pro,
we propose Vibe Benchmark Construction, a methodology in which an image
generative model (e.g., Nano Banana Pro) produces candidate visual questions, and
humans verify the outputs and, when necessary, regenerate with adjusted prompts to
Preprint.
arXiv:2512.14620v1 [cs.CL] 16 Dec 2025
ensure quality. By leveraging Nano Banana Pro’s highly realistic image generation
capabilities and its ability to embed clean Japanese text, we construct a high-quality
benchmark at low cost, covering a wide range of background and layout designs.
Experimental results show that all open-source LMMs struggle substantially with
JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guid-
ing future efforts in the open-source community. We believe that JMMMU-Pro
provides a more rigorous evaluation tool for assessing the Japanese capabilities of
LMMs and that our Vibe Benchmark Construction also offers an efficient guideline
for future development of image-based VQA benchmarks.
1
Introduction
With the recent success of large multimodal models (LMMs) in English [34, 23, 24], there has
been a growing interest in developing multilingual LMMs [52, 10, 58, 6] and LMMs specialized
for non-English languages [41, 46]. Although LMM development in the Japanese domain has
emerged [41, 4, 40], progress has been slower than in the English domain, in part due to the limited
evaluation benchmarks. Given the large and rapidly growing population of Japanese LMM users,
there is an increasing need to establish more Japanese benchmarks that can facilitate the development
of LMMs capable of handling the Japanese language and culture seamlessly.
Among the several benchmarks for Japanese LMMs [16, 5, 32, 33], one of the most representative is
JMMMU (Japanese Massive Multi-discipline Multimodal Understanding Benchmark) [33]. Inspired
by the MMMU benchmark [57], JMMMU is the first benchmark designed to evaluate LMMs on
extensive, multi-disciplinary tasks in Japanese that require college-level subject knowledge, deliberate
reasoning, and cultural understanding. JMMMU consists of a culture-agnostic (CA) subset of 720
items, constructed through translation from MMMU, and a culture-specific (CS) subset of 600
items that incorporate Japanese cultural elements. This systematic design enables apple-to-apple
comparisons with the original MMMU through the CA subset while simultaneously evaluating cultural
understanding through the CS subset. Due to its comprehensive and rigorous evaluation coverage,
JMMMU has become a foundational benchmark for the development of Japanese LMMs [41, 42, 51].
A major limitation of existing Japanese benchmarks is that the question image and the question text
are provided to the model as separate modalities. This evaluation setup differs substantially from the
core human cognitive skill: Seamlessly integrating visual and textual information and interpreting
them through visual perception. Equipping LMMs with this cognitive ability in Japanese is a crucial
step toward developing embodied agents and robotic systems [64, 2, 15, 20] that can autonomously
operate and explore real-world environments in Japan through visual perception. Furthermore, from
the perspective of current LMMs’ use cases, users commonly provide LMMs with screenshots that
include both Japanese text and images. Therefore, to foster core human cognitive skills and support a
wide range of real-world use cases, it is essential to evaluate LMMs on sufficiently complex tasks
where both the question image and the question text are presented throu
Reference
This content is AI-processed based on open access ArXiv data.