📝 Original Info
- Title: UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark
- ArXiv ID: 2512.17196
- Date: 2025-12-19
- Authors: Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang
📝 Abstract
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.
💡 Deep Analysis
Deep Dive into UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. B
📄 Full Content
UmniBench: Unified Understanding and Generation
Model Oriented Omni-dimensional Benchmark
Kai Liu1*, Leyang Chen1*, Wenbo Li2, Zhikai Chen3, Zhixin Wang3,
Renjing Pei3, Linghe Kong1†, Yulun Zhang1†
1Shanghai Jiao Tong University, 2The Chinese University of Hong Kong, 3Huawei Technologies Ltd.
Homepage: https://umnibench.github.io/
Abstract
Unifying multimodal understanding and generation has
shown impressive capabilities in cutting-edge proprietary
systems. However, evaluations of unified multimodal mod-
els (UMMs) remain decoupled, assessing their understand-
ing and generation abilities separately with corresponding
datasets. To address this, we propose UmniBench, a bench-
mark tailored for UMMs with omni-dimensional evalua-
tion. First, UmniBench can assess the understanding, gen-
eration, and editing ability within a single evaluation pro-
cess. Based on human-examined prompts and QA pairs,
UmniBench leverages UMM itself to evaluate its genera-
tion and editing ability with its understanding ability. This
simple but effective paradigm allows comprehensive evalu-
ation of UMMs. Second, UmniBench covers 13 major do-
mains and more than 200 concepts, ensuring a thorough
inspection of UMMs. Moreover, UmniBench can also de-
couple and separately evaluate understanding, generation,
and editing abilities, providing a fine-grained assessment.
Based on UmniBench, we benchmark 24 popular models,
including both UMMs and single-ability large models. We
hope this benchmark provides a more comprehensive and
objective view of unified models and logistical support for
improving the performance of the community model.
1. Introduction
A unified multimodal model (UMM) refers to a single
model that seamlessly integrates understanding, generation,
and editing within a single architecture, with the aim that
these capabilities mutually reinforce one another [7, 26, 31].
Recent advances in UMMs have shown impressive capa-
bilities in cutting-edge models, which mainly focus on ar-
chitecture design, post-training paradigms, and dataset con-
struction. However, current evaluation protocols for these
models are largely decoupled [8, 32].
They assess un-
derstanding, generation, and editing separately rather than
in an integrated manner.
To be specific, understanding
benchmarks typically take text and images as input and use
*Equal contribution
†Corresponding authors: Yulun Zhang, yulun100@gmail.com, Linghe
Kong, linghe.kong@sjtu.edu.cn
Unified
Omni
Fine
Grained
Simple
Efficient
No
Leakage
Three process in one model!
Und.
Gen.
Edit UmniBench
Isolaon
Redundant
Tasks
UMMs
External
reviewer
Metrics
Tons of benchmarks!
Isolated assessment!
Figure 1. Advantages of our proposed UmniBench compared with
previous UMMs isolated evaluation protocols.
multiple-choice answers to probe comprehension and rea-
soning [6]. By contrast, generation and editing benchmarks
take short textual prompts or text–image pairs as input to
assess quality and consistency of the output image [10]. De-
spite all these benchmarks allowing accurate evaluation in
their corresponding sub-tasks, they are inherently unsuit-
able for evaluating UMMs. This is because these evalua-
tions capture each facet in isolation, and they do not reflect
the holistic competence of unification, namely, the seamless
integration of understanding, generation, and editing.
Consequently, there is an urgent need for a benchmark
specifically designed to evaluate the holistic capabilities
of unified generation–understanding models.
To address
this gap, we propose UmniBench, a benchmark tailored for
UMMs with omni-dimensional evaluation. While existing
pure generation and editing models have achieved remark-
able image quality, they remain deficient in complex in-
struction following [1]. In particular, these models often
struggle to accurately interpret complex user intents, espe-
cially when the input requires nontrivial reasoning.
Re-
searchers have therefore begun integrating understanding
capabilities into generative models, aiming to improve uni-
fied models’ handling of complex intents in generation and
editing tasks. Accordingly, our proposed UmniBench fo-
cuses on image generation and editing scenarios that de-
mand substantive understanding and reasoning.
Conventional benchmarks typically rely on external
models, hand-crafted rules, or human evaluators to score
system outputs [23]. Because UMMs intrinsically possess
arXiv:2512.17196v1 [cs.AI] 19 Dec 2025
Mechanics
Spatial
Animal
Plant
Fluid
Playground
Arts and Crafts
Office
Household
Gardening
Cooking
Weather and Environ.
Personal Care
UmniBench
Mechanics
Household
Figure 2. The overview of UmniBench. All 13 domains involved in UmniBench are enumerated in the table above, with left-hand and
right-hand panels presenting representative images generated under each concept under the specific domain.
understanding, generation, and editing capabilities, they
can, in principle, serve as their own evaluators. A natu-
ral property of ima
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.