UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark

Reading time: 5 minute
...

📝 Original Info

  • Title: UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark
  • ArXiv ID: 2512.17196
  • Date: 2025-12-19
  • Authors: Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang

📝 Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.

💡 Deep Analysis

Deep Dive into UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark.

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. B

📄 Full Content

UmniBench: Unified Understanding and Generation Model Oriented Omni-dimensional Benchmark Kai Liu1*, Leyang Chen1*, Wenbo Li2, Zhikai Chen3, Zhixin Wang3, Renjing Pei3, Linghe Kong1†, Yulun Zhang1† 1Shanghai Jiao Tong University, 2The Chinese University of Hong Kong, 3Huawei Technologies Ltd. Homepage: https://umnibench.github.io/ Abstract Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal mod- els (UMMs) remain decoupled, assessing their understand- ing and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a bench- mark tailored for UMMs with omni-dimensional evalua- tion. First, UmniBench can assess the understanding, gen- eration, and editing ability within a single evaluation pro- cess. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its genera- tion and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evalu- ation of UMMs. Second, UmniBench covers 13 major do- mains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also de- couple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model. 1. Introduction A unified multimodal model (UMM) refers to a single model that seamlessly integrates understanding, generation, and editing within a single architecture, with the aim that these capabilities mutually reinforce one another [7, 26, 31]. Recent advances in UMMs have shown impressive capa- bilities in cutting-edge models, which mainly focus on ar- chitecture design, post-training paradigms, and dataset con- struction. However, current evaluation protocols for these models are largely decoupled [8, 32]. They assess un- derstanding, generation, and editing separately rather than in an integrated manner. To be specific, understanding benchmarks typically take text and images as input and use *Equal contribution †Corresponding authors: Yulun Zhang, yulun100@gmail.com, Linghe Kong, linghe.kong@sjtu.edu.cn Unified Omni Fine Grained Simple Efficient No Leakage Three process in one model! Und. Gen. Edit UmniBench Isolaon Redundant Tasks UMMs External reviewer Metrics Tons of benchmarks! Isolated assessment! Figure 1. Advantages of our proposed UmniBench compared with previous UMMs isolated evaluation protocols. multiple-choice answers to probe comprehension and rea- soning [6]. By contrast, generation and editing benchmarks take short textual prompts or text–image pairs as input to assess quality and consistency of the output image [10]. De- spite all these benchmarks allowing accurate evaluation in their corresponding sub-tasks, they are inherently unsuit- able for evaluating UMMs. This is because these evalua- tions capture each facet in isolation, and they do not reflect the holistic competence of unification, namely, the seamless integration of understanding, generation, and editing. Consequently, there is an urgent need for a benchmark specifically designed to evaluate the holistic capabilities of unified generation–understanding models. To address this gap, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. While existing pure generation and editing models have achieved remark- able image quality, they remain deficient in complex in- struction following [1]. In particular, these models often struggle to accurately interpret complex user intents, espe- cially when the input requires nontrivial reasoning. Re- searchers have therefore begun integrating understanding capabilities into generative models, aiming to improve uni- fied models’ handling of complex intents in generation and editing tasks. Accordingly, our proposed UmniBench fo- cuses on image generation and editing scenarios that de- mand substantive understanding and reasoning. Conventional benchmarks typically rely on external models, hand-crafted rules, or human evaluators to score system outputs [23]. Because UMMs intrinsically possess arXiv:2512.17196v1 [cs.AI] 19 Dec 2025 Mechanics Spatial Animal Plant Fluid Playground Arts and Crafts Office Household Gardening Cooking Weather and Environ. Personal Care UmniBench Mechanics Household Figure 2. The overview of UmniBench. All 13 domains involved in UmniBench are enumerated in the table above, with left-hand and right-hand panels presenting representative images generated under each concept under the specific domain. understanding, generation, and editing capabilities, they can, in principle, serve as their own evaluators. A natu- ral property of ima

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut