EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.


💡 Research Summary

The paper introduces EMMA (Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture), a novel and efficient model designed to handle multiple multimodal tasks—including visual understanding, text-to-image generation, and instruction-based image editing—within a single, unified framework. EMMA addresses key inefficiencies prevalent in prior unified models that follow the “unification of architecture formats” paradigm.

The core innovation of EMMA lies in its architectural design centered around efficiency and performance synergy. First, it employs a high-compression autoencoder (DCAE) with a 32x compression ratio for the generation branch. This aligns with the 32x compression achieved by the understanding branch’s visual encoder (a modified SigLIP2 with pixel shuffling), resolving a fundamental mismatch in token scales that plagued previous models like BAGEL. This alignment enables the second key innovation: channel-wise concatenation of visual tokens from the understanding and generation branches. Unlike the token-wise concatenation used in earlier works, which drastically increases the sequence length, channel-wise fusion merges information along the feature dimension without inflating the token count. This leads to a dramatic reduction in visual context tokens (e.g., up to 5x fewer in interactive tasks like image editing), significantly boosting training and inference efficiency.

Third, EMMA features a shared-and-decoupled network design. Recognizing that understanding and generation tasks share common low-level abstractions but require specialized high-level processing, the model shares parameters in the shallow layers to foster mutual improvement and knowledge transfer. In the deeper layers, parameters are decoupled to cater to the distinct needs of semantic modeling (for understanding) and joint semantic/detail modeling (for generation).

Fourth, to enhance the perceptual capability of the understanding encoder for diverse image types, EMMA incorporates a Mixture-of-Experts (MoE) mechanism into the SigLIP2 backbone. It introduces a dedicated STEM expert alongside the original versatile expert. A router dynamically selects the appropriate expert based on the input image, improving performance on specialized benchmarks with only a modest increase of around 50M parameters.

EMMA is trained end-to-end in multiple stages: alignment, pre-training, and supervised fine-tuning. The training corpus is a large-scale, carefully curated mix of multimodal understanding data (including alignment, pre-training, SFT, and specialized STEM tuning data), text-to-image generation data, and high-quality image editing data synthesized using state-of-the-art models and pipelines.

Comprehensive experiments demonstrate that EMMA-4B (with 4 billion parameters) substantially outperforms larger state-of-the-art unified models like BAGEL-7B (with 7 billion parameters) across a wide range of benchmarks. It achieves scores of 73.0 on MMVet (vs. BAGEL’s 67.2) and 0.93 on GenEval (vs. BAGEL’s 0.88). Remarkably, despite being a general-purpose unified architecture, EMMA also achieves competitive results compared to recent task-specific expert models in both understanding (e.g., Qwen3-VL) and generation (e.g., Qwen-Image).

The paper concludes that EMMA establishes a solid foundation for future unified multimodal architectures by successfully balancing and advancing the critical axes of efficiency, performance, and task generality. Its design effectively reduces computational overhead while achieving superior or competitive results, highlighting a promising path forward for capable and practical multimodal foundation models.


Comments & Academic Discussion

Loading comments...

Leave a Comment