Title: STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
ArXiv ID: 2512.13752
Date: 2025-12-15
Authors: ** - Jie Qin (Meituan Inc.) - Jiancheng Huang (Meituan Inc.) - Limeng Qiao (Meituan Inc.) - Lin Ma (Meituan Inc.) **
📝 Abstract
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
💡 Deep Analysis
📄 Full Content
Technical Report
STAR: STACKED AUTOREGRESSIVE SCHEME FOR
UNIFIED MULTIMODAL LEARNING
Jie Qin ∗
Jiancheng Huang ∗
Limeng Qiao ∗
Lin Ma
Meituan Inc
Project page: https://star-mm-ai.github.io
ABSTRACT
Multimodal large language models (MLLMs) play a pivotal role in advancing the
quest for general artificial intelligence. However, achieving unified target for mul-
timodal understanding and generation remains challenging due to optimization
conflicts and performance trade-offs. To effectively enhance generative perfor-
mance while preserving existing comprehension capabilities, we introduce STAR:
a STacked AutoRegressive scheme for task-progressive unified multimodal learn-
ing. This approach decomposes multimodal learning into multiple stages: under-
standing, generation, and editing. By freezing the parameters of the fundamental
autoregressive (AR) model and progressively stacking isomorphic AR modules, it
avoids cross-task interference while expanding the model’s capabilities. Concur-
rently, we introduce a high-capacity VQ to enhance the granularity of image rep-
resentations and employ an implicit reasoning mechanism to improve generation
quality under complex conditions. Experiments demonstrate that STAR achieves
state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit
(4.34), validating its efficacy for unified multimodal learning.
1
INTRODUCTION
In recent years, the rapid advancement of multimodal large language models (MLLMs) has signifi-
cantly propelled the progress of artificial general intelligence (AGI) (Touvron et al., 2023; Bi et al.,
2024; OpenAI, 2024a; Team et al., 2023; DeepSeek-AI et al., 2025; Yang et al., 2025). Numerous
studies have focused on constructing unified models that use a single set of parameters to simultane-
ously handle different tasks, such as multimodal understanding and generation (Wang et al., 2024;
Chen et al., 2025c; Wang et al., 2025; Liao et al., 2025; Deng et al., 2025; Xie et al., 2025; OpenAI,
2025; Chen et al., 2025b). However, these from-scratch-trained models face a critical challenge: in-
herent conflicts exist between multimodal understanding and generation tasks in both optimization
objectives and feature spaces. This often results in joint training sacrificing performance in one or
more domains, thereby limiting the overall capability ceiling of unified models.
Against this backdrop, a fundamental research question emerges: Can we continuously enhance a
model’s image generation capabilities while fully preserving its multimodal understanding abilities?
Existing approaches, such as MetaQuery (Pan et al., 2025) and BLIP3-o (Chen et al., 2025a), adopt
a warm-started adaptation paradigm, which initializes from a pre-trained multimodal understand-
ing model and augments it with a diffusion-based generator to enhance generation while preserving
image-to-text capability. Yet, these approaches typically require constructing feature transforma-
tion bridges between autoregressive and diffusion models or designing complex loss functions,
significantly increasing training complexity. Thus, we face a critical challenge: How to extend a
single MLLM in the most streamlined manner possible, enabling it to progressively acquire more
sophisticated multimodal capabilities without compromising existing abilities?
To address the aforementioned challenge, we propose STAR (STacked AutoRegressive Scheme for
Unified Multimodal Learning), a novel unified learning method based on stacked autoregressive
(AR) paradigm that offers three key design advantages: (i) a task-progressive training strategy;
∗Equal Contribution.
1
arXiv:2512.13752v1 [cs.CV] 15 Dec 2025
Technical Report
Text-to-Image Generation
Image Editing
Knowledge Reasoning
Remove the tiger in the water.
Replace the bird in the image with a squirrel
Change the autumn leaves to a sandy beach setting
“The weapon of Apollo in
Greek mythology.”
“The Pyramids of Giza at 8
PM Tokyo time.”
“The object used to find the
heroine in Cinderella.”
“The fastest land animal.”
Figure 1: STAR enables unified multimodal learning for understanding, text-to-image, image edit-
ing, and reasoning, with a diffusion decoder enhancing the granularity of image outputs.
(ii) a stacked autoregressive model; and (iii) an implicit reasoning mechanism. Firstly, the task-
progressive training paradigm decomposes unified multimodal learning into an ordered curriculum:
understanding, generation, and editing, while freezing the fundamental AR backbone at each ex-
tension. This staged training paradigm simultaneously shields existing comprehension capabilities
from catastrophic degradation and equips the model with novel generative abilities. Secondly, the
stacked autoregressive model extends the frozen fundamental AR by appending a small set of iso-
morphic AR modules that share identical architecture and are initialized from the same parameter.
The generation and editing tasks can be optimized with the standard next-token prediction