STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
  • ArXiv ID: 2512.13752
  • Date: 2025-12-15
  • Authors: ** - Jie Qin (Meituan Inc.) - Jiancheng Huang (Meituan Inc.) - Limeng Qiao (Meituan Inc.) - Lin Ma (Meituan Inc.) **

📝 Abstract

Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

💡 Deep Analysis

Figure 1

📄 Full Content

Technical Report STAR: STACKED AUTOREGRESSIVE SCHEME FOR UNIFIED MULTIMODAL LEARNING Jie Qin ∗ Jiancheng Huang ∗ Limeng Qiao ∗ Lin Ma Meituan Inc Project page: https://star-mm-ai.github.io ABSTRACT Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for mul- timodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative perfor- mance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learn- ing. This approach decomposes multimodal learning into multiple stages: under- standing, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model’s capabilities. Concur- rently, we introduce a high-capacity VQ to enhance the granularity of image rep- resentations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning. 1 INTRODUCTION In recent years, the rapid advancement of multimodal large language models (MLLMs) has signifi- cantly propelled the progress of artificial general intelligence (AGI) (Touvron et al., 2023; Bi et al., 2024; OpenAI, 2024a; Team et al., 2023; DeepSeek-AI et al., 2025; Yang et al., 2025). Numerous studies have focused on constructing unified models that use a single set of parameters to simultane- ously handle different tasks, such as multimodal understanding and generation (Wang et al., 2024; Chen et al., 2025c; Wang et al., 2025; Liao et al., 2025; Deng et al., 2025; Xie et al., 2025; OpenAI, 2025; Chen et al., 2025b). However, these from-scratch-trained models face a critical challenge: in- herent conflicts exist between multimodal understanding and generation tasks in both optimization objectives and feature spaces. This often results in joint training sacrificing performance in one or more domains, thereby limiting the overall capability ceiling of unified models. Against this backdrop, a fundamental research question emerges: Can we continuously enhance a model’s image generation capabilities while fully preserving its multimodal understanding abilities? Existing approaches, such as MetaQuery (Pan et al., 2025) and BLIP3-o (Chen et al., 2025a), adopt a warm-started adaptation paradigm, which initializes from a pre-trained multimodal understand- ing model and augments it with a diffusion-based generator to enhance generation while preserving image-to-text capability. Yet, these approaches typically require constructing feature transforma- tion bridges between autoregressive and diffusion models or designing complex loss functions, significantly increasing training complexity. Thus, we face a critical challenge: How to extend a single MLLM in the most streamlined manner possible, enabling it to progressively acquire more sophisticated multimodal capabilities without compromising existing abilities? To address the aforementioned challenge, we propose STAR (STacked AutoRegressive Scheme for Unified Multimodal Learning), a novel unified learning method based on stacked autoregressive (AR) paradigm that offers three key design advantages: (i) a task-progressive training strategy; ∗Equal Contribution. 1 arXiv:2512.13752v1 [cs.CV] 15 Dec 2025 Technical Report Text-to-Image Generation Image Editing Knowledge Reasoning Remove the tiger in the water. Replace the bird in the image with a squirrel Change the autumn leaves to a sandy beach setting “The weapon of Apollo in Greek mythology.” “The Pyramids of Giza at 8 PM Tokyo time.” “The object used to find the heroine in Cinderella.” “The fastest land animal.” Figure 1: STAR enables unified multimodal learning for understanding, text-to-image, image edit- ing, and reasoning, with a diffusion decoder enhancing the granularity of image outputs. (ii) a stacked autoregressive model; and (iii) an implicit reasoning mechanism. Firstly, the task- progressive training paradigm decomposes unified multimodal learning into an ordered curriculum: understanding, generation, and editing, while freezing the fundamental AR backbone at each ex- tension. This staged training paradigm simultaneously shields existing comprehension capabilities from catastrophic degradation and equips the model with novel generative abilities. Secondly, the stacked autoregressive model extends the frozen fundamental AR by appending a small set of iso- morphic AR modules that share identical architecture and are initialized from the same parameter. The generation and editing tasks can be optimized with the standard next-token prediction

📸 Image Gallery

logo.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut