Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.
💡 Research Summary
The paper addresses a critical gap in the field of multimodal large language models (MLLMs): while Dual‑to‑Dual MLLMs—models that accept both text and image inputs and produce either text or image outputs—have demonstrated strong instantaneous learning and generalization, they lack the ability to continuously adapt to evolving real‑world tasks. Existing continual learning research has focused almost exclusively on text‑only models, leaving multimodal comprehension‑generation systems without a standardized framework for lifelong learning.
To fill this void, the authors propose Continual‑NExT, a unified continual learning (UCL) framework specifically designed for Dual‑to‑Dual MLLMs, and MAGE (Mixture and Aggregation of General LoRA and Expert LoRA), a novel parameter‑efficient adaptation method.
Continual‑NExT Framework
Continual‑NExT defines a sequence of six heterogeneous tasks that together cover the full spectrum of multimodal understanding and generation:
- Visual Question Answering (VQA) – text‑image input, text output.
- Image Classification – image input, text output (label).
- Image Generation – text input, image output.
- OCR Token Recognition – image input, text output.
- Visual Grounding – image‑text input, bounding‑box output.
- Image Editing – image‑text input, image output.
Each task is presented as an independent dataset with sizes ranging from 10 K to 100 K examples, mimicking the stochastic nature of real‑world data streams. The framework introduces a comprehensive evaluation suite: besides the classic metrics of Average Accuracy (Avg.Acc), Forgetting, and New Task Accuracy (New.Acc), it adds three diagnostic measures—Average Hallucination Rate (Avg.HAL), Average Instruction‑Unfollowing Rate (Avg.IUF), and Average Other Error Rate (Avg.OTH). These metrics expose the distinct failure modes that arise when a model must both understand and generate across modalities, such as producing hallucinated visual content or ignoring explicit generation instructions.
MAGE: Structured LoRA Adaptation
MAGE builds on Low‑Rank Adaptation (LoRA) but splits the adaptation parameters into four distinct groups:
- General Image LoRA (G‑LoRA‑I) – learns representations for image inputs.
- General Text LoRA (G‑LoRA‑T) – learns representations for text inputs.
- Expert Image LoRA (E‑LoRA‑I) – dedicated to image generation decoders.
- Expert Text LoRA (E‑LoRA‑T) – dedicated to text generation decoders.
During the training of a new task, only the LoRA modules that correspond to the task’s input and output modalities are activated and updated; the remaining modules stay frozen. This selective updating dramatically reduces interference between tasks that rely on different modality pathways.
To further protect previously acquired knowledge, the authors introduce PEMA (Parameter‑wise Exponential Moving Average), an improvement over the previously used Dynamic EMA (DEMA). PEMA replaces the Hessian‑based approximation with a Fisher‑information‑based weighting, computed via Monte‑Carlo sampling of gradients. Each parameter receives its own EMA coefficient, allowing fine‑grained control over how quickly it adapts versus how strongly it is retained. This eliminates the memory overhead of storing full historical parameter snapshots and provides a more precise, layer‑agnostic regularization.
The final model parameters are obtained by a simple equal‑weight linear sum of the frozen base model and the four LoRA groups:
W = W₀ + W_GI↓·W_GI↑ + W_GT↓·W_GT↑ + W_EI↓·W_EI↑ + W_ET↓·W_ET↑
where the up/down arrows denote the low‑rank matrices inserted into the frozen transformer layers. Although frozen LoRA components still participate in the forward pass, they are not updated, ensuring that the learned modality‑specific knowledge remains stable.
Empirical Findings
The authors first visualize parameter change heatmaps when switching from VQA to each subsequent task. They observe that tasks sharing the same input modality cause large updates in the shallow half of the network, while tasks sharing the same output modality affect the deeper half. Moreover, the parameter sets for text‑output and image‑output tasks show minimal overlap, confirming the hypothesis that comprehension and generation can be disentangled at the parameter level.
Three LoRA configurations are compared:
- Full LoRA – all LoRA parameters are updated regardless of modality.
- 2‑Split LoRA – separates LoRA into input‑modality groups only.
- 4‑Split LoRA – separates both input and output modalities (the proposed MAGE).
All configurations keep the total rank constant. The 4‑Split version achieves the best trade‑off: it retains higher accuracy on previously learned tasks (lower Forgetting) while suffering only a modest drop in new‑task performance.
Quantitatively, MAGE outperforms state‑of‑the‑art continual learning baselines (e.g., FwT‑Prompt, CIA) across all six tasks. For text‑based tasks, average accuracy improves by 2–4 %; for image‑generation tasks, CLIP‑Score gains 3–5 %. The diagnostic metrics reveal that hallucination rates drop by roughly 40 % and instruction‑unfollowing rates by 35 % compared to baselines, indicating that the structured LoRA approach not only preserves knowledge but also curbs the generation‑specific errors that are unique to multimodal models.
Discussion and Limitations
While MAGE demonstrates strong empirical performance, its effectiveness depends on careful placement and rank selection of LoRA modules. The current study is limited to six tasks; extending the framework to more complex modalities such as video, audio, or 3‑D data will require additional investigation. Moreover, the Fisher‑based EMA introduces extra gradient sampling overhead, which, although lighter than full Hessian storage, still adds computational cost. Future work could explore more efficient approximations or adaptive rank allocation.
Conclusion
Continual‑NExT provides the first comprehensive benchmark and evaluation suite for lifelong multimodal learning in Dual‑to‑Dual MLLMs. The MAGE method, by explicitly separating modality‑comprehension (General LoRA) from modality‑generation (Expert LoRA) and applying a parameter‑wise EMA, offers a practical solution to catastrophic forgetting, hallucination, and instruction‑unfollowing in continual settings. Extensive experiments confirm that MAGE achieves state‑of‑the‑art performance across a diverse set of understanding and generation tasks, paving the way for truly adaptable multimodal AI systems capable of evolving alongside real‑world demands.
Comments & Academic Discussion
Loading comments...
Leave a Comment