Large Continual Instruction Assistant

Large Continual Instruction Assistant
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https://github.com/JingyangQiao/CoIN.


💡 Research Summary

The paper tackles the problem of catastrophic forgetting in Continual Instruction Tuning (CIT) for large foundation models (LFMs) and multimodal large models (MLLMs). Existing approaches either expand the model with new branches—leading to memory and compute explosion—or fine‑tune all parameters, which quickly erodes previously learned knowledge. While Exponential Moving Average (EMA) updates preserve past parameters by averaging, a fixed EMA weight (β) cannot adapt to the constantly changing distribution of instruction data, resulting in an imbalance between plasticity (learning new tasks) and stability (retaining old tasks).

Core Contributions

  1. Dynamic EMA Weight Derivation – The authors formulate an “ideal state” where (i) loss on the current task remains unchanged when using EMA parameters, and (ii) EMA parameters stay identical to their previous values to protect old knowledge. By expanding the loss function with a second‑order Taylor series around the current parameters and imposing the ideal constraints, they construct a Lagrangian that couples the loss difference and the EMA update. Solving ∂F/∂β = 0 yields a closed‑form expression for the optimal β at each iteration:

    \


Comments & Academic Discussion

Loading comments...

Leave a Comment