깊은 정렬을 통한 대형 언어 모델의 지속 학습 망각 방지

Reading time: 6 minute
...

📝 Abstract

Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work [2] revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this foundational work left critical gaps: it only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. Key Contribution: We extend [2] by introducing the shallow versus deep alignment framework, which provides the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment-alignment is maintained only over the first few output tokens (approximately 3-5), making models vulnerable to forgetting. This shallow alignment explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. In this paper, we propose a comprehensive framework that addresses all gaps in [2]: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions, addressing the qualitative-only limitation; (2) real-time detection methods for identifying shallow alignment and spurious forgetting during training, enabling early intervention; (3) specialized analysis tools for alignment depth visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model architectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines, including the fixed freezing strategy in [2] .

💡 Analysis

Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work [2] revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this foundational work left critical gaps: it only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. Key Contribution: We extend [2] by introducing the shallow versus deep alignment framework, which provides the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment-alignment is maintained only over the first few output tokens (approximately 3-5), making models vulnerable to forgetting. This shallow alignment explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. In this paper, we propose a comprehensive framework that addresses all gaps in [2]: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions, addressing the qualitative-only limitation; (2) real-time detection methods for identifying shallow alignment and spurious forgetting during training, enabling early intervention; (3) specialized analysis tools for alignment depth visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model architectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines, including the fixed freezing strategy in [2] .

📄 Content

Real-Time Detection and Quantitative Analysis of Spurious Forgetting in Continual Learning Weiwei Wang Shenzhen Sunline Tech Co., Ltd. Shenzhen, China weiweiw404@gmail.com December 25, 2025 Abstract Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work [2] revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this foundational work left critical gaps: it only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. Key Contribution: We extend [2] by introducing the shallow versus deep alignment framework, which provides the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment—alignment is maintained only over the first few output tokens (approximately 3-5), making models vulnerable to forgetting. This shallow alignment explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. In this paper, we propose a comprehensive framework that addresses all gaps in [2]: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions, addressing the qualitative-only limitation; (2) real-time detection methods for identify- ing shallow alignment and spurious forgetting during training, enabling early intervention; (3) specialized analysis tools for alignment depth visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model archi- tectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines, including the fixed freezing strategy in [2]. Keywords: Continual learning, catastrophic forgetting, spurious forgetting, shallow align- ment, deep alignment, task alignment depth 1 Introduction Continual learning has emerged as a critical capability for large language models (LLMs) to adapt to new tasks and domains without forgetting previously acquired knowledge. As LLMs are increasingly deployed in dynamic environments where new tasks and domains emerge contin- uously, the ability to learn new capabilities while preserving existing ones becomes essential for practical applications. However, the phenomenon of catastrophic forgetting, where models lose performance on previous tasks when learning new ones, poses a significant challenge [1]. This problem is particularly acute in resource-constrained scenarios where storing all training data for replay is infeasible, or when privacy concerns prevent data retention. Traditional approaches assume that performance degradation directly indicates knowledge loss, leading to strategies that attempt to preserve all learned parameters or replay all previous data. (see Appendix A.5.1) 1 arXiv:2512.20634v1 [cs.LG] 2 Dec 2025 Recent research has revealed a more nuanced understanding of forgetting mechanisms. The concept of spurious forgetting, introduced in 2025, suggests that performance degradation may stem from task alignment disruption rather than true knowledge loss [2]. In spurious forgetting, internal representations remain intact, but the alignment between representations and the out- put layer is disrupted. This distinction is crucial because spurious forgetting can be reversed through minimal fine-tuning (often requiring only 50-100 samples and 1-3 epochs), whereas true forgetting requires extensive retraining with full datasets. Understanding this distinction opens new opportunities for efficient mitigation: instead of preserving all parameters or replaying all data, we can focus on maintaining or repairing alignment, which is far more computationally efficient. However, we identify a fundamental limitation in current task alignment approaches: task alignment is largely only a few tokens deep—what we term shallow alignment. This shallow alignment creates critical vulnerabilities that explain why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. (see Appendix A.5.2) The Shallow Alignment Problem: Alignment adapts the model’s generative distribu- tion primarily over only the very first few output tokens (approximately 3-5 tokens)—shallow alignment. This shallow alignment creates a critical vulnerability: if initial tokens deviate from expected alignment (due to new task training, adversarial manipulation, or distribution shift), generation catastrophically falls onto a harmful trajectory of forgetting, even though underlying representations remain intact. This shallow alignment problem provides a unified explanation for multiple forgetting phe- nomena: (1) Spurious forgetting—alignment disruption in initial tokens lea

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut