Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective
Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown promise in adapting pre-trained models to sequential tasks while mitigating catastrophic forgetting problem. However, understanding the mechanisms that dictate continual performance in this paradigm remains elusive. To unravel this mystery, we undertake a rigorous analysis of PEFT-CL dynamics to derive relevant metrics for continual scenarios using Neural Tangent Kernel (NTK) theory. With the aid of NTK as a mathematical analysis tool, we recast the challenge of test-time forgetting into the quantifiable generalization gaps during training, identifying three key factors that influence these gaps and the performance of PEFT-CL: training sample size, task-level feature orthogonality, and regularization. To address these challenges, we introduce NTK-CL, a novel framework that eliminates task-specific parameter storage while adaptively generating task-relevant features. Aligning with theoretical guidance, NTK-CL triples the feature representation of each sample, theoretically and empirically reducing the magnitude of both task-interplay and task-specific generalization gaps. Grounded in NTK analysis, our framework imposes an adaptive exponential moving average mechanism and constraints on task-level feature orthogonality, maintaining intra-task NTK forms while attenuating inter-task NTK forms. Ultimately, by fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks. This work provides a theoretical foundation for understanding and improving PEFT-CL models, offering insights into the interplay between feature representation, task orthogonality, and generalization, contributing to the development of more efficient continual learning systems.
💡 Research Summary
The paper tackles the problem of catastrophic forgetting in continual learning when only a small set of additional parameters is fine‑tuned on top of a large pre‑trained model (the PEFT‑CL setting). While recent works have shown empirical success, they lack a solid theoretical foundation. The authors adopt Neural Tangent Kernel (NTK) theory—an asymptotic description of infinitely wide neural networks—to analyze the training dynamics of PEFT‑CL and to replace the usual accuracy‑gap metric with a mathematically tractable generalization‑gap measured during training.
First, the authors formalize the empirical NTK for a task‑specific subnetwork and derive the continuous‑time dynamics of the network output under gradient descent. They show that, for each task τ, the output evolves according to a linear differential equation driven by the NTK matrix Φτ and the gradient of the loss. Extending this to a sequence of T tasks yields a closed‑form solution where each task’s contribution is filtered by (Φi + λI)⁻¹, highlighting the role of L2 regularization λ in reaching a saddle‑point solution.
From this analysis, four theorems are proved:
- Sample‑size effect – The minimum eigenvalue of the NTK grows with the number of training samples, which shrinks the generalization gap at a rate O(1/√N).
- Task‑interplay bound – The interaction term between tasks i and j is proportional to the inner product of their NTK matrices; smaller overlap yields less forgetting.
- Feature orthogonality – Enforcing orthogonality between task‑specific feature spaces reduces inter‑task NTK overlap and thus mitigates forgetting.
- Regularization impact – Proper choice of λ stabilizes the inverse‑NTK term and aligns training with the theoretical saddle‑point.
Guided by these insights, the authors propose NTK‑CL, a new PEFT‑CL framework that eliminates any task‑specific parameter storage (no extra subnetworks, no prompt pools). Its key components are:
- Triple‑expansion of sample representations – each input is mapped to three distinct feature subspaces, effectively tripling the sample size in the NTK sense, which, according to Theorem 1, reduces the generalization gap.
- Adaptive Exponential Moving Average (EMA) – EMA smooths parameter updates, preserving the intra‑task NTK structure (Theorem 1) and acting as a memory buffer.
- Task‑feature orthogonality constraints – a regularization term penalizes the Frobenius norm of Φi Φjᵀ for i ≠ j, directly implementing the orthogonality principle of Theorem 3.
- Dynamic λ scheduling – λ is updated together with EMA to keep the (Φ + λI)⁻¹ term well‑conditioned, satisfying Theorem 4.
The architecture is illustrated in Figure 1c and contrasted with “additional subnetworks” (Fig 1a) and “prompt‑based” (Fig 1b) approaches. NTK‑CL uses a shared backbone (e.g., a Vision Transformer) and only a tiny set of optimizable parameters that generate task‑relevant features on the fly.
Extensive experiments are conducted on standard continual‑learning benchmarks: Split‑CIFAR‑100, DomainNet (multiple visual domains), and the CLBench suite (including multimodal tasks). NTK‑CL is compared against state‑of‑the‑art PEFT‑CL methods such as L2P, Dual‑Prompt, S‑Prompt, CODA‑Prompt, HiDe‑Prompt, and EASE. Results show:
- Higher average accuracy – NTK‑CL improves by 3–5 percentage points over the best baselines.
- Reduced forgetting – the average accuracy drop on previously learned tasks is cut by more than 30 %.
- Parameter efficiency – the increase in trainable parameters stays below 0.5 % of the backbone, comparable to or better than competing methods.
- Ablation studies confirm that each component (tripling, EMA, orthogonality, λ‑schedule) contributes significantly; removing any of them degrades performance in line with the theoretical predictions.
The authors discuss the broader implications: NTK provides a principled way to quantify and control forgetting, turning an empirical problem into a set of measurable quantities (sample size, NTK overlap, regularization strength). NTK‑CL demonstrates that these quantities can be directly optimized in practice without sacrificing the lightweight nature of PEFT. Moreover, because NTK theory applies to a wide range of architectures (ResNets, Transformers), the proposed framework can be readily adapted to other domains such as NLP or speech.
In conclusion, the paper makes four major contributions: (1) a novel NTK‑based theoretical analysis of PEFT‑CL, (2) identification of three key factors—training sample size, task‑level feature orthogonality, and regularization—that govern forgetting, (3) the design of NTK‑CL, a parameter‑efficient continual‑learning system that operationalizes these insights, and (4) comprehensive empirical validation showing state‑of‑the‑art performance across diverse benchmarks. This work bridges the gap between rigorous kernel‑theoretic understanding and practical continual‑learning systems, opening avenues for further research on kernel‑guided adaptation in ever‑changing environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment