Although parameter-efficient fine-tuning methods, such as LoRA, only modify a small subset of parameters, they can have a significant impact on the model. Our instruction-tuning experiments show that LoRA-based supervised fine-tuning can catastrophically degrade model capabilities, even when trained on very small datasets for relatively few steps. With that said, we demonstrate that while the most straightforward approach (that is likely the most used in practice) fails spectacularly, small tweaks to the training procedure with very little overhead can virtually eliminate the problem. Particularly, in this paper we consider a regularized approximate replay approach which penalizes KL divergence with respect to the initial model and interleaves in data for next token prediction from a different, yet similar, open access corpus to what was used in pre-training. When applied to Qwen instructiontuned models, we find that this recipe preserves general knowledge in the model without hindering plasticity to new tasks by adding a modest amount of computational overhead.
📄 Full Content
The problem of continual learning with neural networks has remained a challenging problem for over three decades [95]. Indeed, it is well known that continual learners face a dilemma between prioritizing stability and plasticity in the weights of the model [13]. It is also well documented that when neural networks perform optimization for extended periods on only a single task, they tend to experience the phenomena of catastrophic forgetting [64] in which plasticity of the network to the current task leads to substantial reduction in the quality of the model on prior tasks encountered during training. While many different settings have been identified in the literature under the umbrella of continual learning [71], simple instruction-tuning with LLMs is not often considered one of them. As a result, we believe that many practitioners have been delving into the waters of continual learning without even realizing it. The truth is that simple supervised fine-tuning with LLMs, even on small datasets with relatively few steps of optimization, is a setting under the umbrella of continual learning for which we have a strong expectation that the model will experience catastrophic forgetting of past capabilities. We believe that LLMs are rarely evaluated by practitioners for their general capabilities after instruction-tuning, and that this could be a contributing reason for the high failure rate of recent projects to get return for their business from generative AI [15].
Recent studies have demonstrated that catastrophic forgetting is indeed a significant problem during finetuning of LLMs [58; 51; 127]. However, these studies focused on full fine-tuning rather than parameter efficient fine-tuning methods such as LoRA [35] that have become quite popular because of their increased computational and memory efficiency. Intuitively, because LoRA adapters only learn very few parameters relative to the base model, it is often believed that forgetting is less of an issue with these models. Our experiments directly contradict this narrative and demonstrate that forgetting is still a very substantial issue during training with LoRA -even when the trainable parameters are less than 1% of the size of the base model. Recent papers have considered modifications to LoRA’s decomposition itself in order to prevent forgetting [29; 125; 57]. In contrast, we consider simple methods to address forgetting that are agnostic to the particular fine-tuning method leveraged, which we show to also be effective when training with LoRA.
The recent work of Shenfeld et al. [102] demonstrated that RL training of LLMs results is considerably less forgetting than is experienced during supervised fine-tuning. There are two reasons provided by the authors for this insight: 1) RL when applied to LLMs is generally KL regularized to prevent drift from the initial model parameters, and 2) RL training leverages on-policy data to update the model. While we definitely agree with Shenfeld et al. [102]’s insight that these two things are very synergistic when applied together, we also feel that it is worth considering how much KL regularization can improve supervised fine-tuning on its own. While it is not often applied in supervised fine-tuning settings, we find that there are hyperparameter values for KL regularization that result in an entirely superior learning process to fine-tuning without it where plasticity to the new task is maintained while forgetting of general knowledge is greatly reduced.
Another long-standing approach to eliminating forgetting in neural networks is experience replay [70; 55; 96], which has been shown effective and computationally efficient in the context of LLMs as well [1; 52]. However, pure experience replay does not feel practical in the current age of open source LLMs as, in nearly all cases, open weight LLM models do not also publish copies of the data used for training. Part of the underlying reason for this is the use of proprietary data, and the internal value of data based on the work of paid annotators, which corporations are then less willing to share. Moreover, various licensing issues and privacy issues may arise from publishing the data used for pre-training models. As such, in this work we consider a practical alternative that we call approximate replay where replay samples are drawn that mirror the next token prediction data seen during pre-training while leveraging an open web data corpus that is different than the actual corpus used for training. We find that despite the disconnect in data sources, approximate replay based on this open web data is still very effective at minimizing forgetting during fine-tuning without sacrificing plasticity to the new task. Indeed, the combined approach of approximate regularized replay that utilizes both approximate replay and KL regularization provides an exceedingly simple yet computationally efficient solution for mitigating forgetting without diminishing the effectiveness of fine-tuning.
In general, the computational cost (in FLOPs) associated with fine-tuning a model θ on N updates with a batch size of B and context window size of W can be expressed as c FT = 2N F θ,B,W where F θ,B,W is the cost of forward propagation. This is because backward propagation over the full set of parameters θ has the same cost as the forward pass and both are needed to compute gradients.
Often full fine-tuning is not necessary though. Indeed, adapter models where some new parameters ϕ are learned to alter θ (and |ϕ| « |θ|) have become quite popular as a method for fine-tuning LLMs. For example, in LoRA [35] a low-rank approximation of each weight matrix is learned. This can be very efficient in terms of parameters. In our experiments, as an example, |ϕ| is always less than 0.5% the size of |θ|. As |ϕ| « |θ|, then the cost of backpropagation becomes insignificant relative to the cost of forward propagation and c LoRA → N F θ,B,W in the limit of very few adapter parameters. In general,
Ability to Overwrite. While for LoRA |ϕ| « |θ|, it is important to note that LoRA models still have the ability to impact every parameter of the model θ. Indeed, a LoRA model can be merged with the base model across all parameters by multiplying the low rank matrices [35]. As such, LoRA models are just as prone to overwriting knowledge in the model as full fine-tuning despite computational and parameter efficiency.
Fine-tuning models on only a single task and expecting that the model performs well across a variety of tasks is unlikely to work due to the resulting biased optimization. This biased optimization is well characterized within the formalism of reinforcement learning through the conceptual framework of mixing times [89; 91] and can be directly applied in supervised learning contexts as well [41]. We will now summarize some of these high-level insights to contextualize how the stability-plasticity dilemma [13] arises.
We can consider the current fine-tuning task as constituting a data distribution d current (x, y * ) over pairs of input contexts x and associated optimal outputs y * . Standard fine-tuning with LoRA then optimizes the objective J current (θ, ϕ) = E x,y * ∼dcurrent [L SFT θ,ϕ (x, y * )]. However, the objective that we really care about evaluating our model on is the steady-state distribution which performs i.i.d. sampling over all future experiences d future (x, y * ) with associated objective J eval (θ, ϕ) = E x,y * ∼d future [L SFT θ,ϕ (x, y * )]. The problem is this distribution is generally unknowable. We can say it may bear some resemblance to d current (x, y * ) and it also may bear resemblance to an i.i.d. sampling over all past experiences d past (x, y * ). Moreover, there may be additional entirely novel experiences. So the stability-plasticity dilemma arises from uncertainty about the correct balance of current and past experiences to prepare the model as much as possible for the future. Concretely, plasticity measures progress on J current (θ, ϕ) and stability targets preservation or improvement of
The reason why catastrophic forgetting occurs during fine-tuning is because of the distributional mismatch between d current (x, y * ) and d past (x, y * ) such that the more steps of consecutive optimization steps we take on J current (θ, ϕ), the less likely it is that this optimization also aligns with J past (θ, ϕ). Of course forgetting doesn’t actually matter when it isn’t relevant to d future (x, y * ), but when we optimize for multiple steps in a row on d current (x, y * ) we are implicitly conveying to the model that d current (x, y * ) is the future distribution even when this is only partially true. As such, we must consider ways to bias the optimization process towards learning the new task (i.e. promote plasticity) without disrupting the general capabilities of the model (i.e. while maintaining stability).
We consider two approaches for biasing optimization in favor of stability in this work: the KL divergence with respect to the base model and an approximation of replay from the pre-training phase using open data. We find that these two approaches are both very computationally efficient while providing significant leverage over balancing the stability-plasticity tradeoff. Moreover, because practitioners have control of the number of LoRA parameters and the replay rate, they can customize the compute overhead added to standard LoRA fine-tuning to meet use case requirements with more compute often leading to better results.
KL regularization is a theoretically appealing and simple approach for allowing the initial model parameters to serve as a Bayesian prior on learning while making the model more robust in the face of spurious features. Given a given context x, we can express the base LLM’s probability of producing output y as π θ (y|x). We can then express the output probabilities for the learned LoRA adapter on top of the base LLM as π θ+ϕ (y|x). If we want to promote stability in the model during fine-tuning, rather than optimizing ϕ directly for
we can optimize for the KL regularized objective:
Here β is the KL regularization coefficient, which controls the degree of penalty when drifting from the output probabilities of the base model. Tuning β thus provides directly leverage over the stability-plasticity tradeoff. High values of β prevent forgetting while also preventing plasticity. Low values of β allow for greater degrees of forgetting for old tasks and plasticity with respect to the new task.
Computational Overhead. When applied to standard fine-tuning, KL divergence adds a 1.5× computational overhead as it also requires an additional forward propagation through the initial model. However, there is significant synergy between computing KL divergence and using LoRA. Because the model being trained is only a LoRA adapter of the original model, it is then possible to perform both forward propagations with little computational overhead when |ϕ| « |θ|.
Memory Overhead. Once again there is synergy between the memory overhead of computing the KL divergence with respect to the base model and LoRA. During standard fine-tuning with KL regularization, it would be required to store two copies of the base model of size |θ| in memory. However, with LoRA it is only required that you store |θ| + |ϕ|, which adds no memory overhead over standard LoRA fine-tuning.
Another theoretically appealing and simple to implement stability prior is experience replay [70; 55; 96] in which past experiences are interleaved with current experiences during learning. If the environment follows a potentially unknown Markov chain, experience replay provides a nice theoretical solution. The experience replay buffer eventually converges to the steady-state distribution of the encountered contexts, which directly enables the model to combat optimization bias. However, practically speaking open source models are typically not released with the actual training data used. As a result, in this work we approximate replay by using an open source web corpus https://huggingface.co/datasets/Skylion007/openwebtext
. The idea is that open source LLMs are trained on a large segment of web data using the next token prediction objective and that we can use that objective on a similar corpus to approximate the effect of a true experience replay implementation. We only leverage a very small segment of this corpus, so it would seem that random data should be representative and not present a tremendous mismatch with what was seen during training.
Computational Overhead. In this setting, we can consider replay as equivalent to augmenting the finetuning dataset with more examples drawn from an open web corpus. Concretely, we define a replay rate ρ, which describes the amount of next token prediction replay examples of the given maximum context window W for each example in the dataset. For example, ρ = 0 corresponds to standard fine-tuning without replay, and ρ = 1 corresponds to adding one replay example for each example in the fine-tuning datasets. As such, training with replay takes (ρ + 1)× the amount of compute of standard fine-tuning and
Memory Overhead. Approximate replay requires (ρ + 1)× more disk space to store the data, but does not necessarily require additional RAM on the CPU or GPU.
Now that we have outlined the approach, we can expand on the theoretical perspective of what it achieves:
KL Regularization as a Bayesian Prior. From Bayesian perspective, learning can be treated as Maximum A Posteriori (MAP) estimation, from which we obtain the most likely model given both prior beliefs and the data from a new task. To see this, we treat the base-model π θ as the prior distribution and take the likelihood as p(D|ϕ) = (x,y)∈D π θ+ϕ (y|x). The posterior p(ϕ|D) is the updated belief (the fine-tuned model) after seeing the data. According to the Bayesian rule, p(ϕ|D) ∝ p(D|ϕ)P (ϕ). Taking the logarithm, we have logP (ϕ|D) = log p(D|ϕ) + log p(ϕ) + const. In the context of KL regularization, log p(D|ϕ) is the standard cross-entropy loss (log (x,y)∈D π θ+ϕ (y|x)) and the negative KL term -βD KL (π θ+ϕ ||π θ ) is equivalent to the log-prior. To obtain a valid Bayesian prior, p(ϕ) must be a proper probability distribution, if we exponentiate the log-prior, we obtain p(ϕ) = exp(-βD KL (π θ+ϕ ||π θ )/Z, which is known as a Boltzmann distribution (or Gibbs distribution) over the space of models with β -1 being the temperature and Z the partition function. This prior explicitly encodes the belief that the most likely model is the base model, and the probability of any other model decays exponentially as its output distribution divergences from the base model. To further understand how a prior on the outputs relates to the weights θ + ϕ, we can use a second-order Taylor expansion to approximate the KL divergence. For a small change in weights ∆θ = ϕ = θ ′ -θ, the KL divergence is approximately:
, where F (θ) is the Fisher information matrix. By substituting this approximation back into the log-prior, we have log p(ϕ) ≈ -β 2 ϕ T F (θ)ϕ, which is exactly the log-density of a multivariate Gaussian distribution: p(ϕ) = N (0, 1 β F (θ) -1 ).
KL Regularization and Robustness to Spurious Features. Supervised fine-tuning with KL regularization and LoRA adapters can also be theoretically interpreted through the lens of the information bottleneck (IB) objective [21; 132]: I(Y ; π θ+ϕ ) -βI(X; π θ+ϕ ), where the log-likelihood term maximizes I(Y ; π θ+ϕ ), the mutual information between output y and π θ+ϕ (model fitting objective) and the KL term is the upper bound on I(X; π θ+ϕ ), the mutual information between context x and π θ+ϕ (model compression objective). The IB framework [113] aims to minimize I(X; π θ+ϕ ) the information retained about the input x to promote generalization. However, directly computing I(X; π θ+ϕ ) is computationally intractable for neural networks. To address this issue, one typically applies the variational upper bound of I(X; π θ+ϕ ) derived as I(X; π θ+ϕ ) ≤ E[D KL (π θ+ϕ ||π ϕ )] as a surrogate [4]. It can be shown that, by minimizing the KL divergence to a reference model, one can effectively minimize the mutual information between the input prompt and the model’s internal state [50]. This motivates learning to ignore spurious features in the input prompt (such as specific phrasing or noise) and only keep the essential features needed to generate the correct answer.
Replay and Steady-State Optimization. In supervised learning, it is typically assumed that an agent’s behavior does not impact future experiences. In this case, it is possible to model the agent’s experiences as some unknown Markov chain P (x t+1 , y * t+1 |x t , y * t ) where x t is the current context and y * t is the associated optimal output. While our supervised fine-tuning dataset only represents BN steps from this chain, what the agent really cares to optimize over is all steps that it will encounter in it’s lifetime. In fact, it is the disconnect between these two distributions that is the underlying cause of catastrophic forgetting. If the lifetime is sufficiently large (i.e. greater than the mixing time of the chain) then this converges to a steadystate distribution d future (x, y * ) over which we want to minimize L SFT θ,ϕ (x, y * ). Replay provides at least an asymptotic solution to this problem without attempting to model the Markov chain directly. This is because as a replay buffer fills, the sampling distribution from the buffer will asymptotically converge to d future (x, y * ).
Training Datasets. For the training tasks, we consider a set of 5 tasks inspired by the prior work on catastrophic forgetting during supervised fine-tuning of Luo et al. [58] in which the authors had selected from a subset of the instruction following tasks considered by Scialom et al. [100]:
Text Simplification (Simp): This task requires the LLM to paraphrase the provided text with a simple shorter piece of text [38; 5]. Concretely, the model is instructed to “Reformulate this text with simpler words: " where the normal article text and simplified article text (as a target for supervision) are provided by part 1 (for training) and part 2 (for testing) of the dataset https://huggingface.co/
datasets/chaojiang06/wiki_auto.
This task requires the LLM to generate a response to a conversational context under a given emotional situation and was previously considered by Rashkin et al. [79]. Concretely, the model is given an instruction of the form “The associated emotion is {emotion} and the input prompt is {prompt}. Now what would be your response, given the following dialogue context:==={text}”. The training and testing data is pulled from the splits provided at the repository https://huggingface.co/datasets/facebook/empathetic_dialogues
.
This task requires the LLM to generate a simple question that could be associated with a long-form answer and was previously considered by Fan et al. [27].
Concretely, the model is given an instruction of the form “{text}===Given the above text, write the possible curious question it answers:”. The training and testing data is drawn from the splits provided by the repository https://huggingface.co/datasets/Pavithree/eli5
.
This task requires the LLM to generate an explanation about why two sentences are different and was previously considered by Camburu et al. [12]. Concretely, the model is instructed to “Explain why the two following sentences are unrelated: Sentence 1: {first-sentence}; Sentence 2: {second-sentence}”. The data is sampled from both training splits and the testing split of the repository https://raw.githubusercontent.com/OanaMariaCamburu/e-SNLI/master/dataset/
.
This task requires the LLM to generate headlines for articles and was previously considered by Scialom et al. [100]. However, the data used by Scialom et al.
[100] and Luo et al. [58] requires an LDC license, so we opt for the dataset of Leeb & Schölkopf [49] to allow for greater general purpose reproducibility. Concretely, the model is instructed to “Make a title for this article: {article}”. The training and testing data consist of random subsets of the english titles and articles from the repository https://huggingface.co/datasets/felixludos/babel-briefings
where data is filtered such that the article is at least 3× longer than the title and the title is at least 3 words.
Training Procedure. Our training procedure was implemented by extending the Transformers Trainer class and deployed across a cluster of H100 GPUs. We found that the AdamW optimizer with a constant learning rate achieved the same performance as a cosine schedule with warm-up and chose a constant learning rate without warm-up for simplicity and to stay consistent with Luo et al. [58]. We followed Luo et al. [58] and set the context length for these tasks to 512. Based on our preliminary runs with the 3B model, we set the LoRA rank r = 32 and α = 64 such that α/r = 2, the learning rate to 1e -4, the LoRA dropout rate to 0.05, and the batch size to 8 for all experiments. For each task, we sample 1, 000 random examples from the training set and 1, 000 random examples from the testing set. Our experiments ran on from 1 to 4 H100 GPUs at a time, depending on the model size, in order to make sure we satisfied GPU memory requirements. All reported results are an average of 7 random seeds for each task and hyperparameter combination.
Model Sizes. We consider a variety of model sizes within the Qwen 2.5 Instruct [77] family of models. Specifically, we ran all experiments across the 1.5B, 3B, 7B, and 14B instruction-tuned models. We build LoRA adapters for the key, value, and output matrices. This corresponds to trainable parameters that are 0.46% the size of the base model for the 1.5B model, 0.39% the size of the base model for the 3B model, 0.22% the size of the base model for the 7B model, and 0.28% the size of the base model for the 14B model.
Evaluating Plasticity. In order to assess the plasticity or adaptation performance of each model to the task it is being trained on we evaluate on the held out test set data for each task. While previous papers considered different metrics catered to each task [100; 58], we found this made it difficult to assess average across task performance fairly. As a result, we opted for the simple solution of always evaluating performance based on the BERTScore [130] F1 metric between the generate response and gold label on the testing data.
We provide the main results of our comprehensive experiments across base model sizes |θ|, replay rates ρ, and KL coefficients β in Tables 1 and2. The first row of Table 1 reveals the very significant catastrophic forgetting involved in standard LoRA fine-tuning. Interestingly, forgetting seems to get even worse with increased model size in this regime. One potential explanation would be that the larger models have better initial performance and more to lose, but our results in Table 2 demonstrate that the relative loss of performance also grows.
Contextualizing How Catastrophic Forgetting Is. One way to understand the effect of forgetting is in terms of how final general performance compares to other smaller models. Indeed, the 3B model after training performs worse than the 1.5B model, the 7B model after training is comparable to the performance of the 1.5B model, and the 14B model after training is even worse than the performance of the 1.5B model. As such, the effect of forgetting is equivalent to downgrading the model significantly in terms of general capabilities such that it is comparable to a significant loss of parameters.
The Effect of Approximate Replay. Our experiments reveal that approximate replay provides a significant deterrent to forgetting while retaining the ability to achieve the plasticity of standard fine-tuning. Indeed, on average approximate replay alone provides about a 3× reduction in the amount of forgetting without sacrificing plasticity. It does appear, however, that the effectiveness of replay degrades with more investment, particularly in terms of the computational overhead. We achieve the best performance with a replay rate of 3X, but 1X provides the most economical solution if there are constraints on the compute.
The Effect of KL Divergence. Our experiments also demonstrate the ability to manipulate the stabilityplasticity tradeoff by setting an appropriate KL coefficient. β = 0.1 seems to provide a very substantial deterrent for changing the model parameters. This results in a virtual elimination of forgetting, but also all but eliminates the plasticity of the model. β = 0.01 seems to provide a better tradeoff. Forgetting is still virtually eliminated, and while plasticity is worse than standard fine-tuning, it is not that much worse. β = 0.001 allows for much more flexibility in the model and achieves plasticity that even slightly surpasses standard fine-tuning. This improved generalization to the new task makes sense given our remarks in Section 2.4.2. However, β = 0.001 also allows for a significant degree of forgetting. That said, β = 0.001 represents an entirely improved solution over standard fine-tuning with minimal computational and memory overhead as it still improves on forgetting substantially over standard fine-tuning.
Combining Replay and KL Divergence. The best results come from combining approximate replay with KL divergence regularization. For example, replay is able to improve even further on β = 0.001 by retaining plasticity while cutting down even more on the extent of forgetting. Overall, the best combination depends on the perceived tradeoff between stability and plasticity. Approximate replay with β = 0.01 provides virtually no forgetting while experiencing only a mild loss in term of plasticity in comparison to standard fine-tuning. On the other hand, approximate replay with β = 0.001 provides the same plasticity as standard fine-tuning with an over 7× average reduction in the amount of forgetting experienced.
Our work is related to a variety of directions of study in the continual learning literature. [73; 56]. We argue in this paper that it should also be widely used during supervised fine-tuning.
Connections to Distillation. Our use of KL regularization during learning also bears similarities to prior work leveraging distillation to aid with continual learning both with [11] and without [84; 53] replay buffers. KL regularization can be seen as a particularly simple form of distillation that focuses only on output probabilities rather than differences at the hidden layers.
Replay Buffer Types. The approximate replay buffer we consider in this work is theoretically related to reservoir sampling based buffers [118; 86] in that it draws a random subset over the data of a prespecified size. A recency based sampling [69] wouldn’t make sense in our setting as it would be likely to sample correlated data that would make learning less robust. We also experimented with generative replay approaches [103; 83; 87; 10] based on examples generated by the LLM itself, but found it difficult to get sufficient diversity in the generated experiences to make a meaningful difference in stabilizing learning.
Architectures that exploit modular structure such as sparse mixtures of experts (MoE) [37; 67; 39; 20] have established benefits in the field of continual learning [81; 68; 30] and have become commonly used for training LLMs due to their computational advantages [101]. As discussed by Rosenbaum et al. [99], modular architectures with dynamic composition [97; 14; 16; 98; 112; 48; 133] have the ability to effect the dynamics of transfer and forgetting by allowing the model to orthogonalize weight updates by routing experiences to different modules. That said, routers are not necessarily trained to make routing decisions with gradient interference in mind and may fail to live up to this promise in practice. Thérien et al. [111] recently explored the influence of the MoE router in effecting the dynamics of continual pre-training. We avoid the use of MoE models in our experiments to avoid this potential conflating factor in the results, but believe this is a promising direction for improving fine-tuning of LLMs while preventing forgetting.
Model Merging. Another interesting approach for preventing forgetting in LLMs is model merging [122; 36; 126; 3] with recent work exploring model merging in the context of continual learning [76]. LLMs finetuned using LoRA could be merged with the base model or other task specific models to improve retention of general knowledge [109; 110]. This approach is complementary to the direction considered in our work.
Capacity Regularization. The KL loss and approximate replay both serve to regularize the learning objective during fine-tuning. Another interesting form of regularization is to limit the capacity of the model to acquire potentially spurious knowledge [60; 59; 62; 61; 63]. As we show in Section 2.4.2, the IB theory suggests that we should get this benefit for free when using KL regularization as we do in this work.
The Impact of Model Size on Forgetting. An interesting aspect of our results is that we find catastrophic forgetting in fine-tuning to be even worse for bigger models than it is for smaller models. This directly contradicts findings in the work of Ramasesh et al. [78] suggesting that larger models experience less forgetting than small models. Larger models experiencing less forgetting seems to be generally true with large datasets such as during continual pre-training [1]. However, the numbers in Luo et al. [58] interestingly also suggest that the effect may be the opposite during LLM fine-tuning. In our work we consider a wider array of model sizes and see this result of more forgetting in larger models more consistently and with a greater effect size. This result serves as a cautionary tale to practitioners who may be comforted by the argument of Ramasesh et al. [78] and not worried about forgetting during fine-tuning because they use large models.
In this paper, we have proposed a very simple yet efficient and effective strategy for stabilizing learning during LLM fine-tuning. The potential applications of this approach are vast and we refer readers to Appendix A for an in-depth discussion. Our proposal of approximate regularized replay combines two straightforward approaches in the continual learning literature in KL regularization and experience replay to largely eliminate forgetting during LLM fine-tuning without sacrificing the ability to adapt to the new fine-tuning task. Our approach prioritizes efficiency in leveraging parameter efficient tuning based on LoRA with a customizable degree of additional computational overhead that can be tuned to meet use case requirements. Moreover, our approach prioritizes practicality by leveraging an open source dataset that is used as a proxy for replay in lieu of direct access to the pre-training data used for the model. Our work takes a step in the direction of democratizing the ability to fine-tune LLMs towards specific business needs, which we believe may be a crucial bottleneck in achieving higher rates of success integrating generative AI across business applications.
Each model achieved around an 81 average performance across the 5 tasks by this metric prior to training. We denote the amount of plasticity as ↑ P in our experiments, which is the average score after training subtracted by the average score before training. We use ↑ to indicate that higher scores are better with larger values indicating improvement of the ability to generalize to held out examples from the fine-tuning task and negative values indicated that training actually had a counter productive effect.Evaluating Forgetting. We follow the procedure established by Shenfeld et al.[102] for a general purpose evaluation of knowledge in the LLMs across a variety of capabilities to assess catastrophic forgetting. Leveraging the lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness)
, we evaluate each model before and after training on the average of six datasets: HellaSwag, HumanEval, IFEval, MMLU,
Each model achieved around an 81 average performance across the 5 tasks by this metric prior to training. We denote the amount of plasticity as ↑ P in our experiments, which is the average score after training subtracted by the average score before training. We use ↑ to indicate that higher scores are better with larger values indicating improvement of the ability to generalize to held out examples from the fine-tuning task and negative values indicated that training actually had a counter productive effect.Evaluating Forgetting. We follow the procedure established by Shenfeld et al.[102]
Each model achieved around an 81 average performance across the 5 tasks by this metric prior to training. We denote the amount of plasticity as ↑ P in our experiments, which is the average score after training subtracted by the average score before training. We use ↑ to indicate that higher scores are better with larger values indicating improvement of the ability to generalize to held out examples from the fine-tuning task and negative values indicated that training actually had a counter productive effect.Evaluating Forgetting. We follow the procedure established by Shenfeld et al.
Each model achieved around an 81 average performance across the 5 tasks by this metric prior to training. We denote the amount of plasticity as ↑ P in our experiments, which is the average score after training subtracted by the average score before training. We use ↑ to indicate that higher scores are better with larger values indicating improvement of the ability to generalize to held out examples from the fine-tuning task and negative values indicated that training actually had a counter productive effect.
KL Regularization in RL.As mentioned earlier, KL regularization of the form used in our paper has become commonplace when performing RL with LLMs. In particular, it is generally applied in concert with PPO