Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual learning remains a challenge across various natural language processing (NLP) tasks, as models updated with new training data often risk catastrophic forgetting of previously acquired knowledge. We introduce a discrete key-value bottleneck (DKVB) for encoder-only language models, enabling efficient continual learning through localized updates. Inspired by a discrete key-value bottleneck in vision, we consider new and NLP-specific challenges. We compare different bottleneck architectures for NLP and introduce a new, task-independent initialization technique for the discrete keys. We evaluate our DKVB for NLP in four continual learning scenarios and show that it alleviates catastrophic forgetting. Our experiments demonstrate that the proposed approach achieves competitive performance compared to popular continual learning methods while incurring lower computational costs. Furthermore, we show that DKVB remains effective even in challenging single-head continual learning scenarios where no task ID is provided.


💡 Research Summary

The paper introduces a Discrete Key‑Value Bottleneck (DKVB) adapted for encoder‑only language models to enable efficient continual learning (CL). Building on a vision‑based DKVB, the authors address three NLP‑specific challenges: (i) the high‑dimensional token‑wise representations produced by transformers, (ii) the choice of pooling strategy, and (iii) the design of an appropriate decoder. They explore multiple architectural variants, including how to partition the encoder output (by hidden dimension versus token dimension), where to apply pooling (before or after the bottleneck), and whether to use a parametric (linear + dropout) or non‑parametric (mean‑pool + softmax) decoder.

Key initialization is performed in a task‑independent manner: keys are randomly seeded and then updated for three epochs using an exponential moving average (EMA) on a generic corpus. After this pre‑training phase the keys are frozen, while only the value codes are updated during downstream training. This design ensures that new tasks modify only a small subset of parameters, limiting catastrophic forgetting.

In preliminary experiments on two text‑classification benchmarks (R8 and 20Newsgroup), the best configuration uses four heads (C = 4) that split the hidden dimension, a codebook size of 4096 per head, key dimension 12, value dimension 2, mean pooling applied after the bottleneck, and a parametric decoder. With a frozen BERT‑base encoder, this setup achieves 96.04 % (±0.26) on R8 and 77.83 % (±0.89) on 20NG, narrowing the gap to a fully fine‑tuned BERT by only 2 % and 7 % respectively. The same architecture works equally well with RoBERTa and DistilBERT.

For continual learning evaluation, three incremental scenarios are defined: Domain Incremental Learning (DIL), Class Incremental Learning (CIL), and Task‑type Incremental Learning (TIL). Additionally, a challenging single‑head CIL setting (no task ID, shared output head) is tested. DKVB is compared against a suite of strong baselines—including regularization‑based (EWC), replay‑based, adapter‑based, and other recent CL methods—across five random seeds and shuffled task orders. Accuracy, standard deviation, and per‑epoch runtime are reported.

Results show that DKVB consistently matches or slightly exceeds the baselines in all three scenarios while requiring 30‑45 % less training time per epoch. In the single‑head CIL case, where most methods suffer severe performance collapse, DKVB maintains relatively stable accuracy, demonstrating that the discrete bottleneck provides effective parameter isolation without needing separate task‑specific heads.

The authors argue that DKVB’s advantages stem from three factors: (1) task‑independent key initialization that yields a well‑distributed codebook, (2) localized updates of value codes that prevent global weight drift, and (3) the ability to keep the bulk of the pretrained encoder frozen, drastically reducing computational overhead. Limitations include the memory footprint of large codebooks (K = 4096, C = 4) and the current focus on classification tasks; extending the approach to generation, QA, or multimodal inputs will likely require dynamic key‑value sharing or compression techniques.

In summary, the paper presents a principled adaptation of a discrete bottleneck for small language models, demonstrates its efficacy across multiple continual‑learning regimes, and shows that it can alleviate catastrophic forgetting while being computationally efficient. The work opens avenues for further research on scalable key‑value sharing, dynamic head allocation, and broader NLP task coverage.


Comments & Academic Discussion

Loading comments...

Leave a Comment