LACE: Loss-Adaptive Capacity Expansion for Continual Learning

Fixed representational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width before training, without knowing how many distinct concepts the data contains. We propose LACE (Loss-Adaptive Capac…

Authors: Shivnath Tathe

LACE: Loss-Adaptive Capacity Expansion for Continual Learning
LA CE: Loss-Adapti v e Capacity Expansion for Continual Learning Shi vnath T athe Independent Resear cher Pune, India sptathe2001@gmail.com ORCID: 0009-0007-7142-1119 Abstract —Fixed repr esentational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width befor e training, without knowing how many distinct concepts the data contains. W e propose LA CE (Loss-Adaptive Capacity Expansion), a simple online mechanism that expands a model’ s repr esentational capacity during training by monitoring its own loss signal. When sustained loss deviation exceeds a threshold — indicating that the current capacity is insufficient for newly encountered data — LA CE adds new dimensions to the pr ojection layer and trains them jointly with existing parameters. Across synthetic and real-data experiments, LA CE triggers expansions exclusively at domain boundaries (100% boundary precision, zero false positives), matches the accuracy of a large fixed-capacity model while starting from a fraction of its dimensions, and produces adapter dimensions that are collectively critical to performance (3% accuracy drop when all adapters removed). W e further demonstrate unsuper - vised domain separation in GPT -2 activations via layer -wise clustering, showing a U-shaped separability curve across layers that motivates adaptive capacity allocation in deep networks. LA CE requir es no labels, no replay buffers, and no external controllers, making it suitable for on-device continual learning under r esource constraints. Index T erms —continual learning, dynamic capacity, adaptive width, loss-based detection, on-device lear ning I . I N T R O D U C T I O N Neural network architectures require practitioners to fix model width — the number of dimensions in each layer — be- fore training begins. This decision is made without knowledge of the true comple xity of the data distribution, the number of distinct concepts to be learned, or ho w that complexity may change over time. In continual learning settings, where data arriv es sequentially from shifting distributions, this constraint becomes especially problematic: a model sized for early tasks may lack capacity for later ones, while a model sized for the full task sequence wastes capacity during early training. Existing approaches to dynamic capacity — progressive neural networks [1], neural architecture search [2], and mixture-of-experts [3] — either require expensi ve search procedures, external controllers, or architectural assumptions that limit deployment on constrained hardware. Adapter-based methods [4], [5] add capacity at fine-tuning time but do not address online capacity allocation during continual pretraining. W e ask a simpler question: can a model detect when its curr ent capacity is insuf ficient and e xpand automatically , using only its own loss as a signal? The intuition is straightforward. When a model encounters a new distribution it cannot represent with existing capacity , training loss rises sharply and remains elev ated. This sustained deviation from the recent loss baseline is a direct, label-free signal that additional representational capacity is needed. W e formalize this as a spike detection mechanism and couple it with a lightweight expansion operation that adds ne w dimensions to the projection matrix. Contributions: 1) A loss-spike-dri ven expansion mechanism with EMA baseline, ratio threshold, confirmation window , and cooldown — requiring no labels, no gradient tracking beyond the standard training loop. 2) Empirical validation showing 100% expansion precision across all experiments: e very e xpansion fires at a genuine domain boundary , with zero false positives over 5,000 training steps. 3) Evidence that dynamically added dimensions are col- lectiv ely critical: removing all adapter dimensions drops accuracy by 3%, while indi vidual dimensions show a distributed representation pattern consistent with super- position [10]. 4) A capacity efficiency result: LA CE matches Fixed- Large accuracy while starting from a Fix ed-Small base, demonstrating that adapti ve allocation outperforms both under-pro visioned and ov er-pro visioned fix ed baselines. 5) Layer-wise unsupervised activ ation clustering on GPT -2 rev ealing a U-shaped domain separability curve, moti- vating where in deep networks capacity expansion is most beneficial. I I . R E L AT E D W O R K A. Continual Learning Catastrophic forgetting [6] is the central challenge in contin- ual learning. Regularization-based methods such as EWC [7] constrain weight updates to preserve prior task performance. Replay-based methods [8] store or generate examples from prior tasks. Architectural methods [1], [9] allocate separate capacity per task. LA CE is complementary to all of these: it addresses when to allocate capacity , not how to pre vent forgetting after allocation. B. Dynamic Capacity Progressiv e Neural Networks [1] add ne w columns per task but require task identity at training time. PackNet [9] prunes and reuses weights but requires a fixed total budget. Neural Architecture Search [2] optimizes architecture globally but is computationally prohibitive for online settings. LA CE dif fers by operating online, requiring no task labels, and using the model’ s own loss as the sole e xpansion trigger . C. Adapters and Low-Rank Methods LoRA [4] and adapter modules [5] add trainable parame- ters to frozen pretrained models. These methods target fine- tuning ef ficiency rather than online capacity allocation during training. LA CE expands the projection matrix directly dur- ing training, which is mechanistically distinct from post-hoc adapter insertion. D. Activation-Based Analysis Superposition in neural networks [10] shows that models store multiple features per dimension when capacity is con- strained. Our ablation results are consistent with this find- ing: individually ablating dimensions has small effect, but removing all adapter dimensions collectiv ely causes significant performance degradation. I I I . M E T H O D A. Pr oblem Setting W e consider a continual learning setting where a model receiv es data from a sequence of distributions D 1 , D 2 , . . . , D T arriving online. The model has a base capacity d base and may expand up to a maximum d max . The expansion budget d adapt = d max − d base represents the maximum number of adapter dimensions av ailable. B. Loss-Based Novelty Detection Let L t denote the training loss at step t and ¯ L t denote the exponential moving average of recent losses over a window of W steps: ¯ L t = 1 W t − 1 X i = t − W L i (1) W e define a loss spike at step t as: spike ( t ) = ( 1 if L t > τ · ¯ L t 0 otherwise (2) where τ > 1 is the spike ratio threshold. T o avoid reacting to transient noise, we require K consecutiv e spikes before triggering expansion: expand ( t ) = ( 1 if P t i = t − K spike ( i ) ≥ K 0 otherwise (3) After expansion, a cooldown of C steps is enforced before the detector can fire again. Additionally , we detect sustained high loss as a secondary signal: if the mean loss over the window exceeds an abso- lute threshold θ for S consecutiv e steps, expansion is also triggered. This handles the case where the model starts with very limited capacity and the loss ne ver stabilizes enough to produce clear spikes. C. Capacity Expansion When expansion is triggered, we extend the projection matrix W ∈ R d active × d in by one dimension: W ←  W w ⊤ new  , w new ∼ N (0 , σ 2 ) (4) The corresponding output mask is updated to activ ate the new dimension: h ′ = ReLU ( W x ) ⊙ m, m i = ( 1 i ≤ d activ e 0 otherwise (5) New dimensions are initialized wit h small random weights ( σ = 0 . 01 ) and trained jointly with existing parameters in subsequent gradient updates. Expansion is bounded by d max . D. T raining Loop Algorithm 1 summarizes the complete LACE training pro- cedure. Algorithm 1 LACE T raining Require: Model f θ , loss windo w W , threshold τ , confirm K , cooldown C 1: Initialize d activ e ← d base , detector window ← [] 2: for each training step t do 3: Sample batch ( x, y ) ∼ D t 4: Compute loss L t = L ( f θ ( x ) , y ) 5: Update θ via gradient descent 6: if t ≥ t warmup and cooldown = 0 then 7: if expand ( t ) = 1 and d activ e < d max then 8: Expand W , increment d activ e 9: Reset cooldown ← C 10: end if 11: end if 12: Append L t to detector window 13: end for E. Unsupervised Activation Clustering (Analysis) T o understand where in deep networks domain information is encoded, we apply online K-means clustering to mean- pooled hidden state activ ations across all 12 layers of GPT - 2 [14]. For each layer l , we reduce activ ations to d pca = 32 dimensions via PCA and cluster using cosine distance thresh- old δ = 0 . 15 . Cluster purity is computed as: purity = 1 N X c max d | C c ∩ D d | (6) where C c is the set of samples in cluster c and D d is the set of samples from domain d . This analysis is used as a diagnostic tool, not as an expansion trigger . I V . E X P E R I M E N T S A. Setup All synthetic experiments use character-lev el tokenization (ASCII, vocab size 128) with sequence length 32. The base model consists of a learned embedding layer , a single projec- tion layer , and a classification head. W e use Adam optimizer with learning rate 3 × 10 − 4 and batch size 64. LACE hyperpa- rameters: W = 50 , τ = 2 . 5 , K = 1 , C = 60 , warmup = 100 steps. Domains are generated synthetically from 10 distinct f ami- lies: scientific te xt, news, dialog, medical, code, poetry , finan- cial, sports, math, and legal. Each family produces structurally and lexically distinct character sequences, ensuring genuine distributional separation. B. Baselines W e compare three configurations throughout: • Dynamic (LA CE) : starts at d base , expands up to d max . • Fixed-Large : fixed at d max from step 0, same maximum budget. • Fixed-Small : fixed at d base , same starting budget as LA CE. C. Experiment 1: Baseline Comparison (10 Domains) W e introduce 10 domains sequentially , one e very 200 steps, ov er 2,000 total training steps. d base = 64 , d max = 84 . T ABLE I B A S E LI N E C O M P A R IS O N — 1 0 D O M A IN S Model Acc Exp d final d avg Precision LA CE (Dynamic) 0.999 9 73 ∼ 68 100% Fixed-Lar ge 0.999 — 84 84 — Fixed-Small 0.998 — 64 64 — LA CE achie ves accuracy matching Fixed-Lar ge while using fewer dimensions on av erage throughout training (Fig. 1). All 9 expansion ev ents fire within one phase window of a domain boundary — 100% boundary precision with zero false positiv es. Fig. 1. Exp 1: Training loss and active dimensions for LACE vs fixed baselines ov er 10 sequential domains. Red dashed lines indicate e xpansion ev ents. D. Experiment 2: F or getting Measurement W e track per -domain accuracy throughout training to mea- sure catastrophic forgetting. Fig. 2 shows that once a domain is learned, accuracy on that domain remains stable throughout subsequent training for both LA CE and Fixed-Large. No significant forgetting is observ ed on this classification task. Fig. 2. Exp 2: Per-domain accuracy ov er time for LA CE (left) and Fixed- Large (right). Both models retain learned domains without forgetting. Limitation: Forgetting is not observed on classification tasks because the output head preserves all class outputs. Generative tasks, where prior knowledge can be overwritten at the token lev el, represent an important direction for future work. E. Experiment 3: Ablation of Adapter Dimensions T o verify that dynamically added dimensions are genuinely used, we ablate each adapter dimension individually and collectiv ely after training. T ABLE II A B LAT IO N R E S U L T S Condition Accuracy Drop Baseline (all dims active) 0.999 — Ablate dim 68 (most critical) 0.981 0.018 Ablate dim 70 0.987 0.013 Ablate all adapter dims 0.969 0.030 Individual dimensions show small drops (0.001–0.018), while removing all adapter dimensions collectiv ely causes a 3% accuracy drop (Fig. 3). This pattern is consistent with distributed representation [10]: information is spread across dimensions rather than stored in dedicated slots. The result confirms that adapter dimensions are genuinely used, not wasted capacity . Fig. 3. Exp 3: Per-dimension accuracy drop (left) and collectiv e ablation (right). Red bars indicate dimensions exceeding 1% individual drop threshold. F . Experiment 4: Confirmation W indow W e compare K = 1 (immediate expansion on first spike) vs K = 3 (expansion after 3 consecutiv e spikes). T ABLE III C O NFI R M A T I O N W I ND O W C O MPA R IS O N Config Expansions Precision Accuracy K = 1 (immediate) 9 100% 0.999 K = 3 (confirmed) 9 100% 1.000 Both configurations achiev e 100% boundary precision. K = 3 uses the same number of expansions and achie ves marginally higher accuracy , suggesting the confirmation window acts as a useful noise filter without sacrificing sensitivity (Fig. 4). Fig. 4. Exp 4: Loss curves and metric comparison for K = 1 vs K = 3 confirmation windows. G. Experiment 5: Capacity W all (50 Domains) T o stress-test the system, we scale to 50 domains (10 families × 5 variants) with d base = 8 and d max = 48 . This configuration forces Fixed-Small into a genuine capacity wall. T ABLE IV C A P A CI T Y W AL L — 5 0 D OM A I N S ( d BA SE = 8 , d M AX = 48 ) Model Final Acc d final Expansions LA CE (Dynamic) 0.676 38 30 Fixed-Lar ge 0.884 48 — Fixed-Small 0.434 8 — Fixed-Small plateaus at 0.434 accurac y — it cannot separate 50 domains with only 8 dimensions (Fig. 5). LA CE signifi- cantly outperforms Fixed-Small (0.676 vs 0.434) by growing from 8 to 38 dimensions. Fixed-Large achiev es the highest accuracy by having full capacity throughout, but requires knowing d max upfront. LA CE starts with the same budget as Fixed-Small and closes 73% of the gap to Fixed-Large without prior knowledge of task complexity . H. Experiment 6: Real-W orld V alidation (W ikipedia → Code → Chat) T o address the limitation of synthetic-only ev aluation, we conduct a real-world experiment using three sequential do- mains drawn from HuggingFace datasets: Wikipedia [11] (en- cyclopedic text), Python code [12] (structured programs), and con versational chat [13] (informal dialogue). These domains Fig. 5. Exp 5: Accuracy over time across 50 domains. Fixed-Small plateaus early due to insufficient capacity . LA CE gro ws adaptiv ely and closes the gap to Fixed-Large. Fig. 6. Exp 5: Active dimensions over time. LA CE grows from d = 8 to d = 38 via 30 expansion e vents, staying well below Fixed-Large’ s constant d = 48 . represent genuinely distinct real-world text distrib utions with ov erlapping vocabulary but fundamentally different structure, syntax, and register . W e use frozen GPT -2 embeddings [14] as input represen- tations ( d emb = 768 ), with LA CE starting at d base = 32 and growing up to d max = 128 . Fixed-S uses d = 32 throughout; Fixed-L uses d = 128 throughout. T ABLE V R E AL - W OR L D E X P E RI M E N T — W I K I PE D I A → C OD E → C H A T Model Acc Exp d final Precision LA CE (Dynamic) 0.796 2 34 100% Fixed-Lar ge 0.821 — 128 — Fixed-Small 0.667 — 32 — LA CE expands exactly twice — once when code is intro- duced (step 300) and once when chat is introduced (step 600) — maintaining 100% boundary precision on real data. Fixed- Small plateaus at 0.667, unable to adapt beyond two domains with only 32 dimensions. LACE outperforms Fixed-Small by 12.9% while using an average of ∼ 33 dimensions compared to Fixed-Lar ge’ s constant 128, achieving 96.9% of Fixed-Large accuracy with 74% fewer average dimensions. These results demonstrate that LA CE’ s loss-spike detection generalizes beyond synthetic domains to real-world text with genuine distributional heterogeneity , directly addressing the concern that 100% boundary precision may be an artifact of synthetic separability . Fig. 7. Real-world experiment: W ikipedia → Code → Chat. LA CE fires precisely at both domain boundaries (orange/purple vertical lines), expands from d = 32 to d = 34 , and outperforms Fixed-Small while approaching Fixed-Lar ge accuracy . I. GPT -2 Layer Analysis T o motiv ate adaptiv e capacity allocation in pretrained mod- els, we analyze domain separability across all 12 layers of GPT -2 [14] using unsupervised activ ation clustering on 600 samples from three domains (scientific, news, dialog). Fig. 8. GPT -2 layer-wise domain separability . Purity drops in middle layers (3–7) where cross-token attention mixes representations, then recovers in deep layers (8–12). Fig. 8 re veals a U-shaped purity curve: early layers separate domains by surface vocab ulary (high purity), middle layers blur domains as the transformer builds contextual represen- tations (purity drops to 0.67), and deep layers recover clean separation at the semantic le vel. This pattern suggests that capacity pressure v aries by layer depth — middle layers, where domains are least separable, are most likely to benefit from adaptiv e capacity expansion. V . D I S C U S S I O N What LA CE detects. The loss-spike detector identifies distributional shift, not semantic features. When a new domain introduces unfamiliar character patterns, the model’ s loss rises because its current weight configuration cannot represent the new distribution. This is a surface-lev el signal, but it is reliable: across all experiments, 100% of expansions occurred at genuine distribution boundaries. Capacity efficiency . In the 10-domain setting, LACE matches Fixed-Large accuracy while using 13% fe wer average dimensions throughout training. In the 50-domain setting, LA CE outperforms Fixed-Small by 57% while using the same starting budget. The cost of adaptiv e expansion — detection ov erhead and occasional wasted expansions — is negligible compared to the benefit of not requiring foreknowledge of task complexity . Distributed representation in adapters. The ablation re- sult (small individual drops, large collective drop) indicates that dynamically added dimensions store information in a distributed fashion. This is consistent with superposition the- ory [10] and suggests that expansion adds genuinely useful representational capacity rather than redundant dimensions. Limitations. Three limitations warrant honest discussion. First, LA CE does not provide a forgetting advantage on classification tasks, where the output head preserves prior class outputs regardless of capacity . Second, the loss-spike detector can be sensitiv e to training noise — the EMA base- line, confirmation window , and cooldo wn mitigate this, but optimal hyperparameters may v ary by task. Third, real-world validation is currently limited to three domains; ev aluation on standard continual learning benchmarks such as Split-CIF AR remains future work. V I . C O N C L U S I O N W e presented LA CE, a simple mechanism for adaptiv e capacity expansion in continual learning. By monitoring the model’ s own loss signal, LACE detects when existing capacity is insufficient and expands the projection matrix with new dimensions trained jointly with existing parameters. Across synthetic experiments spanning 10 to 50 sequential domains, LA CE achiev es 100% expansion precision, outperforms fixed under-pro visioned baselines, and produces adapter dimensions that are collectiv ely necessary for learned performance. The method requires no labels, no replay buf fers, and no external controllers, making it a practical tool for resource-constrained continual learning. The core principle — allocate capacity only when the data demands it — is broadly applicable beyond the specific archi- tecture studied here, and we hope it motiv ates further work on self-supervised capacity management in deep networks. A C K N O W L E D G M E N T The author thanks the open-source communities behind PyT orch, Hugging Face Transformers, and the arXiv preprint platform. R E F E R E N C E S [1] A. Rusu et al., “Progressiv e neural networks, ” arXiv pr eprint arXiv:1606.04671 , 2016. [2] T . Elsken, J. H. Metzen, and F . Hutter , “Neural architecture search: A survey , ” Journal of Machine Learning Resear ch , vol. 20, no. 55, pp. 1–21, 2019. [3] W . Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , ” Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022. [4] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models, ” in International Confer ence on Learning Representations , 2022. [5] J. He et al., “T ow ards a unified view of parameter-efficient transfer learning, ” in International Conference on Learning Representations , 2022. [6] M. McCloskey and N. J. Cohen, “Catastrophic interference in con- nectionist networks: The sequential learning problem, ” Psychology of Learning and Motivation , vol. 24, pp. 109–165, 1989. [7] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks, ” Pr oceedings of the National Academy of Sciences , vol. 114, no. 13, pp. 3521–3526, 2017. [8] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning, ” in Advances in Neur al Information Processing Systems , 2017. [9] A. Mallya and S. Lazebnik, “P ackNet: Adding multiple tasks to a single network by iterati ve pruning, ” in IEEE Confer ence on Computer V ision and P attern Recognition , 2018. [10] N. Elhage et al., “T oy models of superposition, ” T ransformer Circuits Thr ead , 2022. [Online]. A vailable: https://transformer- circuits.pub/2022/ toy model/index.html [11] Wikimedia Foundation, “W ikipedia dataset, ” Hugging F ace Datasets , 2023. [Online]. A v ailable: https://huggingface.co/datasets/wikimedia/ wikipedia [12] FlyT ech, “Python codes 25k, ” Hugging F ace Datasets , 2023. [Online]. A vailable: https://huggingface.co/datasets/flytech/python- codes- 25k [13] A. Korshuk, “Persona-chat dataset, ” Hugging F ace Datasets , 2022. [Online]. A vailable: https://huggingface.co/datasets/AlekseyK orshuk/ persona- chat [14] A. Radford et al., “Language models are unsupervised multitask learn- ers, ” OpenAI Blog , vol. 1, no. 8, p. 9, 2019. A P P E N D I X Fig. 9. T oy experiment: Unsupervised K-Means activation clustering on 3- pattern synthetic data. Left: true patterns. Right: discovered clusters. Fig. 10. Cluster center similarity matrix for toy K-Means experiment. Fig. 11. GPT -2 activ ation space by layer, colored by true domain. Each subplot sho ws PCA projection of layer activ ations with cluster purity (p) and number of clusters (k). Fig. 12. From-scratch training on 10 sequential domains: loss curv e and acti ve dimensions. Loss spikes at each domain boundary trigger expansions. Fig. 13. Capacity wall experiment: training loss across 50 domains. Fixed- Small loss remains elev ated throughout; LA CE and Fixed-Large conv erge.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment