Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories – performed in a principled way by reusing model parameters, activations and/or gradients – is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model’s general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model’s existing internal knowledge.

💡 Research Summary

The paper introduces Locas, a novel framework that treats the feed‑forward network (FFN) of modern transformers as a parametric memory consisting of key‑value pairs, and equips it with principled initialization strategies for test‑time training (TTT). Traditional TTT approaches either update the entire model or add randomly initialized small modules, which are computationally expensive and converge slowly, often causing catastrophic forgetting of pre‑trained knowledge. Locas addresses these issues by explicitly expanding the model’s internal memory capacity in a controlled, efficient manner.

Two variants are proposed:

Locas‑MLP – a conventional two‑layer MLP (ReLU) memory. The key matrix (K\in\mathbb{R}^{d\times r}) and value matrix (V\in\mathbb{R}^{r\times d}) are initialized using the current activation (A) and the gradient of the log‑likelihood with respect to the hidden state (G=\nabla_H\log p(x_t|x_{<t})). Specifically, the new key vector is set to the normalized activation, and the new value vector to a scaled, globally normalized gradient. Under mild assumptions, this yields a step‑wise optimal initialization both in time‑step and gradient‑update dimensions, providing a theoretical guarantee of fastest possible improvement per update.
Locas‑GLU – a memory that mirrors the GLU‑FFN architecture used in state‑of‑the‑art large language models (e.g., LLaMA, Mistral). Instead of gradients, it reuses the backbone model’s own parameters. For a given context chunk, the intermediate activations of the backbone GLU‑FFN are computed, and the average absolute activation per hidden dimension is used as an importance score. The top‑(r) most active dimensions are selected, and the corresponding rows of the backbone’s gate and up‑projection matrices are cloned (and normalized) to form the memory’s key and gate matrices. The down‑projection (value) matrix is initialized to zero, ensuring that the memory contributes nothing initially, akin to LoRA’s initialization. This “activation‑guided parameter cloning” aligns the new memory with the principal subspace of the pretrained model, balancing local support for the current context with the model’s generalizable features.

To prevent the memory from overwhelming the backbone during inference, two stabilisation mechanisms are introduced:

Weight‑norm clipping – each row (or column) vector in the key, gate, and value matrices is clipped only if its L2 norm exceeds 1, effectively bounding the memory’s per‑step contribution within a fixed‑radius ball in output space. This acts as an implicit KL‑divergence constraint with negligible computational overhead.
Output scaling factor (\tau) – the memory’s output is multiplied by (\tau), defined as the average row norm of the backbone’s down‑projection matrix divided by the memory width (r). This adaptively matches the magnitude of the memory’s contribution to that of the backbone FFN.

Memory growth is linear in the number of memorised tokens. While standard back‑propagation (BP) suffices for updating the memory, the authors also propose a Non‑Linear SVD (NL‑SVD) algorithm to compress the two‑layer memory after accumulation. NL‑SVD generalises classic SVD to the non‑linear case, preserving dominant activation behaviour while reducing latent dimensionality. Empirical results show NL‑SVD does not outperform simple BP updates and incurs higher computational cost, so BP is recommended in practice.

Experiments are conducted on two demanding benchmarks:

PG‑19 whole‑book language modelling – Locas‑GLU memorises the entire 30 GB PG‑19 corpus using only 0.02 % additional parameters (≈2 M). After memorisation, the model’s general knowledge is evaluated on MMLU, showing negligible degradation, indicating successful preservation of pre‑trained capabilities.
LoCoMo long‑context dialogue QA – With context windows far shorter than the full dialogue, Locas‑GLU still achieves high answer accuracy, outperforming baselines that either truncate context or rely on full attention (which is infeasible at such lengths). Compared against TempLoRA, Locas provides superior parameter‑ and compute‑efficiency.

Across all baselines—full attention, context truncation, and TempLoRA—Locas demonstrates markedly lower FLOPs and parameter growth while maintaining or improving task performance. Ablation studies confirm that the principled initialization (activation‑guided for GLU, gradient‑guided for MLP) dramatically reduces the number of gradient steps needed for convergence, directly translating into both parameter and compute savings.

Contributions summarized:

Framework – Reinterpretation of transformer FFNs as soft lookup tables, enabling explicit parametric memory expansion at test time.
Two variants – Locas‑MLP with provable step‑wise optimal initialization; Locas‑GLU with activation‑guided cloning compatible with existing GLU‑based LLMs.
Compression – Introduction of NL‑SVD for non‑linear memory compression, with theoretical analysis (Appendix A.1).
Stabilisation – Weight‑norm clipping and adaptive scaling to bound memory influence and prevent catastrophic forgetting.
Empirical validation – Demonstrated on PG‑19 and LoCoMo that a minuscule parameter overhead can store extensive context while preserving the model’s original knowledge.

In essence, Locas offers a practical, theoretically grounded solution for continual learning at inference time, bridging the gap between non‑parametric in‑context learning and heavyweight test‑time fine‑tuning. By leveraging the model’s own representations for memory initialization, it achieves rapid convergence, high memory capacity, and minimal interference with pre‑existing knowledge, marking a significant step forward in efficient, scalable test‑time adaptation of large language models.

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

💡 Research Summary

Comments & Academic Discussion

Leave a Comment