Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods

T rained P ersistent Memory f or Fr ozen Encoder –Decoder LLMs: Six Ar chitectural Methods Hong Jeong Inha Univ ersity in T ashkent, Uzbekistan hjeong@postech.ac.kr Abstract Frozen encoder–decoder language models are stateless: the latent representation is discarded after ev ery forward pass, so no information persists across sessions. This paper presents a proof-of-concept pilot study sho wing that persistent memory in the continuous latent space of a frozen LLM is feasible—ev en under se vere re- source constraints (a single frozen Flan-T5-XL backbone, small trainable adapters, a single dataset). W e implement six architectural methods spanning three injection points and four write mechanisms; unlike te xt-lev el memory systems, e very write and read is a differentiable operation on dense vectors. After training only the adapter , the memory bank continues to accumulate at inference time without gra- dients, enabling con versational learning . Under a for getting-curve ev aluation on LoCoMo at two capacity scales (1 × and 10 × ), the stateless baseline scores exactly zero; at 10 × all six trained adapters produce positive memory-recall curv es; at 1 × three methods collapse, rev ealing capacity as a critical design parameter . Because the memory bank is a compact numerical array , it can be scaled to arbitrarily large capacity without altering the backbone. W e argue that full end-to-end training with larger models, lar ger data, and orders-of-magnitude larger memory will yield substantially stronger results; this pilot study establishes the feasibility baseline and design-space taxonomy that such efforts require. 1 Introduction Consider a frozen encoder–decoder model such as Flan-T5 built on a T5-style backbone [ 1 ]. The forward pass is: Z t = E frozen ( x t ) , ˆ y t = D frozen ( Z t ) , (1) where E frozen and D frozen are ﬁxed pre-trained weights, x t is the input at turn t , Z t ∈ R n × d is the encoder output, and ˆ y t is the generated text. This system is stateless : Z t is discarded after each forward pass and the model has no recollection of pre vious turns. If a user says “I like reading” in session 1 and asks “What do I like?” in session 3, the model cannot answer —there is no state that surviv es across sessions. This inter-session memory problem is the concrete target of persistent memory . Existing long-term memory solutions such as MemGPT and MemoryBank operate at the text le vel : they store, summarize, and retriev e natural-language passages through an external database. This paper works at a fundamentally dif ferent level—the latent space of the frozen model. The memory bank P t ∈ R n P × d holds continuous encoder representations, not strings, so writing and reading are differentiable operations embedded inside the forw ard pass rather than pre- or post-processing steps around it. Preprint. A persistent bank built directly from frozen encoder outputs is not enough. The decoder’ s cross- attention was pre-trained to read current encoder states, not arbitrarily accumulated cache states, so a naiv e memory path tends to dilute attention rather than sharpen retriev al as history grows. Recent attention-coupled latent-memory work sho ws that learned structure inside the memory pathway can instead induce functional specialization and controlled routing [ 6 , 7 ]. For frozen encoder–decoder LLMs, this suggests that a small trainable adapter is the minimal mechanism needed to write memory in a form the frozen decoder can use. This paper takes the necessary next step: we allow training for a small memory adapter θ Mem while keeping both encoder and decoder frozen. W e augment the stateless system with a persistent memory bank P t ∈ R n P × d that persists across turns and sessions: Z t = E frozen ( x t ) , P t = W rite( P t − 1 , Z t ) , ˆ y t = D frozen  Read( Z t , P t − 1 )  . (2) The Write operation updates P from the current latent; the Read operation injects historical context from P into the decoder . The learned parameters θ Mem enable the adapter to learn how to write memory in a format that the fr ozen decoder’ s e xisting cr oss-attention can discriminate —the capability that zero-training methods prov ably lack. This paper is deliberately designed as a low-budget pilot study : we use a single frozen backbone, a single ev aluation dataset, and minimal adapter parameters. The goal is not to achiev e state-of-the-art recall but to demonstrate feasibility —that persistent latent-space memory can be installed in an existing frozen LLM with inexpensi ve adapters, and that the resulting system exhibits non-trivial, capacity-dependent memory behaviour . Full end-to-end training with unfrozen, larger-scale LLMs, bigger datasets, and memory banks orders of magnitude larger than ours lies beyond the scope of this pilot but is the natural industrial-scale follo w-up that our results motiv ate. Empirically , ev en under these constrained conditions the six designs separate clearly . At 10 × capacity all trained adapters rise above the zero baseline on the forgetting curve; at 1 × three methods collapse. M.2 XAttn and M.6 Slot dominate at lo w capacity , while M.4 Hebbian leads at high capacity—rev ealing memory-bank size as a critical design parameter . After training, θ Mem is frozen but P t continues to accumulate at inference time without gradients. W e call this con versational learning : each ne w session enriches P , so a fact stated in session 1 (“I am John Doe”) can be recalled in session 10 (“Who am I?”) without re-stating it and without million-token context windo ws—the relev ant inter-session history is compressed into P . The analogy to human cognition is deliberate. Human brains accumulate knowledge through comple- mentary memory systems [ 8 ]: episodic memory records speciﬁc e vents (“ Alice mentioned a trip to Paris”), semantic memory distills general facts (“ Alice likes trav el”), pr ocedural memory encodes how-to skills, and working memory holds the acti ve con versational context. All of these are linked through associative retriev al: a cue in working memory can trigger recall from any long-term store. The persistent memory bank P in our framew ork plays an analogous role: the write rule determines what is stored (episodic vs. semantic), the read rule implements associativ e retriev al, and the capacity of P constrains how much can be retained—mirroring the interplay of encoding, consolidation, and retrie val in biological memory . A successful method must demonstrate not merely storage b ut r emembering (retrieving the right fact), g eneralisation (answering questions phrased differently from the original statement), and abstraction (combining multiple facts into a coherent response)—the hallmarks of genuine learning from experience. The six methods differ along three orthogonal design dimensions: (i) wher e P enters the forward pass (before the encoder , between encoder and decoder , or inside the decoder), (ii) how P is written (attention-coupled update, Hebbian outer product, gated cross-attention, or sparse slot addressing), and (iii) how many parameter s ar e added (all modest relati ve to the 3B-parameter backbone). A crucial constraint across all methods is that the frozen decoder is calibrated exclusi vely for encoder outputs. Let M E denote the set of representations the encoder can produce; an y method that replaces Z with H / ∈ M E will degrade the decoder . All six methods preserv e the frozen encoder – decoder route through Z t and ensure that memory inﬂuence enters through controlled, learnable pathways. Our core e valuation is simple: how much does each method r emember? Every condition—baseline and all six memory methods—sees only the current turn x t ; no method receiv es the conv ersation history . The baseline is deliberately short-sighted and retains nothing. Each memory method must accumulate facts into P through its write rule and retriev e them through its read path. By probing the 2 same factual questions across increasing evidence lag, we measure a for getting curve : the fraction of av ailable headroom that the method’ s persistent state ﬁlls, normalised so that 100% means memory brings F1 to the gold standard and 0% means memory adds nothing. The stateless baseline has no persistent state and is therefore identically zero; stronger methods start higher at short lag and decay more slowly as the e vidence recedes into the past. A secondary question is whether the frozen decoder’ s cross-attention queries—trained only on encoder outputs—ha ve enough r epresentational slack to attend usefully to memory entries projected by the trained adapter . If not, e ven perfectly trained θ Mem will not help, implying that the frozen decoder itself is the bottleneck. Contributions. 1. Latent-space persistent memory . W e formulate the problem of adding persistent memory that liv es entirely in the continuous latent space of a frozen encoder–decoder LLM. Unlike text-le vel memory systems (MemGPT , MemoryBank) that store and retrie ve natural-language strings outside the model, our memory bank P t ∈ R n P × d holds dense encoder representations; e very write and read is a differentiable operation inside the forw ard pass. 2. Six architectural methods. W e design, implement, and release six trained memory adapters that span three injection points (before the encoder , between encoder and decoder , inside the decoder) and four write mechanisms (attention-coupled update, Hebbian outer product, gated cross-attention, sparse slot addressing). All methods keep ev ery encoder and decoder weight frozen and add only a small set of learnable parameters θ Mem . 3. Headroom-normalised f orgetting-curve e valuation. W e introduce an e valuation protocol that measures the fraction of a vailable answer -quality headroom ﬁlled by a method’ s persistent state, as a function of evidence lag. The metric is normalised to a 0–100% scale (100% = perfect recall, 0% = no memory contribution), gi ving intuitive and comparable scores. The stateless baseline is identically zero by construction. 4. Empirical ﬁndings. Under this protocol on LoCoMo, we test at two capacity scales (1 × and 10 × ). At 10 × all six trained adapters produce positi ve memory-recall curves; at 1 × three methods collapse, re vealing capacity as a critical design parameter . M.2 XAttn and M.6 Slot dominate at low capacity; M.4 Hebbian leads at high capacity . Kno wledge accumulation curves conﬁrm that the best methods steadily accumulate facts ov er 30 sessions. 2 Related W ork Persistent memory is adjacent to, but distinct from, se veral existing lines of w ork. Application-le vel long-term memory systems such as MemGPT [ 2 ] and MemoryBank [ 3 ] demonstrate that explicit memory can improv e LLM beha viour over extended interactions, but the y operate at the te xt level : facts are stored as natural-language strings, retrie val is a search ov er those strings, and the language model itself is unchanged. LoCoMo [ 5 ] provides a public benchmark targeted speciﬁcally at very long-term con versational memory and motiv ates the multi-session ev aluation setting adopted here. The present paper operates at a fundamentally dif ferent lev el—the latent space of the frozen model. Our persistent memory bank stores continuous encoder representations, not text; reading and writ- ing are differentiable operations inside the forward pass rather than pre- or post-processing steps. Rather than proposing a single end-to-end memory agent, we deﬁne a taxonomy of six architectural alternativ es for latent-space persistent memory , formalize their read and write paths, and specify a public-dataset protocol for comparing them against a stateless baseline under a shared released backbone. Parameter -efﬁcient adaptation. Sev eral of our six methods adapt ideas originally proposed for parameter-ef ﬁcient ﬁne-tuning or memory-augmented architectures. Preﬁx tuning [ 11 ] prepends learnable soft tokens to the input; M.1 extends this idea to a persistent memory bank. Flamingo [ 12 ] inserts gated cross-attention layers into a frozen decoder for visual grounding; M.2 and M.5 adopt the same parallel-branch topology for memory injection. Memorizing T ransformers [ 13 ] e xtend the decoder KV cache with retrie ved past representations; M.3 follo ws the same KV concatenation principle. Linear T ransformers and fast weight programmers [ 14 ] accumulate an outer-product associativ e matrix updated at ev ery step; M.4 uses the same Hebbian write rule. Neural T uring Machines [ 15 ] maintain addressable memory slots with content-based sparse writes; M.6 inherits 3 this slot-addressing mechanism. Crucially , none of these prior methods were designed for persistent latent-space memory that accumulates across sessions inside a frozen encoder–decoder model. Our contribution is not the individual read or write primiti ve but the controlled comparison—under a single frozen backbone and a common forgetting-curve e valuation—of ho w these primiti ves perform as latent-space persistent memory for inter-session recall. Attention-coupled latent memory . Recent arXiv work on attention-coupled latent memory ex- plores richer structured bank dynamics than the minimal adapters studied here. Inhibitory cross-talk across paired banks can drive functional lateralization [ 6 ], while a miniature brain-transformer ar- chitecture adds thalamic gating, amygdaloid salience, hippocampal lateralization, and prefrontal working memory to shape routing and consolidation [ 7 ]. The present paper is narro wer: it compares six simpler trainable memory adapters under a ﬁx ed frozen encoder–decoder backbone and a common forgetting-curv e ev aluation. Cognitive memory systems. Our taxonomy is informed by the cogniti ve science of human memory . T ulving’ s distinction between episodic and semantic memory [ 8 ] maps onto our design choices: methods with high-capacity slot banks (M.6) store episode-like snapshots, while Hebbian associative memory (M.4) naturally distils co-occurrence statistics akin to semantic memory . The content-gated method (M.5) resembles working-memory gating, selecti vely admitting rele v ant information. Modern complementary learning systems theory [ 9 , 10 ] argues that f ast episodic binding and slow semantic consolidation are both necessary; our experiment measures whether an y single-mechanism method sufﬁces or whether the task demands a composite architecture. 3 Problem Setting and Notation Figure 1 sho ws the stateless frozen encoder –decoder pipeline used as the control architecture through- out the paper . x t E frozen Z t D frozen ˆ y t ephemeral—discar ded after each turn Figure 1: Frozen encoder–decoder baseline used as the stateless control. The latent representation is consumed within the current turn and then discarded. The encoder maps x t to Z t ∈ R n × d (sequence length n , hidden dimension d ). The decoder uses cross-attention to read Z t at ev ery layer . Both E frozen and D frozen are frozen; only the persistent memory P and (optionally) a small set of adapter parameters θ Mem are modiﬁed. 4 T rained Alternati ves These methods introduce a small set of learnable parameters θ Mem trained via backpropagation; at inference time (T ype 2), P continues to accumulate without gradients. Every method uses content-based addressing—deciding what to store or retrie ve based on the semantic content of the current turn rather than ﬁxed positions. T able 1 giv es the complete read and write operations for all six methods; the baseline (M.0) is included for reference. Dele gated-read methods (M.1, M.3, M.6) project all of P into the decoder’ s KV cache and let the frozen cross-attention select rele v ant entries. Explicit-r ead methods (M.2, M.4, M.5) perform their o wn content-based retriev al before passing the result to the decoder . This distinction has implications for trainability: explicit-read methods introduce more parameters but gi ve the adapter direct control ov er retriev al; delegated-read methods rely on the decoder’ s pre-trained attention patterns, which are already tuned for cross-attention selection. 4 T able 1: Read and write operations for each method. A = softmax( Z t W Q ( P W K ) ⊤ / √ d ) . Delegated read means the frozen decoder cross-attention selects from the concatenated KV ; explicit read means the adapter performs its own retrie val before passing to the decoder . Method Write rule ( P t ← ) Read rule (inject into decoder) T ype 0 Baseline — D ( Z t ) — 1 Preﬁx γ P + A ⊤ V , V = Z t W V [ Z t ; P W P ] → D (extra KV) Delegated 2 XAttn γ P + A ⊤ V , V = Z t W V softmax( s W m Q ( P W m K ) ⊤ / √ d ) ( P W m V ) Explicit 3 KV Extension γ P + A ⊤ V , V = Z t W V K =[ K Z ; P W K,m ] , V =[ V Z ; P W V ,m ] Delegated 4 Hebbian γ M + ( Z t W K,H ) ⊤ ( Z t W V ,H ) ( Z t W Q,H ) M → W Mem → D (extra KV) Explicit 5 Gated γ P + A ⊤ V , V = Z t W V g t ⊙ XAttn( s, P ) , g t = σ ( W g [ s ; c ] + b g ) Explicit 6 Slot top- k : γ P [ s ] + (1 − γ ) ¯ z t W u K =[ K Z ; P W K,m ] , V =[ V Z ; P W V ,m ] Delegated 4.1 M. 1: Memory as Encoder -Input Preﬁx Persistent memory is compressed into m soft tokens and prepended to the encoder input, e xtending the preﬁx-tuning idea [ 11 ] from static task prompts to a dynamic, accumulating memory bank. The encoder integrates memory and current input through self-attention and produces a valid Z ∈ M E . The decoder remains entirely untouched. Concretely , S t = U P P t − 1 W P ∈ R m × d , (3) ˜ x t =  S t |{z} m soft tokens ; x t  ∈ R ( m + n ) × d , (4) ˜ Z t = E frozen ( ˜ x t ) , Z t = ˜ Z t [ m :] , (5) ˆ y t = D frozen ( Z t ) , (6) where U P ∈ R m × n P mixes the n P memory rows into m preﬁx slots and W P ∈ R d × d is a learnable feature projection. If n P = m , one may set U P = I , reducing the preﬁx to S t = P t − 1 W P . Memory is updated via an attention-coupled write rule that injects the current turn’ s content into the memory bank: Q = Z t W Q , K = P t − 1 W K , V = Z t W V , A t = softmax  QK ⊤ / √ d  , P t = γ P t − 1 + A ⊤ t V . (7) Q and K perform content-based addressing between the current latent Z t and the existing mem- ory P t − 1 . Crucially , values V are drawn from Z t , not from P ; A ⊤ t V aggregates the current turn’ s content into n P memory rows, weighted by the addressing scores. This ensures that new information enters memory at ev ery turn—without it, an all-zero initialisation would remain zero indeﬁnitely . Figure 2 illustrates how memory is projected into a soft preﬁx before the frozen encoder , with the current latent driving a write-back update. P t − 1 S t = U P P t − 1 W P [ S t ; x t ] E frozen Z t D frozen ˆ y t A ⊤ t V write P t Z t ∈ M E ✓ Figure 2: M. 1 injects persistent memory as an encoder-input preﬁx and writes the current latent back into memory through an attention-coupled update. The trainable read-side parameters are { W P } ; the write-side projections { W Q , W K , W V } and decay γ are frozen (Sec. 5). Under T ype 1 training, θ Mem is optimised via ∇ θ Mem L ; under T ype 2, P t accumulates at inference with θ Mem frozen. 5 4.2 M. 2: Parallel Decoder Cr oss-Attention A parallel cross-attention layer is inserted in each decoder block to attend to P independently of the frozen pathway , following the Flamingo architecture [ 12 ] which sho wed that interleav ed cross- attention can inject external information into a frozen LM. The original Z route is untouched; memory inﬂuence is additive via a zero-initialised coefﬁcient. This method requires source-lev el access to the decoder blocks of an open-weight model. Here fr ozen means that the original decoder weights are not updated; it does not mean that intermediate layers are inaccessible. At decoder layer ℓ with hidden state s ( ℓ ) t : c ( ℓ ) Mem = XA ttn ( ℓ ) Mem  s ( ℓ ) t , P t − 1  , (8) s ( ℓ ) ′ t = s ( ℓ ) t + XAttn ( ℓ ) frozen  s ( ℓ ) t , Z t  + β ( ℓ ) |{z} init =0 c ( ℓ ) Mem . (9) At initialisation β ( ℓ ) = 0 , so the model falls back exactly to the frozen baseline. Memory is updated with the same attention-coupled write rule as M.1: Q = Z t W Q , K = P t − 1 W K , V = Z t W V , P t = γ P t − 1 + A ⊤ t V , (10) where A t = softmax( QK ⊤ / √ d ) and V is sourced from Z t so that new content enters memory regardless of P ’ s current state. Figure 3 illustrates the frozen cross-attention path running in parallel with the additiv e memory branch. E frozen ( x t ) Z t XAttn frozen ( s, Z t ) β ( ℓ ) · XAttn Mem ( s, P t − 1 ) s ′ = s + frozen + mem Z t pathway untouched ✓ Figure 3: M. 2 preserv es the frozen cross-attention route o ver the current encoder latent and adds a parallel decoder memory branch scaled by a learned coefﬁcient. The per -layer parameters are θ Mem = { W Mem Q , W Mem K , W Mem V , O Mem , β ( ℓ ) } , totalling ∼ 16 . 8 M (0.6%); the cross-attention projections are shared across layers and β ( ℓ ) adds one scalar per layer . Implementation note. Because injecting into each frozen decoder block requires per -layer hooks, our implementation approximates the per -layer read by computing XA ttn Mem once using Z t as a proxy for internal decoder states and blending with ¯ β = mean( β ( ℓ ) ) . The result is passed as additional encoder positions, so the frozen decoder’ s own per -layer projections provide implicit layer specialisation. 4.3 M. 3: Decoder KV Extension Persistent memory is projected into additional ke y–value pairs that are concatenated alongside Z in the decoder’ s cross-attention, follo wing the KV -extension principle of Memorizing T ransformers [ 13 ], and leaving the original Z positions byte-for-byte preserved. A shared zero-initialised projection maps P into pseudo-encoder hidden states; the frozen decoder then applies its o wn per-layer W ( ℓ ) K , W ( ℓ ) V to both the original Z positions and the memory extension: H Mem = P t − 1 W Mem ∈ R n P × d , (11) K ( ℓ ) = W ( ℓ ) K  Z t ; H Mem  ∈ R ( n + n P ) × d k , (12) V ( ℓ ) = W ( ℓ ) V  Z t ; H Mem  ∈ R ( n + n P ) × d v , (13) ˆ y t = D frozen  Q ( ℓ ) , K ( ℓ ) , V ( ℓ )  . (14) Queries Q ( ℓ ) use the frozen W ( ℓ ) Q . Zero-initialised W Mem ensures no-regression at init; the frozen per-layer projections provide implicit layer specialisation without extra learned parameters. The write rule is the same attention-coupled update as M.1 (Eq. (7)): Q = Z t W Q , K = P t − 1 W K , V = Z t W V , P t = γ P t − 1 + softmax  QK ⊤ / √ d  ⊤ V . (15) 6 Figure 4 shows decoder keys and values extended with memory-deriv ed entries while the current encoder positions remain unchanged. E frozen ( x t ) Z t P t − 1 K =[ K Z ; K P ] , V =[ V Z ; V P ] D frozen ( Q, K, V ) Z positions unchanged ✓ Figure 4: M. 3 extends decoder keys and v alues with learned projections of persistent memory while keeping the current encoder tokens on their original path. The trainable read-side parameter is W Mem ∈ R d × d ; the write-side projections { W Q , W K , W V } and decay γ are frozen (Sec. 5). T otal added: ∼ 4 . 2 M (0.1%). 4.4 M. 4: Hebbian / Associativ e Memory An outer-product Hebbian rule—the same write primitiv e used in linear transformers and fast weight programmers [ 14 ]—accumulates associativ e structure in a matrix M t , and the full read path is made explicit by injecting the recalled memory through decoder KV extension. This yields a complete, experimentally realizable architecture rather than a generic “inject someho w” formulation. Let d h denote the associativ e-memory dimension: ˜ M t = γ M t − 1 + 1 n ( Z t W K,H ) ⊤ ( Z t W V ,H ) , M t = ˜ M t  max  ∥ ˜ M t ∥ F , 1  ∈ R d h × d h , (16) R t = ( Z t W Q,H ) M t − 1 ∈ R n × d h , (17) H Mem = R t W Mem ∈ R n × d , (18) ˆ y t = D frozen  Q ( ℓ ) , W ( ℓ ) K  Z t ; H Mem  , W ( ℓ ) V  Z t ; H Mem   . (19) Figure 5 sho ws the Hebbian matrix being read by the current latent and exposed to the decoder as extra ke y-value memory . E frozen ( x t ) Z t M t − 1 R t = ( Z t W Q,H ) M t − 1 [ K Z ; K M ] , [ V Z ; V M ] D frozen M t = norm  γ M + 1 n ( Z W K,H ) ⊤ Z W V,H  Figure 5: M. 4 stores associativ e structure in a Hebbian memory matrix that is queried by the current latent and passed to the decoder as additional memory . The trainable parameters are θ Mem = { W Q,H , W Mem } ; the Hebbian write projections { W K,H , W V ,H } and decay γ are frozen (Sec. 5). W Mem (the read projection from d h back to d ) is zero-initialised so that the memory branch is silent at startup (safe startup). W Q,H uses small random initialisation, N (0 , 0 . 02) : because W Mem is zero, the gradient signal for W Q,H comes from the loss through W Mem once it becomes non-zero, and con versely the gradient for W Mem requires R t  = 0 , i.e. W Q,H  = 0 . Zero-initialising both W Q,H and W Mem would create a gradient deadlock. 4.5 M. 5: Context-Gated Decoder Memory Branch Rather than modifying encoder outputs, a lightweight memory branch is inserted inside the decoder , using a content-dependent gate inspired by Flamingo’ s tanh-gated cross-attention [ 12 ]. The branch 7 reads from P , and the gate controls how strongly its output af fects the decoder hidden state: c ( ℓ ) Mem = XA ttn ( ℓ ) Mem  s ( ℓ ) t , P t − 1  , (20) g ( ℓ ) t = σ  W ( ℓ ) g [ s ( ℓ ) t ; c ( ℓ ) Mem ] + b ( ℓ ) g  , (21) s ( ℓ ) ′ t = s ( ℓ ) t + XAttn ( ℓ ) frozen  s ( ℓ ) t , Z t  + g ( ℓ ) t ⊙ c ( ℓ ) Mem . (22) W ith b ( ℓ ) g < 0 at initialisation, the memory branch starts nearly off, so the model initially behaves like the frozen baseline and only later opens the memory pathway where helpful. Figure 6 illustrates the decoder-side branch whose contrib ution is controlled by the learned context gate. E frozen ( x t ) Z t XAttn frozen ( s, Z t ) P t − 1 XAttn Mem ( s, P ) g t ⊙ c Mem s ′ g → 0 ⇒ decoder path unchanged ✓ Figure 6: M. 5 adds a memory read branch inside the decoder and lets a learned gate decide when the auxiliary memory signal should inﬂuence the frozen path. The trainable parameters are θ Mem = { W Mem Q , W Mem K , W Mem V , W g , b g } ; the write-side projections { W Q , W K , W V } and decay γ are frozen (Sec. 5). Implementation note. As with M.2, the per -layer decoder injection is approximated by computing the gated memory read once with Z t as proxy for decoder hidden states and passing the result as additional encoder positions. The write rule is the same attention-coupled update as M.1 (Eq. (7)): Q = Z t W Q , K = P t − 1 W K , V = Z t W V , P t = γ P t − 1 + softmax  QK ⊤ / √ d  ⊤ V . (23) 4.6 M. 6: Slot-Based Memory with Sparse Write Memory is organised as S ﬁxed-size slots P ∈ R S × d , adopting the addressable-slot design of Neural T uring Machines [ 15 ]. At each turn, only the top- k addressed slots are updated; the read path is an explicit decoder KV e xtension: ¯ z t = 1 n n X i =1 Z t [ i, :] , (24) a t = softmax  ¯ z t W a P ⊤ t − 1 / √ d  ∈ R S , (25) m t [ s ] = 1 [ s ∈ top - k( a t , k )] , (26) u t = ¯ z t W u ∈ R d , (27) P t [ s ] = (1 − m t [ s ]) P t − 1 [ s ] + m t [ s ]  γ P t − 1 [ s ] + (1 − γ ) u t  , (28) H Mem = P t W Mem , (29) K ( ℓ ) = W ( ℓ ) K  Z t ; H Mem  , V ( ℓ ) = W ( ℓ ) V  Z t ; H Mem  . (30) Figure 7 shows sparse writes into a ﬁxed slot bank that is later read by the decoder as structured episodic memory . Z t s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 P t − 1 : S slots D frozen address + top- k write KV read all slots red = updated this turn 8 Figure 7: M. 6 writes to a ﬁxed set of memory slots sparsely and exposes the slot bank to the decoder as an explicit episodic memory . The trainable read-side parameter is W Mem ∈ R d × d ; the write-side projections { W a , W u } and decay γ are frozen (Sec. 5). 5 T raining and Inference Each of the six methods introduces learnable parameters θ Mem trained while both encoder and decoder remain frozen. T raining proceeds in two phases. Gradient ﬂow through frozen networks. “Frozen” means that E and D receiv e no weight updates—it does not mean that gradients cannot ﬂow through them. Both networks act as ﬁxed differentiable functions: in the backward pass the chain rule propagates the loss gradient through the frozen decoder (and, for M. 1, also through the frozen encoder) to reach θ Mem . This is the same principle underlying preﬁx tuning, LoRA, and adapter methods—the only parameters that recei ve gradient updates are those belonging to the memory adapter . The frozen backbone’ s role is to provide both a ﬁxed forward computation and a ﬁx ed gradient signal that the adapter must learn to exploit. In the ﬁrst phase ( T ype 1: supervised learning ), gradients ﬂo w from the decoder loss through the read pathway and update θ Mem , thereby learning how to read and write memory effecti vely: θ Mem ← θ Mem − η ∇ θ Mem L  D frozen (Read( Z t , P )) , y t  . (31) Write projections as ﬁxed random maps. During T ype 1 training the write rule P t = W rite( P t − 1 , Z t ) ex ecutes without gradients to pre vent the computation graph from gro wing across the full con versation history . Consequently , the write-side projections ( W Q , W K , W V in most meth- ods) receiv e no gradient updates and remain at their random initialisation throughout training. These projections act as ﬁxed random maps that nonetheless preserve pairwise distances in the input (Johnson–Lindenstrauss property), ensuring that distinct encoder outputs produce distinguishable memory entries. The read-side parameters ( W P , W Mem , cross-attention heads, gates) are therefore the only parameters that recei ve gradient updates; the y learn to decode whate ver structure the ﬁx ed write rule deposits in P . Excluding the write-side projections from the optimiser prevents unnecessary weight decay and av oids misleading parameter counts. In the second phase ( T ype 2: con versational learning ), θ Mem is frozen but P t continues to ac- cumulate. Each conv ersation enriches P , improving the system’ s responses without any gradient computation: P t = W rite( P t − 1 , Z t ) with θ Mem ﬁxed. (32) This is the mechanism we call con versational learning : the system becomes more knowledgeable and personalised with ev ery conv ersation, exactly through repeated online updates while the frozen model weights remain ﬁxed. Figure 8 contrasts the two phases. Frozen E , D + Memory adapter T ype 1: Train θ Mem (backprop, ofﬂine) T ype 2: Update P t (no grad, ev ery turn) ∇ θ L update θ Mem Z t update P t 9 Figure 8: T wo learning phases. T ype 1 updates memory adapter parameters ofﬂine by backpropag a- tion, whereas T ype 2 updates the persistent memory online at each turn with frozen model weights. T able 2 consolidates the six methods along key design dimensions. T able 2: Six trained persistent-memory methods compared across key design dimensions. “Primary- path safe” means the original frozen encoder–decoder route through Z t is preserved, even if an auxiliary decoder-side memory branch is added. “Memory cost” is per turn. Method Injection point Primary-path safe New params Mem. cost Write mechanism 1 Encoder-input preﬁx Before E ✓ 4.2M const. A ⊤ t V 2 Parallel decoder XAttn Inside D ✓ 16.8M const. A ⊤ t V 3 Decoder KV extension Inside D xattn ✓ 4.2M const. A ⊤ t V 4 Hebbian / associativ e Decoder KV ✓ 1.0M O ( d 2 h ) Hebbian outer prod. 5 Context-gated decoder branch Inside D ✓ 21.0M const. A ⊤ t V 6 Slot-based sparse write Decoder KV ✓ 4.2M O ( S d ) T op- k overwrite 6 Evaluation The preceding sections specify how each method reads and writes persistent memory; this section speciﬁes what we measure and how we measure it. 6.1 For getting-Curve Hypothesis If persistent memory genuinely con verts a stateless model into one that r etains con versation-speciﬁc information , the effect should be measurable as a function of how far in the past the supporting evidence w as written. W e therefore formalise ev aluation around a forgetting cur ve rather than a bag of unrelated scalar metrics. For a question q asked after T con versational turns, let E q denote the set of supporting evidence turns and deﬁne the evidence lag by the oldest required support: ℓ q = T − min( E q ) . (33) For each trained method m , we compare two answers to the same question: one with the learned persistent state intact, ˆ y mem q ( m ) , and one from the same trained model with its persistent memory state forced to zero, ˆ y 0 q ( m ) . The memory r ecall rate is ρ q ( m ) = max  0 , F1( ˆ y mem q ( m ) , y q ) − F1( ˆ y 0 q ( m ) , y q )  max  1 − F1( ˆ y 0 q ( m ) , y q ) , ε  , (34) where y q is the gold answer and ε > 0 is a small constant that prev ents division by zero. The numerator is the F1 gain attributable to persistent memory; the denominator is the headroom, i.e. the maximum possible gain giv en the zero-memory baseline of that model. The score is therefore 100% when memory brings the answer to perfect F1, and 0% when memory adds nothing. For the stateless baseline , there is no persistent state to ablate, so the memory recall rate is identically zero for ev ery question: ρ q ( baseline ) = 0 . (35) For an effecti ve stateful system, the memory recall rate should be largest at short lag and should gradually decrease as the relev ant evidence lies further in the past: ρ q ( m ) should decrease as ℓ q increases. (36) The central experimental prediction is therefore a family of for getting curves: the baseline is ﬂat at zero by construction, while stronger trained persistent-memory methods start higher at short lag and decay more slo wly at long lag. W ithout learned memory control, recall-rate curves collapse to ward zero in the stateless limit; the test is whether trained adapter parameters θ Mem can produce large and durable positiv e curves. 10 6.2 Benchmarks and Protocol Equal-input principle. T o isolate the effect of persistent memory , every condition—baseline and all six memory methods—receives exactly the same encoder input x t at each turn: the current con versational turn only . No method is giv en the full history [ x 0 , x 1 , . . . , x t ] . The baseline is therefore intentionally short-sighted : it encodes only the present turn and decodes an answer with no access to prior sessions. Memory methods receive the same x t but additionally condition the decoder on the persistent state P t − 1 accumulated from all earlier turns. Any non-zero memory recall rate is thus attributable solely to the persistent state P . All conditions use the same released frozen encoder–decoder backbone, the same tokenizer , and the same decoding rule. Each example is processed turn by turn: after every conv ersational turn the method-speciﬁc write rule updates P t ; when a question is asked the model answers using the current query and the persistent memory state P t − 1 . The stateless baseline uses the identical backbone but no persistent memory . Only θ Mem is optimised on the public training split with teacher -forced answer loss; the encoder and decoder remain frozen throughout. No method may use an external summarizer , a different retrie ver , or a different pretrained model. Implementation details. The frozen backbone is Flan-T5-XL (3B parameters) in bﬂoat16. All memory adapters are trained with AdamW (learning rate 10 − 4 , weight decay 10 − 2 , linear warmup of 200 steps, gradient norm clipped at 1.0) for 10 epochs with batch size 2 and gradient accumulation 8 (effecti ve batch 16). Write-rule updates are detached from the computation graph; truncated back- propagation through time uses a window of k =8 turns. Shared memory hyper -parameters are: bank size n P =64 , write decay γ =0 . 95 . M.4 uses associativ e dimension d h =256 ; M.6 uses S =64 slots with top- k =8 writes per turn. All experiments use a single seed (42) and a single NVIDIA GPU. The primary benchmark is LoCoMo [ 5 ], a long-term con versational-memory dataset with e xplicit QA supervision and annotated evidence turns. Those evidence annotations are the key ingredient for the rebuilt e valuation: the y let us place ev ery question at a precise lag in the past. W e do not use MSC [ 4 ] as a primary score in this paper , because MSC lacks per-answer evidence locations and therefore cannot support a clean forgetting curv e. 6.3 For getting-Curve Ev aluation W e ev aluate one quantity only: the for getting curve , expressed in terms of the headroom-normalised memory recall rate . F or a LoCoMo QA pair q with gold answer y q and annotated e vidence turns E q , let ℓ q = T − min E q (37) denote the lag in turns from the oldest required fact to the end of the con versation. For each method m , we compute two answers to the same question: ˆ y mem q ( m ) = answer from method m after writing the con versation into P , (38) ˆ y 0 q ( m ) = answer from the same trained method with its persistent state forced to zero. (39) The memory recall rate of question q is then ρ q ( m ) = max  0 , F 1( ˆ y mem q ( m ) , y q ) − F 1( ˆ y 0 q ( m ) , y q )  max  1 − F 1( ˆ y 0 q ( m ) , y q ) , ε  . (40) The numerator is the F1 gain from persistent memory; the denominator is the headroom—the maximum possible gain giv en the zero-memory baseline. The score therefore lies on a 0–100% scale: 100% means memory brings the answer to perfect F1, 0% means memory adds nothing. This is not baseline-relativ e. It is a method-internal quantity: ho w much of the a vailable improv ement room is ﬁlled by that method’ s own memory . The stateless baseline has no persistent state at all, so ρ q ( M.0 ) = 0 for all questions q . (41) W e bucket lags into ﬁve ranges, [0 , 32) , [32 , 64) , [64 , 128) , [128 , 256) , and [256 , ∞ ) turns, av erage ρ q ( m ) inside each bucket, and then ﬁt a weighted non-increasing isotonic curve across buck ets. The isotonic ﬁt suppresses ﬁnite-sample noise while preserving the forgetting prior that memory recall should not improv e as lag increases. A stronger method therefore has a higher intercept at short lag and a slower do wnward decay . 11 6.4 Results Figure 9 is the primary empirical result. W e ev aluate two memory-capacity scales: 1 × ( n p = 64 , d h = 256 , n slots = 64 ) and 10 × ( n p = 640 , d h = 810 , n slots = 640 ). The short-lag height measures write effecti veness: how much conv ersation-speciﬁc information is av ailable immediately after storage. The slope measures resistance to ov erwrite and interference: ho w slo wly that information decays as the relev ant evidence recedes into the past. 0 – 31 32 – 63 64 – 127 128 – 255 256+ 0 5 10 15 20 Evidence lag (turns) Memory recall rate (%) 1 × capacity Baseline M.1 Preﬁx M.2 XAttn M.3 KV Ext M.4 Hebbian M.5 Gated M.6 Slot 0 – 31 32 – 63 64 – 127 128 – 255 256+ 0 5 10 15 20 Evidence lag (turns) 10 × capacity Baseline M.1 Preﬁx M.2 XAttn M.3 KV Ext M.4 Hebbian M.5 Gated M.6 Slot Figure 9: For getting curves on LoCoMo at two memory-capacity scales. Each point is the memory recall rate from Eq. (40) , smoothed by a weighted non-increasing isotonic ﬁt. Left (1 × ): three methods (M.1, M.3, M.5) collapse; M.2 XAttn and M.6 Slot dominate. Right (10 × ): all six methods produce non-tri vial curves; M.4 Hebbian is strongest at long lag. Higher and ﬂatter curves indicate stronger memory . Interpr etation. The e valuation isolates persistent memory itself rather than generic question- answering ability . A method receiv es credit only for answer quality that vanishes when its o wn persistent state is ablated, normalised by the headroom av ailable to that method. The baseline is therefore exactly ﬂat at zero, while trained memory methods sho w positi ve short-lag recall follo wed by gradual decay . Because the score is headroom-normalised, it sits on a 0–100% scale and directly measures what fraction of the remaining improv ement room the persistent memory ﬁlls. The ﬁgure separates two properties that were conﬂated by the pre vious multi-metric protocol. Short- lag height measures whether a method writes useful content into memory at all. Long-lag decay 12 T able 3: Memory recall rate (%) by lag bucket on LoCoMo (smoothed, non-increasing isotonic ﬁt) at two capacity scales. The baseline is identically zero by construction. n gi ves the number of QA pairs per buck et. Higher → stronger memory retention. Lag buck et (turns) Method 0–31 32–63 64–127 128–255 256+ Mean n (samples) 28 24 62 130 395 1 × capacity ( n p = 64 , d h = 256 ) M.0 Baseline 0.00 0.00 0.00 0.00 0.00 0.00 M.1 Preﬁx 0.02 0.02 0.02 0.02 0.02 0.02 M.3 KV Ext 0.00 0.00 0.00 0.00 0.00 0.00 M.5 Gated 0.36 0.10 0.10 0.10 0.09 0.15 M.4 Hebbian 9.51 9.51 9.23 9.23 9.23 9.34 M.6 Slot 17.21 13.91 7.08 7.08 7.08 10.47 M.2 XAttn 17.85 14.65 9.02 9.02 9.02 11.91 10 × capacity ( n p = 640 , d h = 810 ) M.0 Baseline 0.00 0.00 0.00 0.00 0.00 0.00 M.5 Gated 11.22 7.62 7.62 7.62 7.62 8.34 M.1 Preﬁx 10.75 10.75 9.30 9.30 9.20 9.86 M.2 XAttn 11.90 11.90 9.88 9.88 9.88 10.69 M.3 KV Ext 15.58 15.58 9.69 9.69 9.69 12.05 M.6 Slot 13.85 10.60 10.60 10.21 9.66 10.99 M.4 Hebbian 15.86 11.19 10.32 10.32 10.32 11.60 measures whether that content survi ves ov erwriting and interference. A stronger architecture is one whose curve stays higher for longer . 1 × capacity . At the smaller scale (Figure 9, left; T able 3, upper block), the methods separate into two tiers. M.2 XAttn and M.6 Slot dominate with short-lag recall abov e 17% and long-lag scores around 9% and 7%, respecti vely . M.4 Hebbian is the most stable: its curve is nearly ﬂat at ∼ 9 . 3% across all buck ets, indicating strong resistance to overwrite. In contrast, M.1 Preﬁx, M.3 KV Ext, and M.5 Gated collapse to near-zero—their small memory banks cannot sustain useful state. The 1 × ordering is therefore M.2 XAttn > M.6 Slot > M.4 Hebbian ≫ M.5 Gated > M.1 Preﬁx ≈ M.3 KV Ext ≈ Baseline. 10 × capacity . At the larger scale (Figure 9, right; T able 3, lo wer block), all six methods pr oduce non-trivial for getting curves . M.4 Hebbian no w leads with the highest long-lag score (10.3%) and the best mean (11.6%). M.3 KV Ext, which was dead at 1 × , achie ves the highest short-lag recall (15.6%) and a strong mean (12.0%). M.6 Slot remains consistently strong across all buckets. M.5 Gated, which collapsed at 1 × , now reaches 11.2% short-lag recall. The capacity ef fect is the most striking result: three methods (M.1, M.3, M.5) fail completely at 1 × but succeed at 10 × , demonstrating that memory bank size is a critical hyperparameter —not just a scaling con venience. The methods that succeed at both scales (M.2, M.4, M.6) employ write mechanisms that are inherently more selective: attention-coupled writes, associativ e updates, or sparse top- k slot addressing. The fact that all six trained adapters produce non-trivial memory-recall curves at sufﬁcient capacity answers the secondary question posed in the introduction: the frozen decoder’ s cross-attention does possess sufﬁcient representational slack to att end usefully to memory entries projected by a trained adapter—the bottleneck is the quality of the write and read pathway and the capacity of the memory bank. 6.5 Cumulative Kno wledge Curve The forgetting curve measures how well memory resists decay after information is written. A complementary question is whether persistent memory enables knowledge accumulation : does the model know progressi vely more as additional sessions are processed? 13 W e formalise this as a cumulative knowledge curve . Let S 1 , S 2 , . . . , S n denote the sessions in a LoCoMo con versation. After writing all turns in sessions S 1 through S s , we probe every QA pair whose annotated evidence sessions are fully contained within the processed sessions, i.e. E sess q ⊆ { 1 , . . . , s } . The session-le vel kno wledge score is K s = 1 | Q ≤ s | X q ∈ Q ≤ s F1  ˆ y mem q , y q  , (42) where Q ≤ s is the set of answerable questions at session s . The net knowledge gain is summarised by ∆ K = K n − K 1 , (43) which is positi ve when the memory accumulates knowledge and zero (or negativ e) when it merely decays. For the stateless baseline, e very probed answer is produced without access to prior context, so K s is approximately constant across sessions. A rising K s curve therefore pro vides direct visual evidence of kno wledge growth that is impossible in a memoryless system. Knowledge-accumulation results (1 × capacity). T able 4 reports the terminal knowledge K 30 and net gain ∆ K for each method. The stateless baseline accumulates a surprising ∆ K = 5 . 6% —a ceiling ef fect from the frozen encoder’ s raw QA ability impro ving as more context is injected into the current prompt. Among memory methods, M.6 Slot achie ves the highest ∆ K = 9 . 7% and the highest terminal K 30 = 9 . 7% , follo wed by M.4 Hebbian ( ∆ K = 7 . 8% ) and M.2 XAttn ( ∆ K = 7 . 3% ). These are the same three methods that succeed on the forgetting curve at 1 × capacity , conﬁrming that write quality determines both retention and accumulation. M.1 Preﬁx, M.3 KV Ext, and M.5 Gated show near -zero knowledge gro wth ( ∆ K < 0 . 2% ), consistent with their collapsed forgetting curves. T able 4: Knowledge accumulation on LoCoMo (1 × capacity). K 30 is the terminal knowledge score after all 30 sessions; ∆ K = K 30 − K 1 measures net knowledge gain. Higher values indicate stronger persistent memory . Method K 30 (%) ∆ K (%) M.0 Baseline 5.57 5.57 M.1 Preﬁx 0.00 0.00 M.3 KV Ext 0.00 0.00 M.5 Gated 0.17 0.17 M.2 XAttn 11.04 7.34 M.4 Hebbian 10.62 7.84 M.6 Slot 9.71 9.71 7 Discussion 7.1 Why training is necessary If persistent memory is b uilt entirely from frozen encoder outputs and e xposed to the decoder without learned projections, the decoder recei ves states that were ne ver optimised for selecti ve long-range retriev al. The failure is structural: as more history is concatenated or cached, useful entries compete with irrele vant ones inside the same softmax, attention mass disperses, and the contrib ution of any single remembered fact shrinks. The frozen encoder manifold M E matches the pre-trained cross- attention interface, but it provides no mechanism for deciding what to retain, compress, or foreground across long lags. The forgetting-curv e results conﬁrm this analysis. At 1 × capacity , methods whose projections are closer to the raw encoder output (M.1 Preﬁx, M.3 KV Ext) collapse entirely , while methods with learned selectiv e writes (M.2 XAttn, M.4 Hebbian, M.6 Slot) achiev e recall rates above 9%. At 10 × capacity , even the weaker methods recover , b ut the gap remains: methods with richer write rules consistently outperform simpler ones. T raining is therefore necessary—and so is sufﬁcient memory capacity: the adapter must learn to map persistent state back into a representation the frozen decoder can use, and the bank must be large enough to store it. 14 T able 5: Adapter interference analysis. T ax > 0 means the adapter degrades baseline knowledge when memory is empty; Beneﬁt > 0 means memory helps more than the adapter hurts. Baseline mean F1 = 6.44% at both scales (stateless, independent of adapter capacity). 1 × capacity 10 × capacity Method T ax (%) Beneﬁt (%) T ax (%) Beneﬁt (%) M.1 Preﬁx +2.38 − 6.42 +4.23 +4.00 M.2 XAttn +3.39 +6.76 +3.39 +5.09 M.3 KV Ext +2.38 − 6.44 +4.23 +6.26 M.4 Hebbian +3.39 +3.10 +3.39 +5.46 M.5 Gated +3.39 − 6.30 +3.39 +1.83 M.6 Slot +2.38 +5.75 +4.23 +5.22 7.2 Adapter interference A trained memory adapter modiﬁes the decoder’ s cross-attention pathway . Even when the persistent memory is empty—immediately after a reset, or before an y con versation has be gun—the adapter’ s projections inject into the decoder and may displace the pre-trained knowledge that the frozen backbone already possesses. W e quantify this risk with two complementary metrics. Adapter tax. Let e F base be the raw token-F1 of the stateless baseline (no adapter , no memory) and let e F 0 m be the F1 of method m with its adapter attached but its memory state forced to zero. The adapter tax is T ax m = e F base − e F 0 m . (44) Positi ve v alues mean that the adapter degrades the model below its unmodiﬁed stateless performance, negati ve values mean the adapter happens to help e ven without any stored memory . Net beneﬁt. The quantity revie wers ultimately care about is whether adding persistent memory giv es a net improvement ov er the original model: Beneﬁt m = e F mem m − e F base . (45) A method is worthwhile if and only if Beneﬁt m > 0 , i.e. the memory’ s contribution outweighs any adapter interference. Why interference should be small. Because the backbone is 100% frozen, the adapter’ s only effect on the decoder is through the additional cross-attention entries it pro vides. When the memory bank is zeroed, these entries carry near-zero magnitude; a well-trained softmax distributes negligible attention mass to them, lea ving the decoder’ s original computation approximately intact. Methods with explicit gating (M.5, M.6) can learn to suppress the memory pathway entirely when memory is uninformativ e. The adapter tax is therefore expected to be small, and the net beneﬁt should track the forgetting-curv e results closely . Both metrics are computed from the same F1 mem and F1 zero va lues that the forgetting-curv e ev aluation already records, so no additional experiments are required. T able 5 reports the mean adapter tax, net beneﬁt, and raw F1 scores across lag b uckets. Empirical observations. At 1 × capacity , three methods (M.1 Preﬁx, M.3 KV Ext, M.5 Gated) sho w ne gative net beneﬁt—the adapter hurts more than the memory helps, consistent with the capacity collapse observed in the forgetting curv es. In contrast, M.2 XAttn (+6.76%), M.6 Slot (+5.75%), and M.4 Hebbian (+3.10%) produce positiv e net beneﬁt e ven at lo w capacity . At 10 × capacity , all six methods yield positiv e net beneﬁt (range +1.8–6.3%), conﬁrming that sufﬁcient memory capacity ov ercomes the adapter tax. The tax itself is modest (2–4% across all conditions), validating the theoretical prediction that frozen-backbone adapters introduce only minor interference. 7.3 Limitations and scope This work is an intentionally constrained pilot study . All six methods are instantiated on a single frozen encoder–decoder backbone (Flan-T5-XL, 3B parameters) with a single e valuation dataset 15 (LoCoMo) and minimal compute. Whether the same architectural principles transfer to decoder-only , encoder-only , or other -scale models must be validated separately . Absolute recall rates remain modest (up to ≈ 12%), which is e xpected giv en that 100% of backbone weights are frozen and the adapter budget is small. Critically , these limitations are by design : the purpose of the pilot is to demonstrate feasibility under worst-case resource constraints, not to optimise absolute performance. W e expect that relaxing an y of the following constraints will yield substantially stronger results: 1. Unfreezing the backbone. End-to-end training would allo w the encoder to learn what to write and the decoder to learn how to read persistent memory , rather than forcing the adapter alone to bridge both gaps. 2. Larger models. Bigger LLMs (e.g. 70B+ decoder -only) possess richer internal representa- tions; persistent memory injected into these representations should carry more information per slot. 3. Larger and more di verse data. Training on corpora beyond a single multi-session dialogue benchmark will improv e generalisation of the write and read operations. 4. Larger memory banks. Our 10 × scale uses n P = 640 slots. Because the memory bank is a numerical array decoupled from the backbone, it can be scaled by orders of magnitude—potentially millions of slots—with no change to per-turn inference cost. Pursuing these directions requires industrial-scale compute and is beyond the scope of this study , but the design-space taxonomy and ev aluation protocol established here provide the foundation for such work. 7.4 Broader implications End-to-end training and conv ersational learning . This pilot study trains only a minimal memory adapter ( θ Mem ) while the entire backbone remains frozen. This is the most resource-constrained setting possible, and it already demonstrates non-trivial memory recall. The full potential of persistent latent-space memory would be realised when the entir e LLM is trained end-to-end with its memory bank, learning simultaneously what to store and how to use it. An industrial-scale effort—training a 70B+ model on div erse multi-session corpora with a persistent memory bank millions of slots large— would couple the backbone’ s representation power with the memory’ s persistence, a combination that our frozen setup deliberately e xcludes. W e expect such a system to outperform our pilot results by a wide margin; our contribution is to sho w that the underlying mechanism is sound and to map the design space that large-scale training should e xplore. More broadly , persistent memory opens the door to con versational learning : every interaction updates the bank, and the model becomes more informed with each turn, dri ven by ordinary dialogue rather than curated datasets or reward signals. Existing LLMs can be r etr oﬁtted by installing a memory adapter and retraining—the backbone architecture need not change. Scalability . A latent memory bank is a compact numerical array whose capacity can grow without increasing the per-turn inference cost of the backbone. Our experiments use at most n P = 640 slots ( ≈ 5 MB at float32 ); in principle the bank can scale to millions of slots at modest storage cost, far e xceeding the lifetime of any human con versation. Unlike te xt-level memory systems that must retokenise gro wing text stores—incurring cost proportional to memory size—latent memory is read through a ﬁxed-dimension attention operation, so per -turn inference cost is independent of how much history has been stored. Latent-space memory as a cognitiv e substrate. Biological brains do not retain verbatim tran- scripts; they maintain distributed, continuously updated representations that support recognition, abstraction, and generalisation. Persistent latent-space memory mirrors this organisation more closely than text-le vel retrie v al, and connects naturally to neuroscience-inspired architectures such as comple- mentary learning systems [ 10 ] and attention-coupled lateralised memory [ 6 ]. Because LLMs already represent knowledge as continuous activ ations, a persistent memory that operates in the same latent space is a more natural substrate for core cognitive operations—reading, updating, generalisation, and compositional extension—than a symbolic or te xtual buf fer . 16 8 Conclusion This paper presents a proof-of-concept pilot study: persistent memory that lives entirely in the latent space of a frozen encoder–decoder LLM (Flan-T5-XL, 3B parameters), with only small trainable memory adapters and minimal compute. Under these deliberately se vere constraints, we show that the idea works : six adapter architectures spanning three injection points and four write mechanisms all produce non-tri vial memory-recall curves at 10 × capacity , while three collapse at 1 × —rev ealing capacity as a critical design parameter . M.2 XAttn and M.6 Slot dominate at low capacity; M.4 Hebbian leads at high capacity . The cumulativ e knowledge curve conﬁrms that the strongest methods accumulate kno wledge steadily over 30 sessions ( ∆ K up to 9.7%), while collapsed methods show no gro wth. The broader implication is that latent-space persistent memory is not one mechanism but a design space —one whose dimensions (write rule, read path, capacity) ha ve measurable consequences that would be in visible in text-lev el memory systems. Our pilot maps this design space under worst-case conditions; the natural ne xt step is industrial-scale exploration: end-to-end training of lar ge LLMs (70B+) with memory banks scaled to millions of slots and di verse multi-session corpora. Because the memory bank is a compact numerical array decoupled from the backbone, existing pre-trained models can be retroﬁtted with persistent memory by installing an adapter and retraining—no architectural redesign is required. W e believ e that full-scale training will show dramatically stronger results; this study establishes the feasibility , the taxonomy , and the e valuation protocol that such ef forts need. Acknowledgements The author thanks Inha Uni versity in T ashkent for research support. This work reﬂects the author’ s ongoing inquiry into nature and human cognition. References [1] Colin Raf fel, Noam Shazeer , Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-te xt transformer . JMLR , 21(140):1–67, 2020. [2] Charles Packer , Sarah W ooders, Ke vin Lin, V i vian F ang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: to wards LLMs as operating systems. arXiv pr eprint arXiv:2310.08560 , 2024. [3] W anjun Zhong, Lianghong Guo, Qiqi Gao, He Y e, and Y anlin W ang. MemoryBank: enhancing large language models with long-term memory . In AAAI , 2024. [4] Jing Xu, Arthur Szlam, and Jason W eston. Beyond goldﬁsh memory: Long-term open-domain con versation. In ACL , 2022. [5] Adyasha Maharana, Dong-Ho Lee, Serge y T ulyakov , Mohit Bansal, Francesco Barbieri, and Y uwei Fang. Ev aluating very long-term con versational memory of LLM agents. In Pr oceedings of the 62nd Annual Meeting of the A CL , 2024. [6] Hong Jeong. Inhibitory cross-talk enables functional lateralization in attention-coupled latent memory . arXiv:2603.03355 [q-bio.NC] , 2026. . [7] Hong Jeong. A miniature brain transformer: Thalamic gating, hippocampal lateralization, amygdaloid salience, and prefrontal working memory in attention-coupled latent memory . arXiv:2603.07217 [q-bio.NC] , 2026. . [8] Endel T ulving. Episodic and semantic memory . In E. T ulving and W . Donaldson, editors, Or ganization of Memory , pages 381–403. Academic Press, 1972. [9] James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly . Why there are comple- mentary learning systems in the hippocampus and neocorte x: insights from the successes and failures of connectionist models of learning and memory . Psycholo gical Review , 102(3):419– 457, 1995. 17 [10] Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated. T r ends in Cognitive Sciences , 20(7):512–534, 2016. [11] Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. In A CL–IJCNLP , 2021. [12] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr , Y ana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, T engda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Bor geaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mik o- laj Binko wski, Ricardo Barreira, Oriol V inyals, Andrew Zisserman, and Karen Simon yan. Flamingo: a visual language model for fe w-shot learning. In NeurIPS , 2022. [13] Y uhuai W u, Markus N. Rabe, DeLesle y Hutchins, and Christian Sze gedy . Memorizing trans- formers. In ICLR , 2022. [14] Imanol Schlag, Kazuki Irie, and Jür gen Schmidhuber . Linear transformers are secretly fast weight programmers. In ICML , 2021. [15] Alex Gra ves, Greg W ayne, and Ivo Danihelka. Neural Turing machines. arXiv pr eprint arXiv:1410.5401 , 2014. 18

Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment