NextMem: Towards Latent Factual Memory for LLM-based Agents

NextMem: T owards Latent F actual Memory f or LLM-based Agents Zeyu Zhang * 1 Rui Li * 1 Xiaoyan Zhao 2 Y ang Zhang 3 W enjie W ang 4 Xu Chen 1 T at-Seng Chua 3 Abstract Memory is critical for LLM-based agents to preserve past observations for future decision- making, where factual memory serves as its foun- dational part. Howe ver , existing approaches to constructing factual memory face se veral limita- tions. T extual methods impose heavy conte xt and indexing b urdens, while parametric methods suf- fer from catastrophic forgetting and high costs. T o address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressi ve autoencoder to ef ﬁciently con- struct latent memory while ensuring accurate re- construction. For better optimization, we propose a two-stage training process, including autore- gressi ve reconstruction alignment and progressi ve latent substitution. W e also incorporate quanti- zation to reduce storage o verhead. Extensi ve ex- periments demonstrate that NextMem achie ves superior performance, and excels in retrie val, ro- bustness, and e xtensibility properties. W e re- lease our code and model checkpoints at https: //github.com/nuster1128/NextMem . 1 Introduction In recent years, LLM-based agents ha ve emerged as a new AI paradigm in many ﬁelds ( W ang et al. , 2024 ; Xi et al. , 2025 ), such as personal assistants ( Li et al. , 2024 ) and academic research ( Zhang et al. , 2025b ). Memory is among their most critical components, which is responsible for retaining past information to support future decision- making ( Zhang et al. , 2025c ). Although the information is typically stored as multiple levels of memory ( Li et al. , 2025 ), factual memory still remains as their foundation, preserving details of observ ed f acts. Compared to other task- oriented memories, such as preference ( Sun et al. , 2025 ) and experience ( Zhao et al. , 2024 ), which commonly require 1 Renmin Uni versity of China 2 The Chinese University of Hong K ong 3 National Univ ersity of Singapore 4 Univ ersity of Sci- ence and T echnology of China. Correspondence to: Y ang Zhang < zhangy@nus.edu.sg > , Xu Chen < xu.chen@ruc.edu.cn > . Pr eprint. Marc h 18, 2026. P r e fe r e n c e M e mor y Exp e r i e n c e M e mor y F ac tu al M e mor y Q : W h a t w i l l A l i c e o rd e r a t t oni ght ? A l i c e l o v e s h e a r t y , b e e f - bas e d fa s t f ood. A : A l i c e w i l l l i k e l y e n j o y b e e f t a c o s . T i c k e t o r d e r i n g c o m m o n l y fo l l o w s s te p s … Q : P l e a s e h e l p m e b o o k a fl i g h t t i c k e t . A : S u re ! F i rs t o f a l l , I n e e d t o c h e c k … A l i c e a t e b e e f p i z z a l a s t n i g h t , a n d a te … Q : W h a t m a y c a u s e A l i c e ' s s t o m a c h a c h e ? A : L e t m e re c a l l ... s h e a t e a c o l d s a n d w i c h ... A l i c e fi r s t w e n t t o t h e m u s e u m , t h e n … Q : W h a t w a s t h e t h ird a t t ra c t i o n A li c e v i s i t e d ? A : It ’ s a v e ry t a l l c l o c k t o w e r , a n d i t ... F igure 1. Comparison between task-oriented and f actual memory . task-speciﬁc extraction from the original information, fac- tual memory emphasizes its lossless preservation, as we present their comparison in Figure 1 . Previ ou s research typically represents memory in two forms, including texts and parameters ( Zhang et al. , 2025c ). T extual memory is commonly utilized as a context for prompting LLMs, which can be stored and retriev ed by databases ( Park et al. , 2023 ; Zhong et al. , 2024 ). Howe ver , it increases con- text length and indexing o verhead when storing and using a large amount of detailed facts. Other studies incorporate information by modifying the parameters of LLMs ( Han et al. , 2024 ; Zhang et al. , 2024 ). Howe ver , they often suf fer from catastrophic forgetting and high costs to store detailed facts accurately . Both paradigms ha ve ob vious limitations in managing factual memory . Recent studies in LLM rea- soning propose to le verage latent embeddings to represent intermediate reasoning steps ( Hao et al. , 2024 ; Xu et al. , 2025 ), as well as compresses task instructions into latent embeddings for optimization ( Mu et al. , 2023 ; W u et al. , 2025 ). These studies sho w the potential of latent represen- tations in storing information. Ne vertheless, they primarily focus on improving reasoning ef ﬁciency or task adaptation, rather than preserving factual information. Motiv ated by these studies, we intend to lev erage latent rep- resentations to address the limitations in managing factual memory . Our primary goals are two-fold: (1) Efﬁciently transform te xtual memory into shorter latent representations compatible with LLMs. (2) Latent representations can be ac- curately and efﬁciently reconstructed into the original mem- ory . W e emphasize reconstruction because factual memory requires lossless preserv ation rather than partial extraction. Therefore, unlike extraction or indexing approaches, our encoding and decoding processes should be rev ersible. 1 NextMem: T owards Latent F actual Memory for LLM-based Agents In this study , we propose a simple yet effecti ve latent mem- ory frame work, named NextMem . W e design an autore- gressiv e autoencoder built upon LLMs for efﬁcient memory encoding and decoding. Our training process has two crit- ical stages, including autore gressive reconstruction align- ment and progressi ve latent substitution, with further quanti- zation to reduce storage cost while preserving accuracy . Additionally , we conduct extensi ve experiments to ver - ify model effecti veness and rev eal key properties and in- sights. Our experiments also indicate that the latent memory encoded by NextMem exhibits superior retriev al proper- ties, robustness, and extensibility . T o facilitate future re- search, we ha ve released our code and model checkpoints at https://github.com/nuster1128/NextMem . Our contributions are summarized as follo ws: • W e introduce a simple yet ef fective frame work for latent factual memory , with autoregressiv e reconstruction align- ment and progressiv e latent substitution. • W e integrate quantization methods into the latent memory of our framework, which reduces storage overhead while maintaining competitiv e performance. • W e validate our approach through e xtensiv e experiments and provide further insights. W e provide the research com- munity with open-source code and model checkpoints. 2 Preliminaries 2.1 Memory in LLM-based Agents Unlike v anilla LLMs that generate single-turn responses, LLM-based agents interact iterativ ely with their environ- ments. W e formalize this process as a Markov Decision Process (MDP) ( Sutton et al. , 1998 ). Speciﬁcally , let S and A denote the state and action spaces. The en vironment is fea- tured by a transition function E : S × A → S with rew ards. At each timestep t , the agent selects an action a t = f ( s t ) based on the current state s t ∈ S via a policy f . Then, the state will be updated by the en vironment s t +1 = E ( s t , a t ) with a re ward r t , which can be further observ ed by agents to take the next action a t +1 . This process repeats until task completion, and the objectiv e is to optimize the policy f to maximize the cumulativ e reward P r t . Memory is fundamental to agent decision-making. T o ac- commodate di verse memory representations, we conceptual- ize the memory procedure as an encoding-decoding process rather than the traditional paradigm of storage, retriev al, and utilization ( Zhang et al. , 2025e ). Under this frame work, an agent encodes historical information into memory , which is subsequently decoded to augment LLM inference. Specif- ically , let M t denote the memory state at timestep t . The encoding process is formalized as: M t = Encode ( s t , M t − 1 ) , which is ex ecuted upon observing the current state s t . The decoding process is integrated into agent’ s policy function: f ( s t ) = LLM ◦ Decode ( s t , M t ) , where function LLM represents LLM inference with core capabilities, such as reasoning, inference, and tool calling. Importantly , information encoded into M t at step t may not be utilized until a later timestep, reﬂecting the asynchronous feature of encoding and decoding in agentic workﬂo ws. 2.2 Representation of Agent Memory Memory can be manifested in dif f erent representations, and our encoding-decoding framew ork provides a uniﬁed per- spectiv e for these memory forms as follows. T extual Memory . T extual memory stores information in text format and leverages in-context learning to integrate rele vant data into LLM inference ( Zhang et al. , 2025c ). This approach often emplo ys indexing structures for query-based retriev al. The encoding process is formulated as: m t = LLM ( θ ; p extract ∥ s t ) ∈ V n , M t = M t − 1 ⊕ { ( m t , Inde x ( m t )) } , where p extract denotes the prompt used to extract memory from observations, ∥ represents concatenation, and ⊕ signi- ﬁes a structural merge. Here, we use LLM ( θ ; · ) to represent LLM inference parameterized by θ ov er a vocab ulary V with sequence length n , and use Index ( · ) to denote the approach of indexing. Then, the decoding process can be deﬁned as: q t = LLM ( θ ; p intent ∥ s t ) ∈ V n , c t = Retriev al ( Index ( q t ) , M t ) , f ( s t ) = LLM ( θ ; p instruct ∥ s t ∥ c t ) , where p intent is the prompt for generating the retriev al in- tent, p instruct is the task-speciﬁc instruction prompt, and Retriev al ( · ) denotes the operation of fetching rele vant con- text c t based on the established index es. Parametric Memory . Parametric memory incorporates new information by modifying model parameters, which are subsequently integrated with the base parameters at the inference stage. The encoding process can be formalized as: ∆ θ t = h meta ( s t , θ 0 ⊕ ∆ θ t − 1 ) ∈ R d , M t := θ 0 ⊕ ∆ θ t , where θ 0 denotes the base parameters of the LLM. The func- tion h meta ( · ) is designed to predict parameter modiﬁcations within the space R d based on the current observation s t . The decoding process is deﬁned as: f ( s t ) = LLM ( θ 0 + ∆ θ t ; p instruct ∥ s t ) , where ∆ θ t represents the cumulative parameter modiﬁca- tions that encapsulate the historical information for the task. Latent Memory . Beyond the abov e paradigms, latent mem- ory of fers a distinct approach by using an encoder to trans- form textual information into latent representations, which 2 NextMem: T owards Latent F actual Memory for LLM-based Agents ( b ) A u t o r eg r es s i v e R eco n s t ru ct i o n A l i g n men t ( a ) A u t o r eg r es s i v e A u t o en co d er [SoD] Her name is A l i c e                          E m be dding T oke niz e                                                                                  [SoD]              T oke niz e                              Her                               name      Her name is E m be dding E nco der Deco der Name.. [SoD] Her.. Name.. [IGN] [IGN] Her.. Her.. Name.. [EoT] [EoT] I nput  L a bel  Her nam e is Alic e. Her nam e is Alic e. D eco d er Reco ns t ruct io n L o s s  d ec od er ∗ = a r g ma x   (  |  ) (c ) P r o g r e s s i v e L a te n t S u b s ti tu t i o n     Name.. [SoD] Her..     [SoD] Name.. Her.. [EoT] [IGN] [IGN] [IGN] Her.. Name.. [EoT] [IGN] L a bel  I nput  E nco de Reco ns t ruct io n L o s s  en c od er ∗ = a r g ma x   (  d ec od er ;  |  ) H e r n a m e i s A l i c e .   …   E n co d er D eco d er F igure 2. Ov erview of NextMem frame work. enables reducing sequence length and inference latency . Its encoding process is formalized as: m t = h encode ( s t ) ∈ R L × d , M t = M t − 1 ⊕ { ( m t , Inde x ( m t )) } , where h encode transforms information from the textual space V n into a latent space R L × d , with L ≪ n . F or the decoding procedure, we hav e the following deﬁnition: q t = h encode ( s t ) ∈ R L × d , c t = Retriev al ( Index ( q t ) , M t ) , f ( s t ) = LLM ( θ ; p instruct ∥ s t , c t ) , where the retrie ved latent context c t can be integrated into the input embeddings for LLM inference. 3 Methods 3.1 Overview In this paper , we propose a simple yet effecti ve autore- gressiv e autoencoder to generate latent factual memory , as presented in Figure 2 (a) . Our model can forwardly trans- form textual information into latent representations that are compatible with LLMs’ inputs. It can be accurately decoded back into the original text to ensure ﬁne-grained preserv a- tion. W e design a two-stage training procedure to estab- lish te xt-to-text, latent-to-te xt, and text-to-latent information transformation, which includes autoregressiv e reconstruc- tion alignment and progressiv e latent substitution. 3.2 A utoregr essive A utoencoder Our autoencoder is b uilt on a causal language model archi- tecture. It can be decomposed into three primary parts: (1) Embedding Layer ( h emb ): It maps discrete input tok ens s ∈ V n into a continuous embedding space R n × d . (2) T ransf ormer Blocks ( h trans ): It consists of a stack of T ransformer decoder layers ( V aswani et al. , 2017 ). Each layer inte grates multi-head self-attention and a feed-forward network, with residual connections and normalization. (3) Language Modeling Head ( h lmh ): It projects the ﬁnal hidden states into a probability distribution over the v ocabu- lary space for next-token predictions. T o improv e model efﬁcienc y , the encoder and decoder share an identical architecture with two distinct weight sets θ encode and θ decode . Furthermore, we introduce a special token [SoD] to signify the start of transformation. During the encoding phase, we append it to the original input sequence s = [ s 1 , s 2 , . . . , s n ] ∈ V n before mapping to the initial input embedding: E 0 = h emb ( θ encode ; s ∥ [SoD] ) ∈ R ( n +1) × d . The embedding is then processed through the T ransformer blocks. W e extract the hidden state from the last position of the ﬁnal layer as the ﬁrst latent embedding: h 1 = h trans ( θ encode ; E 0 ) ( T ,n +1) ∈ R d , where T denotes the total number of T ransformer blocks. W e iterati vely add the pre viously obtained latent embedding to the input embeddings. The i -th step is deﬁned as: E i = [ E i − 1 ; h i ] , h i +1 = h trans ( θ encode ; E i ) ( T ,n + i +1) . The ﬁnal latent representation is the concatenation of all l generated latent embeddings: H ( l ) = [ h 1 ; h 2 ; . . . ; h l ] ∈ R l × d . 3 NextMem: T owards Latent F actual Memory for LLM-based Agents In the decoding phase, we ﬁrst map the instruction sufﬁx p sufﬁx (e.g., [SoD] ) into the input embedding space: E sufﬁx = h emb ( θ decode ; p sufﬁx ) . The input of the decoder is then formed by concatenating the latent representation H ( l ) with this sufﬁx embedding: E input = [ H ( l ) ; E sufﬁx ] . The probability distribution over the vocab ulary V for the next-token prediction is computed as: p ( V | E input ) = h lmh ◦ h trans ( θ decode ; E input ) . Based on this distribution, the decoder samples tokens and generates the output sequence o = [ o 1 , o 2 , . . . , o m ] ∈ V m in an autoregressi ve manner . 3.3 A utoregr essive Reconstruction Alignment T o optimize our autoencoder , we propose a two-stage train- ing procedure. In the ﬁrst stage, autore gressive reconstruc- tion alignment is designed to enable the model to transform information from textual space to textual space autore gres- siv ely , as shown in Figure 2 (b) . Our training samples are constructed in a self-supervised manner . Speciﬁcally , for each original sequence s = [ s 1 , s 2 , . . . , s n ] , we deﬁne the input sequence x and the corresponding target labels y as: x = [ s 1 , s 2 , . . . , s n , [SoD] , s 1 , s 2 , . . . , s n ] , y =   [IGN] , ... | {z } n , s 1 , ..., s n , [EoT]   , where [IGN] denotes positions where the loss calculation is ignored, and [EoT] is a pre-trained token signifying the end of the text. W e then ﬁne-tune the causal language model by maximizing the following lik elihood: θ ∗ decode = arg max θ decode P ( θ decode ; Y | X ) , where P ( θ ; Y | X ) = h lmh ◦ h trans ◦ h emb ( θ ; X ) . 3.4 Progr essive Latent Substitution Follo wing the initial alignment, we introduce progressive latent substitution, which further enables the encoder to transform information into latent representations and re- construct it. As illustrated in Figure 2 (c) , it comprises L progressiv e steps. At the k -th step, for an original sequence s , we ﬁrst generate a k -length latent representation H ( k ) via the encoding process. Then, we substitute the ﬁrst k blocks of the original sequence (each of block size B ) with the latent representations. The remaining te xtual sequence is ˜ s ( k ) = [ s k · B +1 , s k · B +2 , . . . , s n ] . W e then construct the input embedding as follows: x ( k ) ( θ ; s ) = h H ( k ) ; h emb ( θ ; ˜ s ( k ) ∥ [SoD] ∥ s ) i . Compared to the last alignment phase, we substitute several blocks of the original text with their corresponding latent representations H ( k ) . This forces the model to rely on latent representations to recov er the missing textual information. The target label for the decoder is deﬁned as: y ( k ) =   [IGN] , ... | {z } n , s 1 , ..., s k · B , [IGN] , ... | {z } n − k · B , [EoT]   . In this stage, only the encoder parameters θ encode are opti- mized, while keeping the decoder parameters θ decode frozen. This ensures that the encoder learns to produce represen- tations compatible with the pre-aligned decoder . The opti- mization objectiv e is formulated as: θ ( k ) encode = arg max θ encode P  θ decode ; Y ( k ) | X ( k ) ( θ encode ; S )  , where P ( θ decode ; Y ( k ) | X ( k ) ) = h lmh ◦ h trans ( θ decode ; X ( k ) ) , and S denotes the original te xt set of training corpus. For each step k , the encoder parameters are initialized with the weights from the pre vious step, i.e., θ ( k ) encode ← θ ( k − 1) encode . For the ﬁrst step, we initialize the encoder using the pre- aligned decoder weights θ ∗ decode . The ﬁnal encoder parame- ters are denoted as θ ∗ encode = θ ( L ) encode after L steps of progres- si ve substitution. T o enhance optimization efﬁcienc y and sta- bility , we apply a stop-gradient operation to the hidden state h i − 1 when computing the gradient ∇ θ h i . This detachment prev ents the backpropagation of gradients through multiple recurrence, signiﬁcantly reducing the computational cost. W e employ LoRA ( Hu et al. , 2022 ) to implement both the encoder and decoder . This approach allows us to switch between the encoder and decoder by simply swapping their LoRA adapters while sharing the same backbone, which av oids the replicated model loading. 3.5 Latent Memory Quantization After the two-stage training, we observ e that the latent rep- resentations exhibit strong robustness (see Section 4.5 ). T o further reduce storage overhead, we employ 4-bit Nor- malFloat (NF4) quantization ( Dettmers et al. , 2023 ) for further compression. The codebook is deﬁned as a ﬁxed set of NF4 v alues C = { c 0 , c 1 , . . . , c 15 } . The quantization process maps the high-precision latent representation H ( L ) to 4-bit indices Q ( L ) and an associated scale vector s . First, we compute the quantization scale s ∈ R d for each feature dimension j by determining the maximum absolute v alue: s j = max 1 ≤ i ≤ l | H ( L ) i,j | . The input is then normalized element-wise using ˜ H ( L ) i,j = H ( L ) i,j s j + ϵ , where ϵ is a small constant for numerical stability . Subse- quently , each normalized element is mapped to its nearest centroid in the codebook C via: Q ( L ) i,j = arg min k    ˜ H ( L ) i,j − c k    . 4 NextMem: T owards Latent F actual Memory for LLM-based Agents The resulting indices Q ( L ) are stored as 4-bit unsigned inte- gers, while the scales s are cast to the FP8 format (speciﬁ- cally float8 e4m3fn ) to further optimize memory efﬁ- ciency . T o reconstruct the approximation ˆ H ( L ) , we retriev e the codebook v alues with their indices and rescale them. Ac- tually , we have e xplored more methods for higher sparsity , but they failed to maintain suf ﬁcient reconstruction accuracy . More details are provided in A ppendix A . 3.6 Efﬁciency and Scalability Analysis NextMem impro ves inference efﬁcienc y by compressing extensi ve textual observ ations into compact latent represen- tations. It alle viates context windo w pressure for agents to allocate more token capacity to complex reasoning and long- horizon planning. Additionally , the adoption of shared back- bone parameters with LoRA adapters and NF4 quantization minimizes the model’ s memory o verhead while enabling high-density storage of factual records. These optimizations collectiv ely facilitate the scalable deployment of factual memory within resource-constrained agentic workﬂo ws. 4 Experiments 4.1 Experimental Settings T o validate the ef fectiveness of our model, we conduct ex- tensiv e e xperiments and analyze its results from multiple aspects. Our primary ev aluation in volves three tasks that are closely related to agent memory , including (1) Factual Re- construction , (2) Contextual Generation , and (3) Dense Passage Retrieval . These three tasks correspond to memory storage, utilization, and retrie val, respecti vely . In addition, we further explore compression ratio inﬂuence, rob ustness, forgetting ef fects and other features of our approaches. The datasets utilized in our experiments include: • SQuAD ( Rajpurkar et al. , 2016 ): A reading comprehen- sion dataset that requires to answer questions according to passages extracted from W ikipedia. • HotpotQA ( Y ang et al. , 2018 ): A question answering dataset that requires multi-hop reasoning based on informa- tion across multiple W ikipedia documents. • RA CE ( Lai et al. , 2017 ): A reading comprehension dataset from English exams that test reasoning and understanding. • LoCoMo ( Maharana et al. , 2024 ): A simulated dataset for ev aluating long-term memory of LLM-based agents through multi-session con versations among dif ferent speakers. • LongMemEval ( W u et al. , 2024 ): A dataset that is con- structed to e valuate long-term memory capabilities of per - sonal assistants in the scenario of user-agent interactions. The major baselines in our experiments include: • DeepSeek-OCR ( W ei et al. , 2025 ): A vision-language model that utilizes context optical compression to compress texts into images before performing inference by LLMs. T o unify the scale of the hidden representations, we employ the 240 × 240 pixel (16 latent tokens) patterns. • ICAE ( Ge et al. , 2023 ): A conte xt compression model that can con vert paragraphs into memory slots for LLM inference, which is based on optimizable memory tokens. W e employ the shortest public checkpoint with 128 tokens. • DyPRA G ( T an et al. , 2025 ): A framew ork that con verts paragraphs into parametric kno wledge (speciﬁcally LoRA adapters) at test time using a lightweight hypernetwork. Our models are denoted as NextMem-Dense for the dense version and NextMem-Sparse with quantization, both gen- erating 15 latent tokens. Besides, we also provide special models to facilitate comparison, including T extual Mem- ory ( Zhang et al. , 2025d ), and BGE ( Chen et al. , 2024 ). As for the metrics in our experiment, we utilize F1 Score ( Rajpurkar et al. , 2016 ), ROUGE-1 , ROUGE-L ( Lin , 2004 ), METEOR ( Banerjee & Lavie , 2005 ), BLEU ( Pa- pineni et al. , 2002 ), BertSore ( Zhang et al. , 2019 ) for the factual reconstruction task. Besides, we adopt Accuracy by LLM-as-Judge ( Gu et al. , 2024 ) for the contextual genera- tion task, and use Hit@5 , Recall@5 , MRR@5 , MAP@5 , DCG@5 , NDCG@5 ( Sch ¨ utze et al. , 2008 ; Karpukhin et al. , 2020 ) for the dense passage retriev al task. For common settings, we employ Qwen3-8B ( Y ang et al. , 2025 ) as the primary backbone except for speciﬁc check- point requirements. W e set the chunk size of textual refer - ences as 128. Due to the page limitation, we place the full details of reproduction in A ppendix B , inﬂuence of block size in Appendix C , for getting effects in A ppendix D , and case studies in Appendix E . 4.2 Major Perf ormances In order to provide a more comprehensiv e ev aluation, our major experiments include three di verse tasks, correspond- ing to the storage, utilization, and retrie val of latent memory . T ask 1: F actual Reconstruction (Memory Storage) Factual memory relies on the precision of details. Therefore, we measure the ability of baselines in information recon- struction. Speciﬁcally , we extract reference paragraphs from datasets and apply sentence sampling to improv e data di- versity and uniformity . In this task, all the baselines are required to encode texts to latent representations and de- code them back to the original form before calculating their consistency . The results are presented in T able 1 . The results show that our proposed methods signiﬁcantly outperform other baselines across v arious datasets. Specif- ically , Ne xtMem-Dense achiev es the highest scores in most scenarios, substantially exceeding ICAE. In addi- tion, NextMem-Sparse maintains highly competiti ve perfor- mance, which shows the ef fectiv eness of our quantization strategy . In contrast, previous methods such as DyPRAG 5 NextMem: T owards Latent F actual Memory for LLM-based Agents T able 1. Performance comparison of factual reconstruction ( i.e., task 1 for memory storage) across multiple datasets. In each assessment, values in bold represent the best results, and those with an underline represent the second-best results. Datasets Methods F1 R OUGE-1 ROUGE-L METEOR BLEU BertScore HotpotQA DyPRA G 0.0305 0.0347 0.0337 0.0187 0.0000 0.7983 DeepSeek-OCR 0.4540 0.5492 0.5374 0.3987 0.2432 0.8664 ICAE 0.7890 0.8570 0.8340 0.7493 0.5782 0.9581 NextMem-Dense 0.9820 0.9862 0.9854 0.9820 0.9633 0.9966 NextMem-Sparse 0.9805 0.9842 0.9833 0.9810 0.9612 0.9962 RA CE DyPRA G 0.0696 0.0826 0.0689 0.0341 0.0000 0.8158 DeepSeek-OCR 0.4068 0.4509 0.4268 0.3626 0.2371 0.8481 ICAE 0.6077 0.6775 0.6117 0.5555 0.3503 0.9370 NextMem-Dense 0.8552 0.8838 0.8580 0.8691 0.6995 0.9735 NextMem-Sparse 0.8554 0.8833 0.8583 0.8705 0.6975 0.9731 SQuAD DyPRA G 0.0493 0.0510 0.0477 0.0221 0.0000 0.8040 DeepSeek-OCR 0.3657 0.4018 0.3755 0.3169 0.1864 0.8289 ICAE 0.7084 0.7709 0.7163 0.6508 0.4501 0.9536 NextMem-Dense 0.8920 0.9128 0.8886 0.8958 0.7581 0.9785 NextMem-Sparse 0.8860 0.9069 0.8826 0.8897 0.7443 0.9778 LoCoMo DyPRA G 0.0901 0.1143 0.0932 0.0501 0.0000 0.8335 DeepSeek-OCR 0.5179 0.6421 0.6272 0.4627 0.3139 0.8962 ICAE 0.6986 0.7815 0.7515 0.7043 0.4730 0.9560 NextMem-Dense 0.9611 0.9742 0.9704 0.9640 0.9063 0.9946 NextMem-Sparse 0.9615 0.9741 0.9705 0.9637 0.9070 0.9944 LongMemEval DyPRA G 0.1338 0.1643 0.1331 0.0643 0.0000 0.8360 DeepSeek-OCR 0.4685 0.5375 0.5106 0.4116 0.2713 0.8681 ICAE 0.7015 0.7510 0.7007 0.6634 0.4690 0.9535 NextMem-Dense 0.9436 0.9620 0.9555 0.9466 0.8784 0.9905 NextMem-Sparse 0.9362 0.9554 0.9486 0.9397 0.8692 0.9891 and DeepSeek-OCR exhibit limited reconstruction capabili- ties. These results validate the superior ability of NextMem to achiev e high-ﬁdelity storage of latent memory . T ask 2: Contextual Generation (Memory Utilization) For latent memory , beyond preserving ﬁne-grained details, it is crucial that the stored information can be utilized by LLMs for their inference. Therefore, we ev aluate our frame- work on memory utilization. Speciﬁcally , for each query , we extract its references from datasets, and encode them into latent representations for LLMs to generate responses. W e design two settings: (1) Compr ession (Comp.) , where the model performs inference directly using latent repre- sentations, and (2) DeCompression (DeComp.) , where the inference is based on reconstructed information. Besides the abov e baselines, we incorporate raw te xtual memory as an oracle comparison. The results are presented in T able 2 . According to the results, while ICAE sho ws an advantage in the Comp. settings, our models outperform all baselines in the DeComp. setting. It indicates that while NextMem’ s latent space is less optimized for direct utilization, its supe- rior reconstruction ﬁdel ity allows it to provide highly usable information once decompressed. It also re veals a trade-of f between reconstruction accuracy and instruction-follo wing capability , which will be our next research topic in the fu- ture. In contrast, DyPRA G and DeepSeek-OCR struggle to support effecti ve generation in either setting. T ask 3: Dense P assage Retrieval (Memory Retriev al) For most agent applications, maintaining long-term mem- ory is essential to support inference at any future point, which makes the retrie val of query-rele vant memory a ke y problem. Since latent representation inherently possesses computational properties within the latent space, it is com- patible as a retrie val inde x. Therefore, we further e valuate their performances in retrie val. Speciﬁcally , we generate latent representations for documents, pool them into 1D embeddings, and rank them based on matching scores calcu- lated by cosine similarity with query embeddings. W e also incorporate BGE as a reference, despite its inability to recon- struct. From the results in T able 3 , our methods demonstrate substantial improv ements over other baselines. In addition, NextMem also bridges the gap between latent memory and retriev al index. These results indicate that NextMem can effecti vely unify memory storage and retrie val into a single latent representation, reducing the architectural complexity . 6 NextMem: T owards Latent F actual Memory for LLM-based Agents T able 2. Performance comparison of contextual generation ( i.e., task 2 for memory utilization) across various datasets. For all non-oracle methods, the best results are highlighted in bold , and the second-best results are underlined. Methods HotpotQA SQuAD LoCoMo LongMemEval Comp. DeComp. Comp. DeComp. Comp. DeComp. Comp. DeComp. DyPRA G 0.5000 0.3789 0.2659 0.2023 0.0191 0.0239 0.0800 0.0971 DeepSeek-OCR 0.5673 0.3744 0.2124 0.2225 0.0766 0.1435 0.1943 0.1543 ICAE 0.8565 0.8229 0.7775 0.7066 0.5407 0.5215 0.4971 0.5029 NextMem-Dense 0.5179 0.8072 0.3223 0.7572 0.2871 0.5407 0.1971 0.5400 NextMem-Sparse 0.4978 0.8184 0.3092 0.7630 0.2679 0.5263 0.2029 0.5486 *Oracle — 0.9350 — 0.9335 — 0.6986 — 0.6971 T able 3. Performance comparison of dense passage retriev al (task 3 for memory retriev al) across various datasets. For all reconstruction models, the best results are highlighted in bold , and the second-best results are underlined. Datasets Methods Hit@5 Recall@5 MRR@5 MAP@5 DCG@5 NDCG@5 HotpotQA DeepSeek-OCR 0.3358 0.1487 0.1659 0.0730 0.2260 0.1171 ICAE 0.4453 0.2217 0.3187 0.1555 0.4058 0.2126 NextMem-Dense 0.7245 0.3925 0.5194 0.2793 0.6673 0.3680 NextMem-Sparse 0.7208 0.4030 0.5107 0.2788 0.6683 0.3687 *BGE 0.9585 0.6681 0.8063 0.5442 1.1756 0.6438 LoCoMo DeepSeek-OCR 0.0676 0.0321 0.0269 0.0115 0.0383 0.0206 ICAE 0.1210 0.0530 0.0577 0.0254 0.0789 0.0411 NextMem-Dense 0.4377 0.2132 0.2418 0.1183 0.3455 0.1768 NextMem-Sparse 0.4342 0.2087 0.2310 0.1111 0.3304 0.1692 *BGE 0.8007 0.4824 0.5061 0.3181 0.7953 0.4166 LongMemEval DeepSeek-OCR 0.4200 0.3125 0.1528 0.1133 0.2315 0.1747 ICAE 0.5480 0.4169 0.2437 0.1840 0.3606 0.2596 NextMem-Dense 0.8220 0.6805 0.5445 0.4350 0.7768 0.5279 NextMem-Sparse 0.8140 0.6740 0.5428 0.4326 0.7695 0.5244 *BGE 0.8960 0.7958 0.6934 0.6037 0.9876 0.6793 4.3 Ablation Studies T o e valuate the contrib ution of each component, we conduct an ablation study on RA CE with the following settings: (1) w/o ST , removing [SoD] in inference. (2) w/o PT , remov- ing the progressi ve latent substitution training. (3) w/o PS , excluding the progressive expansion in latent substitution training. The results in T able 4 shows the removal of any module leads to a performance degradation, with progres- siv e latent substitution being the most critical. In addition, [SoD] also plays a vital role, as its absence signiﬁcantly drops the performance. Moreov er , the result shows that the progressiv e strategy can greatly impro ve the performance. For the sparse model, the remov al of scaling in quantization leads to a drastic decline. These results indicate that each proposed component is important to latent memory storage. 4.4 Inﬂuence of Compression Ratio Scaling Since the capacity of latent representations is intuiti vely lim- ited, we further explore the model’ s scaling behavior in en- coding and reconstruction across different text lengths. The T able 4. Results of ablation study . The best results are highlighted in bold , and the second-best results are underlined. Methods F1 ROUGE-L METEOR BertScor e Dense 0.8552 0.8580 0.8691 0.9735 w/o ST 0.3799 0.3804 0.4048 0.7307 w/o PT 0.0159 0.0138 0.0169 0.7686 w/o PS 0.7389 0.7358 0.7353 0.9502 Sparse 0.8554 0.8583 0.8705 0.9731 w/o SQ 0.0309 0.0290 0.0442 0.7521 results are illustrated in Figure 3 . W e ﬁnd that NextMem maintains higher performance as the input length increases comparing with other models. While all models exhibit per - formance decay , NextMem shows a much slo wer and more graceful degradation. In addition, we observe that Ne xtMem has slight performance dips on shorter sequences, possibly due to hallucinations. Crucially , our model can maintain high semantic integrity be yond the training length (240 to- kens), demonstrating rob ust extrapolation generalization to out-of-distribution sequence lengths. 7 NextMem: T owards Latent F actual Memory for LLM-based Agents D eepS eek O C R I CA E Ne x tM e m - D en s e Ne x tM e m - S pa rs e F igure 3. Results under varying compression ratios. F igur e 4. Robustness results of latent representations under v arying lev els of Gaussian noise ( σ ) and NF4 quantization. 4.5 Robustness of Latent Memory Since latent memory may have precision loss or quantization to sav e storage, we further ev aluate its robustness. Specif- ically , we perturb the encoded latent memory by adding Gaussian noise ϵ ∼ N (0 , σ 2 ) . From the results in Fig- ure 4 , we observe that our model maintains stable perfor- mance under moderate noise ( σ ≤ 0 . 8 ), and still provides meaningful information under high perturbations. Further - more, the results sho w that NF4 quantization leads to neg- ligible performance loss, which provides a foundation for NextMem-Sparse. These ﬁndings validate that NextMem can effecti vely preserv e information in noisy settings. 4.6 Semantic Assignment Analysis T o explore the semantic assignment of latent memory , we perform an experiment using a paragraph with eight declar - ativ e sentences. W e iteratively substitute entities from the ﬁrst sentence to the last, and then rev ert them. F or each step, we compute the distances between perturbed and orig- inal representations. Results in Figure 5 rev eal a distinct diagonal pattern, which indicates a strong spatial-semantic mapping within latent memory . It means speciﬁc memory positions are causally responsible for storing information from corresponding parts. This relationship demonstrates that our frame work successfully learns an ordered latent space, which is crucial for ﬁne-grained memory editing. 4.7 T raining Procedure Analysis Figure 6 presents the curves of training and ev aluation losses, providing further insights into the model’ s optimiza- tion process. Both losses exhibit do wnward trends through- out the training procedure, e ventually conv erging to low values. In addition, the loss curves display a distinct peri- E u c lid e a n Dis ta n c e C o s in e Dis ta n c e F igure 5. Semantic assignment analysis of latent memory . F igure 6. Curves of training and e valuation loss. odic sawtooth pattern that results from a progressi ve train- ing strategy . T emporary loss spikes occur at the boundaries where new substitution phases are introduced. These spikes are immediately followed by rapid adaptation and continued loss reduction within each step. 5 Related W orks Memory is important for LLM-based agents to store pre vi- ous observations and for future inference, typically repre- sented as te xtual and parametric forms ( Zhang et al. , 2025c ). For textual memory , MemoryBank ( Zhong et al. , 2024 ) stores con versations and summaries for the agent’ s reason- ing. MemGPT ( Pack er et al. , 2023 ) constructs an operating system to manage context memory . For parametric memory , model editing methods can modify model parameters to inject knowledge ( Zhang et al. , 2024 ). Zhang et al. ( 2025f ) also explores their adv antages and limitations across differ - ent tasks. Some recent works try to employ latent tokens as experience for task learning and context management. For example, Gist ( Mu et al. , 2023 ) generates latent tokens from preﬁx task instructions for subsequent tasks. T okMem ( Mu et al. , 2023 ) constructs a trainable memory matrix to con vert common instructions into embeddings. MemGen ( Zhang et al. , 2025a ) integrates pre vious tokens to obtain implicit representations for current steps. 6 Conclusion In this paper , we introduce NextMem, an autoregressi ve au- toencoder frame work for ef ﬁ cient latent f actual memory . By employing a two-stage training process and NF4 quantiza- tion, NextMem achie ves high-ﬁdelity reconstruction, robust retriev al, and signiﬁcant storage reduction. Experiments validate its ef fectiv eness, providing a scalable and effecti ve foundation for memory in LLM-based agents. 8 NextMem: T owards Latent F actual Memory for LLM-based Agents Impact Statements This paper presents work whose goal is to advance the ﬁeld of machine learning. There are many potential societal consequences of our work, none of which we feel must be speciﬁcally highlighted here. References Banerjee, S. and Lavie, A. Meteor: An automatic metric for mt ev aluation with improved correlation with human judgments. In Pr oceedings of the acl workshop on in- trinsic and extrinsic evaluation measur es for machine translation and/or summarization , pp. 65–72, 2005. Chen, J., Xiao, S., Zhang, P ., Luo, K., Lian, D., and Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality , multi-granularity text embeddings through self-knowledge distillation. arXiv pr eprint arXiv:2402.03216 , 4(5), 2024. Dettmers, T ., Pagnoni, A., Holtzman, A., and Zettlemoyer , L. Qlora: Efﬁcient ﬁnetuning of quantized llms. Advances in neural information pr ocessing systems , 36:10088–10115, 2023. Ge, T ., Hu, J., W ang, L., W ang, X., Chen, S.-Q., and W ei, F . In-context autoencoder for context compression in a large language model. arXiv pr eprint arXiv:2307.06945 , 2023. Gu, J., Jiang, X., Shi, Z., T an, H., Zhai, X., Xu, C., Li, W ., Shen, Y ., Ma, S., Liu, H., et al. A surve y on llm-as-a- judge. The Innovation , 2024. Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. Parameter -efﬁcient ﬁne-tuning for large models: A com- prehensiv e survey . arXiv pr eprint arXiv:2403.14608 , 2024. Hao, S., Sukhbaatar , S., Su, D., Li, X., Hu, Z., W eston, J., and Tian, Y . Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 , 2024. Hu, E. J., Shen, Y ., W allis, P ., Allen-Zhu, Z., Li, Y ., W ang, S., W ang, L., Chen, W ., et al. Lora: Low-rank adaptation of large language models. ICLR , 1(2):3, 2022. Karpukhin, V ., Oguz, B., Min, S., Lewis, P . S., W u, L., Edunov , S., Chen, D., and Y ih, W .-t. Dense passage retrie val for open-domain question answering. In EMNLP (1) , pp. 6769–6781, 2020. Lai, G., Xie, Q., Liu, H., Y ang, Y ., and Hovy , E. Race: Large-scale reading comprehension dataset from exami- nations. arXiv pr eprint arXiv:1704.04683 , 2017. Li, Y ., W en, H., W ang, W ., Li, X., Y uan, Y ., Liu, G., Liu, J., Xu, W ., W ang, X., Sun, Y ., et al. Personal llm agents: Insights and survey about the capability , efﬁciency and security . arXiv preprint , 2024. Li, Z., Song, S., W ang, H., Niu, S., Chen, D., Y ang, J., Xi, C., Lai, H., Zhao, J., W ang, Y ., et al. Memos: An operat- ing system for memory-augmented generation (mag) in large language models. arXiv preprint , 2025. Lin, C.-Y . Rouge: A package for automatic ev aluation of summaries. In T ext summarization branches out , pp. 74–81, 2004. Maharana, A., Lee, D.-H., T ulyako v , S., Bansal, M., Bar- bieri, F ., and Fang, Y . Ev aluating very long-term con versational memory of llm agents. arXiv pr eprint arXiv:2402.17753 , 2024. Mu, J., Li, X., and Goodman, N. Learning to compress prompts with gist tok ens. Advances in Neural Information Pr ocessing Systems , 36:19327–19352, 2023. Packer , C., Fang, V ., Patil, S., Lin, K., W ooders, S., and Gonzalez, J. Memgpt: T o wards llms as operating systems. 2023. Papineni, K., Roukos, S., W ard, T ., and Zhu, W .-J. Bleu: a method for automatic ev aluation of machine transla- tion. In Pr oceedings of the 40th annual meeting of the Association for Computational Linguistics , pp. 311–318, 2002. Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P ., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior . In Pr oceedings of the 36th annual acm symposium on user interface software and technology , pp. 1–22, 2023. Rajpurkar , P ., Zhang, J., Lopyre v , K., and Liang, P . Squad: 100,000+ questions for machine comprehension of text. arXiv pr eprint arXiv:1606.05250 , 2016. Rajput, S., Mehta, N., Singh, A., Hulikal K eshavan, R., V u, T ., Heldt, L., Hong, L., T ay , Y ., Tran, V ., Samost, J., et al. Recommender systems with generativ e retriev al. Advances in Neural Information Pr ocessing Systems , 36: 10299–10315, 2023. Sch ¨ utze, H., Manning, C. D., and Ragha van, P . Intr oduction to information r etrieval , volume 39. Cambridge Univ er- sity Press Cambridge, 2008. Sun, H., Zhang, Z., and Zeng, S. Preference-aware mem- ory update for long-term llm agents. arXiv preprint arXiv:2510.09720 , 2025. 9 NextMem: T owards Latent F actual Memory for LLM-based Agents Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An intr oduction , volume 1. MIT press Cambridge, 1998. T an, Y ., He, S., Liao, H., Zhao, J., and Liu, K. Dynamic para- metric retriev al augmented generation for test-time knowl- edge enhancement. arXiv pr eprint arXiv:2503.23895 , 2025. V aswani, A., Shazeer , N., Parmar , N., Uszk oreit, J., Jones, L., Gomez, A. N., Kaiser, Ł ., and Polosukhin, I. At- tention is all you need. Advances in neural information pr ocessing systems , 30, 2017. W ang, L., Ma, C., Feng, X., Zhang, Z., Y ang, H., Zhang, J., Chen, Z., T ang, J., Chen, X., Lin, Y ., et al. A surve y on large language model based autonomous agents. F r ontiers of Computer Science , 18(6):186345, 2024. W ei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression. arXiv pr eprint arXiv:2510.18234 , 2025. W u, D., W ang, H., Y u, W ., Zhang, Y ., Chang, K.-W ., and Y u, D. Longmemev al: Benchmarking chat assis- tants on long-term interacti ve memory . arXiv pr eprint arXiv:2410.10813 , 2024. W u, Z., Hao, Y ., and Mou, L. T okmem: T okenized proce- dural memory for large language models. arXiv pr eprint arXiv:2510.00444 , 2025. Xi, Z., Chen, W ., Guo, X., He, W ., Ding, Y ., Hong, B., Zhang, M., W ang, J., Jin, S., Zhou, E., et al. The rise and potential of large language model based agents: A surv ey . Science China Information Sciences , 68(2):121101, 2025. Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efﬁcient reasoning with llms. arXiv pr eprint arXiv:2502.12134 , 2025. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. Y ang, Z., Qi, P ., Zhang, S., Bengio, Y ., Cohen, W ., Salakhut- dinov , R., and Manning, C. D. Hotpotqa: A dataset for div erse, explainable multi-hop question answering. In Pr oceedings of the 2018 confer ence on empirical methods in natural languag e pr ocessing , pp. 2369–2380, 2018. Zhang, G., Fu, M., and Y an, S. Memgen: W eaving gen- erativ e latent memory for self-e volving agents. arXiv pr eprint arXiv:2509.24704 , 2025a. Zhang, N., Y ao, Y ., T ian, B., W ang, P ., Deng, S., W ang, M., Xi, Z., Mao, S., Zhang, J., Ni, Y ., et al. A comprehensive study of kno wledge editing for large language models. arXiv pr eprint arXiv:2401.01286 , 2024. Zhang, T ., Kishore, V ., W u, F ., W einberger , K. Q., and Artzi, Y . Bertscore: Evaluating te xt generation with bert. arXiv pr eprint arXiv:1904.09675 , 2019. Zhang, W ., Li, X., Zhang, Y ., Jia, P ., W ang, Y ., Guo, H., Liu, Y ., and Zhao, X. Deep research: A survey of autonomous research agents. arXiv pr eprint arXiv:2508.12752 , 2025b. Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., and W en, J.-R. A survey on the memory mechanism of large language model-based agents. A CM T ransactions on Information Systems , 43(6):1–47, 2025c. Zhang, Z., Dai, Q., Chen, X., Li, R., Li, Z., and Dong, Z. Memengine: A uniﬁed and modular library for de velop- ing advanced memory of llm-based agents. In Companion Pr oceedings of the A CM on W eb Conference 2025 , pp. 821–824, 2025d. Zhang, Z., Dai, Q., Li, R., Bo, X., Chen, X., and Dong, Z. Learn to memorize: Optimizing llm-based agents with adaptiv e memory frame work. arXiv pr eprint arXiv:2508.16629 , 2025e. Zhang, Z., Zhang, Y ., T an, H., Li, R., and Chen, X. Ex- plicit vs implicit memory: Exploring multi-hop complex reasoning ov er personalized information. arXiv pr eprint arXiv:2508.13250 , 2025f. Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .-J., and Huang, G. Expel: Llm agents are experiential learners. In Pro- ceedings of the AAAI Confer ence on Artiﬁcial Intelli- gence , v olume 38, pp. 19632–19642, 2024. Zhong, W ., Guo, L., Gao, Q., Y e, H., and W ang, Y . Memo- rybank: Enhancing large language models with long-term memory . In Pr oceedings of the AAAI Confer ence on Arti- ﬁcial Intelligence , v olume 38, pp. 19724–19731, 2024. 10 NextMem: T owards Latent F actual Memory for LLM-based Agents A F ailure Cases A.1 Failur e Cases on Reconstruction V ersion 1 (W eighted Logits Combination): Initially , rather than using hidden states, we deri ve the latent representation by weighting the input embeddings of the α -quantile tokens based on their LM head logits at ﬁrst. This approach compresses memory while maintaining high sparsity , effecti vely lo wering storage costs. W e use four special tokens, where [SoD]/[EoD] represent the start/end of textual documents, and [SoM]/[EoM] indicate the start/end of latent memory , respectiv ely . Our preliminary e xperiments in volve direct training, without reconstruction alignment and progressi ve steps. This method fails to yield the desired outcomes. While the loss drops from 2 . 18 to 1 . 24 , the model merely learns to cop y the reference text. W e suppose that it is hard to use existing sparse tokens to directly represent latent representations. V ersion 2 (Additional Latent Dictionary and Pr ojection Head): T o improve upon V ersion 1, we incorporate an extra latent representation dictionary , alongside an MLP projection head that maps hidden states to this dictionary . These tw o parts are introduced as optimizable parameters. In addition, we initialize the dictionary using PCA-reduced input embeddings followed by K-means clustering. Therefore, the latent memory is combined with the tokens from the latent dictionary with projection logits. Despite a loss reduction from 2 . 4 to 1 . 7 , the model generates only meaningless text. W e also ﬁnd that the latent representations are very similar , indicating poor training of the latent space. Moreov er, we ﬁnd the special tokens [SoD] , [SoM] , and [EoD ] are not adequately learned. Giv en the requirement for training efﬁcienc y , we move a way from this approach and pursue subsequent improv ements. V ersion 3 (T wo-Stage T raining Strategy with Additional Latent Space): In this version, we introduce a two-stage training approach: autoregressiv e reconstruction alignment followed by latent space optimization. The ﬁrst stage focuses on explicit-to-explicit text replication, and the second stage establishes the process of encoding explicit text into latent representations and decoding them back to their original form. W e further streamline the model by discarding redundant special tokens from V ersion 2, using only [SoD] as a separator . While stage one performs successfully , which demonstrates reliable text replication, the loss of stage two stabilizes at around 2 . 8 , sho wing no meaningful conv ergence. W e suspect it may result from the fact that LM head maps high-dimensional hidden state onto discrete tokens, losing semantic information. W e also suspect that it is hard to train all latent tokens, so they should be trained in a progressi ve way . V ersion 4/Final (Progr essive Latent Substitution Strategy): In this version, we implement two critical modiﬁcations. First, we introduce progressi ve latent substitution, which replaces the pre vious one-time optimization approach with a gradual substitution strate gy . Second, we transition from the weighted combination of logits to the direct generation of latent hidden states. These changes result in a stable and persistent decline in training loss. Furthermore, based on e xperiments, we incorporate sev eral key optimizations: (1) W e append an token to explicitly denote the completion of the decoding process. (2) W e increase the training to three epochs, ensuring the loss reaches a suf ﬁciently low and appropriate lev el following the substitution-based augmentation. After performing h yperparameter tuning, such as learning rate, our ﬁnal models are successfully produced by this approach, with stable and reliable encoding and decoding capability . A.2 Failur e Cases on Sparsity V ersion 5 (Mixtur e-of-Experts Strategy): T o further minimize the storage ov erhead of latent memory , we explore sparsiﬁcation techniques inspired by the Mixture-of-Experts (MoE) framew ork. W e implement a gating function to manage individual experts, which we construct as a mapping codebook initialized via Singular V alue Decomposition (SVD) of the input embeddings. In this setup, we treat the hidden states as queries to perform sparsiﬁcation and approximate reconstruction, optimizing the entire pipeline during the progressi ve latent substitution phase. Ho wever , experimental results show that the training loss remains stagnant at approximately 3 . 3 , f ailing to conv erge further . V ersion 6 (RQ-V AE Strategy): W e further try RQ-V AE ( Rajput et al. , 2023 ) frame work to achie ve latent representation sparsiﬁcation. It discretizes continuous inputs into a hierarchical sequence of codebook indices through a recursiv e quantization process. Initially , the input is mapped into a latent embedding space. This embedding is then iteratively decomposed across multiple lev els. At each stage, the model identiﬁes the nearest codeword from a le vel-speciﬁc codebook based on Euclidean distance and calculates a residual to be passed to the subsequent le vel. The ﬁnal latent representation is reconstructed by aggregating the selected codewords from all levels through element-wise summation. Howe ver , experimental results sho w that while the training loss decreases to approximately 1 . 0 , the model encounters a representation collapse, where latent vectors across all positions become nearly identical. Additionally , the decoded outputs exhibit signiﬁcant ov erﬁtting, primarily reﬂecting memorized training data rather than generalized features. 11 NextMem: T owards Latent F actual Memory for LLM-based Agents V ersion 7 (OMP Strategy): W e further employ the Orthogonal Matching Pursuit (OMP) strate gy to sparsify the generated latent representations. F or each latent vector , OMP iterativ ely selects the atoms from a pre-deﬁned dictionary that e xhibit the highest correlation with the current residual, follo wed by an orthogonal projection to optimize the corresponding coef ﬁcients. The signal is then reconstructed as a sparse linear combination of these selected atoms, with the ﬁdelity of the approximation ev aluated by the L2 norm of the residual between the original and reconstructed vectors. Howe ver , experimental results rev eal signiﬁcant residuals, indicating that the OMP process fails to achiev e an accurate approximation of the original signals. This high reconstruction error suggests that the latent space may not sufﬁciently match the original representations. V ersion 8 (Explicit Projection): In this version, we explore explicit projection by tasking the trained decoder with translating latent space vectors into their respective logits and explicit tokens. W e expect that the model can effecti vely summarize the original e xplicit text into a single token through this process. Howe ver , the model merely selects isolated tokens from the original sequence in a disjointed pattern. It f ails to capture the cohesi ve, compressed semantic information of the entire text se gment, instead focusing on individual token-le vel mappings. V ersion 9 (Reparameterization Strategy): Furthermore, we employ Gumbel-Softmax and Gaussian Softmax reparameter- ization tricks to bridge the gap between discrete token generation and subsequent computations. This approach aims to address the non-dif ferentiability of discrete variables, allo wing gradients to ﬂow through the sampling process. Howe ver , we ﬁnd that this strategy leads to signiﬁcant numerical instability . The training loss ﬂuctuates between 1 . 0 and 2 . 0 , which leads to poor performance and increases decoding hallucinations in the generated output. B Details of Reproduction In this section, we describe the details of the model and experiment reproduction. B.1 Dataset Preparation Overview . F or each dataset, we standardize the data into a uniﬁed format consisting of three primary ﬁelds, including the question, answer , and reference. The processing pipeline for each dataset is outlined as follo ws: • HotpotQA 1 : W e extract easy lev el samples to focus on direct e vidence retriev al. For each instance, we reconstruct the reference list by extracting only the speciﬁc sentences identiﬁed as supporting facts from the original context, rather than using the entire document. The ﬁnal dataset is di vided into a 90% for training and a 10% for testing. • RA CE 2 : W e aggregate the High lev el data across the original datasets. W e consider the context of article as the reference. W e then re-partition the combined pool into 90% for training and 10% for testing. • SQuAD 3 : T o adapt it for the short-term memory task, we ﬁlter out all unanswerable questions, retaining only those where the answer is explicitly present in the context. The original context is utilized as the reference, and the various ground-truth answers are compiled into a list. The processed data is partitioned into a training set and a test set with 9:1. • LoCoMo 4 : From the LoCoMo dataset, we select samples belonging to category 1 ( i.e., multi-hop ) and category 5 ( i.e., single-hop ), which require speciﬁc evidence-based reasoning. W e map the evidence IDs provided in the original metadata to the corresponding speaker -text turns in the movie dialogues to form the reference list. This dataset serves as a specialized test suite for long-context con versation understanding. • LongMemEval 5 : W e utilize a cleaned version of the LongMemEv al-S dataset for ev aluation. The references for each question are constructed by aggreg ating relev ant dialogue turns from multiple haystack sessions. Unlike other datasets, this set is primarily used for testing and ev aluation of memory retriev al capabilities across long-context dialogues. In addition, the procedure for processing the ev aluation data for each task is as follows: T ask 1: F actual Reconstruction. W e further construct a reconstructed reference pool deriv ed from the datasets. T o manage the data scale and simulate varying context lengths, the script implements a two-stage reﬁnement strategy . First, it performs strided sampling based on a dataset-speciﬁc sampling gap to select a representati ve subset of the data. Second, for e very reference associated with the sampled entries, a stochastic truncation is applied: the text is split into sentences, and only the ﬁrst n sentences (where n is randomly sampled from [1 , 8] ) are retained. The resulting fragments are normalized to ensure 1 http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json 2 http://www.cs.cmu.edu/ ˜ glai1/data/race/RACE.tar.gz 3 https://rajpurkar.github.io/SQuAD- explorer/dataset/train- v2.0.json 4 https://github.com/snap- research/LoCoMo 5 https://huggingface.co/datasets/xiaowu0162/longmemeval- cleaned 12 NextMem: T owards Latent F actual Memory for LLM-based Agents proper punctuation and are ﬁnally saved as a ﬂattened list of strings. This procedure ef fectively creates a di versiﬁed lengths of reference snippets for ev aluating memory reconstruction capabilities. T ask 2: Contextual Generation. W e utilize the processed testing data for the ev aluation of contextual generation. T ask 3: Dense Passage Retrieval. Compared with the initial processing, we further construct the datasets for e valuating memory retrie val e valuation by introducing a hit list , which records the precise indices of the ground-truth references within a larger conte xt pool. While the ﬁrst process primarily paired questions with their necessary e vidence, this version expands the references ﬁeld to include non-essential or session-wide context and explicitly tracks which speciﬁc entries are required to answer the question. It allo ws a more rigorous ev aluation of a model’ s ability to accurately retriev e. B.2 Model and T raining Conﬁguration For the autoregressi ve autoencoder , we employ Qwen3-8B 6 as the backbone model. The maximum lengths for both encoding and output are set to 1024 tokens. During the training phase, the progressive training scheme consists of 15 steps with a block size of 16. Parameter-ef ﬁcient ﬁne-tuning is performed via LoRA, where we set the rank r = 16 , α = 32 , and a dropout rate of 0 . 1 . W e apply LoRA adapters to the q proj , k proj , v proj , and o proj modules. The model is trained on 4 NVIDIA A100 GPUs for 3 epochs per step, with an effecti ve batch size of 32 and a learning rate of 5 × 10 − 4 . B.3 Baseline Conﬁguration DeepSeek-OCR. For DeepSeek-OCR, we employ the PIL 7 package to render raw te xt into images, enabling the model to perform decoding. T o align with the model’ s latent token capacity , the images are generated at a resolution of 240 × 240 pixels, which corresponds to 16 latent tok ens. Speciﬁcally , the te xt is rendered in black on a white background using the T imes New Roman font (size 12), with a 5-pix el padding and an automatic line-wrap mechanism. F ollowing the original model conﬁguration 8 , we use the decoding prompt: \ nOCR this image . ICAE. For ICAE, we utilize the of ﬁcial checkpoint mistral 7b ft icae 9 based on the Mistral-7B-Instruct-v0.2 10 to generates 128 latent tokens. Its LoRA ﬁne-tuning conﬁguration consists of a rank r = 512 and a dropout rate of 0.05. DyPRA G. For DyPRAG, we employ checkpoint llama3-8b-p32-1ep-main-2400sample , built upon Llama-3- 8B 11 to generate LoRA adapters. The LoRA parameters are set to a rank r = 2 and α = 32 , with the projector p = 32 . T extual Memory . This model is exclusi vely employed in T ask 2 for a comparison of conte xtual utilization, using prompt: Please answer the following question based on the reference. Reference: { reference } Question: { question } Answer: BGE. This model is exclusi vely utilized in T ask 3 for a comparison of dense passage retriev al performance. Speciﬁcally , we employ BGE-M3 12 , loaded via SentenceT ransformers 13 . C Inﬂuence of Block Size W e further explore ho w the substitution block size and the ﬁnal length af fect the performance, which are both signiﬁcant hyper-parameters in our framew ork. The results are presented in Figure 7 (a) . W e observe a clear positi ve correlation between performance and latent length L , as increasing the memory capacity allows for more nuanced information storage. 6 https://huggingface.co/Qwen/Qwen3- 8B 7 https://pypi.org/project/pillow/ 8 https://github.com/deepseek- ai/DeepSeek- OCR/ 9 https://huggingface.co/sggetao/icae 10 https://huggingface.co/mistralai/Mistral- 7B- Instruct- v0.2 11 https://huggingface.co/meta- llama/Meta- Llama- 3- 8B- Instruct 12 https://huggingface.co/BAAI/bge- m3 13 https://pypi.org/project/sentence- transformers 13 NextMem: T owards Latent F actual Memory for LLM-based Agents B ert S core METEO R (a) Results of different block sizes and latent lengths. (b) Results of forgetting ef fects. F igure 7. Results of extensi ve experiments. Furthermore, the block size B balances efﬁciency and accuracy . When B is small (e.g., B = 8 ), the model struggles to maintain reconstruction for long paragraphs given a certain latent length. W e also ﬁnd that a large B can also be suboptimal for optimization, with performance peaking at B = 16 . This is consistent with our ﬁndings in Section 4.3 (w/o PS), suggesting that optimizing long tokens in a single stage is inherently dif ﬁcult. D F orgetting Effect of Latent Memory Furthermore, we simulate the memory decay process by introducing a pooling coef ﬁcient α ( t ) = a t , where a controls the rate of forgetting. At each time step t , the pooled representation is H ( L ) = H ( L ) · α ( t ) + ¯ H ( L ) · (1 − α ( t )) , where ¯ H ( L ) is the av erage embedding across various lengths. The results are presented in Figure 7 (b) . W e ﬁnd the performance across all ev aluation metrics remains relativ ely stable during the initial time steps ( t ≤ 4 ). Ho wever , as t increases further , we observe a sharp and consistent decline in all scores. This do wnward trend demonstrates that as the coef ﬁcient α ( t ) decreases, the latent memory effecti vely for gets speciﬁc factual details by gradually con verging to ward the av erage representation ¯ H ( L ) . 14 NextMem: T owards Latent F actual Memory for LLM-based Agents E Case Studies E.1 Case Study on Paragraphs with V arious Lengths Case Study: Below Length Limits. At this length, the model demonstrates near -perfect reconstruction ﬁdelity . The semantic content, syntactic structure, and speciﬁc entities are almost identical to the original text. The reconstruction preserves e very detail, from Alice’ s speciﬁc job title in Silicon V alle y to her dinner preferences. Minor v ariations are negligible ( e .g., capitalizing Master’ s), illustrating that within this limit, the model can effecti vely store and retrie ve the full information density of the paragraph without loss. Original Paragraphs (91 words): Alice recently celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She works as a specialized software engineer for a fast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of downto wn Los Angeles. She holds a master’ s degree in Computer Science, having graduated with honors from a prestigious university . She prefers authentic Italian pasta with fresh basil and creamy tomato sauce for dinner . She bought a new high-performance laptop yesterday afternoon to help with her freelance coding projects. Reconstruction: Alice celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She works as a specialized software engineer for a fast-paced technology startup in Silicon V alle y . She currently resides in a cozy , sun-drenched apartment in the heart of do wntown Los Angeles. She holds a Master’ s degree in Computer Science, having graduated with honors from a prestigious university . She prefers authentic Italian pasta with fresh basil and creamy tomato sauce for dinner . She bought a new high-performance laptop yesterday afternoon to help with her freelance coding projects. Case Study: At Length Limits. As the text approaches the length limit, the model demonstrates impressi ve semantic robustness. It successfully preserves the core narrati ve structure and the vast majority of the character’ s proﬁle attributes. While minor associative substitutions occur , the model maintains the o verall logical coherence and thematic consistency of the original text. For example, it replaces speciﬁc models ( i.e., T esla Model 3 to Model S) or instruments ( i.e., violin to bass). This indicates that ev en at full capacity , the model effecti vely prioritizes the retention of high-lev el biographical information and functional context. Original Paragraphs (190 words): Alice recently celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She works as a specialized software engineer for a fast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of downto wn Los Angeles. She holds a master’ s degree in Computer Science, having graduated with honors from a prestigious university . She prefers authentic Italian pasta with fresh basil and creamy tomato sauce for dinner . She bought a new high-performance laptop yesterday afternoon to help with her freelance coding projects. She visited the interacti ve science museum last weekend to e xplore the new e xhibition on space travel. Alice will trav el to Spain next week to spend ten days exploring the historic streets of Barcelona and Madrid. She plays the electric violin in a local jazz band that performs at community events every month. She dri ves a sleek, cherry-red T esla Model 3 that she carefully maintains in pristine condition. She speaks ﬂuent Japanese and English, a skill she de veloped while gro wing up in a bilingual household. She adopted a ﬂuf fy white cat with bright blue eyes from the animal shelter today . Reconstruction: Alice celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She work ed as a specialized software engineer for a tech startup in Silicon V alley . She currently resides in a cozy , sun-drenched downto wn apartment in San Francisco. She holds a Master’ s degree in Computer Science, with honors, from a prestigious uni versity . She prefers authentic Italian pasta with fresh creamy tomato sauce and basil for dinner . She bought a new high-performance laptop yesterday afternoon to help with her freelance coding assignments. She visited the international space exploration lab last Saturday to explore the ne w interactiv e display . She will spend the ne xt two weeks in Spain tra veling to explore the historic streets of Madrid and Barcelona. She plays the electric bass in a local music club that performs popular songs in Spanish and English. She dri ves a sleek, cherry-red T esla Model S that she maintains in pristine condition. She cherishes the skill she learned in Spanish and English, a language she dev eloped while growing up in a bilingual household. She adopted a ﬂuffy white bunny from the animal shelter with bright blue eyes today . 15 NextMem: T owards Latent F actual Memory for LLM-based Agents Case Study: Over Length Limits. Even when signiﬁcantly exceeding the designated length limits, the model exhibits a remarkable ability to reconstruct the overarching primary life ev ents. Ke y elements such as the subject’ s professional background, academic honors, and upcoming trav el plans remain intact. Although the extreme information density leads to some blending of secondary details, the model still manages to capture the essential persona of the original text. This resilience suggests that the model’ s memory mechanism can maintain a coherent thematic framew ork even when pushed well be yond its operational window . Original Paragraphs (263 words): Alice recently celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She works as a specialized software engineer for a fast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of downto wn Los Angeles. She holds a master’ s degree in Computer Science, having graduated with honors from a prestigious university . She prefers authentic Italian pasta with fresh basil and creamy tomato sauce for dinner . She bought a new high-performance laptop yesterday afternoon to help with her freelance coding projects. She visited the interacti ve science museum last weekend to e xplore the new e xhibition on space travel. Alice will trav el to Spain next week to spend ten days exploring the historic streets of Barcelona and Madrid. She plays the electric violin in a local jazz band that performs at community events every month. She dri ves a sleek, cherry-red T esla Model 3 that she carefully maintains in pristine condition. She speaks ﬂuent Japanese and English, a skill she de veloped while gro wing up in a bilingual household. She adopted a ﬂuffy white cat with bright blue e yes from the animal shelter today . She attended an energetic outdoor music concert last Friday night despite the rainy weather . She drinks a lar ge iced coffee without sugar e very morning to boost her ener gy for the day . She reads gripping mystery no vels by Ag atha Christie for at least an hour before sleeping. Alice will mov e to a more spacious apartment in a quieter neighborhood soon. She is sev erely allergic to roasted peanuts and alw ays checks ingredients carefully before eating. Reconstruction: Alice celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She worked as a specialized software engineer at a high-tech startup in Silicon V alley . She resides in a sun-drenched, cozy do wntown apartment in the heart of San Francisco. She holds a Master’ s degree in Computer Science, with honors. She prefers authentic Itali an pasta with fresh basil and c reamy tomato sauce. She bought a ne w high-performance laptop yesterday afternoon to help with her freelance drawing w ork. She visited the interacti ve science center last week to e xplore the ne w space exploration module. Alice is going to Spain next week to spend ten days exploring the historic Spanish cities. She plays the electric violin and explores the local jazz clubs in Madrid. She enjoys a monthly party at a cherry-bright electric car that she drives in a sleek, pristine condition. She maintains a clean, sharp Mercedes E-Class she drives in a professional manner . She learned Spanish and ﬂuent Japanese while growing up, a skill she developed in a bilingual en vironment. She adopted a ﬂuff y white bunny from the animal shelter today with bright blue eyes. She enjoys an energetic outdoor work out despite the rainy e vening. She drank a large iced cof fee without sugar last night. She has no dietary restrictions. 16 NextMem: T owards Latent F actual Memory for LLM-based Agents E.2 Case Study on Gaussian Noise The case study demonstrates a clear inv erse relationship between the noise level ( σ ) and the semantic ﬁdelity of the reconstructed text. The settings are aligned with Section 4.5 . The degradation process can be categorized into four le vels: • High Fidelity ( σ = 0 . 4 ): At this level, the reconstruction is nearly identical to the original text. The model preserves all factual details ( e.g., age, occupation, location, and degree) with only negligible omissions ( e.g., the word recently), maintaining near-perfect consistenc y . • Minor Semantic Drift ( σ = 0 . 8 ): The text remains grammatically ﬂuent, but factual hallucinations begin to surface. While the core narrati ve is intact, speciﬁc entities are substituted with contextually plausible b ut incorrect ones. For instance, changing the residence from Los Angeles to San Francisco. This indicates that moderate noise be gins to interfere with the model’ s ability to retrie ve precise tokens. • Structural Breakdown ( σ = 1 . 2 ): Consistency de grades signiﬁcantly as the noise disrupts the logic of the sentence. The model starts to repeat irrele vant descriptors ( e.g ., the repetitive use of 10th-ﬂoor) and f ails to complete the paragraph. The original meaning is partially lost, replaced by fragmented imagery . • Semantic Collapse ( σ ≥ 1 . 6 ): At extreme noise le vels, the model suffers from total di vergence or output f ailure. The reconstruction either hallucinates entirely unrelated content ( e.g., introducing a random date 2015-08-23 and changing the subject’ s gender to he) or fails to generate an y text at all. Case Study on Gaussian Noise Original Paragraphs (62 words): Alice recently celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She works as a specialized software engineer for a f ast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of do wntown Los Angeles. She holds a master’ s degree in Computer Science, ha ving graduated with honors from a prestigious univ ersity . Noisy Reconstruction ( σ = 0 . 4 ): Alice celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She w orks as a specialized software engineer for a fast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of downto wn Los Angeles. She holds a Master’ s degree in Computer Science, ha ving graduated with honors from a prestigious univ ersity . Noisy Reconstruction ( σ = 0 . 8 ): Alice celebrated her 21st birthday with a memorable rooftop party surrounded by her closest friends. She w orks as a specialized software engineer for a fast-paced technology startup in Silicon V alley . She currently resides in a cozy , sun-drenched apartment in the heart of downto wn San Francisco. She holds a Master’ s degree in Computer Science, ha ving graduated from a prestigious univ ersity . Noisy Reconstruction ( σ = 1 . 2 ): Alice celebrated her 21st birthday with a memorable weekend trip to the 10th-ﬂoor rooftop with her closest friends. She works as a specialized software engineer in a fast-paced 10th-ﬂoor startup. She currently resides in a cozy , sun-drenched, and dry Noisy Reconstruction ( σ = 1 . 6 ): 2015-08-23: The world’ s most popular celebration of a warm summer night includes a group of close friends. He works as a specialized software engineer for a fast-paced technology company . He resides in a sun-drenched, cozy house in the center of the city . Noisy Reconstruction ( σ = 1 . 6 ): (just a single space) 17

NextMem: Towards Latent Factual Memory for LLM-based Agents

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment