Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory
Recent advances in large language models have driven the emergence of intelligent agents operating in open-world, multimodal environments. To support long-term reasoning, such agents are typically equipped with external memory systems. However, most …
Authors: Rongjie Jiang, Jianwei Wang, Gengda Zhao
Advancing Multimodal Agent Reasoning with Long- T erm Neuro-Symbolic Memor y Rongjie Jiang University of New South W ales Sydney, A ustralia rongjie.jiang@student.unsw .edu.au Jianwei W ang * University of New South W ales Sydney, A ustralia jianwei.wang1@unsw .edu.au Gengda Zhao University of New South W ales Sydney, A ustralia gengda.zhao@unsw .edu.au Chengyang Luo Zhejiang University Hangzhou, China luocy1017@zju.edu.cn Kai W ang Shanghai Jiao T ong University Shanghai, China w .kai@sjtu.edu.cn W enjie Zhang University of New South W ales Sydney, A ustralia wenjie.zhang@unsw .edu.au Abstract Recent advances in large language models have driven the emer- gence of intelligent agents operating in open-world, multimo dal environments. T o supp ort long-term reasoning, such agents are typically equipped with external memor y systems. How ever , most existing multimodal agent memories rely primarily on neural rep- resentations and vector-based retrieval, which ar e well-suited for inductive, intuitive r easoning but fundamentally limited in support- ing analytical, deductive reasoning critical for real-world decision making. T o address this limitation, we pr opose NS-Mem , a long- term neur o-symb olic memory framework designed to advance mul- timodal agent reasoning by integrating neural memor y with explicit symbolic structures and rules. Sp ecically , NS-Mem is operated around three core comp onents of a memory system: (1) a three- layer memory architecture that consists episo dic layer , semantic layer and logic rule layer , (2) a memor y construction and main- tenance me chanism implemente d by SK-Gen that automatically consolidates structured knowledge from accumulated multimodal experiences and incr ementally updates both neural representations and symbolic rules, and (3) a hybrid memor y retrie val me chanism that combines similarity-based search with deterministic symbolic query functions to support structured reasoning. Experiments on real-world multimo dal reasoning benchmarks demonstrate that Neural-Symbolic Memor y achiev es an average 4.35% improvement in overall r easoning accuracy over pure neural memory systems, with gains of up to 12.5% on constrained reasoning queries, validat- ing the eectiveness of NS-Mem. Ke y wor ds Neural-Symbolic AI, Memor y Systems, Knowledge Representation, Open- W orld Agents, Structured Reasoning Code A vailability: The sour ce code of this paper has been made publicly available at https: //anonymous.4open.science/r/NSTF- 842F. Jack wants to make Fruit salad. What should he do next? Extracting Prior Models Embedding Retrieval Video Clips Mix the fruits in a bowl. No. Content Match 254 Fruit salad needs fruits, a bowl, and a spoon. Steps: chop, mix, serve. 798 Jack has chopped the fruits. 2341 The bowl at home is broken. 5231 A store downstairs has bowls, half min away . ... ...... ... Buy a bowl. Figure 1: An example of a vector-centric multimodal agent on a constrained query . 1 Introduction The rapid advancement of Large Language Models (LLMs) has fun- damentally redened the landscape of intelligent agents, empower- ing them as sophisticated systems that p er ceive their environment, reason over observations, and execute actions to achiev e complex goals [ 28 , 43 ]. In real-w orld scenarios, environments are inherently multimodal, spanning modalities such as textual and visual data. Consequently , developing multimodal agents has become a primar y frontier in achieving general-purpose autonomy [ 32 , 35 ]. However , the successful deployment of these agents needs to continuously ac- cumulate knowledge from streams of heterogeneous observations, organize this information eectively , and retrieve contextually r el- evant insights to support exible decision-making under varying constraints. At the core of this capability lies the memory module, the essential substrate for transforming raw multimodal streams into persistent, actionable knowledge. Recent advances in memory-augmented multimodal agents hav e made signicant progress. MemGPT [ 24 ] introduces explicit mem- ory management for LLM-based agents, enabling them to handle information beyond context windo w limitations. MovieChat [ 33 ] and MA -LMM [ 11 ] construct persistent memory structures for long-form video understanding, while Video Agent [ 41 ] employs iterative frame selection guided by memory . M3- Agent [ 21 ] further advances this line by introducing VideoGraph, which organizes memories into episodic nodes representing specic events and se- mantic nodes capturing abstracted concepts, drawing inspiration from human cognitive systems [ 36 ]. These approaches typically employ retrieval-augmented generation (RA G) [ 16 ] with vector- based similarity search, achieving str ong performance on factual recall and semantic matching tasks. Conference’17, W ashington, DC, USA 2026. ACM ISBN 978-x-xxxx-xxxx-x/Y YYY/MM https://doi.org/10.1145/nnnnnnn.nnnnnnn Conference’17, July 2017, W ashington, DC, USA Jiang et al. Motivations . Despite their progress, most multimodal agent mem- ories are vector-centric, relying primarily on neural emb eddings for storage and retrie val and sometimes being augmente d with lightweight relational structures [ 40 ]. Such designs are well-suited for System 1 style reasoning [ 15 ], enabling associative inference through semantic similarity , including inductive reasoning that generalizes from past experiences, analogical reasoning that trans- fers knowledge across related contexts [ 42 ], and associative recall that links semantically proximate concepts. These capabilities are eective for fuzzy matching and commonsense r etrieval in open- ended environments. However , they remain fundamentally limited in supp orting System 2 style reasoning that is critical for r eal-world decision making under explicit constraints [ 29 , 37 ]. This includes de- ductive reasoning ov er dependencies such as understanding prereq- uisites and ordering, abductive inference from partial obser vations, and constraint-aware reasoning inv olving constraint satisfaction or alternative discov er y when constraints are violated [ 2 ]. These scenarios demand explicit structure reasoning mechanisms [40]. Example 1. Consider an agent in Figure 1 tasked with helping Jack make a fruit salad. The agent knows that fruit salad requires fruits, a b o wl, and a sp oon, and that the steps are: chop, mix, and serve. From its memor y , it knows that Jack has already choppe d the fruits (ID 798), the bowl at home is broken (ID 2341), and a store downstairs has bowls, just 1 minute away (ID 5231). When aske d, “What should Jack do next?” , a purely vector-based retrieval system can identify memory fragments mentioning “fruit salad” or “chopped fruits” base d on semantic similarity . However , it cannot take into account that the bowl at home is broken or that a store nearby has bowls. As a result, it will only suggest mixing the fruits, ignoring the practical constraints. Neuro-symbolic AI [ 4 , 13 ] aligns neural representations with System 1 intuitive pattern matching and symbolic rules with System 2 deterministic logic, oering a promising research direction. How- ever , designing a memory system that supports this dual-process architecture poses three fundamental challenges: (1) First, ho w to conceive a unied ar chitecture in which neural representations and symbolic rules can coexist and interoperate eectively . (2) Second, how to construct and continuously update such a memor y system from large-scale multimodal data streams. (3) Thir d, how to eec- tively retrieve and utilize both neural and symbolic memories to advance the reasoning capabilities of intelligent agents. Our approaches . Guide d by these challenges, we propose NS- Mem , a neuro-symbolic memor y system that combines neural rep- resentations and symbolic rules to bridge the gap between System 1 intuitive reasoning and System 2 deterministic reasoning. NS-Mem provides a comprehensiv e framework comprising three integrated modules: (1) a unied prototype, (2) a scalable construction and update mechanism, and (3) a synergistic retrieval process. First, we design a hierarchical framework that organizes informa- tion across three specialize d layers: the episodic layer , the semantic layer , and the logic layer . The episodic layer records ne-grained multimodal obser vations with timestamps, while the semantic layer manages abstracted entity types and attributes. At the core of our system, the logic layer r epresents procedural knowledge through logic nodes. Each node couples a neural index with an explicit sym- bolic structure. The neural index is an aggregated neural embedding that enables discovery via v e ctor similarity , and the symbolic struc- ture is a procedural Directe d Acyclic Graph (DA G) that encodes deterministic logical rules and step-wise dependencies. Second, to construct and maintain the memory , we implement the SK-Gen mechanism, which automatically distills structured knowledge from accumulated multimodal experiences. It extracts temporally ordered action sequences and applies pattern mining to detect recurring procedural knowledge. Once a pattern is iden- tied, the system constructs a symbolic DA G and computes the corresponding neural index. T o handle continuous observations, the mechanism supp orts incremental updates, utilizing exponential moving averages to rene the neural index and structural modi- cations to up date edges in the procedural DA Gs and transition frequencies without full reconstruction. Third, we develop a hybrid retrieval strategy that synergisti- cally combines similarity-based search with deterministic symb olic query functions. The process begins by classifying incoming queries into factual, constraint, or procedural-based types to prioritize rel- evant memory layers. While neural retrieval identies relevant nodes through embe dding similarity , the system then executes sym- bolic functions directly on the retrieved structures. This allows the agent to generate precise, reproducible answers that satisfy e xplicit constraints, eectively merging intuitive semantic matching with rigorous, deterministic reasoning. Contributions . Our main contributions are as follows: • W e propose NS-Mem, a three-layer neuro-symbolic memory ar- chitecture that integrates an episodic layer , a semantic layer , and a neuro-symbolic layer to facilitate both neural discovery via memory prototypes and precise reasoning through symbolic structures with deterministic query functions. • W e develop SK-Gen to automate the extraction of structured knowledge from observations, which supports incremental up- dates to maintain consistency as new observations arrive. • W e intr o duce a hybrid retrieval mechanism that classies queries by type and applies multi-level retrieval with symbolic enhance- ment for better reasoning. • W e demonstrate thr ough experiments on real-w orld benchmarks that our approach achieves 4.35% improvement in reasoning accu- racy , with particularly strong gains on constraint-based queries. 2 Related W ork Memory-A ugmente d Agents in Single Modality . Memor y man- agement has evolved to handle long-term context beyond xed window constraints [ 26 ]. Common strategies inv olve storing entire trajectories, such as dialogues [ 17 , 23 , 38 , 49 ] or execution traces [ 14 , 18 , 20 , 30 , 31 ]. Other metho ds utilize summaries [ 14 , 38 , 49 ] or latent embeddings [ 6 , 19 , 33 , 47 ] for persistence. Sp ecialized architectur es like MemGPT [ 24 ] and V oyager [ 39 ] provide ner control over memory and skill acquisition. These single-modal systems focus on textual or symbolic data, forming the basis for more complex multimodal designs. Multimodal Memory Systems. Recent work extends memory to multimodal environments, particularly for long-form video under- standing [ 22 ]. One approach uses pure neural representations, stor- ing memories as latent embeddings. MA-LMM [ 11 ], MovieChat [ 33 ], and Flash- VStream [ 47 ] employ memory mechanisms to compress Advancing Multimodal Agent Reasoning with Long-T erm Neuro-Symbolic Memor y Conference’17, July 2017, W ashington, DC, USA V ideo Clips Sequential Pattern Mining G Goal Embedding Step Embedding New Clips Constrution Clustering Embedding Episodic Node Semantic Node Entity Anchor(Face&V oice) Logic Node Knowledge Graph Large Language Model Filtering Jack opened the refrigerator and took out a box of strawberries. Jack really likes eating fruit at night. Basic KG Probability=0.3 Knowledge Fusion Update Incremental Maintenance Generating Encoding Query Retrieval Multi-layer Memory Stereotype Search Query Classification Embedding Similarity Factual Procedural Constraint getProcedureWithEvidence aggregateCharacterBehaviors queryStepSequence Return Answer Up to 5 iterations Node Retrieval Threshold Filtering T op-K Selection Modules G Figure 2: Over vie w of the NS-Mem framework. Raw multimodal data is processed through a three-layer memor y prototype, maintained via the SK-Gen mechanism for distillation and incremental updates, and accesse d through a hybrid retrieval framework designed for complex reasoning. video tokens or store encoded visual features [ 1 , 12 , 48 ]. Another direction integrates relational graphs with neural embeddings. M3- Agent [ 21 ] organizes memories into entity-centric VideoGraphs. So- cratic Models [ 17 , 46 ] use multimo dal models to generate language- based descriptions. While ee ctiv e for semantic matching via RAG [ 16 ], these systems often lack the explicit symb olic structures needed for deterministic reasoning under complex constraints. Neuro-Symbolic Integration. Neuro-symbolic AI combines neu- ral perception with symbolic logic [ 9 , 10 ]. Early models like Neuro- Symbolic VQ A [ 45 ] disentangle perception from reasoning by exe- cuting symbolic programs. Modern approaches like Program-aided Language Models [ 8 ] and Viper GPT [ 34 ] use LLMs to generate executable code. Other methods embed symbolic knowledge into networks [ 27 ] or extract rules from representations [ 7 , 44 ]. How- ever , symbolic execution is often tr eate d as a one-o tool. Our work introduces a Neuro-Symbolic Layer where symbolic structures are persistently stored and updated alongside neural representations, merging retrieval eciency with reasoning pr e cision. 3 Problem Statement An intelligent agent perceives its environment through multimodal observations, maintains a memory of past experiences, and lever- ages these memories to make informed de cisions. In modern archi- tectures, Large Language Models (LLMs) ser v e as the cognitiv e cor e, while a memory system M enables the agent to accumulate knowl- edge over extended time horizons. Such memor y-augmented agents must store past experiences, react to new stimuli, and maintain continuity across long operational periods. Specically , the agent perceives a continuous stream of multi- modal observations O = { 𝑜 1 , 𝑜 2 , . . . } , where each obser vation 𝑜 𝑡 comprises visual frames 𝑜 𝑣 𝑡 , audio signals 𝑜 𝑎 𝑡 , and textual descrip- tions 𝑜 𝑠 𝑡 (e .g., transcribed sp eech or subtitles). Given a query 𝑞 ∈ Q , the agent must retriev e relevant information from M and generate an accurate answer 𝑎 ∈ A . Problem Statement. Consider an agent that continuously receives a stream of video observations O . The objective of a multimodal memory system is to maintain a structur e d memory M that cap- tures long-term dependencies. For any query 𝑞 ∈ Q , the agent needs to dynamically retriev e relevant context fr om M to generate an accurate answer 𝑎 ∈ A , ee ctiv ely supporting factual recall, procedural understanding and constraint-aware reasoning, within an online, memory-ecient framework. 4 Overall Framew ork of NS-Mem In this section, we present NS-Mem, a neuro-symbolic memor y framework designed to unify probabilistic semantic matching with deterministic structural reasoning. The frequently used notations are summarized in Appendix A.1. The overall time complexity analysis is summarized in A.2. 4.1 Architecture of NS-Mem Eective reasoning in open-world envir onments demands a dual capability: the exibility to recall concrete experiences (System 1) and the rigor to apply deterministic procedural rules (System 2). T o unify these capabilities, we introduce a three-layer memory architecture designed to capture diverse forms of knowledge: Definition 1. The memory system M = ( L 𝑒 𝑝𝑖 , L 𝑠𝑒 𝑚 , L 𝑙 𝑜𝑔𝑖𝑐 , E ) consists of three layers: the episodic layer L 𝑒 𝑝𝑖 , which stores times- tamped obser vations; the semantic layer L 𝑠𝑒 𝑚 , which maintains entity coherence; and the logic layer L 𝑙 𝑜𝑔𝑖𝑐 , which encodes procedural rules. The edges E dene the relationships between these layers. Speci- cally , the logic layer L 𝑙 𝑜𝑔𝑖𝑐 has directed e dges to b oth the episodic and semantic layers, denoted as E 𝑙 𝑜𝑔𝑖𝑐 → 𝑒𝑝 𝑖 and E 𝑙 𝑜𝑔𝑖𝑐 → 𝑠𝑒𝑚 , respe ctively . Moreover , the episodic and semantic layers are interconnected via shared entity anchors, represented by the edges E 𝑒 𝑝𝑖 ↔ 𝑠𝑒𝑚 . Specically , the proposed memory system consists of the Episodic Layer , Semantic Layer and Logic Layer . (1) Episodic Layer . The episodic layer ser ves as the observational foundation of the memory system, recording ne-grained ev ent de- scriptions grounded in multimodal p er ception. Each episodic node 𝑒 = ( 𝑡 , d , v 𝑒 ) stores the timestamp 𝑡 indicating its temporal posi- tion within the observation stream, a textual description d , and the corresponding embedding v 𝑒 = 𝜙 ( d ) ∈ R 𝑑 . The description d is an atomic event narrative that synthesizes visual and auditor y signals into a unied textual representation. Each description explicitly references recognized entities thr ough p er ceptual entity anchors, which are persistent identity nodes established via clustering of face embeddings and voice embeddings (detailed in Section 4.2.1). These entity references create e dges from episodic no des to the corresponding anchors, enabling entity-centric indexing so that all Conference’17, July 2017, W ashington, DC, USA Jiang et al. events involving a given identity can be eciently retrieved across the entire temporal span of the memory . (2) Semantic Layer . The semantic layer abstracts and consolidates knowledge at a higher level, maintaining entity-centric summaries that evolve as new observations accumulate. Each semantic node 𝑠 = ( type , attrs , v 𝑠 ) encodes a specic facet of abstracted knowledge, where type categorizes the knowledge modality , attrs accumulates the descriptive content, and v 𝑠 = 𝜙 ( attrs ) ∈ R 𝑑 is the node em- bedding. Like episodic nodes, semantic nodes are connected to entity anchors through edges, but they dier fundamentally in their update semantics. Rather than appending every observation as a new node, the semantic layer emplo ys a reinforcement-based consolidation policy to maintain knowledge coherence: when a new semantic observation 𝑠 𝑛𝑒 𝑤 arrives, the system computes its embedding similarity against existing semantic nodes that share at least one entity anchor; if the similarity exceeds a positive threshold 𝜏 𝑝𝑜 𝑠 , the existing node’s condence is r einforced by incrementing its associated e dge w eight, and no new node is created; only when no suciently similar node exists is a new semantic node inserte d into the layer . (3) Logic Layer . Each Logic Node N = ( 𝑖 𝑑 , 𝑐 , I , G , F ) pairs Index V ectors I for neural discovery with a Proce dural DA G G for symbolic querying, along with a goal description 𝑐 and deterministic query functions F . Each node also maintains episodic_links ⊆ L 𝑒 𝑝𝑖 referencing supporting observations for evidence traceability . Index V ectors. Since a proce dur e comprises multiple steps, user queries may match either the high-level goal or specic intermedi- ate steps. T o accommodate both granularities, we maintain dual- level Index V e ctors I = { i 𝑔𝑜𝑎𝑙 , i 𝑠𝑡 𝑒 𝑝 } : the goal-level index i 𝑔𝑜𝑎𝑙 = 𝜙 ( 𝑐 ) embeds the goal description of the proce dur e, while the step-level index i 𝑠𝑡 𝑒 𝑝 = 1 | 𝑆 | Í 𝑠 ∈ 𝑆 𝜙 ( 𝑠 ) averages embeddings of all descriptions 𝑆 = { 𝑠 1 , . . . , 𝑠 𝑛 } . This dual-index design mirrors how search en- gines index both document titles and contents, ensuring that both goal-oriented and step-specic queries can locate the relevant node. Procedural DA G. Index V e ctors solve the discovery problem but cannot answer structural questions such as step or dering or con- straint satisfaction. For such queries, we represent explicit sym- bolic structure as a Procedural DA G G = ( 𝑉 , 𝐸 , 𝐴 ) where 𝑉 = { 𝑣 0 , 𝑣 1 , . . . , 𝑣 𝑛 , 𝑣 𝑛 + 1 } includes distinguished nodes 𝑣 0 = START and 𝑣 𝑛 + 1 = GOAL , 𝐸 ⊆ 𝑉 × 𝑉 encodes valid step transitions, and 𝐴 : 𝑉 → 2 Attr maps each node to relevant attributes. DA Gs oer three advantages: expressiveness through concurrent execution paths, constraint-aware ltering for alternatives, and probabilistic seman- tics via absorbing Markov chain modeling. In our implementation, observations from individual videos initially produce single-path D AGs; through knowledge fusion (Section 4.2.3), these merge into multi-path D AGs capturing procedural variations. (4) Edges . These layers play complementar y roles and interact with each other to support complex r easoning and de cision-making. The Logic Layer connects to the Episodic Layer through episodic_links : each Logic No de N maintains references to specic episodic ob- servations that serve as evidence grounding for the abstracted procedure, enabling the system to trace back to the underlying observations when neede d. In contrast, the Logic Lay er relates to the Semantic Layer through conceptual extension: while the Seman- tic Layer stores static entity attributes, the Logic Layer captures dynamic behavioral patterns involving those entities. Within each layer , the Episodic and Semantic layers organize around entity anchors—perceptual nodes representing r e cognized identities from video obser vations. Episodic nodes and Semantic nodes both connect to relevant entity anchors, with temporal order- ing represented implicitly through timestamps rather than explicit edges. The Logic Layer introduces Logic Nodes that are r elatively independent of each other (representing distinct procedures) but connect downward to episodic evidence through episodic_links . 4.2 Memory Construction and Maintenance Next, we will introduce SK -Gen that automatically constructs and updates the memory . The process is summarized in Algorithm 1. 4.2.1 Memory Construction Pipeline. A s multimodal video streams arrive, the system segments them into clips and extracts perceptual features: face embeddings via ArcFace [ 5 ] and voice embe ddings via ERes2Net [ 3 ], with detected instances clustered to establish persis- tent entity anchors. A vision-language model then processes each clip along with the dete cted face and voice features, generating two types of textual outputs: (1) atomic event descriptions capturing ob- servable actions, dialogues, and scene details with entity references, and (2) high-level conclusions summarizing character attributes, interpersonal relationships, and contextual knowledge. The former become Episodic No des 𝑒 = ( 𝑡 , d , v 𝑒 ) ; the latter populate Semantic Nodes 𝑠 = ( type , attrs , v 𝑠 ) following the layer-specic update poli- cies dened in Section 4.1. Algorithm 1 formalizes the complete construction pipeline, including observation processing (Phase 1) and Logic Node distillation (Phase 2). The pipeline then transforms episodic memories into Logic Nodes through ve sequential steps. Step 1: Action Se quence Extraction. From the observation stream O = { 𝑜 1 , 𝑜 2 , . . . , 𝑜 𝐾 } and episodic memories L 𝑒 𝑝𝑖 , we extract temporally-ordered action se quences. For each video or session 𝑣 , we obtain 𝑆 𝑣 = Extra ctActions ( { 𝑒 ∈ L 𝑒 𝑝𝑖 : 𝑒 . video = 𝑣 }) , producing the sequence set 𝑆 𝑠𝑒 𝑞 = { 𝑆 1 , 𝑆 2 , . . . , 𝑆 𝑉 } where each 𝑆 𝑣 = [ 𝑎 1 , 𝑎 2 , . . . , 𝑎 𝐿 ] is an ordered list of actions. A ction extraction may use rule-based pattern matching on episodic descriptions or LLM- based conversion to structured action representations. Step 2: Sequential Pattern Mining. W e apply PrexSpan [ 25 ], a sequential pattern mining algorithm, to disco ver recurring pro- cedural motifs. Unlike set-based mining algorithms, PrexSpan preserves temp oral ordering: the pattern [ cut , blanch ] is distinct from [ blanch , cut ] . The algorithm eciently explor es the pattern space through projecte d databases, outputting all patterns 𝑝 sat- isfying support ( 𝑝 ) = | { 𝑆 ∈ 𝑆 𝑠𝑒 𝑞 : 𝑝 ⊆ 𝑆 } | /| 𝑆 𝑠𝑒 𝑞 | ≥ 𝜎 where 𝜎 is the minimum support threshold. This yields candidate patterns P 𝑐 𝑎 𝑛𝑑 = { 𝑝 : supp ort ( 𝑝 ) ≥ 𝜎 } . Step 3: Knowledge V erication. Frequent patterns ar e not nec- essarily meaningful procedures. For instance, [ pick_up , put_down ] may be frequent but does not constitute coherent knowledge. W e employ an LLM-base d lter to evaluate whether each candidate rep- resents complete, reusable knowledge: 𝑠𝑐 𝑜 𝑟 𝑒 𝑝 = LLMVerify ( 𝑝 , M 𝑟 𝑒𝑙 ) where M 𝑟 𝑒𝑙 contains related memories as context. Only patterns with 𝑠 𝑐𝑜 𝑟 𝑒 𝑝 > 𝜏 proce ed to structure e xtraction. Advancing Multimodal Agent Reasoning with Long-T erm Neuro-Symbolic Memor y Conference’17, July 2017, W ashington, DC, USA Step 4: DA G Construction. For veried patterns, we construct the Procedural DA G by creating nodes 𝑉 = { START } ∪ { 𝑣 𝑎 : 𝑎 ∈ 𝑝 } ∪ { GOAL } , edges 𝐸 = { ( 𝑣 𝑖 , 𝑣 𝑖 + 1 ) : consecutive actions in 𝑝 } , and extracting attributes 𝐴 ( 𝑣 ) from associated episodic memories. W e also establish episodic_links pointing to the specic episodic nodes supporting this pattern. Step 5: Index Generation. Finally , we compute the Index V ec- tors of the logic node by averaging all the vectors: i 𝑔𝑜𝑎𝑙 = 𝜙 ( 𝑝 . goal ) for goal-level matching and i 𝑠𝑡 𝑒 𝑝 = 1 | 𝑝 . steps | Í 𝑠 ∈ 𝑝 . steps 𝜙 ( 𝑠 ) for step- level matching. The complete Logic No de N = ( 𝑖𝑑 , 𝑐 , I , G , F ) is then added to L 𝑙 𝑜𝑔𝑖𝑐 . 4.2.2 Incremental Maintenance. When new observations arrive in an open-world setting, we avoid costly full reconstruction by in- crementally updating aected Logic No des thr ough couple d neural and symbolic renement. Matching and Gating. For each new obser vation 𝑜 𝑛𝑒 𝑤 , we rst identify the best-matching Logic Node via neural discovery: N ∗ = arg max N ∈ L 𝑙 𝑜𝑔𝑖𝑐 sim ( 𝜙 ( 𝑜 𝑛𝑒 𝑤 ) , I N ) . Updates proce ed only if the sim- ilarity exceeds a gating threshold 𝛿 , preventing noise from corrupt- ing established knowledge. Observations with similarity below 𝛿 may represent no vel procedures; these accumulate in a candidate pool until sucient evidence triggers a new distillation cycle. Neural Renement via EMA. As new observations accumulate, the semantic distribution of a procedure may drift—users may de- scribe the same task with var ying terminology , or the procedure itself may evolve . Static Index V ectors would be come incr easingly misaligned with current usage patterns, degrading retrieval accu- racy . T o maintain alignment while avoiding catastrophic forgetting of historical semantics, we update Index V e ctors using Exponential Moving A verage (EMA): i 𝑡 + 1 = 𝛽 · i 𝑡 + ( 1 − 𝛽 ) · 𝜙 ( 𝑜 𝑛𝑒 𝑤 ) , 𝛽 ∈ [ 0 , 1 ] (1) where i 𝑡 is the current index vector and 𝛽 (default 0.9) controls the decay rate . EMA naturally balances stability with adaptability: high 𝛽 values yield stable indexes that resist noise, while low er values in- crease sensitivity to distributional shifts. Unlike direct replacement, which catastrophically overwrites historical information, EMA pre- serves established semantics while gradually incorp orating new linguistic variations. Symbolic Renement via Transition Statistics. Beyond neural renement, the symbolic structure also benets from new obser- vations. Real-world procedures exhibit variation: some steps are more commonly taken than others, and knowing these frequen- cies enables probabilistic reasoning about typical execution paths and alternative reliability . T o capture this, we maintain edge-level transition counts 𝑁 𝑖 𝑗 for each ( 𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 in the Procedural DA G G = ( 𝑉 , 𝐸 ) . Each observed transition 𝑣 𝑖 → 𝑣 𝑗 increments the count: 𝑁 𝑖 𝑗 ← 𝑁 𝑖 𝑗 + 1 , yielding the estimated transition probability: ˆ 𝑃 ( 𝑣 𝑗 | 𝑣 𝑖 ) = 𝑁 𝑖 𝑗 Í 𝑘 : ( 𝑣 𝑖 ,𝑣 𝑘 ) ∈ 𝐸 𝑁 𝑖𝑘 (2) When obser vations reveal previously unseen but valid action, we expand the graph by inserting new nodes or edges with initial- ized statistics, thereby increasing coverage of procedural diversity while preserving determinism for existing structure. A concern is whether these incrementally updated statistics actually conv erge Algorithm 1 SK-Gen: Memory Construction and Maintenance Require: Observation stream O = { 𝑜 1 , 𝑜 2 , . . . , 𝑜 𝐾 } , consolidation thresholds 𝜏 𝑝𝑜 𝑠 , 𝜏 𝑛𝑒𝑔 , support threshold 𝜎 , verication thr eshold 𝜏 , gating threshold 𝛿 , EMA coecient 𝛽 Ensure: Memory system M = ( L 𝑒 𝑝𝑖 , L 𝑠𝑒 𝑚 , L 𝑙 𝑜𝑔𝑖𝑐 ) 1: // P hase 1: Observation Processing 2: A ← ∅ ; L 𝑒 𝑝𝑖 ← ∅ ; L 𝑠𝑒 𝑚 ← ∅ 3: for each clip 𝑜 𝑘 in O do 4: F 𝑘 ← ArcFace ( 𝑜 𝑘 ) ; U 𝑘 ← ERes2Net ( 𝑜 𝑘 ) ⊲ Perceptual extraction 5: A ← ClusterAndTrack ( A , F 𝑘 , U 𝑘 ) ⊲ Entity anchor update 6: 𝐷 𝑘 , 𝐶 𝑘 ← VLM ( 𝑜 𝑘 , A ) ⊲ Descriptions & conclusions 7: for each description d ∈ 𝐷 𝑘 do ⊲ Episodic construction 8: 𝑒 ← ( 𝑡 𝑘 , d , 𝜙 ( d ) ) ; L 𝑒 𝑝𝑖 ← L 𝑒 𝑝𝑖 ∪ { 𝑒 } 9: Link 𝑒 to each entity anchor 𝑎 ∈ P arseEntities ( d ) 10: for each conclusion 𝑠 𝑛𝑒 𝑤 ∈ 𝐶 𝑘 do ⊲ Semantic consolidation 11: S 𝑐 𝑎 𝑛𝑑 ← { 𝑠 ∈ L 𝑠𝑒 𝑚 : Entities ( 𝑠 𝑛𝑒 𝑤 ) ⊆ Entities ( 𝑠 ) } 12: if ∃ 𝑠 ∈ S 𝑐 𝑎 𝑛𝑑 : sim ( 𝜙 ( 𝑠 𝑛𝑒 𝑤 ) , v 𝑠 ) > 𝜏 𝑝𝑜 𝑠 then 13: Reinforce ( 𝑠 ) ⊲ Edge weights + 1 14: else if ∃ 𝑠 ∈ S 𝑐 𝑎 𝑛𝑑 : sim ( 𝜙 ( 𝑠 𝑛𝑒 𝑤 ) , v 𝑠 ) < 𝜏 𝑛𝑒𝑔 then 15: Weaken ( 𝑠 ) ⊲ Edge weights − 1 ; prune if ≤ 0 16: else 17: L 𝑠𝑒 𝑚 ← L 𝑠𝑒 𝑚 ∪ { ( type 𝑛𝑒 𝑤 , 𝑠 𝑛𝑒 𝑤 , 𝜙 ( 𝑠 𝑛𝑒 𝑤 ) ) } 18: // P hase 2: Logic Distillation 19: L 𝑙 𝑜𝑔𝑖𝑐 ← ∅ 20: 𝑆 𝑠𝑒 𝑞 ← ExtractActionSeqence ( O , L 𝑒 𝑝𝑖 ) ⊲ Step 1 21: P 𝑐 𝑎 𝑛𝑑 ← PrefixSpan ( 𝑆 𝑠𝑒 𝑞 , 𝜎 ) ⊲ Step 2 22: for each pattern 𝑝 ∈ P 𝑐 𝑎 𝑛𝑑 do 23: M 𝑟 𝑒𝑙 ← RetrieveRela tedMemories ( 𝑝 , L 𝑒 𝑝𝑖 ) 24: 𝑠𝑐 𝑜𝑟 𝑒 𝑝 ← LLMVerify ( 𝑝, M 𝑟 𝑒𝑙 ) ⊲ Step 3 25: if 𝑠 𝑐𝑜 𝑟 𝑒 𝑝 > 𝜏 then 26: G ← ConstructDA G ( 𝑝, M 𝑟 𝑒𝑙 ) ⊲ Step 4 27: i 𝑔𝑜𝑎𝑙 ← 𝜙 ( 𝑝 . goal ) ; i 𝑠𝑡 𝑒 𝑝 ← Mean ( { 𝜙 ( 𝑠 ) : 𝑠 ∈ 𝑝 . steps } ) ⊲ Step 5 28: N ← ( 𝑖𝑑 , 𝑝 . goal , { i 𝑔𝑜𝑎𝑙 , i 𝑠𝑡 𝑒 𝑝 } , G , F ) 29: L 𝑙 𝑜𝑔𝑖𝑐 ← L 𝑙 𝑜𝑔𝑖𝑐 ∪ { N } 30: // P hase 3: Incremental Maintenance 31: for each new obser vation 𝑜 𝑛𝑒 𝑤 do 32: N ∗ ← arg max N ∈ L 𝑙 𝑜𝑔𝑖𝑐 sim ( 𝜙 ( 𝑜 𝑛𝑒 𝑤 ) , I N ) ⊲ Matching 33: if sim ( 𝜙 ( 𝑜 𝑛𝑒 𝑤 ) , I N ∗ ) > 𝛿 then ⊲ Gating 34: i ← 𝛽 · i + ( 1 − 𝛽 ) · 𝜙 ( 𝑜 𝑛𝑒 𝑤 ) for i ∈ I N ∗ ⊲ Neural Renement 35: G N ∗ ← Upda teTransitions ( G N ∗ , 𝑜 𝑛𝑒 𝑤 ) ⊲ Symbolic Renement 36: return M to meaningful values, or might drift arbitrarily . Fortunately , the counting-based estimator enjoys strong theoretical guarantees: Theorem 1 (Posterior Consistency). As the numb er of obser- vations 𝑛 → ∞ , the estimated transition probabilities converge almost surely to the true underlying probabilities: ˆ 𝑃 ( 𝑣 𝑗 | 𝑣 𝑖 ) 𝑎 . 𝑠 . − − → 𝑃 ∗ ( 𝑣 𝑗 | 𝑣 𝑖 ) . Proof. Se e Appendix A.3. 4.2.3 Knowledge Fusion. Real-world procedures rarely have a sin- gle canonical execution. Dierent individuals may perform the same task with variations in step ordering, optional steps, or al- ternative methods. When the same proce dur e is observed across multiple videos, each observation initially yields a single-path D AG representing one execution variant. Maintaining these as separate Logic Nodes would fragment the knowledge base, making retrieval less eective and preventing the system from recognizing that these variants represent the same underlying procedure . W e propose a Knowledge Fusion phase that merges single-path D AGs into a unied multi-path D AG through three operations: (1) node alignment via embe dding similarity and optimal bipartite matching to identify semantically e q uivalent steps across DA Gs; (2) edge union to preserve all observed transitions, cr eating branch- ing points where procedures diverge; and (3) statistic pooling to combine transition counts via Bayesian conjugacy . The fused DA G Conference’17, July 2017, W ashington, DC, USA Jiang et al. captures the full space of procedural variations while maintain- ing accurate transition statistics, enabling constraint-based queries to explore all valid alternatives. The following theorem demon- strates that the fusion operation is sound, ensuring that it do es not introduce spurious paths or corrupt statistics: Theorem 2 (Fusion Consistency). Assuming input DA Gs are valid obser vations of the same underlying procedure, the fusion op- eration preserves correctness: aligned nodes correspond to the same action with high probability , p ooled parameters equal the p osterior from the union of obser vations, and the fuse d structure retains all valid alternatives. Proof. Se e Appendix A.4. 4.3 Hybrid Retrieval and Reasoning With the memory architecture established, the agent now nee ds to access the right knowledge at query time. This requires not only nding relevant memories across heterogeneous layers, but also extracting structured information from Logic Nodes when queries demand precise, constraint-aware answers. 4.3.1 ery Classification. Dierent queries demand dierent re- trieval strategies. W e classify incoming queries 𝑞 ∈ Q into three types 𝑇 ∈ { factual , constraint , character } : Factual queries request event recall or entity attributes and are best served by L 𝑒 𝑝𝑖 and L 𝑠𝑒 𝑚 . Constraint queries imp ose explicit feasibility constraints and require symbolic operations on G for resolution. Character queries seek personality traits, behavioral patterns, or role summaries of specic individuals and benet from cross-procedure aggregation via the Logic Layer . Classication employs a two-tier approach. A rule-based pre-lter provides fast initial classication based on lexical patterns indicating constraint or character-base d intent; others default to factual. For ambiguous cases, an LLM-based clas- sier renes the prediction. The resulting classication 𝑇 guides subsequent retrieval weighting. 4.3.2 Multi-Granularity Retrieval. A single query may relate to memory at dierent levels of abstraction: users sometimes ask about high-level goals and sometimes about specic intermediate steps. T o handle b oth, retrieval procee ds in tw o stages that leverage the dual-level Index V ectors. Stage I (Neural Discovery) performs broad similarity search across all memory layers. For Logic No des, retrieval scor es combine goal-level and step-lev el matching: score ( 𝑞, N ) = 𝛼 · sim ( 𝜙 ( 𝑞 ) , i 𝑔𝑜𝑎𝑙 ) + ( 1 − 𝛼 ) · sim ( 𝜙 ( 𝑞 ) , i 𝑠𝑡 𝑒 𝑝 ) (3) where 𝛼 ∈ [ 0 , 1 ] (default 0.3) balances high-level intent match- ing against specic content matching. This dual-index approach ensures that both goal-oriented and step-specic queries can dis- cover relevant Logic Nodes. The initial retrieval returns candidates R 𝑖𝑛𝑖 𝑡 ( 𝑞 ) = { 𝑛 ∈ M : score ( 𝑞, 𝑛 ) > 𝜃 } where 𝜃 is the threshold. Stage II (T ype- A ware Re-ranking) re-weights candidates based on the query classication 𝑇 to prioritize the most relevant layer: score 𝑓 𝑖𝑛 𝑎 𝑙 ( 𝑛 ) = score 𝑖𝑛𝑖 𝑡 ( 𝑛 ) · 𝑤 layer ( 𝑛, 𝑇 ) (4) where 𝑤 layer assigns higher weights to Episo dic/Semantic nodes for 𝑇 = { factual } and to Logic no des for 𝑇 ∈ { constraint , character } . This strategy ensures that constraint and character queries surface Logic Nodes while factual queries prioritize episodic evidence. 4.3.3 Symbolic Enhancement for Reasoning. Once a relevant Logic Node is retrieved, its Procedural D AG G provides structured knowl- edge that can be queried programmatically . This is where the sym- bolic component b ecomes essential: rather than asking an LLM to “gure out” step sequences or lter by constraints from unstruc- tured text, we directly traverse the D AG to extract exactly the information needed—enumerating valid paths, ltering by attribute constraints, or aggregating cross-procedure statistics. These opera- tions are fast ( 𝑂 ( | Π | · 𝐿 ) for path enumeration) and deterministic, ensuring reproducible answers. Formally , a symbolic query function is a mapping 𝑓 : ( G , 𝑥 ) ↦→ 𝑦 where 𝑥 is query-spe cic input and 𝑦 is the structured output. W e implement three core functions: (1) getProcedureWithEvidence ( 𝑔 𝑜 𝑎𝑙 ) → ( G , episodic_links ) : Returns the Procedural DA G along with supporting episodic evi- dence for a sp ecied goal. This function enables evidence-grounded reasoning by providing both the abstract proce dur e and the con- crete observations from which it was derived. (2) queryStepSequence ( 𝑔 𝑜 𝑎𝑙 , C ) → Π C : Returns all paths from START to GOAL satisfying constraints C . Formally: Π C = { 𝜋 ∈ Π ( 𝑣 0 , 𝑣 𝑛 + 1 ) : ∀ 𝑣 ∈ 𝜋 , 𝐴 ( 𝑣 ) | = C } (5) where Π ( 𝑣 0 , 𝑣 𝑛 + 1 ) denotes all paths in G and 𝐴 ( 𝑣 ) | = C indicates that node 𝑣 ’s attributes satisfy constraint C . This function handles constraint queries by ltering paths whose every node fullls the specied feasibility requirements. (3) aggregateCharacterBehaviors ( 𝑝 𝑒𝑟 𝑠 𝑜𝑛 ) → { N 1 , . . . , N 𝑘 } : Returns all Logic Nodes linked to a specied person entity , en- abling character-centric aggregation queries. This cross-procedure aggregation cannot be answered by individual embeddings alone. The following the or em demonstrates that the symb olic query functions are deterministic: Theorem 3 (Determinism Guarantee). All symbolic query functions 𝑓 ∈ F are deterministic: for any xed G and input 𝑥 , repeated invocations of 𝑓 ( G , 𝑥 ) always return identical output 𝑦 . Proof. Se e Appendix A.5. 4.4 Case Study Figure 3 demonstrates how NS-Mem outp erforms vector-centric memory in reasoning tasks. In Q1, NS-Mem successfully infers Jack’s intent by linking "building a bed" with "missing screws" via a Logic Node, whereas the baseline fails due to noise. In Q2, NS-Mem achieves the answer in a single turn by mapping the event "T om beat Alex" to a pre-structured logic chain, while the baseline requires 3 turns of multi-hop retrieval. This highlights NS-Mem’s superior capability in denoising and accelerating complex reasoning. 5 Experiments Datasets. W e evaluate our framew ork on M3-Bench [ 21 ], a com- prehensive long-vide o question answering benchmark designe d for memor y-augmented agents. The b enchmark consists of two primary subsets: M3-Bench-robot , which contains 100 real-world videos captured from a robot’s perspective, and M3-Bench-w eb , Advancing Multimodal Agent Reasoning with Long-T erm Neuro-Symbolic Memor y Conference’17, July 2017, W ashington, DC, USA T able 1: Performance comparison on M3-Bench-robot and M3-Bench-web. MD: Multi-Detail, MH: Multi-Hop, CM: Cross-Modal, HU: Human Understanding, GK: General Knowledge. Best results in each column are underline d. M3-Bench-robot M3-Bench-web Method MD MH CM H U GK All MD MH CM HU GK All Socratic Models Qwen2.5-Omni-7b 2.1 1.4 1.5 1.5 2.1 2.0 8.9 8.8 13.7 10.8 14.1 11.3 Qwen2.5- VL-7b 2.9 3.8 3.6 4.6 3.4 3.4 11.9 10.5 13.4 14.0 20.9 14.9 Gemini-1.5-Pro 6.5 7.5 8.0 9.7 7.6 8.0 18.0 17.9 23.8 23.1 28.7 23.2 GPT -4o 9.3 9.0 8.4 10.2 7.3 8.5 21.3 21.9 30.9 27.1 39.6 28.7 Online Video Understanding MovieChat 13.3 9.8 12.2 15.7 7.0 11.2 12.2 6.6 12.5 17.4 11.1 12.6 MA -LMM 25.6 23.4 22.7 39.1 14.4 24.4 26.8 10.5 22.4 39.3 15.8 24.3 Flash- VStream 21.6 19.4 19.3 24.3 14.1 19.4 24.5 10.3 24.6 32.5 20.2 23.6 Agent Methods M3- Agent 32.8 29.4 31.2 43.3 19.1 30.7 45.9 28.4 44.3 59.3 53.9 48.9 NS-Mem 36.2 31.5 33.8 45.7 26.4 34.7 54.2 34.6 47.8 60.1 59.7 53.6 There are no screws at home The design requires 20 screws Follow the design plan T o build a bed. The token can be redeemed at Jack’ s Get a basketball T om beat Alex on the court Beat Alex ...... ...... There are no screws at home The design requires 20 screws T om beat Alex on the court Jack wants to build a bed according to the design plan Jack wants to build a bed according to the design plan ...... Alex always gives his token to the opponent after losing The token can be redeemed at Jack’ s for a basketball ...... Alex gives his token Alex always gives his token to the opponent after losing The token can be redeemed at Jack’ s for a basketball ...... NS-Mem Q1: Why did Jack go out? Jack wants to build a bed according to the design plan The token can be redeemed at Jack’ s for a basketball There are no screws at home The design requires 20 screws T o build a bed. Follow the design plan Jack wants to build a bed according to the design plan I don't know . Go buy the screws needed to build a bed. Q2: Who finally got the basketball, and how did they get it? The token can be redeemed at Jack’ s for a basketball Alex always gives his token to the opponent after losing T om beat Alex on the court The token can be redeemed at Jack’ s Get a basketball T om beat Alex on the court Beat Alex Alex gives his token T om. He beat Alex and used the token to get it from Jack. T om. He beat Alex and used the token to get it from Jack. 3 turns 1 turn Episodic Node Semantic Node Logic Node Figure 3: Case study on vector-centric Memor y and NS-Mem. T able 2: Accuracy by Quer y T ype Method Factual Procedural Constrained M3- Agent 52.5 23.8 25.0 NS-Mem 54.3 35.7 37.5 Δ +1.8 +11.9 +12.5 T able 3: Eciency Comparison on M3-Bench Datasets Dataset Metric Baseline NS-Mem Δ Robot A vg. Rounds 4.01 3.38 -15.8% A vg. Time (sec) 45.47 42.11 -7.4% W eb A vg. Rounds 3.14 2.84 -9.6% A vg. Time (sec) 36.04 34.57 -4.1% which includes 920 web-sourced videos. The questions are cate- gorized into ve reasoning types: Multi-Detail (MD), Multi-Hop (MH), Cross-Modal (CM), Human Understanding (H U), and General Knowledge (GK). Due to computational constraints, W e conduct evaluations on 50 videos (703 questions) for M3-Bench-r ob ot and 550 videos (2,066 questions) for M3-Bench-web. Baseline Methods. Following the previous work [ 21 ], we compare NS-Mem against three categories of methods: • Socratic Models . The methods directly quer y multimodal LLMs (Qwen2.5-Omni-7b , Qwen2.5- VL-7b, Gemini-1.5-Pro and GPT -4o) without explicit memory; • Online Video Understanding Metho ds . The methods designe d for streaming video processing (Mo vieChat[ 33 ], MA -LMM[ 11 ] and Flash- VStream[47]); • Agent Method . M3-Agent [ 21 ] represents a state-of-the-art ap- proach that utilizes episodic and semantic memory with vector- only retrieval. Metrics and Implementation. Accuracy is evaluated using GPT - 4o as the judge, following the standard M3-Bench protocol. For NS-Mem, we set the hidden dimension to 512, the retrieval weight 𝛼 to 0.3, and the verication threshold 𝜏 to 0.25. The incremental maintenance uses an EMA coecient 𝛽 of 0.9. All experiments are conducted on a server with Intel(R) X eon(R) Silver 4314 CP U, 512GB memory and N VIDIA RTX A5000 GP Us. 5.1 Accuracy Comparison Exp-1: Overall performance comparison. In this experiment, we evaluate the overall accuracy of NS-Mem compared to all base- lines. The results are summarized in T able 1. As shown in the table, NS-Mem consistently outperforms all baseline methods across b oth Robot and W eb datasets. Specically , NS-Mem achieves 53.6% accu- racy on M3-Bench-web and 34.7% on M3-Bench-r ob ot, representing absolute improvements of +4.7 and +4.0 points ov er M3-Agent. In contrast, Socratic Models and streaming methods show signicantly lower performance. For instance, GPT -4o achieves only 8.5% on Conference’17, July 2017, W ashington, DC, USA Jiang et al. F actual P r ocedural Constraint (a) Impact of 𝜏 (b) Impact of 𝛿 * = 0 . 3 (c) Impact of 𝛼 Figure 4: Hyp er-parameter analysis across dierent thresholds and weights. (a) Impact of 𝜏 on accuracy across query types. (b) Impact of of 𝛿 on knowledge consolidation and merge (c) Impact of 𝛼 on accuracy and eciency . Robot, which is 4.1 × lower than NS-Mem. This is because our neuro- symbolic architecture pro vides a structured substrate for reasoning about procedural sequences, which baselines fail to capture. Exp-2: Accuracy over dierent reasoning types. W e further analyze performance across the ve reasoning types in T able 1. No- tably , NS-Mem demonstrates substantial gains in MH and GK. For MH, we observe relative gains of 21.8% on W eb, which is b ecause the Procedural DA G enables explicit multi-step path enumeration. For GK, NS-Mem shows a 38.2% relative gain on Robot, beneting from NS-Nodes that consolidate domain-specic procedural pat- terns from episodic observations. This validates that the logic layer eectively abstracts r eusable knowledge from concrete experiences. Exp-3: Performance under dierent query types. T o under- stand where neuro-symbolic memory provides the most value, we break down the results by query type in T able 2. W e obser ve that procedural and constrained queries benet most from the neuro- symbolic layer , with relative improv ements of 50.0% for both. This is because the Procedural DA G explicitly enco des step-by-step logic, enabling deterministic constraint satisfaction via symbolic func- tions. In contrast, factual queries show 1.8% improvement, as they primarily require direct recall rather than structured reasoning. 5.2 Eciency Evaluation Exp-4: Eciency . In this experiment, we evaluate the eciency of NS-Mem in terms of retrie val rounds and time. The results ar e summarized in Table 3. The results show that NS-Mem signicantly reduces the number of quer y rounds from 4.01 to 3.38 on Rob ot. This is be cause symbolic functions like queryStepSequence() can return complete procedural sequences in a single call, eliminating the iterative cycles required by pure neural memory . This reduction leads to concrete time savings of 7.4% on Robot and 4.1% on W eb, even with the minimal ov erhead of symb olic e xe cution. 5.3 Ablation Study Exp-5: Ablation study . W e evaluate the individual contributions of key components in T able 4. W e can se e that symbolic reasoning is the most critical comp onent, pro viding 2.5 × larger gains on W eb compared to retrieval enhancement alone . Furthermor e, we obser ve a synergistic interaction. For example, on dataset W eb, the combined gain of the full mo del ( +4.7) exceeds the sum of individual gains ( +1.1 + +2.8 = 3.9). This is be cause the neural comp onent (Index T able 4: Ablation Study on M3-Bench Datasets Conguration Description W eb Robot Baseline M3- Agent 48.9 30.7 w/o Symbolic +Neuro +D AG 50.0 31.7 w/o Neuro +Symbolic +DA G 51.7 33.1 Full NS-Mem +Neuro +Symb olic +DA G 53.6 34.7 T able 5: Incremental Up date Evaluation. Metric Dataset Static D ynamic Δ Accuracy (%) Robot 33.8 34.7 +0.9 W eb 53.0 53.6 +0.6 A vg. Rounds Robot 3.52 3.38 -0.14 W eb 2.92 2.84 -0.08 V ectors) improves retrieval precision, providing a more relevant substrate for symbolic reasoning to operate on. 5.4 Maintenance Evaluation Exp-6: Incremental update p erformance . In this experiment, we evaluate the capacity of NS-Mem to maintain NS-No des as new observations arrive incrementally . W e compare the p erformance of static graphs with dynamic graphs. The results are summarized in T able 5. The results show that dynamic graphs consistently out- perform static graphs in both accuracy and eciency . Spe cically , for accuracy , the dynamic approach achieves 34.7% on Rob ot and 53.6% on W eb, r epresenting absolute improvements of +0.9% and +0.6% respe ctiv ely . Regarding eciency , the dynamic graph reduces the average number of rounds from 3.52 to 3.38 on Robot and from 2.92 to 2.84 on W eb. This is be cause the EMA -based renement eectively incorporates new procedural variations while preserv- ing historical knowledge, preventing semantic staleness without requiring costly full reconstruction. 5.5 Hyper-parameter Analysis Exp-7: V erication threshold ( 𝜏 ). W e analyze the impact of 𝜏 on accuracy in Figure 4a. W e obser ve that pr oce dural and constraint queries are highly sensitive to 𝜏 , peaking at 𝜏 = 0 . 25 . This is b ecause setting 𝜏 too low admits spurious patterns that break symb olic logic, while setting it too high overly lters valid procedural knowledge. Exp-8: Gating threshold ( 𝛿 ). As shown in Figure 4b, the gating threshold 𝛿 controls the balance between knowledge consolidation and noise prev ention. The results show that merge error rates drop Advancing Multimodal Agent Reasoning with Long-T erm Neuro-Symbolic Memor y Conference’17, July 2017, W ashington, DC, USA sharply until 𝛿 = 0 . 5 , where they reach an elbow . This validates our choice of 𝛿 = 0 . 5 as the trade-o for incremental maintenance. Exp-9: Retrieval weight ( 𝛼 ). W e evaluate the impact of 𝛼 in Fig- ure 4c. W e obser v e that overall accuracy peaks at 𝛼 = 0 . 3 . This is because a moderate weight balances high-level goal intent with specic experiential grounding, ensuring that Memory Prototypes remain both discoverable and contextually accurate . 6 Conclusion In this paper , we presented NS-Mem, a long-term neuro-symbolic memory framework to bridge the gap between intuitive neural re- trieval and deterministic symbolic reasoning for multimodal agents. By integrating a hierarchical three-layer architecture with explicit logic rules and procedural D AGs, NS-Mem enables agents to handle complex decision-making tasks that require constraint satisfaction and dependency reasoning. Our proposed SK-Gen mechanism fur- ther ensures that this memor y can b e automatically constructe d and incrementally maintained from continuous multimodal obser- vations. Extensive experiments on real-world benchmarks demon- strate that NS-Mem signicantly outperforms the state-of-the-art approach, particularly in constrained reasoning scenarios where symbolic structures provide rigorous logical gr ounding. References [1] Y ongtang Bao, Chenxi Wu, Peng Zhang, Caifeng Shan, Y ue Qi, and Xianye Ben. 2024. Boosting micro-expression recognition via self-expression reconstruction and memory contrastive learning. IEEE Transactions on Aective Computing 15, 4 (2024), 2083–2096. [2] Marcel Binz and Eric Schulz. 2022. Using cognitive psychology to understand GPT - 3. CoRR abs/2206.14576 (2022). arXiv:2206.14576 doi:10.48550/ARXI V .2206.14576 [3] Y afeng Chen, Siqi Zheng, Hui W ang, Luyao Cheng, Qian Chen, and Jiajun Qi. 2023. An Enhanced Res2Net with Local and Global Feature Fusion for Speaker V erication. In Interspe ech . 3032–3036. [4] Artur d’ A vila Garcez and Luís C. Lamb. 2023. Neurosymbolic AI: the 3rd wave. A rtif. Intell. Rev . 56, 11 (2023), 12387–12406. doi:10.1007/S10462- 023- 10448- W [5] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 4690–4699. [6] Anxhelo Diko, Tinghuai W ang, W assim Swaileh, Shiyan Sun, and Ioannis Patras. 2025. ReWind: Understanding Long Videos with Instructed Learnable Memory . In Proceedings of the Computer Vision and Pattern Recognition Conference . 13734– 13743. [7] Richard Evans and Edward Grefenstette. 2018. Learning Explanator y Rules from Noisy Data. In JAIR . [8] Luyu Gao et al. 2023. P AL: Program-aided Language Mo dels. In ICML . [9] Artur d’ A vila Garcez and Luis C Lamb. 2019. Neural-symb olic computing: An ef- fective methodology for principled integration of machine learning and reasoning. Journal of A pplied Logics 6, 4 (2019), 611–632. [10] Artur d’ A vila Garcez and Luis C Lamb. 2023. Neurosymbolic AI: The 3rd W ave. A rticial Intelligence Review (2023). [11] Bo He et al . 2024. MA-LMM: Memory- Augmented Large Multimodal Model for Long- T erm Video Understanding. In CVPR . [12] Bo He, Hengduo Li, Young K yun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. Ma-lmm: Memory-augmente d large multimodal model for long-term video understanding. In Proce e dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13504–13514. [13] Pascal Hitzler and Md. Kamruzzaman Sarker (Eds.). 2021. Neuro-Symbolic A r- ticial Intelligence: The State of the A rt . Frontiers in Articial Intelligence and Applications, V ol. 342. IOS Press. doi:10.3233/F AIA342 [14] Mengkang Hu, Tianxing Chen, Qiguang Chen, Y ao Mu, W enqi Shao, and Ping Luo. 2025. Hiagent: Hierarchical working memor y management for solving long-horizon agent tasks with large language model. In Proceedings of the 63rd A nnual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) . 32779–32798. [15] Daniel Kahneman. 2011. Thinking, fast and slow . macmillan. [16] Patrick Lewis et al . 2020. Retrieval- Augmented Generation for Knowledge- Intensive NLP T asks. In NeurIPS . [17] Kevin Lin, Faisal Ahme d, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Y ang, Jianfeng W ang, Lin Liang, Zicheng Liu, Yumao Lu, et al . 2023. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773 (2023). [18] Na Liu, Liangyu Chen, Xiaoyu Tian, W ei Zou, Kaijiang Chen, and Ming Cui. 2024. From llm to conversational agent: A memory enhanced architecture with ne-tuning of large language models. arXiv preprint arXiv:2401.02777 (2024). [19] W eijie Liu, Zecheng Tang, Juntao Li, Kehai Chen, and Min Zhang. 2024. Mem- long: Memory-augmented retrieval for long text modeling. arXiv preprint arXiv:2408.16967 (2024). [20] Zhiwei Liu, W eiran Y ao, Jianguo Zhang, Liangwei Y ang, Zuxin Liu, Juntao Tan, Prafulla K Choubey , Tian Lan, Jason Wu, Huan W ang, et al . 2024. Agentlite: A lightweight library for building and advancing task-oriented llm agent system. arXiv preprint arXiv:2402.15538 (2024). [21] Lin Long, Yichen He, W entao Y e, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and W ei Li. 2025. Se eing, listening, r ememb ering, and r easoning: A multimo dal agent with long-term memory . arXiv preprint arXiv:2508.09736 (2025). [22] Lin Long, Yichen He, W entao Y e, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and W ei Li. 2025. Se eing, listening, r ememb ering, and r easoning: A multimo dal agent with long-term memory . arXiv preprint arXiv:2508.09736 (2025). [23] Kai Mei, Xi Zhu, Wujiang Xu, W enyue Hua, Mingyu Jin, Zelong Li, Shuyuan Xu, Ruosong Y e, Yingqiang Ge, and Y ongfeng Zhang. 2024. Aios: Llm agent operating system. arXiv preprint arXiv:2403.16971 (2024). [24] Charles Packer et al . 2023. MemGPT: To wards LLMs as Operating Systems. In NeurIPS . [25] Jian Pei et al . 2001. PrexSpan: Mining Sequential Patterns Eciently by Prex- Projected Pattern Growth. In ICDE . [26] Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Y uchen Zhang, and Hang Li. 2024. Agile: A novel reinforcement learning framework of llm agents. Advances in Neural Information Processing Systems 37 (2024), 5244–5284. [27] Tim Rocktäschel and Sebastian Riedel. 2017. End-to-End Dierentiable Proving. In NeurIPS . Conference’17, July 2017, W ashington, DC, USA Jiang et al. [28] Stuart Russell and Peter Norvig. 2010. Articial Intelligence: A Modern A pproach (3rd ed.). Prentice Hall. [29] Abulhair Saparov and He He. 2023. Language Mo dels Ar e Gree dy Reasoners: A Systematic Formal Analysis of Chain-of- Thought. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview .net. https://openreview .net/forum?id=qF VVBzXxR2V [30] Gabriel Sarch, Yue Wu, Michael Tarr , and Katerina Fragkiadaki. 2023. Open- ended instructable embodied agents with memory-augmented large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 . 3468–3500. [31] Y u Shang, Yu Li, Ke yu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Y ong Li. 2024. Agentsquare: Automatic llm agent search in modular design space. arXiv preprint arXiv:2410.06153 (2024). [32] Mohit Shridhar et al . 2020. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday T asks. In CVPR . [33] Enxin Song et al . 2024. MovieChat: From Dense T oken to Sparse Memor y for Long Video Understanding. In CVPR . [34] Dídac Surís, Sachit Menon, and Carl V ondrick. 2023. Vip erGPT: Visual Inference via Python Execution for Reasoning. In ICCV . [35] Y ansong Tang et al . 2019. COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis. In CVPR . [36] Endel T ulving. 1972. Episodic and semantic memory . Organization of memor y 1 (1972), 381–403. [37] Karthik V almeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the Planning Abilities of Large Language Models - A Critical Investigation. CoRR abs/2305.15771 (2023). arXiv:2305.15771 doi:10.48550/ARXIV . 2305.15771 [38] Bing W ang, Xinnian Liang, Jian Y ang, Hui Huang, Shuangzhi Wu, Peihao W u, Lu Lu, Zejun Ma, and Zhoujun Li. 2023. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343 (2023). [39] Guanzhi W ang et al . 2023. V oyager: An Open-Ended Embodied Agent with Large Language Models. In NeurIPS . [40] Lei W ang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai T ang, Xu Chen, Y ankai Lin, Wayne Xin Zhao, Zhe wei W ei, and Jirong W en. 2024. A sur v ey on large language model based autonomous agents. Fr ontiers Comput. Sci. 18, 6 (2024), 186345. doi:10.1007/S11704- 024- 40231- 1 [41] Xiaohan W ang et al . 2024. Vide o Agent: Long-form Video Understanding with Large Language Model as Agent. In ECCV . [42] T aylor W . W ebb, Keith J. Holyoak, and Hongjing Lu. 2022. Emergent Ana- logical Reasoning in Large Language Models. CoRR abs/2212.09196 (2022). arXiv:2212.09196 doi:10.48550/ARXIV .2212.09196 [43] Michael W ooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The Knowledge Engine ering Review 10, 2 (1995), 115–152. [44] Fan Yang et al . 2017. Dierentiable Learning of Logical Rules for Knowledge Base Reasoning. In NeurIPS . [45] Kexin Yi et al . 2018. Neural-Symbolic V QA: Disentangling Reasoning fr om Vision and Language Understanding. In NeurIPS . [46] Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng W ang, Linjie Li, Chung- Ching Lin, Zicheng Liu, and Lijuan W ang. 2024. Mm-narrator: Narrating long- form videos with multimodal in-context learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13647–13657. [47] Haoji Zhang, Yiqin W ang, Y ansong Tang, Y ong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. 2024. F lash-vstr eam: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085 (2024). [48] Pan Zhang, Xiaoyi Dong, Y uhang Cao, Y uhang Zang, Rui Qian, Xilin W ei, Lin Chen, Yifei Li, Junb o Niu, Shuangrui Ding, et al . 2024. Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions. arXiv preprint arXiv:2412.09596 (2024). [49] W anjun Zhong, Lianghong Guo, Qiqi Gao, He Y e, and Y anlin W ang. 2024. Memo- rybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on A rticial Intelligence , V ol. 38. 19724–19731. T able 6: Key notations used throughout this paper . Notation Description M Memory system M = ( L 𝑒 𝑝𝑖 , L 𝑠𝑒 𝑚 , L 𝑙 𝑜𝑔𝑖𝑐 , E ) L 𝑒 𝑝𝑖 , L 𝑠𝑒 𝑚 , L 𝑙 𝑜𝑔𝑖𝑐 Episodic, Semantic, and Logic layers O = { 𝑜 1 , 𝑜 2 , . . . } Multimodal obser vation str eam 𝑜 𝑡 = ( 𝑜 𝑣 𝑡 , 𝑜 𝑎 𝑡 , 𝑜 𝑠 𝑡 ) Observation at time 𝑡 (visual, audio , text) Q , A Query space and answer space 𝑞 ∈ Q A user query in natural language 𝑒 = ( 𝑡 , d , v 𝑒 ) Episodic node: timestamp, content, episodic embedding 𝑠 = ( type , attrs , v 𝑠 ) Semantic no de: entity type, attributes, seman- tic embedding N = ( 𝑖 𝑑, 𝑐 , I , G , F ) Logic Node: id, goal description, index vec- tors, D AG, functions 𝑐 Natural language goal description I = { i 𝑔𝑜𝑎𝑙 , i 𝑠𝑡 𝑒 𝑝 } Index V ectors: goal-level and step-level em- beddings i 𝑔𝑜𝑎𝑙 , i 𝑠𝑡 𝑒 𝑝 Goal index 𝜙 ( 𝑐 ) and step index 1 | 𝑆 | Í 𝑠 𝜙 ( 𝑠 ) G = ( 𝑉 , 𝐸, 𝐴 ) Procedural D AG: nodes, edges, attribute func- tion F Set of deterministic symbolic query functions 𝜙 : T ext → R 𝑑 T ext embe dding function 𝑑 Embedding dimension A Appendix A.1 Notation T able A.2 Complexity Analysis NS-Mem operates continuously through two decoupled processes: incremental maintenance and query-driven reasoning. In terms of the incremental maintenance phase, for each incoming observation at time step 𝑡 , the system performs pattern mining and prototype updates sequentially . The SK-Gen me chanism pr o cesses a sliding window of size 𝑤 to update the episodic buer , with a time com- plexity of 𝑂 ( 𝑤 × 𝑑 ) for emb edding computation, where 𝑑 is the vector dimension. Regarding the structural renement, updating the transition statistics and edges in the Procedural DA G G = ( 𝑉 , 𝐸 ) takes 𝑂 ( 1 ) via hash-based lookups. The prototype update via EMA requires element-wise operations with a complexity of 𝑂 ( 𝑑 ) . Cru- cially , maintaining the vector index for L 𝑙 𝑜𝑔𝑖𝑐 involves incremental insertions. Using graph-based indexing structures, this requires 𝑂 ( 𝑑 log | N | ) , where | N | is the number of logic no des. Therefore , the total maintenance complexity p er time step is dominated by the index update 𝑂 ( 𝑑 log | N | ) , which is highly ecient compared to batch retraining. Regarding the retrieval and reasoning phase, the system rst identies relevant Logic Nodes N using the index-based retrieval. For a specic quer y 𝑞 , the search complexity is 𝑂 ( 𝑑 log | N | ) . Once a relevant node is identied, symbolic reasoning is executed on the associated procedural DA G G . The traversal of nding a path or checking constraints takes 𝑂 ( | 𝑉 | + | 𝐸 | ) . Thus, suppose that NS-Mem runs for 𝑇 time steps and handles 𝑄 queries, the overall time complexity is 𝑂 ( 𝑇 × 𝑑 log | N | + 𝑄 × ( 𝑑 log | N | + | 𝑉 | + | 𝐸 | ) ) . Advancing Multimodal Agent Reasoning with Long-T erm Neuro-Symbolic Memor y Conference’17, July 2017, W ashington, DC, USA A.3 Proof of Theorem 1 (Posterior Consistency) Proof. W e prove consistency for both parameter types. Node success rates (Beta-Binomial) : The posterior mean is: ˆ 𝑅 ( 𝑣 ) = 𝛼 + 𝑠 𝛼 + 𝛽 + 𝑛 (6) where 𝑠 is successes out of 𝑛 trials. As 𝑛 → ∞ : ˆ 𝑅 ( 𝑣 ) = 𝛼 + 𝑠 𝛼 + 𝛽 + 𝑛 = 𝛼 / 𝑛 + 𝑠 / 𝑛 𝛼 / 𝑛 + 𝛽 / 𝑛 + 1 → 𝑠 𝑛 → 𝑅 ∗ ( 𝑣 ) (7) by the strong law of large numbers ( 𝑠 / 𝑛 𝑎 . 𝑠 . − − → 𝑅 ∗ ( 𝑣 ) ). Edge probabilities (Dirichlet-Multinomial) : The posterior mean for edge ( 𝑢 , 𝑣 𝑗 ) is: ˆ 𝑝 𝑢, 𝑣 𝑗 = 𝛾 𝑗 + 𝑐 𝑗 Í 𝑖 ( 𝛾 𝑖 + 𝑐 𝑖 ) (8) where 𝑐 𝑗 is the count of transitions to 𝑣 𝑗 . As total observations 𝑛 = Í 𝑐 𝑖 → ∞ : ˆ 𝑝 𝑢, 𝑣 𝑗 → 𝑐 𝑗 𝑛 𝑎 . 𝑠 . − − → 𝑝 ∗ 𝑢, 𝑣 𝑗 (9) again by the strong law of large numbers. Both results follow from Doob ’s posterior consistency the or em for exponential families with compact parameter spaces. □ A.4 Proof of Theorem 2 (Fusion Consistency) Proof. W e show that the fusion algorithm preserves correctness under the assumption that input DA Gs are valid observations of the same true procedure. Node alignment correctness : Using semantic embeddings with cosine similarity and threshold 𝜏 = 0 . 8 , nodes representing the same action (with potentially dierent surface forms) are correctly matched with high probability . The Hungarian algorithm guaran- tees optimal bipartite matching. Parameter fusion correctness : The Bayesian fusion rule: 𝛼 fused = 𝛼 1 + 𝛼 2 − 1 (10) 𝛽 fused = 𝛽 1 + 𝛽 2 − 1 (11) is e quivalent to po oling observations: if DA G 1 observed ( 𝑠 1 , 𝑛 1 ) successes/trials and DA G 2 observed ( 𝑠 2 , 𝑛 2 ) , the fused posterior is: Beta ( 𝛼 0 + 𝑠 1 + 𝑠 2 , 𝛽 0 + ( 𝑛 1 − 𝑠 1 ) + ( 𝑛 2 − 𝑠 2 ) ) (12) which is the correct posterior for combined obser vations. Structure preser vation : Ne w e dges ( alternative paths) discov- ered in one video but not another are added to the fused DA G with appropriate Laplace-smoothed initial probabilities. This ensures no valid alternatives are lost. By Theorem 1, as more videos ar e fuse d, parameters converge to true values, and the structure asymptotically captures all valid paths. □ A.5 Proof of Theorem 3 (Determinism) Proof. W e prove that each of the three symb olic quer y functions produces deterministic outputs. Function 1 ( getProcedureWithEvidence ): Given a goal de- scription, this function retrieves the corresponding Logic Node and returns its Procedural DA G G together with the asso ciated episodic_links . Since the Logic Node is identied through a xed similarity computation on static Index V ectors, and both G and episodic_links are stored data structur es, the output is a deter- ministic lookup with no stochastic component. Function 2 ( queryStepSequence ): Given a goal and constraint set C , this function enumerates all paths Π ( 𝑣 0 , 𝑣 𝑛 + 1 ) from START to GOAL in the xed DA G G via depth-rst traversal, then lters by checking 𝐴 ( 𝑣 ) | = C for ev er y node 𝑣 along each path. Graph traver- sal on a static structure is deterministic, and attribute-constraint satisfaction is a Boolean pr e dicate evaluated through set/arithmetic operations. Hence the ltered path set Π C is uniquely determined by ( G , C ) . Function 3 ( aggregateCharacterBehaviors ): Given a person entity identier , this function scans the Logic Layer L 𝑙 𝑜𝑔𝑖𝑐 and col- lects all Logic No des whose episodic_links reference episo dic nodes associated with the sp ecied entity anchor . Both the entity- anchor association and the episodic_links are xed stored refer- ences, making the aggregation a deterministic ltering operation over a static set. Since none of the three functions involves random sampling, stochastic models, or LLM inference during computation, all outputs are reproducible given identical inputs. □
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment