CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory
ArXiv ID: 2512.15813
Date: 2025-12-17
Authors: ** - Nishant Gaurav (nishant@agentr.dev) - Adit Akarsh (adit@agentr.dev) - Tejas Ravishankar (tejas@agentr.dev) - Manoj Bajaj (manoj@agentr.dev) **

📝 Abstract

Current tool-using AI agents suffer from limited action space, context inefficiency, and probabilistic instability that makes them unsuitable for handling repetitive tasks which are otherwise reliably and efficiently tackled by agentic workflows built on platforms like n8n [12] and Zapier [8] . Earlier works like CodeAct [15], DynaSaur [13], Code Mode [4] have tried to tackle the first two issues by using the whole Python language as its action space: The number of tools that the agent can call becomes infinite. Python code blocks can execute complex actions into a single step and print only relevant results which helps in keeping the context lean. However, the probabilistic instability issue still remains, as for the same task in the same environment, the agent can follow different trajectories due to the probabilistic nature of LLMs. Therefore, we need procedural memory for consistency and reliability. This paper proposes CodeMem, an architecture to implement procedural memory via code which can be used to build and run reusable agentic workflows with deterministic reliability.

💡 Deep Analysis

Deep Dive into CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory.

📄 Full Content

CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory Nishant Gaurav nishant@agentr.dev Adit Akarsh adit@agentr.dev Tejas Ravishankar tejas@agentr.dev Manoj Bajaj manoj@agentr.dev Abstract Current tool-using AI agents suffer from limited action space, context inefficiency, and probabilistic instability that makes them unsuitable for handling repetitive tasks which are otherwise reliably and efficiently tackled by agentic workflows built on platforms like n8n [12] and Zapier [8]. Earlier works like CodeAct [15], DynaSaur [13], Code Mode [4] have tried to tackle the first two issues by using the whole Python language as its action space: The number of tools that the agent can call becomes infinite. Python code blocks can execute complex actions into a single step and print only relevant results which helps in keeping the context lean. However, the probabilistic instability issue still remains, as for the same task in the same environment, the agent can follow different trajectories due to the probabilistic nature of LLMs. Therefore, we need procedural memory for consistency and reliability. This paper proposes CodeMem, an architecture to implement procedural memory via code which can be used to build and run reusable agentic workflows with deterministic reliability. 1 Introduction Tool-using language agents have evolved from single-shot chatbots into complex systems capable of planning and state management. Frameworks like CoALA emphasize that capable agents require structured memory, rich action spaces, and iterative decision-making loops [14]. However, most production architectures still rely on token-heavy, tool-centric interaction patterns where the LLM micromanages every step. This paper proposes CodeMem, an architecture that reframes the LLM as an architect of executable workflows. Instead of standard chat-based tool calling, the agent utilizes a sandbox to write, validate, and save successful logic into a persistent procedural memory bank. This approach solves the reproducibility crisis inherent in probabilistic models by shifting complex logic from volatile context windows into deterministic code. The plan for the rest of the paper is as follows. We first summarize the relevant literature which form the basis for CodeMem (in Section 2). Then we dive deeper into the benefits of CodeAct over ReAct which establishes that Code is the right format for capturing procedures (in Section 3). The next Section covers the key bottlenecks which must be solved (in Section 4). Section 5 proposes the CodeMem Architecture which overcomes the key bottlenecks specified in the previous section. In Section 6, we show how CodeMem creates procedural memory with a real-world case-study. Section 7 shows experiments which prove the benefits of CodeMem quantitatively. Finally we summarize the findings in Section 8. Preprint. arXiv:2512.15813v1 [cs.SE] 17 Dec 2025 2 Related Work 2.1 Procedural Memory in Theory The CoALA framework organizes agent memory into working, episodic, semantic, and procedural categories [14]. While working memory (context) and semantic memory (RAG) are well-solved, procedural memory (implicit knowledge of how to execute tasks) remains a bottleneck. The paper suggests three possible approaches for capturing procedural memory in AI Agents: 1. Rewriting the code of the Agent 2. Editing the weights of the LLMs 3. Editing the instructions of the AI Agent Theoretically both these approaches make sense, but practical implementations are still missing. LangGraph, a popular agentic framework, calls this out in their documentation: “In practice, it is fairly uncommon for agents to modify their model weights or rewrite their code. However, it is more common for agents to modify their own prompts.” [9] 2.2 Implementing Procedural Memory via Meta Prompting Frameworks like LangGraph distinguish between short-term memory (thread-scoped checkpoints) and long-term memory (cross-thread persistence) [10]. Short-Term: LangGraph effectively manages conversation state (i.e. the messages history, includ- ing human, AI, and tool messages) via checkpoints. All kinds of memory (semantic, episodic, and procedural) can be captured in short term memory but there is no guarantee that the agent will utilize this memory reliably. That’s why we need long term memory. Long-Term: For capturing semantic and episodic memory we can leverage RAG and there are several open-source projects (e.g. Graphiti [6], LightRAG [7]) which are getting used in production. However, capturing procedural memory is still an underexplored topic. LangGraph conceptualizes procedural updates primarily as modifications to graph topology or just the system instructions (e.g., “update the prompt to handle X better next time”) [9]. For example, the LangGraph team built a Tweet generator using external feedback and prompt re- writing to produce high-quality paper summaries for Twitter. In this case, the specific summarization prompt was diff

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Start searching

No results found