MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating ca…
Authors: Peng Xia, Jianwen Chen, Xinyu Yang
: Just T alk – An Agent That Meta-Lea rns and Evolves in the Wild Peng Xia 1 ∗ , Jianwen Chen 1 ∗ , Xinyu Yang 2 ∗ , Haoqin Tu 3 ∗ , Jiaqi Liu 1 ∗ , Kaiwen Xiong 1 ∗ , Siwei Han 1 , Shi Qiu 1 , Haonian Ji 1 , Yuyin Zhou 3 , Zeyu Zheng 4 , Cihang Xie 3 , Huaxiu Yao 1 ∗ 1 UNC-Chapel Hill , 2 Carnegie Mellon University , 3 UC Santa Cruz , 4 UC Berkeley , ∗ Core Contributors Large language model (LLM) agents hav e rapidly emerged as p o w erful assistants for complex, multi-step tasks, yet agen ts deplo y ed in the wild remain largely static, trained once and serv ed unchanged regardless of ho w user needs ev olv e. This creates a fundamen tal tension: they m ust serv e users contin uously without interruption, yet their capabilities gro w stale as the task distribution drifts with real-world usage. On platforms such as Op enClaw, where a single agent connects to 20+ messaging channels and handles div erse, evolving workloads, existing approaches either store raw tra jectories without distilling transferable b eha vioral knowledge, maintain static skill libraries disconnected from weigh t optimization, or incur service down time during retraining. W e presen t MetaClaw , a contin ual meta-learning framew ork that jointly maintains a base LLM p olicy and an evolving skill library of reusable b ehavioral instructions, improving b oth through tw o complemen tary mechanisms. Skil l-driven fast adaptation analyzes failure tra jectories and synthesizes new skills via an LLM evolv er, taking effect immediately with zero service down time. Opp ortunistic p olicy optimization p erforms gradien t-based w eigh t up dates via cloud LoRA fine-tuning using RL with a pro cess reward mo del, triggered only during user-inactive windo ws by the Opp ortunistic Meta-Learning Sc heduler (OMLS), which monitors configurable sleep hours, system keyboard inactivity , and Go ogle Calendar o ccupancy . The tw o mechanisms are mutually reinforcing: a b etter p olicy pro duces more informative failures for skill syn thesis, and ric her skills yield higher-reward tra jectories for p olicy optimization. T o preven t stale reward contamination, a skill generation versioning mechanism strictly separates supp ort data (failure tra jectories consumed by skill ev olution) from query data (p ost-adaptation tra jectories used for RL up dates). Built on a proxy- based architecture, MetaClaw scales to pro duction-size LLMs without a lo cal GPU. Exp erimen ts on MetaCla w-Benc h (934 questions, 44 simulated workda ys) and A utoResearc hCla w (23-stage autonomous researc h pip eline) demonstrate consistent improv emen ts: skill-driv en adaptation improv es accuracy by up to 32% relative; the full pip eline adv ances Kimi-K2.5 from 21.4% to 40.6% accuracy (vs. GPT-5.2 baseline 41.1%) with an 8.25 × gain in end-to-end task completion; and skill injection alone improv es A utoResearc hCla w comp osite robustness by 18.3%. Github: https://github.com/aiming- lab/MetaClaw 1 Intro duction Large language mo del (LLM) agents hav e demonstrated remarkable capabilities across complex tasks ( Y ao et al. , 2022 ; Shinn et al. , 2023 ), y et agen ts deplo y ed in the wild remain largely static, trained once and serv ed unc hanged regardless of how the user’s needs evolv e ( Zhang et al. , 2025b ; Naihin et al. , 2023 ; Song et al. ). Consider Op enCla w ( Op enClaw , 2026 ), an op en-source CLI agent platform connecting to 20+ messaging channels, where a single user’s workload may shift from m ulti-step file system op erations one week to multi-agen t messaging w orkflo ws the next. As the task distribution drifts, a frozen mo del b ecomes increasingly misaligned with actual usage patterns, rep eatedly failing on task types underrepresented during pretraining. Existing approaches to agent adaptation fall into three broad categories, eac h with notable limitations. Memory- based metho ds ( Shinn et al. , 2023 ; Zhao et al. , 2024 ; F ang et al. , 2025 ; T ang et al. , 2025 ; Ouyang et al. , 2025 ; Chhikara et al. , 2025 ; Liu et al. , 2026a ) store ra w con ve rsation tra jectories for future retriev al, but such 1 Skills Meta-Learning Scheduler Three Idle Signals Sleep Inactivity Event Open Claw Free Busy Trajectory a 1 a 2 a n ⋯ Failed Cloud LoRA Fine-Tuning OpenClaw IronClaw CoPaw PicoClaw ZeroClaw NanoClaw NemoClaw … Claude OpenAI Gemini OpenRouter Qwen Kimi MiniMax Azure vLLM SGLang … Tinker MinT ⚡ Supported LLM Providers & Platforms 🧬 RL Training Backends … 🦞 Supported Personal Agents Figure 1 Overview of . The framework impro v es the meta-model M = ( θ , S ) via t w o complemen tary lo ops op erating at different timescales. Skill-driv en fast adaptation (left) analyzes failed tra jectories and instan tly expands the skill library S without parameter up dates, taking effect immediately for subsequen t tasks. Opp ortunistic p olicy optimization (right) accum ulates p ost-adaptation tra jectories and, once sufficient data is av ailable, leverages idle signals (sleep, inactivity , calendar) detected by the Opportunistic Meta-Learning Scheduler to trigger RL-based w eigh t up dates on θ via Cloud LoRA fine-tuning. tra jectories are v erb ose and redundan t, preven ting the agent from extracting transferable b eha vioral patterns. Skill-based metho ds ( Xia et al. , 2026 ; Zhang et al. , 2025a , 2026b ; W u et al. , 2025 ; Zhang et al. , 2026a ) compress exp erience into reusable b ehavioral instructions, yet treat the resulting skill library as a static database never co ordinated with w eigh t optimization. RL-based metho ds ( Sch ulman et al. , 2017 ; Ahmadian et al. , 2024 ; Shao et al. , 2024 ; F eng et al. , 2025 ; Zheng et al. , 2025 ) up date mo del weigh ts, but op erate in small-scale or offline settings and ignore a critical data v alidity problem: once skills ha v e evolv ed, tra jectories collected under the old skill context carry stale rewards that contaminate gradient up dates if reused without filtration. A common thread across all three categories is that each addresses only one asp ect of adaptation in isolation, leaving the complemen tary dimensions unexploited. Our key observ ation is that t w o fundamentally different timescales of adaptation are in fact naturally com- plemen tary . Behavioral heuristics (e.g., “alwa ys v erify a file path b efore reading,” “confirm b efore destructiv e commands”) can b e distilled within seconds from a single failed conv ersation and injected immediately as skill instructions. Improving the mo del’s underlying p olicy across diverse task types requires gradient-based optimization ov er many tra jectories, on a timescale of minutes to hours. The tw o mechanisms are also mutually reinforcing: a b etter p olicy pro duces more informative failures for skill synthesis, and ric her skills yield higher- rew ard tra jectories for p olicy optimization. No existing system unifies these tw o forms of adaptation into a coheren t framew ork that exploits this virtuous cycle. W e present MetaClaw , a contin ual meta-learning ( Finn et al. , 2017 ; Y ao et al. , 2021 ) framework that jointly main tains a base LLM p olicy and an evolving skill library of reusable b eha vioral instructions. The skill library serv es a dual role: as a meta-p ar ameter that accumulates b ehavioral knowledge across the task stream, and as an adaptation b asis from which task-sp ecific skills are retrieved at inference time. MetaClaw improv es b oth comp onen ts through tw o mechanisms. Skil l-driven fast adaptation p erforms gradient-free skill evolution: an LLM analyzes failure tra jectories and synthesizes new b eha vioral instructions ( Xia et al. , 2026 ) that take effect immediately with zero service down time. Opp ortunistic p olicy optimization uses RL with a pro cess rew ard mo del (PRM) ( Zhang et al. , 2025c ) to up date mo del weigh ts via cloud ( TML , 2026 ) LoRA fine-tuning ( Hu et al. , 2021 ), optimizing p ost-adaptation p erformance. T wo design principles gov ern their co ordination. First, when to run p olicy optimization: our Opp ortunistic Meta-Learning Scheduler (OMLS) monitors three idle signals, i.e., configurable sleep hours, system keyboard inactivity , and Go ogle Calendar even t o ccupancy , and triggers weigh t up dates only during user-inactive windows, eliminating down time. Second, which data to use: w e distinguish supp ort data (failure tra jectories consumed by skill evolution) from query data (tra jectories 2 collected after new skills take effect). Only query data, reflecting the agent’s p ost-adaptation b ehavior, is v alid for RL; supp ort data carries rewards conditioned on the old skill context and is excluded. Our skill generation v ersioning mec hanism enforces this separation by stamping each tra jectory with its skill generation index and flushing stale samples from the training buffer whenev er skills ev olv e. In summary , our primary con tribution is MetaCla w, a contin ual meta-learning framework that unifies skill-driv en fast adaptation with opp ortunistic p olicy optimization, enabling deplo y ed LLM agents to evolv e contin uously through a proxy-based architecture without requiring a lo cal GPU. W e ev aluate on MetaClaw-Benc h, a new b enc hmark of 934 questions ov er 44 simulated w orkda ys, where each day forms a sequen tial, feedback-driv en m ulti-round session of real CLI tasks (file editing, JSON structuring, shell scripting). Experiments with GPT-5.2 and Kimi-K2.5 show that skill-driven fast adaptation alone improv es o v erall accuracy by up to 32.2% in relative terms; MetaClaw (F ull) further adv ances Kimi-K2.5 from 21.4% to 40.6%, improv es end-to-end task completion b y 8.25 × on Part I and file-chec k completion by 185% on Part I I, and nearly closes the gap with GPT-5.2’s baseline. W e further v alidate on AutoResearc hClaw, a 23-stage autonomous research pip eline, where skill injection alone improv es the comp osite robustness score by 18.3%, demonstrating cross-domain generalization of MetaCla w’s adaptation mec hanisms. 2 Problem Setup W e consider a deplo y ed CLI agen t that serves a user o v er a stream of tasks τ 1 , τ 2 , . . . dra wn from a non-stationary distribution p t ( τ ). Each task τ i consists of a user instruction and environmen tal context (file system state, shell history , etc.), and the agen t must pro duce a sequence of actions a 1: T to accomplish the task. The agent’s b eha vior at any p oint in time is fully determined by a meta-mo del : M = ( θ , S ) , (1) where θ denotes the parameters of the base LLM p olicy and S = { s 1 , s 2 , . . . , s K } is a library of skil l instructions , i.e., concise, reusable b eha vioral directiv es injected in to the agen t’s system prompt at inference time. Given a task τ , the agen t generates actions according to: a ∼ π θ ( · | τ , Retrieve ( S , τ )) , (2) where Retrieve ( S , τ ) ⊆ S selects the most relev an t skills for the current task via embedding-based retriev al. The meta-mo del M ev olv es ov er the task stream as the agent accumulates exp erience. W e distinguish tw o t yp es of tra jectory data based on their role in this ev olution. Supp ort data D sup consists of tra jectories whose failures driv e adaptation of the skill library S ; these tra jectories are consumed by the adaptation pro cess and reflect pre-adaptation b ehavior. Query data D qry consists of tra jectories collected after adaptation has taken effect; these reflect the agent’s p ost-adaptation b ehavior and are used to optimize the p olicy parameters θ . Main taining a strict separation b etw een supp ort and query data is essential: mixing them w ould cause θ to b e optimized against stale rew ard signals that no longer reflect the agen t’s current capabilities. The goal of MetaClaw is to contin uously improv e M o v er the task stream, not merely to solv e each task in isolation, but to b e c ome pr o gr essively b etter at adapting to new tasks as they arrive. This p ositions MetaClaw as a contin ual meta-learning system: the agen t learns from a non-stationary task stream while sim ultaneously impro ving its o wn adaptation capabilit y . 3 MetaCla w 3.1 Overview MetaCla w improv es the meta-mo del M = ( θ , S ) through tw o complementary mechanisms op erating at different timescales (Figure 1 ). Skil l-driven fast adaptation analyzes failure trajectories and syn thesizes new skill instructions that are immediately injected in to the agent’s prompt, evolving S without touching mo del weigh ts. Opp ortunistic p olicy optimization uses p ost-adaptation tra jectories to up date θ via reinforcement learning, deferred to user-inactive windows by the Opp ortunistic Meta-Learning Scheduler (OMLS). A skill generation v ersioning mechanism ensures that p olicy optimization alwa ys trains on query data collected under the current 3 skill library , preven ting stale reward contamination from supp ort data. The tw o mechanisms are mutually reinforcing: a b etter θ pro duces more informative failures for skill synthesis, and ric her skills pro duce higher- rew ard tra jectories for p olicy optimization. This virtuous cycle enables the system to le arn to b e c ome b etter at adapting . The complete pro cedure is summarized in Algorithm 1 . 3.2 Skill-Driven Fast A daptation Giv en the curren t meta-mo del ( θ , S g ), the agent executes tasks and collects tra jectories. T ra jectories that reveal failure mo des form the supp ort set D sup g . Skill-driv en adaptation evolv es the skill library via a gradient-free exp erience distillation pro cess: S g +1 = S g ∪ E ( S g , D sup g ) , (3) where E is a skil l evolver , an LLM that analyzes failure tra jectories and synthesizes new b ehavioral instructions. The index g denotes the skil l gener ation , incremented eac h time the library changes. This step mo difies only S , lea ving θ fixed, and tak es effect immediately for all subsequent tasks. Because skill injection op erates through the prompt rather than mo del parameters, fast adaptation incurs zero service down time. This mec hanism is gradient-free by design, not by approximation. The skill library S liv es in a discrete natural- language space where gradient descen t is ill-defined; LLM-based failure analysis is the natural adaptation mec hanism for this space. The skill library S pla ys a dual role in the learning structure. As a meta-p ar ameter , S accum ulates b eha vioral kno wledge across the entire task stream, with each skill generation S g +1 ⊇ S g represen ting the system’s growing op erational knowledge. As an adaptation b asis , Retrieve ( S , τ ) extracts a task-sp ecific subset at inference time, pro viding instant sp ecialization without an y parameter up date. This dual c haracter arises b ecause natural-language instructions are inherently cross-task transferable: a skill distilled from one failure (e.g., “verify file path b efore reading”) generalizes to all tasks inv olving file op erations. Unlike systems where task-sp ecific adaptations are ephemeral and discarded after each task, each adaptation episo de in MetaClaw contributes lasting kno wledge to the meta-mo del, making knowledge accumulation a feature rather than a side effect. 3.3 Opp ortunistic Policy Optimization After each skill-driven adaptation step, the agen t contin ues serving tasks under the latest skill library . Because p olicy optimization is deferred to idle windows, the skill library may hav e adv anced through sev eral generations b y the time training b egins. Let g ∗ denote the current skill generation when a training window op ens. The RL buffer B accum ulates query tra jectories across all p ost-adaptation generations, and p olicy optimization up dates θ o v er this buffer: θ t +1 = θ t + α ∇ θ E ( τ ,ξ ,g ′ ) ∼B [ R ( π θ ( · | τ , S g ′ ))] , (4) where g ′ ≤ g ∗ is the skill generation under which each tra jectory w as collected, and R is a pro cess reward mo del (PRM) score. The versioning mechanism (Section 3.4 ) guarantees that B con tains only query data, i.e., ev ery sample reflects p ost-adaptation b eha vior under its resp ective skill generation. Crucially , p olicy optimization do es not optimize θ for raw task p erformance, but for how well the agen t p erforms after skil l adaptation . A b etter θ yields a meta-mo del from which skill-driven adaptation pro duces stronger p ost-adaptation b eha vior, resulting in an impro v ed meta-mo del M ′ = ( θ t +1 , S g ∗ ). In practice, p olicy optimization is realized via cloud LoRA fine-tuning using GRPO, deferred to idle windows b y the Opp ortunistic Meta-Learning Scheduler (Section 3.5 ). Imp ortan tly , training is initiated only after the query buffer B has accumulated a sufficient n um b er of tra jectories; launching RL with to o few samples leads to high-v ariance gradient estimates and unstable p olicy up dates. This means p olicy optimization naturally lags b ehind skill-driv en adaptation by days or longer, reinforcing the asymmetry b et w een the t w o timescales: skills ev olv e con tin uously , while the p olicy improv es in discrete, data-gated steps. 3.4 Skill Generation V ersioning The supp ort-query separation defined in Section 2 must b e enforced in MetaClaw’s online setting, where tasks arriv e sequentially and skill evolution is triggered asynchronously . Without a dedicated mechanism, supp ort data can leak in to the p olicy optimization buffer. 4 The problem is concrete: a tra jectory ( τ i , ξ i ) that triggers skill evolution from S g to S g +1 carries a reward r i reflecting p erformance under S g , b efor e the new skill existed. If this tra jectory enters the RL buffer, p olicy optimization receives a gradient that p enalizes θ for a failure that skill-driv en adaptation has already corrected, optimizing for pre-adaptation rather than post-adaptation p erformance and violating the meta-learning objective in Eq. 4 . W e enforce separation via a skil l gener ation version g i stamp ed on each collected sample: • Supp ort set D sup g : tra jectories collected under S g whose failures trigger skill evolution S g → S g +1 . These are consumed b y the skill ev olv er and disc ar de d fr om the RL buffer . • Query set D qry g +1 : tra jectories collected after S g +1 tak es effect. Only these, reflecting the agen t’s p ost- adaptation b eha vior, are eligible for p olicy optimization gradient up dates. When the skill generation coun ter adv ances from g to g + 1, the trainer flushes all samples with version ≤ g from its buffer. This ensures p olicy optimization alwa ys up dates θ with resp ect to the agent’s adapted b ehavior, preserving the in tegrit y of the meta-learning structure. 3.5 Opp ortunistic Meta-Learning Scheduler P olicy optimization requires a mo del weigh t hot-swap up on completion, which briefly in terrupts inference. In a deplo y ed interactiv e system, this creates a tension: p olicy optimization must run p erio dically to improv e θ , but it m ust not degrade the user’s exp erience. W e introduce the Opp ortunistic Meta-L e arning Sche duler (OMLS), a background daemon that defers p olicy optimization to p erio ds when the user is not actively in teracting with the agent. OMLS monitors three complemen tary idle signals: (1) Sleep window. The user configures a sleep schedule (e.g., 23:00–07:00). During this windo w, the system is guaran teed to b e idle, providing the largest contiguous training blo ck. (2) System inactivity . OMLS p olls the op erating system’s input device idle timer (e.g., ioreg HIDIdleTime on macOS). If no keyboard or mouse activity is detected for δ min utes (default: 30), a training window op ens. Up on renewed input, the trainer pauses gracefully via mid-batch chec kpointing. (3) Calendar-aw are sc heduling. OMLS queries the user’s Go ogle Calendar API. When the current time falls within a scheduled meeting, the user is presumed unav ailable, op ening an opp ortunistic training window. This is the most anticipatory of the three signals: it leverages the user’s o wn sc hedule to predict idle p erio ds proactiv ely . A training window op ens when any signal indicates user absence and closes when any signal indicates the user has returned. The RL trainer supp orts pause/resume across fragmented idle windo ws, accum ulating gradien t steps opp ortunistically without requiring a single long contiguous blo ck. 4 Exp eriments 4.1 Exp erimental Setup 4.1.1 Benchmark and Evaluation Platform MetaCla w-Benc h. W e construct MetaClaw-Benc h, a contin ual agentic b enc hmark comprising tw o com- plemen tary ev aluation parts (934 questions total across 44 simulated workda ys) for ev aluating an agent’s abilit y to adapt across a sequential stream of real-world CLI tasks. Existing agent b enchmarks present tasks as indep endent episo des, providing no mechanism to assess whether an agent improv es from accumulated exp erience. MetaCla w-Benc h addresses this gap by structuring ev aluation as multi-w orkday simulations in whic h the agen t op erates under consistent workspace and p olicy rulesets that evolv e through user feedback. 1) Part I structures ev aluation as a 30-workda y simulation (346 questions, days 01–30, 10–15 p er day). The w orkspace state (files, configs, pro ject records) p ersists across rounds within each da y , and each question includes the ev aluation outcome of the previous round as corrective feedback context. Questions fall into tw o types: 5 Algorithm 1 MetaClaw: Con tin ual Meta-Learning for Deploy ed LLM Agents Require: Meta-mo del M = ( θ 0 , S 0 ), skill evolv er E , task stream { τ i } , PRM R , OMLS idle detector Ensure: Contin uously improv ed meta-mo del M 1: Initialize skill generation g ← 0, RL buffer B ← ∅ 2: for each task τ i in stream do 3: ▷ Serve task with curr ent meta-mo del 4: S τ i ← Retrieve ( S g , τ i ) // retrieve r elevant skil ls 5: ξ i ← Execute ( π θ ( · | τ i , S τ i )) // c ol le ct tr aje ctory 6: r i ← R ( ξ i ); stamp ( τ i , ξ i , r i ) with generation g 7: if ξ i rev eals failure then 8: A dd ( τ i , ξ i ) to supp ort set D sup g 9: else 10: A dd ( τ i , ξ i , r i , g ) to RL buffer B 11: end if 12: ▷ Skil l-driven fast adaptation (when failur es ac cumulate) 13: if |D sup g | ≥ threshold then 14: ∆ S ← E ( S g , D sup g ) // synthesize new skil ls fr om failur es 15: S g +1 ← S g ∪ ∆ S // evolve skil l libr ary 16: Flush all samples with version ≤ g from B // support-query sep ar ation 17: g ← g + 1 18: end if 19: ▷ Opp ortunistic p olicy optimization (when user is id le) 20: if OMLS detects idle window and |B | ≥ batch size then 21: θ ← θ + α ∇ θ E ( τ ,ξ,r,g ′ ) ∼B [ R ( π θ ( · | τ , S g ′ ))] // RL up date 22: Hot-sw ap mo del w eights // deploy updat e d θ 23: end if 24: end for file-che ck tasks (structured edits or transformations pro ducing output files v alidated by automated c heck ers) and multi-choic e tasks (conceptual pro cedural questions on domain-sp ecific rules). T ask difficulty increases monotonically with day index, with days 25–30 requiring sophisticated multi-step reasoning. Part I’s file-chec k tasks are hea vily execution-orien ted, with many in terdep enden t side effects, pro viding a conserv ative measure of end-to-end completion. 2) Part I I extends the ev aluation to a 14-w orkday simulation (588 questions, 42 p er day: 434 m ulti-c hoice and 154 file-chec k). Part I I’s file-chec k tasks are rule-based transformations where compliance with b ehavioral heuristics (e.g., sc hema conv entions, timestamp formats) is the primary b ottleneck, making them more amenable to skill distillation. This design pro vides a complementary signal: while Part I stress-tests execution reliability , P art I I directly measures how quic kly the RL-trained p olicy internalizes pro cedural rules across a higher-density task stream. W e rep ort tw o primary metrics across b oth parts: ov erall accuracy (mean p er-question score) and file-chec k completion rate (fraction of file-chec k outputs passing all automated chec ker assertions simultaneously). Because the b enc hmark tasks are authored to simulate realistic deploymen t rather than collected from actual user sessions, w e view b oth parts as controlled stress tests of contin ual adaptation under increasing difficulty . Do wnstream ev aluation: A utoResearc hCla w. T o test whether MetaCla w’s adaptation mec hanisms generalize b eyond CLI-task b enchmarks, we additionally ev aluate on AutoResearc hClaw ( Liu et al. , 2026b ), a fully autonomous 23-stage researc h pip eline that transforms a single research idea in to a conference-ready pap er, co v ering literature search, hypothesis generation, exp eriment design, co de synthesis, sandb ox execution, result analysis, pap er drafting, and multi-agen t p eer review. Unlik e MetaClaw-Benc h’s structured file-chec k and m ulti-c hoice tasks, AutoResearc hCla w presents an op en-ended, long-horizon agentic w orkload where failures manifest as stage retries, excessiv e refinement cycles, and incomplete pip eline runs. W e rep ort four pip eline-level metrics: stage r etry r ate , r efine cycle c ount , pip eline stage c ompletion (out of 19 scorable stages), and a c omp osite r obustness sc or e (weigh ted av erage of stage completion rate at 40%, retry reduction at 30%, and refine cycle efficiency at 30%). Baselines and Implementation Details. W e ev aluate tw o frontier LLMs as backbone p olicies: GPT- 6 T able 1 Main results on MetaClaw-Benc h Parts I and I I. Acc.: mean p er-question accuracy . Compl.: file-chec k completion rate. MetaClaw (F ull) is ev aluated for Kimi-K2.5 only . Best result p er mo del p er part is b olded . P art I (30 days, 346 Q) P art I I (14 days, 588 Q) Mo del Condition A cc. (%) Compl. (%) A cc. (%) Compl. (%) GPT-5.2 Baseline 41.1 14.7 44.9 58.4 GPT-5.2 MetaCla w (Skills) 44.0 17.1 49.1 67.5 Kimi-K2.5 Baseline 21.4 2.0 21.1 18.2 Kimi-K2.5 MetaCla w (Skills) 28.3 2.0 26.9 33.8 Kimi-K2.5 MetaCla w (F ull) 40.6 16.5 39.6 51.9 5.2 ( Op enAI , 2025 ) and Kimi-K2.5 ( T eam et al. , 2026 ). W e compare three conditions: 1) Baseline: the base mo del serv ed without any adaptation mechanism. 2) MetaClaw (Skills): the base mo del augmented with skill-driv en fast adaptation; after each failed tra jectory , the skill evolv er syn thesizes b ehavioral instructions immediately injected into the system prompt, with top- k retriev al via cosine similarity o v er sentence em b eddings. 3) MetaClaw (F ull): the full pip eline com bining skill-driven fast adaptation with opp ortunistic p olicy optimization via RL (5-day training run), ev aluated for Kimi-K2.5 only , as it requires a cloud LoRA training endp oint configured for the target backbone. All conditions use iden tical prompts and to ol sets. This design isolates the individual con tributions of the t w o MetaCla w comp onen ts as defined in Section 3 . F or the AutoResearc hCla w ev aluation, w e deplo y MetaClaw’s skil l-driven fast adaptation within AutoResearc h- Cla w’s pip eline executor. After each pip eline run, failures and warnings from all 23 stages are captured as structured lessons and conv erted into reusable skill files via MetaClaw’s lesson-to-skill evolv er. On subsequent runs, accumulated skills are injected in to the system prompt of all 18 LLM-driven stages. W e run controlled A/B exp eriments with the same researc h topic, backbone LLM, and pip eline configuration, differing only in whether MetaCla w’s skill injection is activ e. 4.2 Main Results T able 1 rep orts p erformance on b oth parts of MetaClaw-Benc h for all five mo del–condition pairs. MetaClaw consisten tly improv es ov er the resp ectiv e baselines across b oth mo dels, both adaptation mo des, and b oth b enc hmark parts. MetaCla w improv es b oth mo dels and the full pip eline yields the largest gains. Results are consistent across b oth b enchmark parts. F or GPT-5.2, MetaClaw (Skills) raises ov erall accuracy from 41.1% to 44.0% on Part I (+7.1% relativ e) and from 44.9% to 49.1% on Part I I (+9.4% relative), with file-chec k completion rising from 14.7% to 17.1% on Part I and from 58.4% to 67.5% on Part I I. F or Kimi-K2.5, MetaClaw (Skills) impro v es accuracy from 21.4% to 28.3% on Part I (+32.2%) and from 21.1% to 26.9% on Part I I (+27.5%). MetaCla w (F ull) yields substantially larger gains: on Part I, accuracy reac hes 40.6% and task completion rises 8.25 × (from 2.0% to 16.5%); on Part I I, accuracy reaches 39.6% and file-c hec k completion jumps from 18.2% to 51.9% (+185% relativ e). Stronger mo dels b enefit less and w eak er mo dels b enefit more. GPT-5.2 starts from a higher baseline (41.1% vs. 21.4% on Part I), lea ving less headro om for skill-driven gains. Kimi-K2.5, b y contrast, lacks implicit pro cedural knowledge that the skill library provides explicitly , so skill injection yields larger returns. Notably , MetaCla w (F ull) with Kimi-K2.5 (40.6%) nearly closes the gap with GPT-5.2’s baseline (41.1%), demonstrating that the combination of skill injection and gradien t-based p olicy optimization can largely comp ensate for mo del capabilit y differences. This pattern suggests MetaClaw is particularly v aluable for deploying capable but not state-of-the-art mo dels at pro duction scale. The full pip eline unlo c ks end-to-end task completion and skills alone do not. On Part I, MetaCla w (Skills) leav es task completion rates unchanged for b oth mo dels, confirming that skill injection sharp ens partial execution quality without reliably enabling zero-defect outputs under heavy execution demands. MetaClaw (F ull) closes this gap: Kimi-K2.5’s completion rate jumps from 2.0% to 16.5% (8.25 × ). On Part I I, where file-c hec k tasks are rule-based, skills already drive a substantial completion gain (18.2% → 33.8%), and the full 7 T able 2 MetaClaw (Skills-Only) on AutoResearc hClaw, a 23-stage autonomous research pipeline. Skill injection alone yields consistent improv ements across all robustness metrics without requiring RL weigh t up dates. Metric Baseline + MetaClaw (Skills) Relative Change Stage retry rate ( ↓ ) 10.5% 7.9% ↓ 24.8% Refine cycle count ( ↓ ) 2.0 1.2 ↓ 40.0% Pip eline stage completion ( ↑ ) 18 / 19 19 / 19 ↑ 5.3% Comp osite robustness score ( ↑ ) 0.714 0.845 ↑ 18.3% pip eline pushes this further to 51.9%, confirming that weigh t-level optimization pro vides an additive b enefit on top of skill injection regardless of task t yp e. Since MetaCla w-Benc h is an authored simulation rather than a collection of real user sessions (see Section 4 ), the absolute magnitudes of these gains are sp ecific to this b enchmark and may not transfer directly to pro duction w orkloads. The primary v alue of these results lies in the consistent directional trends: skill-driven adaptation reliably improv es partial execution quality across b oth mo dels, while weigh t-lev el optimization is necessary to unlo c k end-to-end task completion. MetaCla w generalizes to op en-ended multi-stage pip elines. T able 2 rep orts MetaClaw’s impact on AutoResearc hCla w, an ev aluation setting structurally different from MetaClaw-Benc h. Using skills-only adaptation (no RL), MetaClaw reduces the stage retry rate by 24.8% (from 10.5% to 7.9%) and cuts refine cycles by 40.0% (from 2.0 to 1.2 p er stage). Pip eline completion improv es from 18/19 to 19/19 stages (+5.3%), and the comp osite robustness score rises from 0.714 to 0.845, an 18.3% improv ement . These gains are ac hiev ed without any gr adient-b ase d p olicy up dates , demonstrating that MetaClaw’s light weigh t, zero-do wn time skill injection transfers effectiv ely to complex, long-horizon agen tic w orkflo ws b ey ond structured CLI tasks. 4.3 Analysis P er-da y accuracy trends. Figure 2 visualizes p er-day accuracy (3-day rolling av erage) for all five conditions. Both mo dels and all conditions sho w a consisten t accuracy decline from day01–10 (where accuracies routinely exceed 50%) to day25–30 (where most mo dels fall b elow 30%), confirming that MetaClaw-Benc h exhibits increasing difficulty . MetaCla w’s adv antage ov er the baseline is most pronounced in the mid-r ange da ys (day11– 22), where tasks require multi-step pro cedural compliance that is learnable through failure distillation, and MetaCla w (F ull) reaches its p eak adv an tage of nearly 0.8 accuracy around da y 19–20. The e arly days (da y01–10) in v olv e simpler manipulations where b oth conditions p erform reasonably , and the late days (day23–30) are sufficien tly complex that accumulated skills are insufficient without stronger mo del weigh ts, leading all fiv e conditions to con v erge to w ard similarly lo w p erformance. T ask-t yp e breakdown. Figure 3 decomp oses p erformance by task type, rev ealing that the tw o MetaClaw comp onen ts address fundamentally different b ottlenecks. Skills-only adaptation lifts multi-c hoice pass rates for b oth mo dels while leaving file-chec k completion flat, as pro cedural kno wledge helps reasoning but not execution. MetaCla w (F ull) reverses this: Kimi-K2.5’s file-chec k completion rate jumps to matc h GPT-5.2’s baseline, while m ulti-c hoice accuracy sligh tly decreases as the p olicy shifts tow ard file-execution b ehavior during training. RL training dynamics. Part II provides a fine-grained view of ho w p olicy optimization ev olv es ov er time. The file-c hec k completion curve for MetaCla w (F ull)–Kimi-K2.5 sho ws a clear inflection at day 8, after which p er-da y pass rates escalate rapidly: from ∼ 9% on days 1–4, through 27–36% on days 5–8, to 55–64% on days 9–10, and ultimately reaching 100% on days 12 and 14. This learning tra jectory mirrors the MAML inner-lo op up date structure: the first several days accum ulate supp ort tra jectories for skill synthesis and weigh t up dates, the inflection marks when sufficient gradient signal has b een collected for the LoRA fine-tune to shift the p olicy’s execution strategy , and the late-phase conv ergence indicates that the p olicy has internalized the pro cedural rules surfaced b y the skill library . The tw o-phase pattern (skill-driven gains first, RL-driven gains after day 8) directly v alidates the complementary timescale hypothesis underlying MetaClaw’s design. Skill library analysis. Across the 30-day session, MetaClaw’s skill evolv er synthesizes skills clustered around three recurring failure categories: (1) temp or al format c omplianc e , normalizing natural-language time expressions to ISO 8601 format with timezone offsets; (2) b ackup-b efor e-mo dify pr oto c ol , creating .bak files b efore any 8 1 5 10 15 20 25 30 Day 0.0 0.2 0.4 0.6 0.8 1.0 A ccuracy Early Mid Late Baseline (GPT -5.1) MetaClaw Skills (GPT -5.1) Baseline (Kimi-K2.5) MetaClaw Skills (Kimi-K2.5) MetaClaw RL+Skills (Kimi-K2.5) Figure 2 P er-day accuracy ov er 30 simulated workda ys (3-da y rolling av erage). Solid lines: GPT-5.2; dashed lines: Kimi-K2.5. MetaClaw (F ull) dominates in the mid phase (day 11–22) b efore difficulty outpaces accumulated kno wledge in late days. GPT -5.1 Baseline GPT -5.1 Skills Kimi-K2.5 Baseline Kimi-K2.5 Skills Kimi-K2.5 RL+Skills 0 20 40 60 P ass R ate (%) F ile-check (P assed %) Multi-choice (P ass %) Figure 3 P er-task-type pass rates. File-chec k (yello w) is unchanged by skills alone but jumps 8.25 × under MetaCla w (F ull). Multi-choice (blue) impro ves with skills but slightly decreases under MetaClaw (F ull), reflecting a p olicy shift to ward file-execution. destructiv e file op eration; and (3) naming c onvention adher enc e , following date-prefixed file naming patterns (e.g., 20260408 *.json ). These cross-cutting b ehavioral heuristics generalize across tasks, explaining wh y a single failure can yield skills that impro v e p erformance on subsequent, structurally different questions. Cross-domain skill transfer to AutoResearc hClaw. The AutoResearc hCla w results (T able 2 ) provide complemen tary evidence for skill generalization. In this setting, the skill evolv er, designed for CLI-task adaptation, syn thesizes actionable skills for a fundamen tally differen t workload (m ulti-stage researc h automation) without any domain-sp ecific tuning. The 40% reduction in refine cycles indicates that skills distilled from earlier pip eline failures (e.g., citation formatting errors, exp eriment co de v alidation failures) directly preven t rep eated mistakes in subsequent runs. This cross-domain transferability , combined with the zero-down time deplo ymen t mo del (skill injection op erates entirely at the prompt level), confirms that MetaClaw functions as a general-purp ose contin ual learning lay er applicable to diverse agentic systems. Case studies. T able 3 con trasts t w o represen tativ e cases that illustrate the distinct con tributions of the tw o MetaCla w mec hanisms. In Case 1, a single distilled skill resolves a compliance error with zero weigh t up date. In Case 2, skill injection provides necessary format con text but is insufficient alone; weigh t-level RL up dates are required to reliably execute a structurally complex file op eration. 5 Related W ork Skill-based and memory-augmen ted agen ts. A line of work augments LLM agents with external memory or reusable skill libraries to improv e p erformance without mo difying mo del weigh ts ( Shinn et al. , 2023 ; Zhao et al. , 2024 ; F ang et al. , 2025 ; T ang et al. , 2025 ; Ouyang et al. , 2025 ; Chhikara et al. , 2025 ; Liu et al. , 2026a ; W ang et al. , 2024b ). Reflexion ( Shinn et al. , 2023 ) stores verbal self-reflections in an episo dic buffer, allowing the agen t to a v oid rep eating past mistak es. Mem0 ( Chhikara et al. , 2025 ) and SimpleMem ( Liu et al. , 2026a ) main tain longer-horizon memory through hierarc hical retriev al. On the skill side, V oy ager ( W ang et al. ) incremen tally builds a library of executable co de skills from successful episo des, while Exp eL ( Zhao et al. , 2024 ) and Agent-KB ( T ang et al. , 2025 ) distills cross-task exp erience into natural-language rules. A key limitation shared by these metho ds is that the skill library (or memory) is treated as a static artifact ( Xia et al. , 2026 ): it is never co ordinated with weigh t-level optimization, and successful tra jectories are reused indiscriminately without regard for whether the agent’s b eha vior has changed since they were collected. MetaCla w addresses b oth gaps by coupling skill evolution with p olicy optimization through explicit supp ort-query separation. Reinforcemen t learning for LLM agents. RLHF ( Ouyang et al. , 2022 ) and its v ariants establish the use of reward signals to fine-tune LLM b ehavior, and subsequen t work applies RL to to ol-using and agentic settings ( Nakano et al. , 2021 ; Y ao et al. , 2022 ). More recen tly , GRPO ( Shao et al. , 2024 ) and DAPO ( Y u et al. , 2025 ) demonstrate stable online p olicy gradient training for reasoning tasks ( Sch ulman et al. , 2017 ; Ahmadian et al. , 2024 ; Shao et al. , 2024 ; F eng et al. , 2025 ; Zheng et al. , 2025 ; T eam et al. , 2025 ; Dong et al. , 2025 ). Ho w ev er, these approaches optimize a fixed policy against a fixed reward signal, with no mechanism for the agen t to up date its b eha vioral context b etw een rollouts. In deploy ed interactiv e settings, they also do not address when to run training or which data remains v alid for gradient up dates after b ehavioral changes. MetaCla w 9 T able 3 Representativ e case studies. Case 1 sho ws skill-driven fast adaptation (MetaClaw Skills, GPT-5.2); Case 2 sho ws the full pip eline (MetaClaw (F ull), Kimi-K2.5). Both recov er from score 0 to score 1.0; the mechanisms differ fundamen tally . Case 1 MetaClaw (Skills) Case 2 MetaCla w (F ull) Mo del GPT-5.2 Kimi-K2.5 Da y / Round Day 19 / Round 4 Da y 18 / Round 6 T ask type File-c heck File-c heck T ask instruction Up date sprint8 board.json : set T-405/T- 406 to "done" , add completed at fields. App end a deploymen t record to deploy log.json with fields timestamp (ISO 8601+TZ), env , status , and changes . Baseline resp onse (score: 0) Reads file; directly o verwrites it. Check er detects missing sprint8 board.json.bak → 0. Uses field name date instead of timestamp ; omits changes . Check er rejects schema → 0. MetaClaw resp onse (score: 1.0) Skill distilled from Day 2: “Always cre ate .bak b efor e modifying (P4). ” Agent writes sprint8 board.json.bak , applies targeted patc h. Check er passes → 1.0. Skills inject “use ISO 8601 with timezone offset” ; skills-only Kimi still omits changes arra y → 0. After RL: all four fields present, sc hema v alid, backup created → 1.0. Day accur acy (al l r ounds) Baseline: 43.9% MetaCla w (Skills): 62.1% ∆ +18.2 pp Baseline: 8.3% Skills-only: 25.0% MetaCla w (F ull): 80.6% K ey me chanism Skills: one distilled rule generalizes across file types and subsequent days with zero w eight up date. RL: skills supply declarative format con- text; weigh t up dates internalize the exe- cution reliabilit y that skill injection alone cannot enforce. targets exactly these practical constrain ts via opp ortunistic scheduling and skill generation versioning. Con tin ual and meta-learning. Meta-learning ( Finn et al. , 2017 ; Nichol et al. , 2018 ; Hosp edales et al. , 2021 ) frames learning as optimizing for fast adaptation to new tasks, typically in an offline episo de-based setting. Meta-reinforcemen t learning extends this idea to sequential decision-making: RL 2 ( Duan et al. , 2016 ) trains a recurrent p olicy whose hidden state implicitly enco des task con text, PEARL ( Rakelly et al. , 2019 ) infers a probabilistic con text v ariable for off-p olicy adaptation, and ProMP ( Rothfuss et al. , 2019 ) applies trust-region constrain ts at the meta-level. These metho ds demonstrate effective fast adaptation in rob otic control and na vigation, but op erate on simple netw ork architectures with low-dimensional action spaces and assume fixed offline task distributions. Contin ual learning ( Kirkpatrick et al. , 2017 ; Lop ez-Paz and Ranzato , 2017 ; Chaudhry et al. , 2019 ; Zenke et al. , 2017 ; W ang et al. , 2024a , 2022 ) studies sequential task adaptation without forgetting through regularization, replay , or architectural strategies, yet do es not incorp orate fast adaptation mechanisms at inference time. Online meta-learning approaches ( Finn et al. , 2019 ; Nagabandi et al. , 2018 ; Harrison et al. , 2020 ; Y ao et al. , 2020 ) relax the offline assumption and even handle task heterogeneity , but remain grounded in represen tation learning ov er simple netw orks. MetaClaw extends the meta-learning ob jectiv e to a non-stationary stream of LLM agen t tasks where fast adaptation is gradient-free (skill synthesis in discrete natural-language space) and slow adaptation is gradient-based (p olicy optimization via RL), with a versioning proto col that preserv es the supp ort-query structure in an online, asynchronous setting. 10 6 Conclusion W e presented MetaClaw, a contin ual meta-learning framework that enables deploy ed LLM agents to improv e autonomously through normal usage. MetaClaw com bines tw o complementary adaptation mec hanisms op erating at differen t timescales: fast, inference-time skill injection that distills reusable b eha vioral kno wledge from failures, and slo w, gradien t-based p olicy optimization that refines the mo del during idle windo ws. Built on a light weigh t pro xy architecture, the system requires no lo cal GPUs and in tegrates transparently with existing p ersonal agents and LLM pro viders. Exp eriments on MetaClaw-Benc h demonstrate consistent improv emen ts across mo dels and adaptation mo des, with the full pip eline yielding the largest gains on b oth partial execution quality and end-to-end task completion. Ev aluation on AutoResearc hCla w further shows that skill injection generalizes to op en-ended research pip elines without any gradient up dates. A current limitation is that idle-windo w detection dep ends on user configuration, whic h may not generalize to all deploymen t environmen ts. W e b eliev e MetaClaw establishes a principled foundation for agents that genuinely learn and evolv e in the wild, simply by b eing used. References Arash Ahmadian, Chris Cremer, Matthias Gall´ e, Marzieh F adaee, Julia Kreutzer, Olivier Pietquin, Ahmet ¨ Ust ¨ un, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from h uman feedbac k in llms. In Pr o c e e dings of the 62nd A nnual Me eting of the A sso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 12248–12267, 2024. Arslan Chaudhry , Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny . Efficient lifelong learning with A-GEM. In International Conferenc e on L e arning R epr esentations , 2019. Prateek Chhikara, Dev Khant, Saket Aryan, T aranjeet Singh, and Deshra j Y ada v. Mem0: Building pro duction-ready ai agen ts with scalable long-term memory . arXiv pr eprint arXiv:2504.19413 , 2025. Guan ting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan W ang, Zhongxia Chen, Jiazhen Du, Huiyang W ang, F uzheng Zhang, et al. Agentic reinforced p olicy optimization. arXiv pr eprint arXiv:2507.19849 , 2025. Y an Duan, John Sch ulman, Xi Chen, Peter L Bartlett, Ilya Sutskev er, and Pieter Abbeel. Rl 2 : F ast reinforcement learning via slow reinforcement learning. arXiv pr eprint arXiv:1611.02779 , 2016. R unnan F ang, Y uan Liang, Xiaobin W ang, Jialong W u, Shuofei Qiao, Pengjun Xie, F ei Huang, Hua jun Chen, and Ningyu Zhang. Memp: Exploring agent pro cedural memory . arXiv pr eprint arXiv:2508.06433 , 2025. Lang F eng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group p olicy optimization for llm agent training. arXiv pr eprint arXiv:2505.10978 , 2025. Chelsea Finn, Pieter Abb eel, and Sergey Levine. Mo del-agnostic meta-learning for fast adaptation of deep netw orks. In International confer ence on machine learning , pages 1126–1135. PMLR, 2017. Chelsea Finn, Aravind Ra jeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International c onfer enc e on machine le arning , pages 1920–1930. PMLR, 2019. James Harrison, Ap oorv a Sharma, Chelsea Finn, and Marco Pa vone. Contin uous meta-learning without tasks. A dvanc es in neural information pr o c essing systems , 33:17571–17581, 2020. Timoth y Hosp edales, Antreas Antoniou, Paul Micaelli, and Amos Storkey . Meta-learning in neural netw orks: A survey . IEEE transactions on p attern analysis and machine intel ligenc e , 44(9):5149–5169, 2021. Edw ard J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizh u Chen. Lora: Lo w-rank adaptation of large language mo dels. arXiv pr eprint arXiv:2106.09685 , 2021. James Kirkpatrick, Razv an P ascanu, Neil Rabinowitz, Oriol Viny als, Shakir Mohamed, et al. Overcoming catastrophic forgetting in neural netw orks. Pr o c e e dings of the National A c ademy of Scienc es , 114(13):3521–3526, 2017. Jiaqi Liu, Y aofeng Su, Peng Xia, Siw ei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Y ao. Simplemem: Efficien t lifelong memory for llm agents. arXiv pr eprint arXiv:2601.02553 , 2026a. Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Hao qin T u, Xinyu Y ang, , Jia wei Zhou, Hongtu Zh u, Y un Li, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Y ao. Autoresearc hclaw: F ully autonomous research from idea to pap er, 2026b. https://github.com/aiming- lab/AutoResearchClaw . 11 Da vid Lop ez-P az and Marc’Aurelio Ranzato. Gradient episo dic memory for con tinual learning. In A dvanc es in Neur al Information Pro cessing Systems , 2017. An usha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning: Con tin ual adaptation for mo del-based rl. arXiv pr eprint arXiv:1812.07671 , 2018. Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam T auman Kalai, and David Bau. T esting language mo del agents safely in the wild. arXiv pr eprint arXiv:2311.10538 , 2023. Reiic hiro Nakano, Jacob Hilton, Suchir Bala ji, Jeff W u, Long Ouyang, Christina Kim, Christopher Hesse, Shan tanu Jain, Vineet Kosaraju, William Saunders, et al. W ebgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021. Alex Nichol, Joshua Ac hiam, and John Sch ulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 , 2018. Op enAI. Introducing gpt-5.2, 2025. https://openai.com/index/introducing- gpt- 5- 2 . Op enCla w. Op enclaw — p ersonal ai assistant, 2026. https://github.com/openclaw/openclaw . Long Ouyang, Jeffrey W u, Xu Jiang, Diogo Almeida, Carroll W ainwrigh t, Pamela Mishkin, Chong Zhang, Sandhini Agarw al, Katarina Slama, Alex Ray , et al. T raining language mo dels to follow instructions with human feedback. A dvanc es in neur al information pro c essing systems , 35:27730–27744, 2022. Siru Ouy ang, Jun Y an, I Hsu, Y anfei Chen, Ke Jiang, Zifeng W ang, R ujun Han, Long T Le, Samira Daruki, Xiangru T ang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory . arXiv pr eprint arXiv:2509.25140 , 2025. Kate Rakelly , Auric k Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient off-p olicy meta-reinforcement learning via probabilistic context v ariables. In International Confer enc e on Machine L e arning , 2019. Jonas Rothfuss, Dennis Lee, Ignasi Cla v era, T amim Asfour, and Pieter Abb eel. Promp: Proximal meta-p olicy search. In International Conferenc e on L e arning R epr esentations , 2019. John Sc hulman, Filip W olski, Prafulla Dhariw al, Alec Radford, and Oleg Klimov. Proximal p olicy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. Zhihong Shao, P eiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haow ei Zhang, Mingch uan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathematical reasoning in op en language mo dels. arXiv pr eprint arXiv:2402.03300 , 2024. Noah Shinn, F ederico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shun yu Y ao. Reflexion: Language agents with verbal reinforcement learning. A dvanc es in Neur al Information Pr o c essing Systems , 36:8634–8652, 2023. Da wn Song, Chenguang W ang, Nicholas Crispino, Ruo xi Jia, Kyle Montgomery , Y ujin P otter, Vincent Siu, and Zhun W ang. Agents in the wild: Safety , security , and b eyond. In ICLR 2026 W orkshop Prop osals . Xiangru T ang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming W ei, Peng Xia, F ang W u, He Zhu, et al. Agent kb: Leveraging cross-domain exp erience for agentic problem solving. arXiv pr eprint arXiv:2507.06229 , 2025. Kimi T eam, T ongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv pr eprint arXiv:2602.02276 , 2026. T ongyi DeepResearch T eam, Baixuan Li, Bo Zhang, Dingch u Zhang, F ei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong W u, Jingren Zhou, et al. T ongyi deepresearch technical rep ort. arXiv pr eprint arXiv:2510.24701 , 2025. Thinking Machines Lab TML. Tinker is a training api for researchers, 2026. https://thinkingmachines.ai/tinker/ . Guanzhi W ang, Y uqi Xie, Y unfan Jiang, Ajay Mandlekar, Chaow ei Xiao, Y uke Zhu, Linxi F an, and Anima Anandkumar. V oy ager: An op en-ended embo died agent with large language mo dels. T r ansactions on Machine L e arning R ese ar ch . Liyuan W ang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of contin ual learning: Theory , metho d and application. IEEE transactions on p attern analysis and machine intel ligenc e , 46(8):5362–5383, 2024a. Zifeng W ang, Zizhao Zhang, Chen-Y u Lee, Han Zhang, Ruo xi Sun, Xiao qi Ren, Guolong Su, Vincent Perot, Jennifer Dy , and T omas Pfister. Learning to prompt for contin ual learning. In Pr o c e e dings of the IEEE/CVF c onfer enc e on c omputer vision and p attern r e c o gnition , pages 139–149, 2022. 12 Zora Zhiruo W ang, Jia yuan Mao, Daniel F ried, and Graham Neubig. Agent workflo w memory . arXiv pr eprint arXiv:2409.07429 , 2024b. Rong W u, Xiaoman W ang, Jianbiao Mei, Pinlong Cai, Dao cheng F u, Cheng Y ang, Licheng W en, Xuemeng Y ang, Y ufan Shen, Y uxin W ang, et al. Evolv er: Self-evolving llm agents through an exp erience-drive n lifecycle. arXiv preprint arXiv:2510.16079 , 2025. P eng Xia, Jianw en Chen, Hany ang W ang, Jiaqi Liu, Kaide Zeng, Y u W ang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcemen t learning. arXiv pr eprint arXiv:2602.08234 , 2026. Huaxiu Y ao, Yingb o Zhou, Mehrdad Mahdavi, Zhenhui Li, Richard So cher, and Caiming Xiong. Online structured meta-learning. In A dvanc es in Neur al Information Pr o c essing Systems , 2020. Huaxiu Y ao, Y u W ang, Ying W ei, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, and Chelsea Finn. Meta-learning with an adaptiv e task scheduler. A dvances in Neural Information Pro cessing Systems , 34:7497–7509, 2021. Sh unyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language mo dels. In The eleventh international c onfer enc e on learning r epr esentations , 2022. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiao chen Zuo, Y u Y ue, W einan Dai, Tiantian F an, Gaohong Liu, Ling jun Liu, et al. Dap o: An op en-source llm reinforcement learning system at scale. arXiv pr eprint arXiv:2503.14476 , 2025. F riedemann Zenke, Ben Poole, and Surya Ganguli. Contin ual learning through synaptic in telligence. In International Confer enc e on Machine L e arning , 2017. Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao W ang, He Zhu, W angc h unsh u Zhou, and Shuic heng Y an. Memevolv e: Meta-evolution of agent memory systems. arXiv pr eprint arXiv:2512.18746 , 2025a. Guibin Zhang, Junhao W ang, Junjie Chen, W angch unshu Zhou, Kun W ang, and Shuic heng Y an. Agentracer: Who is inducing failure in the llm agentic systems? arXiv pr eprint arXiv:2509.03312 , 2025b. Haozhen Zhang, Quanyu Long, Jianzhu Bao, T ao F eng, W eizhi Zhang, Haodong Y ue, and W eny a W ang. Memskill: Learning and evolving memory skills for self-evolving agents. arXiv pr eprint arXiv:2602.02474 , 2026a. Shengtao Zhang, Jiaqian W ang, Ruiw en Zhou, Junw ei Liao, Y uchen F eng, W einan Zhang, Ying W en, Zhiyu Li, F eiyu Xiong, Y utao Qi, et al. Memrl: Self-evolving agents via runtime reinforcemen t learning on episo dic memory . arXiv pr eprint arXiv:2601.03192 , 2026b. Zhenru Zhang, Chujie Zheng, Y angzhen W u, Beichen Zhang, Runji Lin, Bow en Y u, Dayiheng Liu, Jingren Zhou, and Juny ang Lin. The lessons of dev eloping pro cess reward mo dels in mathematical reasoning. In Findings of the A sso ciation for Computational Linguistics: A CL 2025 , pages 10495–10516, 2025c. Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Y ong-Jin Liu, and Gao Huang. Exp el: Llm agents are exp erien tial learners. In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , v olume 38, pages 19632–19642, 2024. Ch ujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bow en Y u, Chang Gao, Kai Dang, Y uqiong Liu, Rui Men, An Y ang, et al. Group sequence policy optimization. arXiv pr eprint arXiv:2507.18071 , 2025. 13 Appendix A Prompts and T emplates This app endix do cuments the core prompt templates used in MetaClaw-Benc h ev aluations and the MetaClaw framew ork comp onen ts. A.1 Agent System Prompt (MetaCla w-Bench Pa rt I) P art I ev aluates agen ts on Op enCla w CLI tasks via a programmatic rollout lo op. The agent receives the follo wing fixed system prompt, which may b e replaced b y a compressed v ariant (see Section A.6 ) after the first session: SYSTEM PROMPT — Part I Agent You are an expert CLI agent controlling an OpenClaw installation. Your goal is to complete the given task by issuing CLI commands via the run_command tool. Guidelines: - Issue ONE command at a time and carefully read the output before proceeding. - Use ’openclaw status’ or similar read commands to inspect state before making changes. - Handle errors: if a command fails, diagnose from the output and retry differently. - When the task is fully complete, call run_command with command="done". - Do NOT ask clarifying questions | act based on the task description alone. The single to ol exp osed to the agent is run command : run command T o ol Sc hema { "name": "run_command", "description": "Execute a CLI command and observe the output. When the task is fully complete, call run_command with command=\"done\".", "parameters": { "type": "object", "properties": { "command": { "type": "string", "description": "The exact CLI command to run, e.g. ’openclaw status’, ’openclaw agents add --name bot1 --model gpt-4o’, ’done’" } }, "required": ["command"] } } A.2 Agent Identit y Context (MetaClaw-Bench Pa rt I I) P art I I (CALMB-14) injects the following workspace context files into the agent’s session at initialization, defining the agen t’s role, user profile, and b eha vioral principles. 14 IDENTITY.md — Agent Role Definition # Identity You are MetaClaw Agent, an AI assistant integrated into the internal toolchain of Orion Tech. ## Role You serve as Alex Zhang’s primary AI assistant for day-to-day engineering and project management work. ## Context - Company: Orion Tech | a B2B SaaS company - Product: Project Orion, a project management platform - Team: Backend engineering team - Principal: Alex Zhang, Backend Tech Lead ## Principles 1. Accuracy comes first | double-check facts, values, and formats before writing files 2. Be consistent | use the same conventions across all files 3. Be complete | fill in all required fields; no placeholders 4. Be professional | produce output ready for team use SOUL.md — Core Behavioral Principles ## Reliability Every file you create or modify should be correct and complete. A half-finished or incorrect file creates more work than it saves. ## Consistency When you establish a pattern in one file, maintain it across all related files. ## Attention to Detail Small errors in data files can cascade into larger problems. Pay close attention to field names, data types, value formats, and structural requirements. ## Ownership Own every assigned task completely. Do not produce output requiring the user to fix obvious issues. ## Professionalism All output should meet the standard of work that could be shared directly with teammates or stakeholders. A.3 T ask Question T emplates Eac h da y in MetaCla w-Benc h presen ts t w o question types. Belo w are represen tativ e examples. Multi-c hoice question (Part I I, Da y 01 / r1): 15 Multi-Choice Question T emplate Regarding the source and applicability of the baseline daily revenue of 4500 yuan/store in the financial model, which of the following descriptions are consistent with the project documentation? (Select all correct options) A. The assumption document labels 4500 yuan as "Source: operational baseline data from market research report", while the original description in the market research report is "median of 86 tier-1 city stores" B. The rent assumption of 650 yuan/m2 is noted as tier-2 and tier-3 city market rates, but the revenue assumption of 4500 yuan references tier-1 city store data C. Operations VP Zhang Wei mentioned an estimated daily revenue of approximately 2500-3200 yuan, but this is only a hypothetical opinion with no direct contradiction D. The validation memo confirms that 4500 yuan "has been cross-validated with operational baseline data", indicating the assumption underwent applicability review Please answer using \bbox{X} or \bbox{X,Y} format. Ground truth: A, B Scoring: max(0 , 1 − ( F P + F N ) /n options) File-c hec k question (Part I I, Da y 01 / r21): File-Chec k Question T emplate Based on the reference documents in day01/, create a decision log tracking key decisions from the documents. Save as day01/decision_log_r21.json. Include these fields: title, created_at, decisions (array with id, date, decision, rationale, decided_by, review_date). Ev al: python scripts/check iso8601.py day01/decision log r21.json (exit 0 = pass) F eedback (incorrect): Time/date fields must use ISO 8601 with +08:00 timezone: YYYY-MM- DDTHH:MM:SS+08:00. P art I task instruction format (real Op enClaw session, train.jsonl ): P art I User Instruction T emplate [Sat 2026-02-21 07:25 EST] I grant you read access to /Users/jimchen/Documents/openclaw/skills. Locate gog/skill.md within it. Use skill.md to obtain the instructions for adding a Google Calendar. Execute the instructions in the isolated environment to directly add ten meetings to Google Calendar, starting February 20th, recurring every Friday from 3:30-4:30 PM. Language requirement: Reply in English only. Keep your response concise and task-focused. A.4 Skill Evolver Prompt T emplate After each session in which failed tra jectories are collected, the MetaClaw skill evolv er submits the following prompt to syn thesize new b eha vioral skills: 16 Skill Evolv er Prompt ( skill evolver.py ) You are a skill engineer for an AI assistant trained with RL. Your job: analyze the failed conversations below and generate NEW skills that would have prevented those failures. --- ## Failed Conversations ### Failure 1 (reward=0.0) **Conversation context (last 600 chars):** ‘‘‘ ...[truncated trajectory] ‘‘‘ **Assistant response (first 500 chars):** ‘‘‘ [model response] ‘‘‘ [...up to 6 failures shown...] --- ## Existing Skills (do NOT duplicate any of these) ["skill-name-1", "[category] skill-name-2", ...] --- ## Instructions Generate **1 to {max_new_skills}** new skills that directly address the failure patterns observed above. Focus on actionable, concrete guidance for future conversations. Each skill must follow Claude skill format: - ‘name‘: a lowercase hyphenated slug - ‘description‘: one sentence | when to trigger this skill and what it achieves - ‘content‘: 6-15 lines of actionable Markdown. Include: a heading, numbered steps or bullet points, a concrete example or code snippet, and an Anti-pattern section. - ‘category‘: one of [coding, research, data_analysis, security, communication, automation, productivity, agentic] or "general" or "common_mistakes" **Output:** Return ONLY a valid JSON array. **Example output:** [ { "name": "dyn-001", "description": "Always verify file existence before reading or writing.", "content": "## Verify File Existence Before Acting\n\n 1. Check: os.path.exists(path)\n 2. If missing, ask the user for the correct path.\n **Anti-pattern:** Calling open(path) without checking.", "category": "coding" } ] 17 A.5 Skill Injection F o rmat Retriev ed skills are app ended to the agent’s system message by SkillManager.format for conversation : Skill Injection Blo ck (app ended to system prompt) ## Active Skills ### backup-before-modify _Always create a .bak copy before modifying any existing file._ ## Backup Before Modify 1. Before editing any file, create a backup: cp .bak 2. Verify the backup exists before proceeding. 3. Apply all modifications to the original file. **Anti-pattern:** Overwriting a file without a backup, leaving no recovery path if the edit is incorrect. ### iso8601-timezone-format _Use when writing any date/time field to a file._ ## ISO 8601 Timestamp with Timezone Always format timestamps as: YYYY-MM-DDTHH:MM:SS+08:00 - Correct: 2026-03-16T09:30:00+08:00 - Incorrect: 2026-03-16, March 16 at 3pm, 2026-03-16T09:30:00Z **Anti-pattern:** Omitting the timezone offset or using natural-language date expressions. A.6 System Prompt Comp ression Prompt T o preven t context ov erflo w during long sessions, MetaClaw compresses Op enClaw’s native system prompt using the follo wing instruction: System Prompt Compression Instruction ( utils.py ) You are compressing an OpenClaw system prompt. Rewrite it to be under 2000 tokens while preserving behavior. Keep all critical policy and routing rules: (1) tool names and their intended usage constraints, (2) safety and non-delegable prohibitions, (3) skills-selection rules, (4) memory recall requirements, (5) update/config restrictions, (6) reply-tag/messaging rules, (7) heartbeat handling rules. Remove duplicated prose, repeated examples, and decorative language. Prefer compact bullet sections with short imperative statements. Do not invent or weaken any rule. Output only the rewritten system prompt text. 18 A.7 Pa rt I I Implicit Preference Rules MetaCla w-Benc h Part I I introduces five implicit preference rules progressively across 14 da ys. These rules are not stated in the agent’s system prompt; they must b e inferred from task feedback and internalized through skill ev olution or RL training. R ule Category Requiremen t A ctiv e from P1 Timestamp All date/time fields: YYYY-MM-DDTHH:MM:SS+08:00 Da y 01 P2 File naming Output files: YYYYMMDD description.ext (snak e case) Da y 04 P3 Metadata Ev ery output file must include created at , author , status Da y 06 P4 Bac kup Create .bak b efore mo difying any exist- ing file Da y 08 P5 Completion log App end [DONE] | | to done.log Da y 10 T able 4 The fiv e implicit preference rules in MetaCla w-Bench Part I I, introduced progressively across 6 learning arcs. Eac h rule is verified by a dedicated automated chec ker script. 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment