AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model agents often fail to accumulate knowledge from experience, treating each task as an independent challenge. Recent methods extract experience as flattened textual knowledge, which cannot capture procedural logic of complex subtasks. They also lack maintenance mechanisms, causing repository degradation as experience accumulates. We introduce AutoRefine, a framework that extracts and maintains dual-form Experience Patterns from agent execution histories. For procedural subtasks, we extract specialized subagents with independent reasoning and memory. For static knowledge, we extract skill patterns as guidelines or code snippets. A continuous maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, AutoRefine achieves 98.4%, 70.4%, and 27.1% respectively, with 20-73% step reductions. On TravelPlanner, automatic extraction exceeds manually designed systems (27.1% vs 12.1%), demonstrating its ability to capture procedural coordination.

💡 Research Summary

AutoRefine addresses two major shortcomings of existing experience‑learning approaches for large‑language‑model (LLM) agents: (1) the inability of flattened textual representations to capture procedural logic, and (2) the lack of a maintenance mechanism that prevents the experience repository from becoming bloated and noisy. The proposed framework introduces a dual‑form Experience Pattern: (i) Sub‑agent Patterns, which encapsulate complex sub‑tasks (e.g., hotel booking, route planning) as specialized, autonomous agents equipped with their own memory and reasoning capabilities; and (ii) Skill Patterns, which store static knowledge as natural‑language guidelines or executable code snippets. Each pattern carries rich metadata—including description, applicable context, retrieval count, usage count, success count, and a dense embedding—for quantitative evaluation.

The system operates in three stages. During task execution, the main agent retrieves the most relevant patterns by generating multiple reformulated queries from the task description, embedding them with Qwen3‑Embedding‑4B, and ranking patterns via cosine similarity (with optional maximal marginal relevance to ensure diversity). Retrieved Skill Patterns are injected into the system prompt or registered as callable tools, while Sub‑agent Patterns are invoked through hierarchical delegation: the main agent detects a matching sub‑task, transfers the relevant context, and lets the sub‑agent run independently, returning its result upon completion. Metadata is updated in real time to record actual usage and success.

Pattern extraction occurs every K tasks (default K = 10) by batching recent trajectories into success and failure sets. A dedicated extraction agent receives contrastive analysis prompts that ask it to compare successful and failed sequences, identify recurring action subsequences, and articulate the causal principles behind success. The agent then abstracts these principles into either a Sub‑agent Pattern (for procedural logic) or a Skill Pattern (for static guidance). This batch‑level extraction mitigates over‑fitting to individual tasks and captures reusable strategies across episodes.

Maintenance runs at exponentially spaced intervals. Each pattern receives a composite score:

  score(p) = s·u + ε·effectiveness·log(1+u) + frequency·(1+u)/(r+ε)

where s is the number of successful uses, u the number of actual uses, r the number of retrievals, and ε a small constant for numerical stability. Low‑scoring patterns are pruned, and patterns of the same type whose embeddings exceed a similarity threshold are merged, preventing repository explosion while preserving high‑utility knowledge.

Empirical evaluation on three benchmarks demonstrates the effectiveness of AutoRefine. On ALFWorld, the framework achieves 98.4 % success with a 73 % reduction in the number of steps; on ScienceWorld, 70.4 % success with a 20 % step reduction; and on TravelPlanner, 27.1 % success—more than double the performance of the manually engineered ATLAS system (12.1 %). Ablation studies reveal that removing Sub‑agent Patterns causes the largest performance drop, while disabling maintenance leads to a 4.5× growth in repository size and an 8.9× decline in utilization efficiency.

In summary, AutoRefine provides a scalable, self‑sustaining pipeline for LLM agents to distill, store, and reuse procedural expertise. By converting complex sub‑tasks into autonomous sub‑agents and continuously pruning and merging patterns based on empirical utility, the framework enables agents to accumulate knowledge in a manner akin to human learning, opening avenues for long‑term, cross‑domain competence without the need for costly model retraining. Future work may explore automatic generation of sub‑agents, multimodal extensions, and deployment in large‑scale real‑world services.

AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement

💡 Research Summary

Comments & Academic Discussion

Leave a Comment