TodoEvolve: Learning to Architect Agent Planning Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.

💡 Research Summary

TodoEvolve tackles a fundamental limitation of current large‑language‑model (LLM)‑driven agents: their planning modules are hand‑crafted, static, and unable to adapt to the structural diversity of open‑ended tasks. The authors introduce a meta‑planning paradigm that automatically synthesizes and continuously revises task‑specific planning architectures. Central to this approach are two contributions: (1) PlanFactory, a modular design space that unifies a wide spectrum of existing planning paradigms under a common codebase, and (2) Impedance‑Guided Preference Optimization (IGPO), a multi‑objective reinforcement‑learning objective used to train a 14‑billion‑parameter meta‑planner called Todo‑14B.

PlanFactory decomposes any planning system into four orthogonal components: topology (the structural organization of the plan), initialization (how the topology is instantiated), adaptation (when and how the plan is revised during execution), and navigation (the mechanism that issues executable directives to the agent). By re‑implementing ten representative planners—including linear to‑do lists, DAG‑based planners, tree‑structured hierarchies, and multi‑agent coordination frameworks—the authors provide a unified abstraction that can express both single‑agent and multi‑agent, step‑wise or task‑wise, linear or graph‑structured plans. This unified codebase serves both as a data synthesis engine (generating high‑quality planning trajectories) and as a benchmark platform for future research.

The meta‑planner Todo‑14B receives a task description and, using the PlanFactory API, dynamically assembles a customized planning system P* that specifies the four dimensions appropriate for the task. To endow Todo‑14B with the ability to generate such systems, the authors devise IGPO. Unlike conventional supervised fine‑tuning that only imitates existing code, IGPO jointly optimizes three criteria: (i) performance (reward obtained by the generated plan), (ii) stability (consistency of the plan’s structure over time), and (iii) token efficiency (minimizing the number of tokens required to describe the plan). The “impedance” term quantifies the resistance incurred when a plan becomes overly complex, leading to higher token costs and lower stability; IGPO explicitly penalizes such impedance.

Training data are constructed through a “Bootstrap‑and‑Filter” pipeline. First, existing planners are broken down into standardized tools within PlanFactory. Then, an evolutionary sampler generates diverse candidate plans for each query, conditioned on the query, a system prompt, tool documentation, and a few static reference plans. Each candidate is executed in the PlanFactory runtime; only those that produce the correct final answer pass an execution‑as‑judge filter, ensuring that the dataset contains only sound architectures. The filtered trajectories are used for supervised fine‑tuning (SFT) to teach basic code‑generation ability, while preference pairs (winner vs. loser plans) are formed for IGPO, where the winner is selected based on a hierarchy: correctness first, then stability, then token efficiency.

Empirical evaluation spans five challenging agentic benchmarks (including GAIA, xBench‑DS, and Smolagents) and multiple LLM backbones (GPT‑5‑Mini, Claude‑2, Llama‑2, etc.). Todo‑14B consistently outperforms carefully engineered hand‑crafted planners, achieving up to a 16.37 % improvement in success rate on GAIA and boosting GPT‑5‑Mini’s performance on xBench‑DS by 75 %. Moreover, the meta‑planner reduces API costs and runtime overhead by roughly 20 % thanks to its token‑efficient designs. In multi‑agent scenarios, Todo‑14B’s ability to dynamically restructure topologies (e.g., switching between parallel DAG execution and sequential refinement) yields superior coordination compared to static DAG‑based planners.

The paper’s contributions are threefold: (1) a publicly released, extensible codebase (PlanFactory) that standardizes planning research, (2) a novel meta‑planning framework (TodoEvolve) that treats planning architecture as a learnable object, and (3) a new training objective (IGPO) that balances performance, stability, and efficiency. By moving the focus from policy‑level optimization to architecture‑level synthesis, TodoEvolve opens a new research direction—automatic design of planning systems—potentially reducing the engineering burden for future LLM‑powered agents.

Remaining challenges include the reliance on a high‑quality execution‑as‑judge filter, which may become a bottleneck in more complex or noisy real‑world environments, and the relatively simple impedance model that could be refined with richer cost‑benefit analyses. Future work might explore self‑supervised verification, hierarchical meta‑planning across multiple abstraction layers, and broader generalization to domains beyond the current benchmarks. Overall, TodoEvolve demonstrates that meta‑learning can effectively automate the creation of adaptable, efficient planning modules, marking a significant step toward more autonomous and versatile AI agents.

TodoEvolve: Learning to Architect Agent Planning Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment