Large Language Model (LLM) agents are increasingly deployed in complex, multi-step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query-aware compression largely ignores the multi-step, plan-aware nature of agentic reasoning. In this work, we introduce PAACE (Plan-Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression. PAACE comprises (1) PAACE-Syn, a large-scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE-FT, a family of distilled, plan-aware compressors trained from successful teacher demonstrations. Experiments on long-horizon benchmarks (AppWorld, OfficeBench, and 8-Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE-FT retains 97 percent of the teacher's performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan-aware compression with compact models.
LLM-driven agents have emerged as a central paradigm for solving complex, multi-step tasks across domains such as software development, research assistance, operations automation, legal workflows, data analysis, and enterprise decision-making. Systems such as ReAct (Yao et al., 2023a), Toolformer (Schick et al., 2023a), AutoGPT, Devin, We-bArena (Zhou et al., 2024), and AgentBench (Liu et al., 2023) highlight the promise and challenges of designing agents with reasoning, planning, and tool-use capabilities. As these systems evolve, a consistent bottleneck has become apparent: context management. An LLM agent's state is represented not by model parameters but by its prompt context: the system instructions, the evolving plan, previous reasoning traces, tool results, user instructions, long-term memories, retrieved knowledge, and intermediate outputs.
Agents do not operate on a single-step basis. They execute a plan = [τ 1 , τ 2 , . . . , τ n ] with dependencies between tasks, and only a subset of context is relevant for each stage. As tasks grow in depth and breadth, this state becomes increasingly large, noisy, redundant, and expensive to process. Even models with 200k-1M token windows exhibit degraded reasoning quality (“context rot”) when overloaded with irrelevant or poorly structured information. Recent industry reports emphasize that modern agentic failures are overwhelmingly context failures, not model failures. Despite advances in model architecture and context length, agents fail when: crucial information is dropped or buried in irrelevant text, irrelevant details overload the model’s attention, multiple tasks compete for context bandwidth, instructions contradict or drift over time, or context becomes too long to process efficiently. We frame context engineering as learning a state compression policy over an agent’s evolving execution state, trained via outcome-preserving supervision, rather than as heuristic prompt editing.
While prompt engineering optimizes initial instructions, and RAG systems optimize retrieval, the missing discipline is context engineering: the science of continuously optimizing what the agent sees at each step. Existing approaches partially address pieces of this problem. Classical and instruction-following summarizers such as BART (Lewis et al., 2019), FLAN-T5 (Chung et al., 2022), and compression-oriented methods like LLMLingua (Jiang et al., 2023) generate concise summaries but often remove struc-tural dependencies required for multi-step reasoning. Summaries flatten causal links across steps, which harms agent planning and tool-use workflows. Methods such as Self-RAG (Asai et al., 2024) and LLMLingua-2 (Jiang et al., 2024) optimize relevance for a single upcoming query. However, they do not model next-k steps, multi-hop dependencies, or evolving plans. Provence (Wu et al., 2024) performs binary keep/drop trimming via a relevance classifier but does not support rewriting, instruction refinement, dependency tracking, or structured context shaping.
Modern long-context LLMs (Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o mini L) offer 200k-1M+ token windows, yet still suffer from attention dilution, context “rot,” and quadratic cost growth in practice (Zaheer et al., 2020;Guo et al., 2021). Large windows do not solve the problem of poorly structured or irrelevant context. Memory and reflection systems such as MemAgent (Yu et al., 2025a), Reflexion (Shinn et al., 2023), and Generative Agents (Park et al., 2023) improve retrieval and episodic memory, but do not perform context shaping, state restructuring, or nextinstruction refinement. Yu et al. (2025b) optimizes next-step relevance through natural-language compression guidelines, but it does not model next-k-step plan structure, does not refine instructions, and cannot jointly optimize plan-aware context and instruction transformations. Our primary distinction is conditioning on multiple future plan steps and the global workflow structure. No existing system considers: Together, these contributions establish plan-aware context engineering as a critical component of robust long-horizon agent design and provide the first end-to-end framework that jointly models plan relevance and context structure to support cost-efficient, high-fidelity agent reasoning.
This section synthesizes prior work across ten research areas relevant to PAACE: (1) summarization and compression, (2) query-aware and task-aware reduction, (3) long-context models, (4) memory architectures for agents, (5) retrievalaugmented agents, (6) agent planning and multi-step reasoning, (7) prompt and instruction optimization, (8) context pruning and selection, (9) multi-agent systems and metareasoning, and (10) cognitive frameworks inspiring artificial memory systems. While each domain has contributed techniques relevant to context management, no prior work provides a unified, plan-aware, next-k-task context engineering framework like the proposed method.
Text summariza
This content is AI-processed based on open access ArXiv data.