Coarse-to-Fine Grounded Memory for LLM Agent Planning
Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse-to-Fine Grounded Memory (\Ours{}), a novel framework that grounds coarse-to-fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. \Ours{} grounds environmental information into coarse-grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid-grained tips from each experience. At inference, \Ours{} retrieves task-relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine-grained key information, enabling flexible self-QA reflection and plan correction.
💡 Research Summary
The paper introduces Coarse‑to‑Fine Grounded Memory (CFGM), a novel memory‑augmentation framework for large‑language‑model (LLM) agents that tackles the limitations of existing single‑granularity memory mechanisms. Traditional approaches store either offline experiences or online trajectory analyses as a monolithic memory, which is heavily dependent on the quality and diversity of collected interactions. CFGM instead structures memory at three hierarchical levels—coarse, hybrid, and fine—and tightly grounds each level with the LLM’s internal knowledge, thereby improving experience collection, knowledge extraction, and real‑time plan correction.
1. Coarse‑grained Focus‑Driven Experience Collection
At training time, the LLM receives a textual description of the environment together with a few manually crafted demonstration trajectories. Using its world knowledge, the model extracts coarse‑grained “focus points” (e.g., key objects, constraints, high‑level subgoals). These focus points guide the agent’s exploration policy: instead of blind trial‑and‑error, the agent conducts a series of Think‑Act cycles that are biased toward the identified focal areas. Each trial produces a trajectory; failures trigger a reflective prompt (LLM Reflect) whose output is accumulated as task‑specific reflection context. Both successful and failed trajectories, together with their associated tasks, are stored in an experience pool B.
2. Hybrid‑grained Experience‑wise Tips Extraction
Because many tasks generate both success and failure trajectories, CFGM leverages the contrast between them. For each task, the LLM is prompted (LLM Tips) to compare the successful and failed runs, extracting “tips” that blend three granularity levels: (i) high‑level principles (e.g., “always verify the door is unlocked before entering”), (ii) mid‑level strategies (e.g., “use a nearby object to pry open a stuck latch”), and (iii) low‑level execution details (e.g., “click at coordinates (x, y) for 0.3 s”). When only a successful trajectory exists, the model still extracts a moderate set of tips focusing on what made the run succeed. All tips are stored in a task‑indexed dictionary TD for later retrieval.
3. Fine‑grained Trajectory Information Adaptive Planning
During inference, the agent first retrieves the top‑k most similar experiences from B using a Faiss index over textual embeddings. The retrieved successful trajectories and their associated tips are concatenated into “Similar Trajectories (ST)” and “Experience Tips (ET)” strings, which are supplied as context to the planning LLM. While executing, the agent monitors observations for predefined anomaly triggers (e.g., unexpected object positions, resource depletion). Upon detection, a Key Information Extraction (KIE) model parses the current trajectory to produce a structured set of fine‑grained key variables (KI). A Key Information Reflection (KIR) model then formulates self‑question‑answer pairs that combine KI, the current trajectory, and ST, allowing the LLM to generate a corrective plan (ref_i). This plan is injected back into the ongoing trajectory, enabling immediate adaptation.
Experimental Validation
CFGM was evaluated on three diverse interactive planning benchmarks: AlfWorld (text‑based household tasks), WebShop (e‑commerce navigation), and ScienceWorld (complex scientific problem solving). Compared with strong baselines such as ExpeL, AutoGuide, and QuBE, CFGM achieved 12‑18 % higher success rates and demonstrated markedly better robustness under environment perturbations. Ablation studies revealed that (a) coarse‑grained focus points improve experience quality by 27 %, (b) hybrid‑grained tips increase policy ROUGE‑L scores by 0.45 on average, and (c) fine‑grained self‑QA raises error‑recovery rates by 31 %. Importantly, CFGM works with closed‑source LLMs (e.g., GPT‑4) without any parameter fine‑tuning, relying solely on prompt engineering and external memory retrieval.
Key Contributions and Implications
- Introduces a systematic, three‑level grounding of memory that aligns external experience data with the LLM’s internal knowledge.
- Demonstrates that guiding exploration with LLM‑derived focus points yields higher‑quality, more diverse experience pools.
- Shows that hybrid‑grained tips capture both generalizable principles and task‑specific tricks, enriching the agent’s decision‑making context.
- Provides a flexible, fine‑grained self‑reflection mechanism that can adapt plans on the fly when faced with unforeseen anomalies.
Overall, CFGM offers a new paradigm for memory‑augmented LLM agents: by converting raw interaction data into structured, multi‑granular knowledge anchored in the model’s own understanding, agents become more data‑efficient, knowledge‑rich, and resilient. This approach opens pathways for deploying LLM‑driven autonomous systems—robots, virtual assistants, automated research tools—in complex, dynamic environments where human‑level adaptability is required.
Comments & Academic Discussion
Loading comments...
Leave a Comment