Experience-based Knowledge Correction for Robust Planning in Minecraft

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models. Project page: https://sjlee-me.github.io/XENON

💡 Research Summary

The paper tackles a fundamental problem in embodied AI: large language models (LLMs) used for planning in complex, long‑horizon environments such as Minecraft often start with inaccurate or incomplete priors about item dependencies and feasible actions. Prior work has tried to fix these flaws by fine‑tuning the LLM, injecting curated knowledge from wikis, or prompting the model to self‑correct. However, the parametric knowledge encoded in the LLM is stubborn; repeated prompting rarely eliminates systematic errors, especially when only sparse binary success/failure feedback is available.

XENON (eXpErience‑based kNOwledge correctioN) proposes a different paradigm: instead of asking the LLM to correct itself, the agent maintains external knowledge structures that are updated algorithmically from experience. Two synergistic components are introduced:

Adaptive Dependency Graph (ADG) – an external directed‑acyclic graph that stores the agent’s current belief about item prerequisites. The graph is initialized using the LLM’s predictions, which may contain hallucinated or missing edges. As the agent successfully obtains items, it records the actual requirement set R_exp(v) (the items consumed to craft v) and replaces the incoming edges of v with this observed set. When an item repeatedly fails to be obtained, a revision counter C(v) triggers the RevisionByAnalogy procedure. If C(v) exceeds a threshold c₀, the item is flagged as potentially hallucinated; its descendants are recursively revised to remove the invalid dependency. Otherwise, the procedure revises the requirement set by borrowing prerequisite items from the top‑K most similar previously‑obtained items, where similarity is measured by cosine similarity of Sentence‑BERT embeddings of item names. This analogy‑based revision allows the graph to converge toward the true latent dependency DAG without any direct supervision.
Failure‑aware Action Memory (FAM) – a memory that maps each target item to a high‑level action (mine, craft, smelt). After each low‑level execution, binary feedback updates FAM: successful actions are cached, while failures increment per‑action counters. When an action repeatedly fails, FAM marks it as invalid and constrains the LLM’s next prompt to explore under‑used alternatives. Crucially, FAM distinguishes between failures caused by a wrong action versus those caused by an incorrect dependency; the latter triggers ADG’s revision mechanism, creating a feedback loop that jointly refines both knowledge sources.

The overall loop proceeds as follows: an unobtained item is selected as an exploratory goal; ADG traverses backward to produce a list of prerequisite items; for each prerequisite, FAM either reuses a cached successful action or asks the LLM to suggest a new high‑level action, conditioned on the action history. The low‑level controller executes the resulting language sub‑goals, returns binary success/failure, and the experience is used to update ADG and FAM. Algorithm 1 in the appendix formalizes this process.

Experimental evaluation spans three Minecraft testbeds: (i) basic resource gathering, (ii) multi‑step crafting chains, and (iii) long‑horizon sequential goal achievement. XENON is compared against state‑of‑the‑art agents such as ADAM, Optimus‑1, and DECKARD, which either rely on large proprietary LLMs or on weak self‑correction mechanisms. Using only a 7‑billion‑parameter open‑weight model (Qwen2.5‑VL‑7B), XENON achieves higher success rates than GPT‑4‑based baselines, improves the true‑dependency metric N_true by 20‑30 percentage points, and reduces the number of invalid actions dramatically. Ablation studies confirm that both ADG and FAM are essential: removing ADG leads to persistent hallucinated dependencies, while removing FAM causes the agent to repeat the same wrong actions.

Contributions and impact:

Introduces a novel “knowledge externalization” framework where LLMs provide initial priors but the agent’s knowledge is stored and corrected outside the model.
Demonstrates that algorithmic revision based on analogical reasoning and failure‑aware action caching can replace costly fine‑tuning or massive model scaling.
Shows that lightweight open‑source LLMs can outperform much larger proprietary models when paired with robust experience‑driven knowledge management.

Limitations and future work: The current setup only uses binary feedback, which limits the granularity of failure diagnosis. Complex failures that involve both a wrong prerequisite and an unsuitable action may still be ambiguous. Moreover, similarity based on textual embeddings may not capture deeper semantic or physical relationships between items. Future directions include incorporating richer feedback signals (partial success, resource consumption), extending the revision mechanisms to other domains (robotic manipulation, simulation), and exploring more sophisticated analogical reasoning or graph‑neural‑network‑based updates.

In summary, XENON provides a compelling blueprint for building practical embodied agents that can learn and correct planning knowledge from minimal supervision, opening the door for scalable, lightweight AI systems in richly structured environments.

Experience-based Knowledge Correction for Robust Planning in Minecraft

💡 Research Summary

Comments & Academic Discussion

Leave a Comment