ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning
The ACPBench dataset provides atomic reasoning tasks required for efficient planning. The dataset is aimed at distilling the complex plan generation task into separate atomic reasoning tasks in their easiest possible form, boolean or multiple-choice questions, where the model has to choose the right answer from the provided options. While the aim of ACPBench is to test the simplest form of reasoning about action and change, when tasked with planning, a model does not typically have options to choose from and thus the reasoning required for planning dictates an open-ended, generative form for these tasks. To that end, we introduce ACPBench Hard, a generative version of ACPBench, with open-ended questions which the model needs to answer. Models that perform well on these tasks could in principle be integrated into a planner or be used directly as a policy. We discuss the complexity of these tasks as well as the complexity of validating the correctness of their answers and present validation algorithms for each task. Equipped with these validators, we test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that no model outperforms another in these tasks and with a few exceptions all tested language models score below 65%, indicating that even the current frontier language models have a long way to go before they can reliably reason about planning. In fact, even the so-called reasoning models struggle with solving these reasoning tasks. ACPBench Hard collection is available at the following link: https://ibm.github.io/ACPBench
💡 Research Summary
The paper introduces ACPBench Hard, a generative, open‑ended extension of the previously released ACPBench benchmark. While ACPBench evaluated atomic reasoning tasks for planning using Boolean or multiple‑choice questions, real planners must generate actions from a large, unrestricted action space. To bridge this gap, the authors transform the seven core reasoning tasks of ACPBench—Applicability, Progression, Reachability, Action‑Reachability, Validation, Justification, and Landmarks—into free‑form natural‑language questions that require a language model to produce the exact answer a symbolic planner would output.
For each task the authors design a dedicated symbolic validator that can automatically score a model’s response. Applicability asks the model to list all actions whose preconditions hold in the current state. Progression requires the model to enumerate the positive (add) and negative (delete) effects of a given action. Reachability asks which specific fact can never become true in any reachable state, while Action‑Reachability asks which action can never become applicable. Validation presents a plan (a sequence of actions) that has been deliberately corrupted by inserting an inapplicable action; the model must identify the first failing step. Justification asks the model to simplify a valid plan by removing one or two consecutive actions (or to insert them if removal would break validity). Landmarks asks the model to list facts that must hold on every valid solution path to the goal.
The dataset is built from the same 13 PDDL domains used in ACPBench (e.g., Logistics, BlocksWorld). For each domain, the authors automatically generate a large set of instances, extract the current state, goal, and relevant actions, and then produce the corresponding open‑ended questions. They also store the symbolic representation of each instance so that the validators can compute the ground‑truth answer. To keep the tasks tractable, they limit the number of applicable actions per state to ten for the Applicability task, and they pre‑compute reachable‑or‑unreachable facts and actions using static predicates and delete‑relaxed approximations for the PSPACE‑hard Reachability and Action‑Reachability tasks. In total, roughly ten thousand questions are generated across the seven tasks.
The authors evaluate a broad spectrum of contemporary large language models, including GPT‑4, Claude 2, Gemini‑1.5, LLaMA 2‑70B, and the “reasoning” models o1‑preview and o1‑mini. Results show that even the largest models achieve modest accuracies, often below 65 % on most tasks. Performance is especially poor on Reachability, Action‑Reachability, Landmarks, and Applicability, where models frequently hallucinate non‑existent actions or facts, or fail to recognize unreachable conditions. The “next‑action” (a variant of the original next‑action task) and “progression” tasks see relatively higher scores (up to 89 % for o1‑preview), but overall consistency across tasks remains limited. Notably, no single model dominates across all tasks; the reasoning‑focused o1‑preview performs best on a few tasks but does not surpass general LLMs on the majority.
Complexity analysis reveals that Applicability and Progression can be validated in polynomial time (O(|F||A|) and O(|F|) respectively), whereas Reachability and Action‑Reachability are PSPACE‑hard in the general case. The authors therefore rely on sound approximations (static predicate grounding, delete‑relaxation) to generate a tractable set of “unreachable” examples for validation. This design enables fully automated evaluation while still reflecting the intrinsic difficulty of the underlying planning problems.
In conclusion, ACPBench Hard provides a rigorous, generative benchmark that isolates the fundamental reasoning components required by symbolic planners. The extensive experimental results demonstrate that current state‑of‑the‑art language models are far from reliable enough to be integrated directly into planning pipelines or to serve as stand‑alone policies. The paper suggests future directions such as improving model architectures for relational reasoning, incorporating domain‑specific prompting, and exploring hybrid symbolic‑neural approaches to close the gap between language model capabilities and the exacting demands of automated planning.
Comments & Academic Discussion
Loading comments...
Leave a Comment