GNNVerifier: Graph-based Verifier for LLM Task Planning

GNN V erifier: Graph-based V erifier for LLM T ask Planning Y u Hao Beijing University of Posts and T elecommunications Beijing, China haoyuu@bupt.edu.cn Qiuyu W ang Beijing University of Posts and T elecommunications Beijing, China autumn@bupt.edu.cn Cheng Y ang ∗ Beijing University of Posts and T elecommunications Beijing, China yangcheng@bupt.edu.cn Y awen Li Beijing University of Posts and T elecommunications Beijing, China warmly0716@126.com Zhiqiang Zhang Ant Group Beijing, China lingyao.zzq@antn.com Chuan Shi Beijing University of Posts and T elecommunications Beijing, China shichuan@bupt.edu.cn Abstract Large language models (LLMs) facilitate the development of au- tonomous agents. As a cor e component of such agents, task plan- ning aims to decomp ose complex natural language requests into concrete, solvable sub-tasks. Since LLM-generated plans are fr e- quently pr one to hallucinations and sensitiv e to long-context prom- pts, recent r esearch has introduced plan veriers to identify and correct potential aws. Howev er , most e xisting approaches still rely on an LLM as the verier via additional prompting for plan r eview or self-reection. LLM-based veriers can be misled by plausible narration and struggle to detect failures caused by structural rela- tions across steps, such as type mismatches, missing intermediates, or broken dependencies. T o address these limitations, we propose a graph-based verier for LLM task planning. Specically , the pro- posed method has four major comp onents: Firstly , we represent a plan as a directed graph with enriched attributes, where nodes de- note sub-tasks and edges encode execution order and dependency constraints. Secondly , a graph neural network (GNN) then p er- forms structural evaluation and diagnosis, producing a graph-level plausibility score for plan acceptance as w ell as node/edge-level risk scores to localize erroneous regions. Thirdly , we construct controllable perturbations from ground truth plan graphs, and au- tomatically generate training data with ne-grained annotations. Finally , guided by the fee dback fr om our GNN verier , we enable an LLM to conduct local edits (e.g., tool replacement or insertion) to correct the plan when the graph-level score is insucient. Ex- tensive experiments across diverse datasets, backbone LLMs, and planners demonstrate that our GNNV erier achieves signicant gains in improving plan quality . Our data and code is available at https://github.com/BUPT -GAMMA/GNNV erier . CCS Concepts • Computing methodologies → Planning and scheduling ; Machine learning . Ke y wor ds Large Language Models, Graph Neural Networks, T ask Planning, V erier ∗ Corresponding author . 1 Introduction Large language models (LLMs) have achieved remarkable progress in natural language understanding and complex reasoning, gradu- ally evolving from te xt generators into general purpose agents [ 5 , 33 ]. In this background, task planning has become a core compo- nent of such agents: it aims to decompose a complex user request expressed in natural language into a sequence of concrete, solvable sub-tasks and to organize them in an appropriate order [ 2 , 25 , 35 , 36 ]. T ask planning has be en widely adopted in practical to ol-augmented agent scenarios, such as retrieval augmente d question answer- ing [ 17 ], oce workow automation [ 32 ], and multimodal content generation and editing [34, 40]. Existing task planning methods often rely on LLMs to generate multi-step plans via prompting [ 25 , 29 , 36 ]. A common practice is to pack tool documentation, constraint rules, and exemplars into the context, so that the model performs task decomposition through in-context learning and outputs a step se quence and tool-calling chain [ 25 , 36 ]. But these methods typically lead to increasingly long contexts, which can dilute attention and incr ease hallucina- tions, causing plans that look plausible but are non-executable or internally inconsistent [13, 14]. In this background, prior work has intr o duced plan veriers to improve planning reliability [ 10 , 11 , 16 ]. A plan verier indep en- dently evaluates a generated plan for a given user r equest, deter- mines whether it should be accepted, and localizes potential issues to trigger correction. Most existing approaches still rely on an LLM as the verier via additional prompting for plan revie w or self- reection [ 22 , 27 ]. Howev er , such methods have clear limitations. As shown in Figure 1, on the one hand, LLM-based veriers can b e misled by uent step descriptions, mistaking plausible narration for corr ect execution [ 8 , 28 ]. On the other hand, many failures arise from structural relations across steps ( e.g., type mismatches, miss- ing critical intermediate steps, or broken dependencies), which are dicult to dete ct and pinpoint by reading steps in isolation [ 28 , 37 ]. Based on these insights, we pr op ose a graph-based v erier for LLM task planning. W e rst represent a plan as a directed plan graph [ 20 , 21 ] with enriched attributes, wher e nodes correspond to sub-tasks and edges encode execution order and dependency constraints. Then we use a graph neural network (GNN) to con- duct structural evaluation and diagnosis. Our model outputs (i) a graph-level plausibility score to de cide whether to accept the Trovato et al. Figure 1: An illustrative example of structural inconsisten- cies in LLM-generated plans. Although all steps are uent and seemingly plausible, two image artifacts (poster .png and chart.png) are mistakenly input into the opp osite down- stream tools from their originally intended ones, which is dicult to detect by reading steps in isolation. current plan, and (ii) node- and edge-level risk scor es to localize erroneous regions. Moreover , since existing data lack ne-grained annotations for erroneous plan graphs, we construct controllable perturbations from ground truth plan graphs and automatically generate supervision signals. This enables the verier to learn to distinguish correct plans from incorrect ones while marking high- risk regions on the graph. Finally , we re vise the plan graph based on feedback from the GNN verier: when the graph-lev el score is not suciently high, an LLM is allowed to p erform local edits at high-risk locations based on tool replacement or insertion. Overall, our GNN verier can leverage the structural information in the plan graph to uncover potential issues and help the LLM correct them. Experimental results show that, compared with the best baseline, our GNN V erier has 2.13%/9.22%/15.96% relative improvements of node/edge/graph-level metrics, respectively . W e summarize our main contributions as follows: • W e innovativ ely propose to use graph-based veriers in- stead of LLM-based ones to eectively identify structural issues in LLM-generated task plans. • By modeling LLM-generated plans as directed, attributed graphs, we propose a GNN verier to predict graph-lev el plausibility scores and node/edge-level risk scores. The scores are then utilized for plan correction. • W e systematically compare against state-of-the-art base- lines across diverse combinations of datasets, backbone LLMs, and planners. Results show that our GNN V erier achieves signicant gains in impro ving plan quality . 2 Related W ork 2.1 Planning with Large Language Models Existing paradigms in LLM-based planning can generally be cate- gorized into two primary directions: Intrinsic LLM Planning and External Planner Integration. Intrinsic LLM Planning refers to approaches where the LLM utilizes its own knowledge and capabilities to generate sub-task sequences. T ypical metho ds like Co T [ 33 ], ReA ct [ 42 ], and Hugging- GPT [ 25 ] break down complex tasks into manageable subgoals and plan for each successively . Following this line, Adapt [ 23 ] further dynamically adjusts the planning process based on b oth task com- plexity and the LLM’s inherent capabilities. Alternativ ely , T oT [ 41 ], Go T [ 4 ] and Co T -SC [ 31 ] explore the solution space by generating multiple candidate trajectories. In these frameworks, LLMs also function as evaluators to assess these trajectories and identify the most eective plan. External Planner Integration adopts a symbolic component or a small neural network to handle intricate constraints, assist- ing the LLM for better planning performance. For example , some works [ 7 , 9 , 19 ] use LLMs to translate natural language pr oblems into Planning Domain Denition Language (PDDL), which are then solved by classical symbolic solvers. Besides, sub-tasks in planning can form a graph where nodes represent tasks and edges represent dependencies. Empirical investigation of GNN4Plan [ 37 ] indicates that planning failures can be ascribed to the LLMs’ inecacy in accurately discerning the structure of the plan graph. Consequently , GNNs can assist LLMs by eectively handling these constraints. 2.2 V eriers for Large Language Models V eriers can deliver meaningful feedback to LLMs. Early research on veriers predominantly concentrated on mathematical tasks and code generation, where tasks ar e objectively veriable thr ough unit tests or nal answers. In these methods, veriers can be cat- egorized into Outcome Reward Models (ORMs) [ 6 ], which assess the correctness of the nal answer , and Process Reward Mo dels (PRMs) [ 18 ], which evaluate the reasoning trajectory step-by-step. Recent studies have extended the application of veriers to more challenging domains, such as open-domain question answering and plan generation. For instance, V ersaPRM [ 3 ] introduces a general PRM trained on synthetic Chain-of- Thought (Co T) traces and coun- terfactual variants. Since these areas are often characterized by high subjectivity and inherent diculties in v erication [ 38 ], generative veriers have also been employ e d in these elds to generate natural language critiques that assist in self-correction [22]. In the domain of planning, a few veriers have emerged to pro- vide modication feedback to initial plans [ 15 ]. For example, V eri- Plan [ 16 ] employs model checking against LLM outputs, incor- porating multiple user-control features. Another work iteratively validates sub-tasks against module descriptions, rening the plan until full consistency is achieved [11]. Howev er , existing methods often o verlook the structural depen- dencies within the plan graph, which are pivotal for robust planning performance [ 37 ]. T o address this limitation, we propose a graph- based verier that evaluates the initial plan by generating node-, edge-, and graph-level scores, thereby pr oviding structural-aware feedback for plan correction. 3 Methodology 3.1 Problem Formalization In this subsection, we rst introduce the LLM-based task planing, and then formalize the problem of graph-based plan verication. GNNV erifier: Graph-base d V erifier for LLM Task Planning (a) Attributed Plan Graph Construction (b) Graph Encoding for Verification D e p e n d e n c y Graph Features Attributed Plan Graph Planner T r a i n i n g Plan Graphs GNN Verifier C o r r e c t e d P l a n G r a p h F A E D C B (d) Verification-guided Local Correction Request (c) Perturbation-based Supervision Candidate Tools Ground Truth Plan Graph Replace Drop Graph Score=0.5 Graph Score=0.6 Compress Node Risk = 1 Node Risk = 1 Edge Risk = 1 Edge Risk = 1 Edge Risk = 1 training Graph Score=0.7 A B C D E F Edge Risk = 0.6 Node Risk = 0.7 Graph Score=0.5 Figure 2: The overall framework of our proposed GNN V erier with four main components. T ask P lanning. Given a natural language user request 𝑟 and the tool set T , task planning aims to decompose 𝑟 into a sequence of concrete, solvable sub-tasks and arrange them in an appr opriate order . Following common settings in tool-augmented agents [ 24 – 26 ], we repr esent a plan as (i) a step text sequence 𝑆 𝑟 = ⟨ 𝑠 1 , . . . , 𝑠 𝑚 ⟩ and (ii) an aligne d to ol trajectory 𝜏 𝑟 = ⟨ 𝑡 1 , . . . , 𝑡 𝑚 ⟩ with 𝑡 𝑖 ∈ T . Here, 𝑠 𝑖 describes the intent of the 𝑖 -th sub-task, and 𝑡 𝑖 is the to ol selected to execute that step. Many LLM-based planners [ 25 , 26 , 29 , 37 ] generate such se quences by rst decomposing the request and then selecting appropriate tools. W e do not assume any specic internal mechanism of the planner , and only require that its output can be converted into the above form. Dependency Graph. Following previous work [ 26 , 37 ], we use a dep endency graph to represent the tool set and their relations. Formally , the dependency graph is dened as 𝐺 tool = ( T , D ) , where T = { 𝑡 1 , 𝑡 2 , . . . , 𝑡 | T | } denotes the set of available tools. Each tool 𝑡 ∈ T is asso ciated with a textual description desc ( 𝑡 ) and input/out- put type sets in ( 𝑡 ) and out ( 𝑡 ) . D ⊆ T × T is a set of directed edges; for any ( 𝑡 𝑢 , 𝑡 𝑣 ) ∈ D , the interface type compatibility con- straint holds: out ( 𝑡 𝑢 ) ∩ in ( 𝑡 𝑣 ) ≠ ∅ . Equivalently , D induces di- rected tool neighborhoo ds N tool out ( 𝑡 ) = { 𝑡 ′ ∈ T | ( 𝑡 , 𝑡 ′ ) ∈ D } and N tool in ( 𝑡 ) = { 𝑡 ′ ∈ T | ( 𝑡 ′ , 𝑡 ) ∈ D } . This graph captures the minimal executability requirement of tool interfaces and ser ves as the basis for subsequent plan representations and constraint construction. Plan Graph. Given the planning result ( 𝑆 𝑟 , 𝜏 𝑟 ) for request 𝑟 , we convert it into a directed plan graph 𝐺 𝑟 = ( 𝑉 𝑟 , 𝐸 𝑟 ) . Each node 𝑣 𝑖 ∈ 𝑉 𝑟 corresponds to a tool invocation instance and is denote d as 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) , where 𝑡 𝑖 ∈ T is the invoked tool and 𝑠 𝑖 is the natural language description of the step. A directed edge ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 encodes the execution order and dependency constraints between tool invocations. T o uniformly handle missing prerequisite steps, we introduce a virtual start node St art and connect it to all tool nodes with zero in-degree. Graph-based Plan V erication. A graph-based plan verier assesses the feasibility and plausibility of a plan produced by a planner under a given user r e quest, and outputs feedback signals that can be used to improve the plan. Formally , given request 𝑟 and its plan graph 𝐺 𝑟 , a verier can be abstracted as a mapping V : ( 𝑟 , 𝐺 𝑟 ) ↦→ 𝑜 𝑟 , (1) where 𝑜 𝑟 denotes the verication signals. These signals may be a global quality assessment, optionally augmented with localized diagnostic feedback (e.g., indicating potentially problematic steps / dependencies), or natural language critiques and suggeste d ed- its. The verier output is typically used as conditional input to a subsequent corrector (e.g., an LLM-based corrector ) to produce an improved plan 𝐺 ′ 𝑟 . 3.2 Framework Overview As shown in Figure 2, our method learns a graph-base d verier with four components: (1) Attributed plan graph construction converts the planner output into a directed plan graph with enriched no de/edge Trovato et al. features; (2) Graph encoding for verication applies a GNN to pre- dict a graph-level plausibility score and node/edge-level risks; (3) Perturbation-based super vision p erturbs ground truth graphs to de- rive graph/node/edge-level training signals; (4) V erication-guided local correction constrains an LLM to perform replacements/inser- tions on high-risk regions for correction. 3.3 Attributed Plan Graph Construction In this subsection, we design rich node and e dge features to con- struct attributed plan graphs for better verication. 3.3.1 Preprocessing. T o enrich the features of plan graph 𝐺 𝑟 = ( 𝑉 𝑟 , 𝐸 𝑟 ) , we precompute two terms from the dependency graph 𝐺 tool = ( T , D ) and the training data: (i) for each tool 𝑡 , we en- code its description desc ( 𝑡 ) into a semantic emb edding e ( 𝑡 ) = Enc ( desc ( 𝑡 ) ) , and compute the T op- 𝐾 neighborhood N 𝐾 ( 𝑡 ) ⊂ T ranked by cosine similarity; (ii) for any length- 𝑛 path ( 𝑡 1 → 𝑡 2 → · · · → 𝑡 𝑛 ) , we denote its occurrence count in the training set as 𝑓 𝑛 ( 𝑡 1 , . . . , 𝑡 𝑛 ) with 𝑛 ∈ { 2 , 3 , 4 } . 3.3.2 Node Features. Each node is a tool invocation instance 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) , where 𝑡 𝑖 ∈ T is the selected tool and 𝑠 𝑖 is the step text. W e construct node features by integrating: (i) tool semantics e ( 𝑡 𝑖 ) = Enc ( desc ( 𝑡 𝑖 ) ) and step semantics e ( 𝑠 𝑖 ) = Enc ( 𝑠 𝑖 ) ; (ii) multi-hot encodings of tool I/O types x in ( 𝑡 𝑖 ) and x out ( 𝑡 𝑖 ) ; and (iii) a light- weight step–tool alignment scorer 𝑔 ( 𝑠 , 𝑡 ) to measure intent–tool consistency . Concretely , for each ground truth node 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) in training data, we build a candidate set { 𝑡 𝑖 } ∪ N 𝐾 ( 𝑡 𝑖 ) , treat 𝑡 𝑖 as the positive and the remaining candidates as negatives, and compute a logit via 𝑔 ( 𝑠 𝑖 , 𝑡 ) = MLP align  [ e ( 𝑠 𝑖 ) ; e ( 𝑡 ) ]  . (2) W e train 𝑔 with a softmax cross-entrop y loss ov er { 𝑡 𝑖 } ∪ N 𝐾 ( 𝑡 𝑖 ) , encouraging ne-grained discrimination among similar tools. W e then dene the step–tool alignment feature Δ 𝑖 = 𝑔 ( 𝑠 𝑖 , 𝑡 𝑖 ) − max 𝑡 ′ ∈ N 𝐾 ( 𝑡 𝑖 ) 𝑔 ( 𝑠 𝑖 , 𝑡 ′ ) . (3) A smaller Δ 𝑖 indicates that the step text is harder to distinguish among similar tools. W e concatenate these features and obtain the node embedding via an MLP (multi-layer perceptron), i.e ., 𝑥 𝑣 𝑖 = MLP node ( ·) . The St art node is represented with a learnable em- bedding. 3.3.3 Edge Features. For each directed e dge ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 , we con- struct three features. I/O compatibility . Based on type constraints from 𝐺 tool , we dene compat ( 𝑢, 𝑣 ) = | out ( 𝑡 𝑢 ) ∩ in ( 𝑡 𝑣 ) | max ( | in ( 𝑡 𝑣 ) | , 1 ) , (4) which measures whether two adjacent tools are connectable at the interface level and how str ong the conne ction is. Pairwise co-occurrence. Using length-2 statistics, we set the transition strength as log ( 1 + 𝑓 2 ( 𝑡 𝑢 , 𝑡 𝑣 ) ) , indicating whether this adjacency is common. Multi-step relationship. T o reect whether transitioning from 𝑢 to 𝑣 typically requires intermediate tools, we dene 𝑚 ( 𝑢, 𝑣 ) = max 𝜋 : 𝑡 𝑢 → · · · → 𝑡 𝑣 , | 𝜋 | = 𝑛 log ( 1 + 𝑓 𝑛 ( 𝜋 ) ) , 𝑛 ∈ { 3 , 4 } , (5) where 𝜋 is a length- 𝑛 directed tool se quence from 𝑡 𝑢 to 𝑡 𝑣 . A larger 𝑚 ( 𝑢, 𝑣 ) suggests that 𝑡 𝑢 → 𝑡 𝑣 may correspond to an unreliable shortcut (e.g., o ver-compression or missing steps). W e concatenate these features and obtain the edge embedding via an MLP , i.e., 𝑥 𝑢 𝑣 = MLP edge ( ·) . For the virtual e dges from St art , we use a learnable edge embedding. 3.4 Graph Encoding for V erication In this subsection, we present how to encode the attributed plan graph for graph/node/edge-level scoring. 3.4.1 GNN-base d Representation. W e encode the plan graph with a directed, edge-aware GNN conditioned on the request embedding e ( 𝑟 ) = Enc ( 𝑟 ) . Let h ( 0 ) 𝑣 = x 𝑣 , at layer ℓ , we aggregate messages from incoming and outgoing neighbors separately: m ( ℓ ) 𝑣, in =  𝑢 ∈ N in ( 𝑣 ) 𝜙 ( ℓ ) in  [ h ( ℓ ) 𝑢 ; x 𝑢 𝑣 ; e ( 𝑟 ) ]  , m ( ℓ ) 𝑣, out =  𝑤 ∈ N out ( 𝑣 ) 𝜙 ( ℓ ) out  [ h ( ℓ ) 𝑤 ; x 𝑣 𝑤 ; e ( 𝑟 ) ]  , (6) and update node representations by h ( ℓ + 1 ) 𝑣 = MLP ( ℓ )  ( 1 + 𝜖 ( ℓ ) ) h ( ℓ ) 𝑣 + m ( ℓ ) 𝑣, in + m ( ℓ ) 𝑣, out  , (7) where 𝜙 ( ℓ ) is a small MLP and 𝜖 ( ℓ ) is learnable. After 𝐿 layers, we ob- tain nal node representations { h 𝑣 } 𝑣 ∈ 𝑉 𝑟 and a graph representation h 𝐺 = READOU T ( { h 𝑣 } 𝑣 ∈ 𝑉 𝑟 ) . 3.4.2 V erifier Outputs. Given request 𝑟 and plan graph 𝐺 𝑟 = ( 𝑉 𝑟 , 𝐸 𝑟 ) , a graph-based plan verier is expecte d to provide both a global qual- ity assessment and localized risk diagnosis to support subsequent local correction. W e let the verier output three probabilistic sig- nals: graph-lev el score 𝑆 𝑟 ∈ ( 0 , 1 ) measuring the overall plausibility of the plan under request r; node-level risk p 𝑉 𝑟 = { 𝑃 𝑉 𝑟 ( 𝑣 ) } 𝑣 ∈ 𝑉 𝑟 indi- cating the pr obability that a particular step selects an incorr ect tool; edge-level risk p 𝐸 𝑟 = { 𝑃 𝐸 𝑟 ( 𝑢, 𝑣 ) } ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 indicating the probability that two adjacent steps are unr eliably connected (e.g., missing inter- mediate steps). Concretely , we compute logits with three prediction heads 𝑓 g , 𝑓 v , 𝑓 e (lightweight MLPs) and map them to probabilities via the sigmoid function 𝜎 ( ·) : 𝑧 ( 𝐺 𝑟 ) = 𝑓 g  h 𝐺  , 𝑆 𝑟 = 𝜎  𝑧 ( 𝐺 𝑟 )  , 𝑧 𝑉 𝑟 ( 𝑣 ) = 𝑓 v  h 𝑣  , 𝑃 𝑉 𝑟 ( 𝑣 ) = 𝜎  𝑧 𝑉 𝑟 ( 𝑣 )  , 𝑧 𝐸 𝑟 ( 𝑢, 𝑣 ) = 𝑓 e  [ h 𝑢 ; h 𝑣 ; x 𝑢 𝑣 ]  , 𝑃 𝐸 𝑟 ( 𝑢, 𝑣 ) = 𝜎  𝑧 𝐸 𝑟 ( 𝑢, 𝑣 )  . (8) 3.5 Perturbation-based Super vision In this subsection, we perturb the ground truth plan graphs in the training data to generate supervision signals. 3.5.1 Perturbation Op erators. Since real planning data typically lacks ne-grained annotations for incorrect plans and error loca- tions, we construct controllable perturbations from the ground truth plan graph 𝐺 gt 𝑟 to obtain perturbe d graphs 𝐺 pert 𝑟 with an operation log ops ( 𝐺 pert 𝑟 ) . W e consider two common failure modes. W rong T o ol. W e select a node 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) and REPLA CE its tool with 𝑡 ′ 𝑖 . The substitute is sampled preferentially from the sim- ilar tool neighborhood N 𝐾 ( 𝑡 𝑖 ) to form hard negatives. T o avoid GNNV erifier: Graph-base d V erifier for LLM Task Planning trivially infeasible samples, we require the replaced tool to satisfy interface connectivity with adjacent nodes (otherwise resample), while keeping 𝑠 𝑖 unchanged. Missing Step. W e simulate missing intermediate steps and over- simplication by op erating on a consecutive directed sub chain 𝑡 𝑢 → 𝑡 𝑢 + 1 → · · · → 𝑡 𝑣 in 𝐺 𝑔𝑡 𝑟 , using two forms: DROP(span) deletes at least one intermediate node and directly connects the endpoint nodes to form 𝑡 𝑢 → 𝑡 𝑣 . W e require the newly created shortcut edge(s) to satisfy interface type connectivity; other wise we resample the span, so that the p erturbed plan remains type-executable while being semantically missing steps. COMPRESS(sp an → 1) replaces the intermediate nodes with a single tool node 𝑡 ∗ , yielding 𝑡 𝑢 → 𝑡 ∗ → 𝑡 𝑣 , where 𝑡 ∗ represents an over-generalized or improper merge of the original span. T o keep the p erturbation executable, we require both endpoint edges to satisfy interface connectivity; otherwise we resample 𝑡 ∗ or reselect the span. When sampling spans, we leverage tool sequence statistics 𝑓 𝑛 ( ·) to prioritize locally plausible paths, improving the realism and diculty of perturbations. The two operators can be applied independently or composed multiple times on the same 𝐺 𝑔𝑡 𝑟 , yielding candidates with dierent severities and error locations. Detailed perturbation specications are provided in Appendix A.1. 3.5.2 Super vision Signals. W e derive graph-level soft targets and local targets automatically from perturbation logs. Graph-level target. For each perturb ed graph 𝐺 pert 𝑟 , we de- ne a non-negative perturbation cost 𝑐 ( 𝐺 pert 𝑟 ) = Í 𝑜 ∈ ops ( 𝐺 pert 𝑟 ) 𝜂 ( 𝑜 ) , where 𝜂 ( 𝑜 ) is an operation-type dependent penalty that reects the strength of the applied p erturbation, larger cost indicates more severe corruption. The detailed computation of 𝜂 ( 𝑜 ) is provided in Appendix A.2. W e then map the cost to a graph-level soft target: 𝑦 ( 𝐺 pert 𝑟 ) = exp − 𝑐 ( 𝐺 pert 𝑟 ) 𝜏 ! ∈ ( 0 , 1 ] , (9) where 𝜏 is a hyperparameter . For the ground truth plan graph 𝐺 gt 𝑟 , we set 𝑐 ( 𝐺 gt 𝑟 ) = 0 and thus 𝑦 ( 𝐺 gt 𝑟 ) = 1 . Node-level target. W e set ℓ node 𝑣 = 1 if node 𝑣 is replaced by REPLA CE or introduce d as the compressed node 𝑡 ∗ in COMPRESS ; otherwise ℓ node 𝑣 = 0 . Edge-level target. W e set ℓ edge 𝑢 𝑣 = 1 for shortcut edges created by DROP and edges created by COMPRESS ; otherwise ℓ edge 𝑢 𝑣 = 0 . 3.5.3 T wo-stage Training. W e adopt a two-stage training strategy: Stage I learns graph-level scoring, and Stage II learns ne-grained local diagnosis. The detailed training losses are described in Ap- pendix A.3. Stage I: Graph-level training. In the rst stage, w e train the graph encoder and graph-level prediction head to produce a contin- uous scor e for o verall plan quality . Since we have a set of p erturbed graphs generated for the same r equest 𝑟 , we employ a margin-based ranking loss L rank . W e further integrate it with the typical binary cross-entropy loss L graph as L stage1 = L rank + 𝜆 graph L graph . Stage II: Node/Edge-level training. In Stage II, we fr eeze most encoder parameters and ne-tune only the last GNN layer , as well as node/edge-level prediction heads. Here we combine the binary cross-entropy losses for no des and edges as L stage2 = L node + 𝜆 edge L edge . 3.6 V erication-guided Local Correction At inference time, w e will improve the original plan based on the graph-level score 𝑆 𝑟 , node-level risks p 𝑉 𝑟 and edge-level risks p 𝐸 𝑟 . 3.6.1 Decision and Editable Region. W e will modify the original plan based on local correction if and only if 𝑆 𝑟 < 𝜏 𝐺 . When local correction is triggered, we threshold the continuous risks to obtain editable regions: V edit = { 𝑣 ∈ 𝑉 𝑟 : 𝑃 𝑉 𝑟 ( 𝑣 ) ≥ 𝜏 𝑉 } , E edit = { ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 : 𝑃 𝐸 𝑟 ( 𝑢, 𝑣 ) ≥ 𝜏 𝐸 } , (10) where 𝜏 𝐺 , 𝜏 𝑉 , 𝜏 𝐸 ∈ ( 0 , 1 ) are the graph-lev el acceptance threshold, node-risk threshold, and edge-risk threshold, respectively . V edit indicates positions that may contain wrong tool selections, while E edit indicates potentially unreliable transitions (e.g., missing inter- mediate steps or excessive compression). 3.6.2 LLM-guided Correction. W e design a constrained editing prompt and feed it back to the LLM planner for correction. The prompt includes 𝑟 , 𝐺 𝑟 , 𝑆 𝑟 , the editable sets V edit and E edit , the local risks p 𝑉 𝑟 and p 𝐸 𝑟 , as well as the candidate tools { C rep ( 𝑣 ) } for node replacement and { C ins ( 𝑢, 𝑣 ) } for edge insertion. W e present the details about candidate tools in Appendix B.1. The LLM output is restricted to a sequence of at most 𝐾 max edits, where each edit must be chosen from a predened set of operations: • replace_on_node(node_id, candidate_id, step) : for a high-risk node 𝑣 𝑖 ∈ V edit , select 𝑡 ∈ C rep ( 𝑣 𝑖 ) to r eplace the tool and generate the updated step text step ; • insert_on_edge(edge_id, candidate_id, step) : for a high-risk edge ( 𝑢, 𝑣 ) ∈ E edit , select 𝑡 ∈ C ins ( 𝑢, 𝑣 ) to in- sert a new node between the endpoints, rewire the graph accordingly , and generate the inserted step text step ; • no_change() : perform no modication. Then we apply the edits accordingly to obtain a corr e cted plan 𝐺 ′ 𝑟 . W e will accept 𝐺 ′ 𝑟 if its graph-level score is higher than the original 𝐺 𝑟 . Otherwise, we will keep the plan unchanged. 4 Experiments W e conduct extensive experiments to answer the following research questions (RQs): RQ1: How does the propose d GNNV erier im- prove plan quality compared to state-of-the-art baselines across diverse datasets and planners? RQ2: How do the key designs con- tribute individually to the overall eectiveness? RQ3: How does the verication-guided correction reduce each type of plan error across dierent datasets? RQ4: How do the learned graph-, node-, and edge-level embedding spaces separate incorrect samples from correct ones? Besides, we present two additional e xp eriments in the appendix: hyperparameter analysis ( Appendix D .5) and case study (Appendix D .6). 4.1 Experimental Setup Datasets. W e evaluate on tw o task-planning benchmarks: T ask- Bench [ 26 ] and UltraT ool [ 12 ]. T askBench contains three datasets that require multi-step tool invocations: HuggingFace , which fo- cuses on planning for AI/ML workows over HuggingFace models Trovato et al. T able 1: Performance comparison across four datasets on GPT -4o: No de-F1, Link-F1, and Accuracy are reported in %. The best results are highlighte d in b oldface, and the second-b est results are underlined. Method HuggingFace Multimedia DailyLife UltraT o ol n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ Direct Raw 79.60 55.27 34.20 85.94 64.36 49.60 97.12 84.21 72.80 73.07 46.36 36.80 +Rene 78.39 52.86 31.20 87.02 65.98 48.40 96.93 60.12 48.20 73.30 38.92 30.80 +V eriCoder 79.78 55.45 34.00 86.69 66.17 50.70 97.44 56.81 46.80 73.86 34.11 26.45 +V eriPlan 79.64 55.95 34.80 86.67 66.14 51.60 97.12 84.22 72.80 78.26 53.89 42.40 +GNNV erier 82.82 60.71 43.80 88.09 70.74 58.60 97.51 87.45 78.76 76.89 52.82 42.80 Re Act Raw 79.91 53.35 31.80 86.50 64.53 48.80 96.59 57.01 45.20 73.89 39.35 32.40 +Rene 78.19 52.81 30.80 86.61 65.60 47.80 96.25 51.89 38.08 74.59 36.34 29.80 +V eriCoder 80.94 58.31 38.20 90.37 73.67 55.80 97.34 36.09 20.08 73.43 28.51 21.44 +V eriPlan 80.16 54.04 33.40 87.45 68.52 52.00 96.62 57.15 45.20 77.91 46.31 34.47 +GNNV erier 82.40 60.66 43.60 90.46 73.73 59.00 96.59 85.83 76.35 77.91 52.06 44.20 GNN4Plan Raw 78.68 57.99 41.20 84.91 69.41 57.00 97.25 87.51 78.80 71.68 46.99 37.80 +Rene 77.52 51.92 32.40 87.64 69.82 54.20 96.63 53.46 43.09 75.60 42.59 34.61 +V eriCoder 80.01 56.33 38.55 87.79 70.40 56.20 97.65 50.37 42.48 75.17 36.31 27.25 +V eriPlan 78.68 57.99 41.20 84.91 69.41 57.00 97.25 87.51 78.80 71.68 46.99 37.80 +GNNV erier 82.69 60.79 43.80 88.96 71.24 60.60 97.25 87.61 78.98 76.44 53.46 43.80 (e.g., model retrieval, selection, and pipeline composition); Mul- timedia , which covers user-centric multimedia tasks such as le downloading, e diting, and format conversion; and DailyLife , which includes everyday service APIs such as web search, shopping, and information retrieval. T o assess scalability on larger dep endency graphs, we additionally conduct the same experiments on Ultra- T ool. All datasets provide a user request together with a ground truth plan graph (tool nodes and dependency links). W e follow the data preprocessing and format conventions in prior work ( e.g., GNN4Plan [ 37 ]) to ensure fair comparison. Detailed dataset descrip- tions are provided in the Appendix C. Evaluation. For both benchmarks, we follow the public split used in GNN4Plan [ 37 ]: 3000 instances for training and 500 in- stances for testing on each dataset. W e further hold out 10% of the training set as a validation split for hyp erparameter and threshold selection. W e adopt the standard metrics from previous work [ 26 , 37 ]: Node F1-score (n-F1), which measures the accuracy of invoked tasks (tool nodes); Link F1-scor e (l-F1), which measures the accu- racy of invoked dependencies (dir e cted links); and A ccuracy (Acc), which measures the task-level success rate by checking whether the predicted tasks and dependencies exactly match the ground truth. All experiments are conducted three times, and the results are reported as the average value. Backbone LLMs. W e report the main results with two backbone LLMs, GPT -4o [ 1 ] and Qwen3-235B- A22B-Instruct-2507 [ 39 ]. T o ensure a fair comparison, within each experimental setting we use the same backbone LLM consistently across all mo dules, including the planner and the correction module. All additional experiments and analyses are conducted under GPT -4o. Planners. W e apply our verier to three planning methods: Direct [ 26 ], LLM generates a complete plan in a single shot given the user request and tool descriptions; ReA ct [ 42 ], LLM alternates between reasoning and acting to incrementally construct the plan; and GNN4Plan [ 37 ], which uses a GNN to guide plan construction by selecting tool nodes and to ol-invocation edges. Baselines. On top of each planner , we compare several plan- correction strategies: Rene [ 22 ], an LLM-only self-renement that revises the initial plan without explicit diagnostic signals; V eri- Coder [ 11 ] employs an LLM-based plan verication assistant to check whether the decomposed sub-tasks are consistent with the user specication and to provide actionable suggestions when in- consistencies are found. V eriP lan [ 16 ] couples an LLM with an external verier that checks whether a plan satises explicit con- straints and uses the detecte d violations to guide a correction step. In our implementation, the verier checks edges against the depen- dency graph specication and prompts the LLM to correct invalid dependencies. For a controlled comparison, all veriers perform exactly one revision step . Implementation Details. All e xperiments are conducted on 8 × NVIDIA A800-40G GP Us. W e use E5-large [ 30 ] as the text encoder to obtain unied semantic repr esentations for the user request, step texts, and tool descriptions. For each tool 𝑡 , we pr e compute its semantic neighborhoo d N 𝐾 ( 𝑡 ) by cosine similarity with 𝐾 = 10 . Our verier uses a 3-layer GNN. The GNN and all MLPs are set to a dimension of 1024 , with ReLU activations and a dropout rate of 0 . 1 . Specically , the feature concatenations (node/edge) and step–to ol alignment module are implemente d as single-layer MLPs, while GNN message/update functions 𝜙 ( ℓ ) in / out and the update MLP in Eq. (7) use two-layer MLPs. For the step–tool alignment module, early GNNV erifier: Graph-base d V erifier for LLM Task Planning T able 2: Ablation studies on HuggingFace and Multimedia: Node-F1, Link-F1, and Accuracy are reporte d in %. V ariant HuggingFace Multimedia n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ Direct Raw 79.60 55.27 34.20 85.94 64.36 49.60 w/o GNN 75.78 51.77 33.00 84.40 66.35 50.20 w/o Stage-II 79.92 55.51 35.40 86.03 66.05 50.40 w/o Stage-I 79.60 57.86 39.80 85.94 68.97 56.40 w/o Node Feat. 80.36 57.94 35.60 86.01 64.53 50.00 w/o Edge Feat. 80.53 56.48 37.20 86.44 66.61 51.80 w/o Graph FB 80.19 55.97 35.60 84.14 64.47 49.00 w/o Node FB 80.45 58.06 34.80 86.03 64.70 50.00 w/o Edge FB 80.36 56.21 34.80 86.30 67.59 52.20 Full 82.82 60.71 43.80 88.09 70.74 58.60 Re Act Raw 79.91 53.35 31.80 86.50 64.53 48.80 w/o GNN 75.58 52.59 31.00 84.92 65.38 46.80 w/o Stage-II 80.21 54.12 35.20 87.73 65.28 48.80 w/o Stage-I 79.91 56.50 41.00 86.50 69.25 56.40 w/o Node Feat. 80.12 54.40 36.40 87.15 64.83 49.00 w/o Edge Feat. 81.23 54.04 37.60 87.61 67.36 51.00 w/o Graph FB 77.29 51.24 32.80 84.87 65.98 49.20 w/o Node FB 79.59 50.77 31.20 87.35 65.82 49.20 w/o Edge FB 79.56 52.34 30.80 88.67 67.42 51.60 Full 82.40 60.66 43.60 90.46 73.73 59.00 GNN4Plan Raw 78.68 57.99 41.20 84.91 69.41 57.00 w/o GNN 76.91 52.70 33.00 83.39 66.66 50.00 w/o Stage-II 79.41 55.26 39.20 85.66 70.04 58.40 w/o Stage-I 79.60 57.86 39.80 85.91 69.41 57.00 w/o Node Feat. 79.22 59.12 42.00 85.37 70.03 57.00 w/o Edge Feat. 80.70 58.38 42.60 86.19 69.84 58.20 w/o Graph FB 77.07 56.04 41.20 84.96 69.45 57.00 w/o Node FB 80.33 59.01 42.00 84.96 70.45 57.60 w/o Edge FB 80.33 59.01 42.00 86.87 69.41 57.60 Full 82.69 60.79 43.80 88.96 71.24 60.60 stopping is congured with a patience of 5 based on the training loss. The graph repr esentation is obtaine d by mean po oling. W e train the GNN verier with learning rate 2 × 10 − 5 and batch size 512 . The correction specic thresholds are pro vide d in Appendix B.2, while hyperparameters related to loss w eighting and soft targets are tuned on the validation set and analyzed in Appendix D.5. 4.2 Main Results (RQ1) T able 1 reports the performance of dierent planner–verier combi- nations on four datasets with GPT -4o. The counterpart r esults with Qwen3-235B- A22B-Instruct-2507 are provided in Appendix D.1. Overall, our graph-based verier improves node-level and e dge- level planning metrics (n-F1 and l-F1) in most settings, and consis- tently achieves the best task-level accuracy . Broad generalization across datasets. A veraged over the three planners and compared to the best baseline, V eriPlan, our method achieves r elative improvements in n-F1/l-F1/A cc of 3.95% / 8.44% /19.93% on HuggingFace, 3.27% / 5.70% / 10.96% on Multimedia, 0.12% / 13.99% / 18.95% on DailyLife, and 1.49% / 7.58% / 14.07% on UltraT ool. The improvements hold across datasets with substan- tially dierent tool inventories and task distributions, indicating strong generalization. Notably , the n-F1 gain on DailyLife is small because the baseline node predictions are already near-saturated (n-F1 is typically above 96% across planners in T able 1), leaving limited room for further improvement. Robust improvements across planners. A veraged over the four datasets and compared to the best baseline, V eriP lan, our method yields relative improvements in n-F1/l-F1/Acc of 1.06% / 4.43% / 11.09% for Direct, 1.53% / 20.47% / 35.19% for Re Act, and 3.86% / 4.28% / 5.76% for GNN4Plan. The consistent gains across planners reect improv e d robustness to heterogeneous error pat- terns produced by dierent planners. Stronger performance on the task-level metric. The n-F1 and l-F1 metrics quantify lo cal matching quality for nodes and edges, whereas Acc is a task-level success rate that requires the predicted plan graph to be corr ect as a whole (both nodes and links). By modeling plans as attributed graphs and performing message passing over execution edges, our verier better captures local context and structural consistency , leading to more reliable task accuracy across all datasets and planners. 4.3 Ablation Studies (RQ2) W e conduct ablation studies on all four datasets. T able 2 reports results on HuggingFace and Multimedia, while results on Dai- lyLife and UltraT ool are deferred to Appendix D .2. W e compare our method with eight ablate d variants to isolate the eects of (i) structural modeling and training objectives of the GNN veri- er , (ii) the proposed advanced no de/edge attributes, and (iii) the multi-granularity feedback used to guide LLM correction. Necessity of graph-structured modeling. T o verify that im- provements come from structure-aware reasoning rather than merely richer features, we replace the GNN verier with an MLP (w/o GNN), thereby remo ving message passing on the plan graph. No- tably , this ablation is not only substantially worse than the full model, but is also worse than having no verier at all (Raw) in several settings. For instance, under the GNN4Plan planner , A cc drops fr om 41.20% (Raw ) to 33.00% on HuggingFace, fr om 57.00% to 50.00% on Multimedia. This indicates that without relational mod- eling, verier feedback can become harmful: an MLP tends to scor e nodes/edges largely independently and cannot aggregate conte x- tual evidence along execution dependencies, making its prediction less precise and less globally consistent. Eectiveness of two-stage training. W e ablate the two-stage training scheme by removing Stage-II ( w/o Stage-II) or Stage-I (w/o Stage-I), testing whether graph-level quality learning and local error localization can substitute for each other . The full model consistently performs best, indicating that the two stages provide complementary signals: Stage-I provides a global quality signal to assess overall plan soundness, while Stage-II provides precise local diagnostics that enable targeted and eective edits. Gains from advanced node/edge attributes. T o verify our feature design, we keep only the basic features and ablate them Trovato et al. (a) HuggingFace (b) Multimedia (c) DailyLife (d) UltraT ool Figure 3: Analysis of planning error typ es b efore and after correction across four datasets. (a) Graph (b) Node (c) Edge Figure 4: The t-SNE visualization of graph-, node-, and edge-level embeddings on UltraT o ol. separately: (i) w/o node feat., which removes step semantics and the step–tool matching features; and (ii) w/o edge feat., which r emoves the co-occurrence statistics. Results show that both ablations con- sistently underperform the full model across all planners on both Huggingface and Multimedia, suggesting a general benet rather than a dataset specic artifact. Feedback signals for LLM-guide d corr e ction. W e ablate the feedback pro vided to the LLM during lo cal correction by remov- ing: graph-level scoring feedback ( w/o graph FB), node-level risk feedback (w/o node FB), or edge-level risk feedback (w/o edge FB), together with their associated prompts, to test whether correction requires multi-granularity diagnostics rather than a single signal. Results show that the three feedback signals are complementary: graph feedback pr ovides global calibration, node feedback supports tool replacement, and edge feedback supp orts dependency correc- tion; using all of them yields the most stable improvements. 4.4 Error Analysis (RQ3) Figure 3 presents the distribution of ve error typ es with the Di- rect planner , comparing the raw plans ( Before ) and the plans after correction by our method ( After ). W e dene the error types as: (1) W rong T o ol (a step selects an incorrect tool), (2) Missing T ools (the plan omits one or more necessary intermediate steps), (3) Depen- dency Err or (a predicted edge is I/O incompatible and thus does not exist in the dependency graph), (4) Edge Fail (the edge exists in the dependency graph but violates the user request), and (5) Other (e.g., empty outputs or r edundant to ols). Across the four datasets, the raw plans ar e often coupled with lo- cal structural defects along execution dependencies. After applying our method, the error counts decrease broadly acr oss types, with the most visible drops in W rong T ool and Missing T ool, indicating that risk-guided local e diting can eectively replace confused tools and insert missing intermediate steps. Dependency Error also re- duces, reecting that dependency graph type constraints provide a reliable guardrail during candidate generation and correction, while Edge Fail decreases as well, suggesting the verier can detect and correct type compatible yet incorrect transitions by lev eraging graph context rather than relying on interface compatibility alone. More detailed results are pro vided in the App endix D .3. 4.5 Visualization of GNN Embe ddings (RQ4) Figure 4 visualizes the learned graph-, node-, and edge-level em- beddings on the largest dataset UltraT ool with the Direct planner . Overall, the t-SNE plots show a clear separation between correct and incorrect test samples with limited overlap across all three granularities. At the graph level (Figure 4(a)), correct plans form compact regions that are well separated from incorrect ones. This supports using the graph score for reliable acceptance versus cor- rection decisions. At the node and e dge levels (Figur e 4(b) and (c)), nodes and edges from correct plans o ccupy mor e coherent clusters, while incorrect ones are scattered into dierent r egions, suggesting that the verier learns dierent patterns for identifying incorrect tool selections and transitions. The visualization suggests that our verier learns consistently discriminative representations across graph, node, and edge granularities. More detailed visualization results and discussions are pro vide d in Appendix D .4. 5 Conclusion In this paper , we propose an eective graph-based v erier for LLM task planning. By representing a generated plan as a directed graph with enriched attributes, we use a GNN to p erform structural eval- uation and diagnosis. W e automatically generate training data for GNNV erifier: Graph-base d V erifier for LLM Task Planning graph/node/edge-level scoring by perturbing ground truth plan graphs. Experimental results show that the feedback from our GN- NV erier can eectively assist an LLM to correct the original plans. For future work, a possible direction is to move from static verica- tion to execution-aware verication and correction by integrating online signals from real tool calls. References [1] Josh Achiam, Steven Adler , Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint (2023). [2] Michael Ahn, Anthony Brohan, Noah Bro wn, Y evgen Chebotar , Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al . 2022. Do as i can, not as i say: Grounding language in robotic aordances. arXiv preprint arXiv:2204.01691 (2022). [3] Anonymous. 2026. Unied Plan V erication with Static Rubrics and D y- namic Policies for Reliable LLM Planning. https://openreview .net/forum?id= qDFegAnCin Under re view . [4] Maciej Besta, Nils Blach, Ales Kubicek, Rob ert Gerstenberger , Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, T omasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al . 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on articial intelligence , V ol. 38. 17682–17690. [5] T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language mo dels are fe w-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901. [6] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heew oo Jun, Lukasz Kaiser , Matthias Plappert, Jerry T worek, Jacob Hilton, Reiichiro Nakano, et al . 2021. Training veriers to solv e math word problems. arXiv preprint arXiv:2110.14168 (2021). [7] Gautier Dagan, Frank Keller , and Alex Lascarides. 2023. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391 (2023). [8] Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, W ei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge. The Innovation (2024). [9] Lin Guan, Karthik V almeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems 36 (2023), 79081–79094. [10] Ananth Hariharan, V ardhan Dongre, Dilek Hakkani- Tür , and Gokhan Tur . 2025. Plan V erication for LLM-Based Embodied T ask Completion Agents. arXiv preprint arXiv:2509.02761 (2025). [11] Chia- Tung Ho, Haoxing Ren, and Brucek Khailany. 2025. V erilogcoder: Au- tonomous verilog coding agents with graph-based planning and abstract syntax tree (ast)-based waveform tracing tool. In Proceedings of the AAAI Conference on A rticial Intelligence , V ol. 39. 300–307. [12] Shijue Huang, W anjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, W eiwen Liu, Y utai Hou, Xingshan Zeng, Y asheng Wang, Lifeng Shang, et al . 2024. Planning, creation, usage: Benchmarking llms for compr ehensive tool utilization in real- world complex scenarios. arXiv preprint arXiv:2401.17167 (2024). [13] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y e Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. A CM computing sur veys 55, 12 (2023), 1–38. [14] Subbarao Kambhampati, Karthik V almeekam, Lin Guan, Mudit V erma, Kaya Stechly , Siddhant Bhambri, Lucas Saldyt, and Anil Murthy . 2024. Llms can’t plan, but can help planning in llm-modulo frame works. arXiv preprint (2024). [15] Subbarao Kambhampati, Karthik V almeekam, Lin Guan, Mudit V erma, Kaya Stechly , Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy . 2024. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. In Forty-rst International Conference on Machine Learning . [16] Christine P Lee, David Porrio, Xinyu Jessica W ang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025. V eriplan: Integrating formal verication and llms into end- user planning. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1–19. [17] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Yih, Tim Rock- täschel, et al . 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. A dvances in neural information processing systems 33 (2020), 9459–9474. [18] Hunter Lightman, Vineet K osaraju, Yuri Burda, Harrison Edwards, Bo wen Baker , T eddy Lee, Jan Leike, John Schulman, Ilya Sutskever , and Karl Cobbe. 2023. Let’s verify step by step. In The Twelfth International Conference on Learning Representations . [19] Bo Liu, Y uqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. Llm+ p: Empowering large language models with optimal planning prociency . arXiv preprint arXiv:2304.11477 (2023). [20] Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lir ong Xiang, Y uchen Liu, and Dongkuan Xu. 2024. T oolnet: Connecting large language mo dels with massive tools via tool graph. arXiv preprint arXiv:2403.00839 (2024). [21] Elias Lumer , Pradeep Honaganahalli Basavaraju, Myles Mason, James A Burke, and Vamse Kumar Subbiah. 2025. Graph RAG- T ool Fusion. arXiv preprint arXiv:2502.07223 (2025). [22] Aman Madaan, Niket T andon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegree, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Y ang, et al . 2023. Self-rene: Iterativ e renement with self-feedback, 2023. URL https://arxiv. org/abs/2303.17651 (2023). [23] Archiki Prasad, Alexander Koller , Mar eike Hartmann, Peter Clark, Ashish Sabhar- wal, Mohit Bansal, and T ushar Khot. 2024. Adapt: As-needed decomposition and planning with language models. In Findings of the Association for Computational Linguistics: NAACL 2024 . 4226–4252. [24] Timo Schick, Jane Dwivedi- Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer , Nicola Cancedda, and Thomas Scialom. 2023. T oolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2023), 68539–68551. [25] Y ongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, W eiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems 36 (2023), 38154–38180. [26] Y ongliang Shen, Kaitao Song, Xu Tan, W enqi Zhang, Kan Ren, Siyu Y uan, W eim- ing Lu, Dongsheng Li, and Yueting Zhuang. 2024. Taskbench: Benchmarking large language models for task automation. Advances in Neural Information Processing Systems 37 (2024), 4540–4574. [27] Noah Shinn, Federico Cassano, Be ck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Y ao. 2023. Reexion: Language agents with verbal reinforcement learning, 2023. URL https://ar xiv . org/abs/2303.11366 1 (2023). [28] Karthik V almeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao K amb- hampati. 2023. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems 36 (2023), 75993–76005. [29] Lei W ang, W anyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka- W ei Le e, and Ee-Peng Lim. 2023. P lan-and-solve prompting: Impr oving zero-shot chain- of-thought reasoning by large language models. arXiv preprint (2023). [30] Liang W ang, Nan Y ang, Xiaolong Huang, Binxing Jiao, Linjun Y ang, Daxin Jiang, Rangan Majumder , and Furu W ei. 2022. T ext embeddings by weakly-super vised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022). [31] Xuezhi W ang, Jason W ei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Cho wdhery, and Denny Zhou. 2022. Self-consistency impr oves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022). [32] Zilong W ang, Y uedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Y uchen Lin, and Jingbo Shang. 2024. Ocebench: Benchmarking language agents across multiple applications for oce automation. arXiv preprint arXiv:2407.19056 (2024). [33] Jason W ei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reason- ing in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [34] Chenfei Wu, Shengming Yin, W eizhen Qi, Xiaodong W ang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: T alking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023). [35] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi W ang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conv ersation framework. arXiv preprint arXiv:2308.08155 3, 4 (2023). [36] Qinzhuo Wu, W ei Liu, Jian Luan, and Bin W ang. 2024. T oolPlanner: A tool aug- mented LLM for multi granularity instructions with path planning and feedback. arXiv preprint arXiv:2409.14826 (2024). [37] Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siw ei W ang, Bohang Zhang, Jiarui Feng, Hong Cheng, W ei Chen, Yun Xiong, et al . 2024. Can graph learning impr ove planning in LLM-based agents? Advances in Neural Information Processing Systems 37 (2024), 5338–5383. [38] Guangzhi Xiong, Qiao Jin, Xiao W ang, Yin Fang, Haolin Liu, Yifan Y ang, Fangyuan Chen, Zhixing Song, Dengyu W ang, Minjia Zhang, et al . 2025. Rag- gym: Systematic optimization of language agents for retrie val-augmente d gener- ation. arXiv preprint arXiv:2502.13957 (2025). [39] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv , et al . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025). [40] Zhengyuan Y ang, Linjie Li, Jianfeng W ang, Kevin Lin, Ehsan Azarnasab , Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan W ang. 2023. Mm- react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023). Trovato et al. [41] Shunyu Y ao, Dian Y u, Jerey Zhao, Izhak Shafran, Thomas L Griths, Y uan Cao, and Karthik Narasimhan. 2023. Tr ee of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv . org/abs/2305.10601 3 (2023), 1. [42] Shunyu Yao , Jerey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations . A Details of Perturbation-based Sup ervision A.1 Perturbation Operators Given a request 𝑟 and its ground truth plan graph 𝐺 gt 𝑟 , we generate a set of type-executable but semantically corrupte d graphs { 𝐺 pert 𝑟 } together with an operation log ops ( 𝐺 pert 𝑟 ) for each sample, which is then used to derive graph/node/edge super vision targets. A.1.1 Sampling numb ers and op eration budgets. For each 𝐺 gt 𝑟 , we sample the number of perturbe d graphs as 𝐶 ∼ Categorical ( { 2 , 3 , 4 } ; [ 0 . 25 , 0 . 50 , 0 . 25 ] ) . (11) For each perturbed graph, we sample an operation budget: 𝐵 ∼ Categorical ( { 1 , 2 , 3 } ; [ 0 . 60 , 0 . 30 , 0 . 10 ] ) . (12) This distribution is skewed toward small 𝐵 because realistic plans often fail due to one primary issue, while still allowing occasional multi-error cases to improve robustness. For each operation, we sample the operator family with equal probability: Pr ( Wrong T ool ) = Pr ( Missing Step ) = 0 . 50 . A.1.2 W rong T o ol. Given a ground truth plan graph 𝐺 gt 𝑟 = ( 𝑉 𝑟 , 𝐸 𝑟 ) , we randomly sample a node 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) ∈ 𝑉 𝑟 and REPLACE its tool 𝑡 𝑖 with an alternative tool 𝑡 ′ 𝑖 , while keeping the step text 𝑠 𝑖 unchanged. T o form hard negatives, 𝑡 ′ 𝑖 is sampled preferentially from the similar tool neighborhood N 𝐾 ( 𝑡 𝑖 ) . T o avoid trivially infea- sible negatives, we require the replacement to pr eser ve interface- level connectivity under the dependency graph, i.e., it must be connectable to both adjacent tools: 𝑡 ′ 𝑖 ∈ N tool out ( 𝑡 𝑖 − 1 ) ∩ N tool in ( 𝑡 𝑖 + 1 ) , (13) where endpoint nodes only check the existing side. W e then draw 𝑡 ′ 𝑖 from two complementary subsets to control p er- turbation hardness: (i) Semantic-neighbor confusion. With probabil- ity 0 . 75 , we sample 𝑡 ′ 𝑖 from N 𝐾 ( 𝑡 𝑖 ) after applying the connectivity lter in Eq. (13) , producing hard negatives that are type-executable and semantically close to 𝑡 𝑖 , which resembles ne-grained con- fusions frequently made by real planners. (ii) Mild noise. With probability 0 . 25 , we sample 𝑡 ′ 𝑖 from the remaining feasible tools that pass Eq. (13) but are not in N 𝐾 ( 𝑡 𝑖 ) . This injects lightweight noise while still respecting the executability constraint, cov ering less frequent but possible tool selection mistakes. If the feasible set is empty , we resample 𝑖 until a valid replacement is found. A.1.3 Missing Step. When Missing Step is chosen, we sample a consecutive directed subchain 𝑡 𝑢 → 𝑡 𝑢 + 1 → · · · → 𝑡 𝑣 in 𝐺 gt 𝑟 while keeping the perturbe d plan type-executable under the de- pendency graph. Then we sample the subtype with Pr ( DROP ) = Pr ( COMPRESS ) = 0 . 50 . DROP(span). W e sample the span length: 𝑚 ∼ Categorical ( { 1 , 2 , 3 , 4 , ≥ 5 } ; [ 0 . 55 , 0 . 25 , 0 . 10 , 0 . 07 , 0 . 03 ] ) , (14) where the distribution is skewed toward short spans to reect that real planners more often skip one or two critical intermediate steps, while still assigning a small pr obability to longer spans to generate harder cases with more se vere missing step errors. Then delete 𝑚 consecutive nodes and directly connect the endpoint tools by a shortcut edge 𝑡 𝑢 → 𝑡 𝑣 , (or St art → 𝑡 𝑣 when the span starts from the beginning). When 𝑡 𝑢 ≠ St art , we require the shortcut edge to be type-executable, i.e., 𝑡 𝑣 ∈ N tool out ( 𝑡 𝑢 ) ; otherwise we resample the span (or 𝑚 ) until the shortcut is feasible. T o make the perturbation realistic, we mix two sampling modes over feasible spans. With probability 0 . 75 , we sample spans proportionally to a motif weight derived from tool se quence counts 𝑓 𝑛 ( ·) (to favor locally plausi- ble shortcuts that a planner might produce), and with pr obability 0 . 25 we sample uniformly among feasible spans to co ver long-tail omissions. COMPRESS(span → 1). W e select a consecutive span and re- place it by a single shortcut to ol 𝑡 ∗ , forming 𝑡 𝑢 → 𝑡 ∗ → 𝑡 𝑣 (or St art → 𝑡 ∗ → 𝑡 𝑣 when the span starts from St art ). In implementation, we restrict the span length to 𝑚 ∈ { 2 , 3 , 4 } and sample a feasible span uniformly fr om all candidate spans of these lengths. The shortcut tool 𝑡 ∗ is enumerated from the depen- dency graph as a bridge candidate that preserves type executability: 𝑡 ∗ ∈ N tool out ( 𝑡 𝑢 ) ∩ N tool in ( 𝑡 𝑣 ) , (15) and for the spans starts fr om St art we drop the left constraint and only require 𝑡 ∗ ∈ N tool in ( 𝑡 𝑣 ) . T o prefer plausible but improper compressions, we rank candi- dates using the span text 𝑠 span (concatenation of the deleted step texts) and sample: Pr ( 𝑡 ∗ ) ∝ exp  𝑔 ( 𝑠 span , 𝑡 ∗ )  , (16) so that 𝑡 ∗ tends to be semantically aligne d with the span while still representing an over-generalized merge. If no feasible 𝑡 ∗ exists, we resample the span. The two perturbation types can be applied independently or composed multiple times on the same 𝐺 gt 𝑟 , yielding candidate graphs with varying severities and error locations. A.2 Details of Super vision Signals Graph-level target. For each 𝐺 pert 𝑟 , we compute a non-negative perturbation cost 𝑐 ( 𝐺 pert 𝑟 ) = Í 𝑜 ∈ ops ( 𝐺 pert 𝑟 ) 𝜂 ( 𝑜 ) , where 𝜂 ( 𝑜 ) is the cost of executing one perturbation operation 𝑜 . W e then map it to the graph-level target 𝑦 ( 𝐺 pert 𝑟 ) = exp  − 𝑐 ( 𝐺 pert 𝑟 ) / 𝜏  as in Eq. 9. In implementation, we use the following operation-type costs: REPLA CE. For a replacement at node 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) where the tool becomes 𝑡 ′ 𝑖 , we set 𝜂 ( REPLA CE ) =  1 − 𝑔 ( 𝑠 𝑖 , 𝑡 ′ 𝑖 )  . (17) This assigns smaller cost to harder near-synonym confusions that still align well with the step text. DROP(span). Let 𝐷 be the deleted to ol set and 𝑚 = | 𝐷 | . W e use 𝜂 ( DROP ) = 𝑚 ·  𝑡 ∈ 𝐷 max  0 , 𝑠 ( 𝑟 , 𝑡 )  , (18) where 𝑠 ( 𝑟 , 𝑡 ) is the semantic similarity between the request and the tool representation. This penalizes dropping longer spans and dropping tools that are relevant to the r equest. GNNV erifier: Graph-base d V erifier for LLM Task Planning COMPRESS(span → 1). For a span of length 𝑚 compressed into a shortcut tool 𝑡 ∗ with span text 𝑠 span (concatenation of the deleted step texts), we set 𝜂 ( COMPRESS ) = ( 𝑚 − 1 ) ·  1 − 𝑔 ( 𝑠 span , 𝑡 ∗ ) ) . (19) This assigns larger cost to longer compressions and to shortcuts that poorly cover the original span. A.3 Details of T wo-stage T raining For each request 𝑟 , we construct a training set that contains the ground truth plan graph 𝐺 gt 𝑟 and a set of perturbe d graphs { 𝐺 pert 𝑟 } . Each graph 𝐺 𝑟 is associated with (i) a graph-level target 𝑦 ( 𝐺 ) ∈ ( 0 , 1 ] , (ii) node-level target { ℓ node 𝑣 } 𝑣 ∈ 𝑉 𝑟 , and (iii) edge-level target { ℓ edge 𝑢 𝑣 } ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 . W e train the verier in tw o stages: Stage I learns stable graph- level scoring, and Stage II learns ne-grained node/edge diagnosis with minimal drift of the global capability . Stage I: Graph-level training. W e super vise the graph-level logit 𝑧 ( 𝐺 𝑟 ) with the graph-level soft target 𝑦 ( 𝐺 𝑟 ) using a binary cross-entropy r egression loss: L graph = BCEWithLogits  𝑧 ( 𝐺 𝑟 ) , 𝑦 ( 𝐺 𝑟 )  . (20) In addition, for the same request 𝑟 we typically have multiple graphs with dierent p erturbation costs 𝑐 ( ·) , which naturally induce an ordering: P ( 𝑟 ) = { ( 𝑖 , 𝑗 ) | 𝑐 ( 𝐺 𝑟 ,𝑖 ) < 𝑐 ( 𝐺 𝑟 , 𝑗 ) } and apply a margin ranking loss: L rank =  ( 𝑖 , 𝑗 ) ∈ P ( 𝑟 ) max  0 , 𝑚 𝑖 𝑗 − 𝑧 ( 𝐺 𝑟 ,𝑖 ) + 𝑧 ( 𝐺 𝑟 , 𝑗 )  , (21) where 𝑚 𝑖 𝑗 = 0 . 2 ( 𝑐 ( 𝐺 𝑟 , 𝑗 ) − 𝑐 ( 𝐺 𝑟 ,𝑖 ) ) , this formulation encourages the verier to assign higher scores to less corrupted graphs, while enforcing larger separation for larger corruption gaps. The overall Stage I objective is L stage1 = L rank + 𝜆 graph L graph , (22) where 𝜆 graph balances calibration against ranking consistency . Stage II: Node/Edge-level training. Stage II focuses on local diagnosis signals. W e freeze the most encoder parameters learned in Stage I, and ne-tune only the last GNN layer together with the node-/edge-level prediction heads. For each node 𝑣 ∈ 𝑉 𝑟 and each directed edge ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 , we compute logits 𝑧 𝑉 𝑟 ( 𝑣 ) and 𝑧 𝐸 𝑟 ( 𝑢, 𝑣 ) , and optimize them with weighted binary cross-entropy losses: L node =  𝑣 ∈ 𝑉 𝑟 WBCEWithLogits  𝑧 𝑉 𝑟 ( 𝑣 ) , ℓ node 𝑣 ; 𝜔 node  , L edge =  ( 𝑢, 𝑣 ) ∈ 𝐸 𝑟 WBCEWithLogits  𝑧 𝐸 𝑟 ( 𝑢, 𝑣 ) , ℓ edge 𝑢 𝑣 ; 𝜔 edge  , (23) since positive targets are rarer than negative ones, we set 𝜔 node and 𝜔 edge > 1 to mitigate class imbalance. Both w eights are com- puted automatically from the positive/negative target counts in the training data. The overall Stage II objective is L stage2 = L node + 𝜆 edge L edge , (24) where 𝜆 edge controls the strength of edge-level diagnosis relative to node-level diagnosis. B Details of Local Correction This appendix details the implementation of our verication-guided local correction module. Given the v erier outputs on a plan graph, we expose a small set of high-risk nodes/edges together with con- strained candidate tools, and require the LLM to return minimal, schema-valid edit operations. B.1 Candidate T ools In this section, we detail how we construct to ol candidate sets from 𝐺 tool for node replacement and e dge insertion during lo cal correction, and how we rank and select the nal candidates. Replacement candidates (no de-lev el). For any high-risk node 𝑣 𝑖 = ( 𝑡 𝑖 , 𝑠 𝑖 ) ∈ V edit , we select candidates from the dependency graph 𝐺 tool and the similar tool neighborhoo d: C rep ( 𝑣 𝑖 ) = N 𝐾 ( 𝑡 𝑖 ) ∩ N tool out ( 𝑡 𝑖 − 1 ) ∩ N tool in ( 𝑡 𝑖 + 1 ) . (25) When 𝑣 𝑖 is a root node (zero in-degree), we treat 𝑡 𝑖 − 1 as St art and only keep N tool in ( 𝑡 𝑖 + 1 ) in the above ltering. W e rank 𝑡 ∈ C rep ( 𝑣 𝑖 ) by score rep ( 𝑡 ) = 𝑔 ( 𝑠 𝑖 , 𝑡 ) + 𝑠 ( 𝑟 , 𝑡 ) . (26) Finally return the T op-3 candidates. Insertion candidates (edge-level). For any high-risk edge ( 𝑢, 𝑣 ) ∈ E edit , we enumerate bridging tools from the dependency graph 𝐺 tool : C ins ( 𝑢, 𝑣 ) = N tool out ( 𝑡 𝑢 ) ∩ N tool in ( 𝑡 𝑣 ) . (27) When 𝑢 = St art , we drop the constraint induced by N tool out ( 𝑡 𝑢 ) and only require 𝑡 ∈ N tool in ( 𝑡 𝑣 ) . W e rank 𝑡 ∈ C ins ( 𝑢, 𝑣 ) by score ins ( 𝑡 ) = 0 . 8 𝑠 ( 𝑟 , 𝑡 ) + 0 . 2  log  1 + 𝑓 2 ( 𝑡 𝑢 , 𝑡 )  + log  1 + 𝑓 2 ( 𝑡 , 𝑡 𝑣 )   , (28) where 𝑓 2 ( · , · ) is the frequency of length-2 tool sequences. Finally return the T op-3 candidates. W e choose 0.8 to prioritize request– tool relevance when inserting bridging tools (to avoid introducing irrelevant steps), while the co-occurrence statistics 𝑓 2 ( · , · ) serve as an auxiliary signal to favor more common local transitions. B.2 Threshold Selection The thresholds 𝜏 𝐺 , 𝜏 𝑉 , 𝜏 𝐸 directly control whether correction is trig- gered and how large the editable region is. W e calibrate them on the validation set in two steps: (i) Sweep 𝜏 𝑉 and 𝜏 𝐸 using 𝑃 𝑉 𝑟 ( 𝑣 ) and 𝑃 𝐸 𝑟 ( 𝑢, 𝑣 ) over a grid and evaluate them as binar y dete ctors of ground truth error locations: a node (or e dge) is pr e dicted as positive if its risk exceeds the threshold. W e then choose the threshold that maximizes the validation F1 score; (ii) With 𝜏 𝑉 and 𝜏 𝐸 xed, we tune the graph-level acceptance threshold 𝜏 𝐺 for deciding whether to invoke the LLM corrector . W e sweep 𝜏 𝐺 on the validation set and apply local correction only to samples with 𝑆 𝑟 < 𝜏 𝐺 . W e select 𝜏 𝐺 that maximizes the end-to-end validation accuracy of the nal plan, reecting the trade-o between correcting low quality plans and avoiding over editing reasonably good plans. B.3 Inputs and Outputs Inputs. Given request 𝑟 and the current plan graph 𝐺 𝑟 = ( 𝑉 𝑟 , 𝐸 𝑟 ) , the verier outputs a graph-level scor e 𝑆 𝑟 , node-level risk p 𝑉 𝑟 , and edge-level risk p 𝐸 𝑟 . Editable regions V edit and E edit are obtained by thresholding risks as in Eq. (10) . When local correction is triggered Trovato et al. # DEPENDENCY GRAPH # {{ dependency_graph }} # USER REQUEST # {{ user_request }} # CURRENT PLAN (JSON) # { "task_steps": {{ task_steps }}, "task_nodes": {{ task_nodes }}, "task_links": {{ task_links }}} # GNN EVALUATION # The GNN is a plan evaluator and reports: - Graph score 𝑆 ∈ ( 0 , 1 ) : higher suggests stronger alignment with the user request. - Node risk ∈ ( 0 , 1 ) : higher indicates a node may use an incorrect tool. - Edge risk ∈ ( 0 , 1 ) : higher indicates a edge (including START → root node and tool-to-tool edges) may be incomplete and need an inserted tool. # TOP RISK NODES # {node_diag_str} # TOP RISK EDGES # {edge_diag_str} # CANDIDATES (node_id -> candidate_id -> tool) # {{ node_candidates_json }} # CANDIDATES (edge_id -> candidate_id -> tool) # {{ edge_candidates_json }} # GOAL # You are a plan refinement assistant. Analyze the user request, the current plan, and the GNN diagnostics to decide whether and how to improve the plan so it better satisfies the request. # COMMON FAILURE MODES # 1) Wrong Tool: a tool is semantically similar but incorrect → replace the node. 2) Missing Step: especially missing preprocessing → insert a tool on the risky edge. # TASK # Based on the user request, the current plan, and the GNN diagnostics, first decide whether any change is necessary. If no change is needed, return empty edits. If changes are needed, select replacement/ insertion tools only from the provided candidates and propose minimal edits. If improvements are not clear with the provided candidates, do not modify. Return EDIT OPERATIONS only (no analysis text). It is valid to return no edits. # RULES # 1. Output JSON ONLY (no extra text). 2. Modify at most 3 places and do not use the same candidate tool in multiple edits. 3. Allowed ops: - replace_on_node(node_id, candidate_id, step) - insert_on_edge(edge_id, candidate_id, step) - no_change() 4. candidate_id must be an integer index from the candidate list. 5. Each node/edge may remain unchanged; prefer fewer edits and only change when necessary. 6. For every edit, provide a new step text aligned with the chosen tool and request and keep steps aligned 1-to -1 with nodes (same count, same order). 7. The updated steps/tools should solve the request better than the current plan; otherwise do not modify. # OUTPUT FORMAT (minimal edits) # {{ "edits": [ {{"op":"replace_on_node","node_id":0,"candidate_id":1,"step":"Step 1: ..."}}, {{"op":"insert_on_edge","edge_id":0,"candidate_id":2,"step":"Step 2: ..."}} ] }} If the current workflow is already optimal, return: {{"edits":[]}} # RESULT # Figure 5: Prompt template for verication-guided local correction. (i.e., 𝑆 𝑟 < 𝜏 𝐺 ), we rank nodes in V edit by 𝑃 𝑉 𝑟 ( 𝑣 ) and edges in E edit by 𝑃 𝐸 𝑟 ( 𝑢, 𝑣 ) , and expose only T op-3 risky locations to the LLM. For each exposed high-risk node 𝑣 𝑖 ∈ V edit , we provide a T op-3 replacement list C rep ( 𝑣 𝑖 ) ; for each exposed high-risk edge ( 𝑢, 𝑣 ) ∈ E edit , we provide a T op- 3 insertion list C ins ( 𝑢, 𝑣 ) . Outputs. The LLM is required to output JSON only with a single eld edits . Each element in edits species one e dit operation: replace_on_node or insert_on_edge . Returning an empty list ( "edits": [] ) indicates no change. W e use the following minimal schema: GNNV erifier: Graph-base d V erifier for LLM Task Planning # USER REQUEST # {{ user_request }} # ERROR # {{ error_msg }} # CANDIDATES (node_id -> candidate_id -> tool) # {{ node_candidates_json }} # CANDIDATES (edge_id -> candidate_id -> tool) # {{ edge_candidates_json }} # TASK # Fix the edit operations to satisfy all constraints. Only output: {{"edits":[ {{"op":"replace_on_node","node_id":0,"candidate_id":1,"step":"Step 1: ..."}}, {{"op":"insert_on_edge","edge_id":0,"candidate_id":2,"step":"Step 2: ..."}} ]}} Or {{\"edits\":[]}} / no_change(). # RESULT # Figure 6: Retry prompt template for schema-valid edit operations. { "edits": [ {"op":"replace_on_node", "node_id": i, "candidate_id": j, "step": "Step k: ..."}, {"op":"insert_on_edge", "edge_id": i, "candidate_id": j, "step": "Step k: ..."} ] } where node_id indexes a tool no de in the current plan graph, and edge_id indexes an exposed risky e dge (including the vir- tual St art → root edge when applicable). candidate_id must be an integer index into the corresponding. Before applying edits, we enforce the following constraints: (1) Edit budget: at most 3 edits; (2) Inde x validity: node / edge_id must refer to an exposed location; (3) Candidate validity: candidate_id must b e within the T op-3 candidate list of that location; (4) No repeated tool: the same candidate tool cannot be use d in multiple edits; (5) Step text: each edit must include a non-empty step de- scription aligned with the chosen tool. If any constraint is violated or the output cannot be parsed as JSON, we trigger a single retry . W e provide the following pr ompt template to drive the LLM to perform constraine d local e dits based on the verier diagnostics and the candidate tools from the dependency graph (Figure 5). B.4 Retry and Acceptance Rules If the LLM output cannot be parsed as valid JSON or violates any constraint in Sec. B.3, we prompt the LLM once again with (i) the re- quest 𝑟 , (ii) an error message summarizing the violated constraint(s), and (iii) the same exposed candidate lists, asking it to only x the edit operations and return JSON. After applying the edits to obtain a corrected plan graph 𝐺 ′ 𝑟 , we re-run the verier to compute 𝑆 ′ 𝑟 . W e accept the correction only if it improves the graph-leve score, i.e., 𝑆 ′ 𝑟 > 𝑆 𝑟 ; otherwise we keep the original plan. This conservative acceptance rule pre vents the corrector from making unnecessary or harmful changes when the constrained edits do not yield a clearer improvement. The r etr y uses a lightweight x-up prompt that only exposes the error message and the same candidate lists, as shown in Figure 6. C Dataset Details T askBench [ 26 ] is a multi-domain benchmark for tool-based task planning. For each domain, it denes a dependency graph, where nodes are tools/APIs and directed edges r epresent dependencies between tools. Each instance pairs a natural language user request with decomp osed step texts and a ground truth tool invocation graph (as nodes and dependency links), generated under contr olle d graph structure templates (e.g., single-node, chain, and directed acyclic graph) and lter e d by quality checks. In our experiments, we use the thr ee T askBench domains (HuggingFace, Multimedia, and DailyLife), and directly treat each instance as a plan graph consistent with prior graph-based planning work. W e follow the ocial formatting and preprocessing conventions. UltraT ool [ 12 ] targets large-scale tool use planning with broad tool coverage and multi-step tool invocation graphs. Each instance contains a user request and a ground truth to ol invocation plan. Following the same preprocessing protocol of GNN4Plan [37], we derive a challenging benchmark with a larger global dependency graph. T o ol descriptions are rened under the same protocol to ensure semantic consistency . Trovato et al. T able 3: Performance comparison across four datasets on Qwen3: Node-F1, Link-F1, and Accuracy are reported in %. The best results are highlighte d in b oldface, and the second-b est results are underlined. Method HuggingFace Multimedia DailyLife UltraT ool n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ A cc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ Direct Raw 82.41 58.63 34.20 89.40 71.56 53.40 97.59 67.04 54.20 64.84 37.67 31.20 +Rene 80.99 56.53 33.20 89.52 72.29 54.40 96.93 62.56 49.20 73.31 30.24 23.40 +V eriCoder 81.47 57.56 34.20 89.33 72.50 53.80 97.05 44.49 29.46 72.34 29.77 23.05 +V eriPlan 82.47 58.92 34.60 89.57 72.65 54.80 97.59 67.04 54.20 71.91 44.13 35.20 +GNNV erier 85.10 64.67 45.80 90.84 75.88 63.20 97.59 84.74 75.55 72.81 45.21 37.47 Re Act Raw 81.89 56.90 34.60 90.06 56.11 42.20 97.07 48.15 32.60 72.97 31.15 24.20 +Rene 78.20 50.75 28.00 88.50 53.76 38.00 94.17 53.62 33.20 73.11 34.12 27.20 +V eriCoder 78.39 51.31 26.60 90.53 72.96 53.40 96.30 31.79 15.40 72.80 26.06 19.60 +V eriPlan 81.85 56.95 34.60 90.15 73.33 54.80 97.07 48.15 32.60 75.83 36.48 27.40 +GNNV erier 83.59 62.82 45.00 91.56 76.02 63.80 97.07 86.57 77.35 75.99 50.47 42.60 GNN4Plan Raw 80.56 61.76 44.20 87.25 73.25 62.00 97.54 84.63 75.40 64.97 40.71 34.40 +Rene 77.22 49.20 23.80 88.50 70.15 51.40 94.09 63.99 49.60 73.98 44.42 36.40 +V eriCoder 80.33 54.94 31.80 88.87 72.46 53.80 96.57 44.60 28.60 72.04 29.72 22.60 +V eriPlan 80.56 61.76 44.20 87.25 73.25 62.00 97.54 84.63 75.40 66.04 41.39 35.20 +GNNV erier 83.30 63.94 46.00 88.55 74.13 62.00 97.54 84.63 75.40 73.11 45.78 38.68 D Additional Experimental Results D .1 Main Results on Qwen3 T able 3 reports the results on Q wen3-235B- A22B-Instruct-2507. Compared to the b est baseline V eriPlan, the average relative im- provements are 2.90% / 7.77% / 20.63% on HuggingFace, 1.49% / 3.10% / 10.14% on Multimedia, 0 / 28.09% / 40.75% on DailyLife, and 3.80% / 15.95% / 21.42% on UltraT ool (n-F1/l-F1/Acc). Acr oss plan- ners, the corresponding improvements ar e 1.41% / 11.44% / 24.17% for Direct, 0.96% / 28.37% / 53.11% for Re Act, and 3.35% / 2.85% / 2.44% for GNN4Plan. A cross GPT -4o and Q wen3, we observe consis- tent trends: our graph-based verier improves n-F1 and l-F1 in most planner–dataset combinations and yields the largest gains on task accuracy , indicating good generalization across backbone LLMs. On DailyLife, both backbones show near-saturated n-F1, while l-F1 and Acc improve substantially , suggesting remaining errors are mainly structural/link-related. D .2 Additional Ablation Studies T able 4 reports ablation results on DailyLife and UltraT ool. Overall, the trends are consistent with HuggingFace and Multime dia: ab- lating any key component (graph-structur ed mo deling, two-stage training, advanced node/edge attributes, or multi-granularity feed- back) leads to p erformance degradation, and the full model achieves the best overall results across planners and metrics. In particular , removing the GNN or any fee dback signal generally hurts link-level quality and task accuracy , conrming that structure-aware reason- ing and complementary diagnostic feedback are both important for stable corr ection. A notable dierence is DailyLife, where n-F1 is al- ready near-saturated acr oss variants (many settings shar e the same n-F1), so impr ovements mainly manifest in l-F1 and Acc, indicating limited headroom on node identication while structural/depen- dency correction remains the primary bottleneck. D .3 Additional Error Analysis Figures 7 and 8 illustrate the distribution of ve error types for the Re Act and GNN4Plan planners, respectively , comparing the raw plans ( Before ) with the plans after correction by our method ( After ). This analysis aims to verify the generalizability and eectiveness of our correction method across dierent planner frameworks. Re Act. Across most datasets, our method reduces planning er- rors, with clear improv ements on Missing T ool and Edge Fail , sug- gesting that verication-guided local correction can ee ctively insert missing intermediate steps and x semantically incorrect yet type-compatible transitions. Notably , on DailyLife, the Wr ong T ool count remains unchanged and is already small, indicating that its errors are dominated by other types and thus leave limited room for tool-replacement gains; Dependency Error also stays negligible with a minor uctuation (2 → 3), likely due to its small error mass. GNN4Plan. A notable dierence is that Dep endency Error is already eliminated (0 both Before and After ) across datasets, which aligns with GNN4P lan constructing plans ov er a dependency graph structure that enforces type compatibility . Despite this, our method continues to reduce non-structural errors such as Missing T ool and Edge Fail , and it also de creases Wrong T ool on several datasets. Similar to Re Act, DailyLife shows no change in W rong T ool and only a small error mass overall, implying limited headroom for further reductions in that type. GNNV erifier: Graph-base d V erifier for LLM Task Planning T able 4: Ablation studies on Dailylife and Ultrato ol: Node-F1, Link-F1, and Accuracy are reported in %. The best results are highlighte d in b oldface, and the second-best results are underlined. V ariant DailyLife UltraT o ol n-F1 ↑ l-F1 ↑ Acc ↑ n-F1 ↑ l-F1 ↑ Acc ↑ Direct Raw 97.12 84.21 72.80 73.07 46.36 36.80 w/o GNN 95.47 84.64 74.00 72.62 45.71 34.00 w/o Stage-II 97.12 87.28 78.60 71.33 47.40 35.60 w/o Stage-I 97.12 87.28 78.60 73.33 47.56 37.60 w/o Node Feat. 97.12 87.28 78.60 75.15 50.49 41.00 w/o Edge Feat. 96.67 86.68 76.20 75.04 50.09 41.40 w/o Graph FB 97.12 87.28 78.60 73.97 51.29 40.80 w/o Node FB 97.12 87.28 78.60 73.20 51.29 42.80 w/o Edge FB 97.12 85.13 73.80 75.13 47.06 36.80 Full 97.51 87.45 78.76 76.89 52.82 42.80 Re Act Raw 96.59 57.01 45.20 73.89 39.35 32.40 w/o GNN 94.22 62.58 51.00 74.05 39.12 32.40 w/o Stage-II 96.59 83.38 68.00 74.73 50.58 42.40 w/o Stage-I 96.58 85.80 76.20 73.97 46.26 38.40 w/o Node Feat. 96.59 80.33 70.40 76.51 51.90 42.20 w/o Edge Feat. 94.72 83.31 70.40 76.32 51.88 44.20 w/o Graph FB 96.59 80.83 69.60 75.24 45.28 37.40 w/o Node FB 96.59 80.56 70.40 75.73 45.75 38.00 w/o Edge FB 96.59 81.40 68.60 73.89 40.05 32.40 Full 96.59 85.83 76.35 77.91 52.06 44.20 GNN4Plan Raw 97.25 87.51 78.80 71.68 46.99 37.80 w/o GNN 95.42 84.75 74.40 71.37 46.61 31.80 w/o Stage-II 97.25 87.61 78.80 72.73 51.18 41.20 w/o Stage-I 97.25 87.61 78.80 71.88 47.52 38.20 w/o Node Feat. 97.25 87.61 78.80 73.90 52.99 43.40 w/o Edge Feat. 96.73 68.93 76.40 73.76 52.26 43.80 w/o Graph FB 97.25 87.61 78.80 72.83 52.49 42.00 w/o Node FB 97.25 87.61 78.80 73.55 52.56 42.20 w/o Edge FB 97.25 87.51 78.80 72.94 47.69 37.80 Full 97.25 87.61 78.98 76.44 53.46 43.80 D .4 Additional Visualization of GNN Embeddings Figures 9 and 10 show the learned embeddings on the largest datase UltraT ool under ReAct and GNN4Plan, respectively . Similar to the Direct planner (Figure 4), correct and incorrect samples are largely separable at the graph, node, and edge levels, with limited overlap. This indicates that our v erier learns discriminative embeddings that can be reliably mapped to graph-/no de-/edge-level scores, en- abling eective correctness discrimination and providing infor- mative signals for acceptance versus correction and for localizing potential errors. D .5 Hyperparameter Analysis W e tune three key hyperparameters via grid search: the soft target temperature 𝜏 , the weight 𝜆 graph for the graph objective in Stage I, and the weight 𝜆 edge for the edge objective in Stage II. Sp ecically , we search 𝜏 ∈ { 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 } , 𝜆 graph ∈ { 0 . 1 , 0 . 5 , 1 . 0 , 1 . 5 , 2 . 0 } , and 𝜆 edge ∈ { 0 . 5 , 1 . 0 , 1 . 5 , 2 . 0 } . All hyperparameter analyses are conducted with GPT -4o on the validation split of all datasets. W e select 𝜏 and 𝜆 graph by maximizing the validation graph AUC (ROC- AUC) of the graph score, since both mainly aect graph qual- ity learning in Stage I. W e select 𝜆 edge by maximizing a validation local AUC, computed as the average of the ROC- AUC for node risk prediction and edge risk prediction, since it primarily controls the strength of local diagnosis learning in Stage II. All AUC values are computed against the self-super vised labels induce d by our perturbation operators. Figure 11 shows that 𝜏 typically benets from moderate values: AUC improv es from 𝜏 = 0 . 2 to around 0 . 6 on all datasets, while larger 𝜏 starts to hurt DailyLife and UltraT ool by overly smoothing the distinction between mild and severe perturbations. For 𝜆 graph , the trend is more dataset dependent: HuggingFace favors a stronger graph-level regr ession signal (peaking at larger 𝜆 graph ), whereas DailyLife prefers smaller weights, indicating an interaction between global calibration and the ranking constraint. In contrast, 𝜆 edge ex- hibits a clearer pattern: increasing 𝜆 edge generally degrades AUC on HuggingFace and UltraT o ol, and Multimedia peaks around 1 . 0 but drops sharply for larger values, suggesting that over-weighting the edge objective can dominate training and reduce the discriminabil- ity of local diagnosis. W e ther efore select the best conguration for each dataset based on its validation performance, and keep it xed in all main experiments. Trovato et al. (a) HuggingFace (b) Multimedia (c) DailyLife (d) UltraT ool Figure 7: Analysis of planning error typ es b efore and after correction under ReA ct cross datasets with GPT -4o. (a) HuggingFace (b) Multimedia (c) DailyLife (d) UltraT ool Figure 8: Analysis of planning error typ es b efore and after correction under GNN4P lan cross datasets with GPT -4o. (a) Graph (b) Node (c) Edge Figure 9: t-SNE visualization of graph-, node-, and edge-level embeddings on UltraT o ol under ReA ct with GPT -4o. (a) Graph (b) Node (c) Edge Figure 10: t-SNE visualization of graph-, node-, and edge-level embeddings on UltraT o ol under GNN4P lan with GPT -4o. GNNV erifier: Graph-base d V erifier for LLM Task Planning Figure 11: Hyperparameter sensitivity on validation AUC cross datasets with GPT -4o. D .6 Case Study Figure 12 presents an example from the Multimedia dataset with the Direct planner . The request asks to apply rev erb to a narration, mix it with background music, generate a wav eform image of the combined audio, colorize the waveform image, and then r etrieve similar waveform images online. In the visualization, gr e en nodes denote corr e ctly selecte d tools by the planner , red nodes denote incorrect selections, gray nodes ar e candidate tools pr ovided by the dependency graph, and y ellow nodes are the correcte d tools chosen from the candidates; the node- and edge-level risk scores predicted before correction are also annotated on the corr esponding no des and edges. Our method corrects two representativ e failure modes in this case: a confusable tool choice at the rst step and an invalid downstream dependency . Guided by the verier’s risk scores, w e replace the mistaken tool with a semantically closer alternative from its similar tool neighborho od, and we simultaneously correct the incorrect transition by selecting a type-compatible retrieval tool under the dep endency graph constraints, resulting in an executable plan that better matches the user request. Figure 12: Case study on Multimedia. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

GNNVerifier: Graph-based Verifier for LLM Task Planning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment