CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

CABT O: Context-A war e Beha vior T ree Gr ounding f or Robot Manipulation Y ishuai Cai 1,2, 4 * † , Xinglin Chen 1,2, 4 * , Y unxin Mao 1 , Kun Hu 1 , Minglong Li 1 ‡ , Y aodong Y ang 3, 4 , Y uanpei Chen 2, 3, 4 1 National Univ ersity of Defense T echnology 2 PsiBot 3 Peking Univ ersity 4 PKU-Psibot Lab { caiyishuai, chenxinglin, liminglong10 } @nudt.edu.cn Abstract Behavior Trees (BTs) offer a powerful paradigm for design- ing modular and reactiv e robot controllers. BT planning, an emerging ﬁeld, provides theoretical guarantees for the au- tomated generation of reliable BTs. Howe v er , BT planning typically assumes that a well-designed BT system is already grounded—comprising high-lev el action models and lo w- lev el control policies—which often requires extensi v e expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. W e analyze its com- plexity and introduce CABTO (Context-A ware Behavior T ree grOunding), the ﬁrst frame work to ef ﬁciently solv e this chal- lenge. CABTO lev erages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observ ations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO’ s effecti v eness and efﬁcienc y in gener- ating complete and consistent behavior tree systems. Introduction Robot manipulation necessitates both reliable high-le vel planning and robust low-le v el control policies. Recently , Be- havior Trees (BTs) ( ¨ Ogren and Sprague 2022; Colledanchise and ¨ Ogren 2018) have emerged as a highly reliable and ro- bust control architecture for intelligent robots, recognized for their modularity , interpretability , reactivity , and safety . Many methods ha v e been proposed to automatically gener- ate BTs for task ex ecution, including evolutionary comput- ing (Neupane and Goodrich 2019; Colledanchise, Parasur- aman, and ¨ Ogren 2019) and machine learning approaches (Banerjee 2018; French et al. 2019). In particular, BT plan- ning (Cai et al. 2021; Chen et al. 2024; Cai et al. 2025a) has shown signiﬁcant promise, primarily due to its strong theo- retical guarantees: the BTs generated by such methods are prov ably successful in achieving goals within a ﬁnite time horizon. Despite these advancements, BT planning critically as- sumes the prior existence of a well-grounded BT system. * These authors contributed equally . † This work was completed during an internship at PsiBot. ‡ Corresponding Author . Copyright © 2026, Association for the Advancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserved. Constructing such a system, encompassing both high-lev el action models and their corresponding lo w-le vel control policies, typically requires substantial human e xpertise and effort. Speciﬁcally , for high-level planning, the BT sys- tem must contain a sufﬁcient and appropriately modeled set of condition and action nodes, enabling their assembly into BTs capable of accomplishing div erse tasks. Concur- rently , for low-le vel ex ecution, these nodes must be reliably linked to ex ecutable control policies that ensure en viron- mental transitions occur precisely as speciﬁed by the action models, ideally with high success rates. In this paper, we formally deﬁne the BT grounding prob- lem: the automated construction of a complete and consis- tent BT system for a given task set. W e characterize a well- designed BT system by two critical properties: (1) Com- pleteness: a complete BT system can generate solution BTs for all tasks within the speciﬁed task set through high-lev el BT planning, based on its action models. (2) Consistency: a consistent BT system ensures that its control policies lead to state transitions that precisely match their corresponding action models during low-le vel BT ex ecution. Figure 1 il- lustrates these concepts. For instance, a BT system with ac- tion set { a 2 , a 3 } is incomplete if it can only produce a solu- tion BT for a subset of tasks, such as { p 2 , p 3 } . Con versely , an action set like { a 1 , a 2 } that successfully generates solu- tion BTs for all three tasks exempliﬁes completeness. How- ev er , a 1 would be inconsistent if its control policy fails to achiev e Holding ( apple ) as declared by its action model. Furthermore, a 2 is also inconsistent because its policy can- not put the apple in the dra wer without the precondition IsOpen ( drawer ) . In contrast, an action like a 3 , whose pol- icy perfectly aligns with its action model’ s state transitions, is considered consistent. W e demonstrate a naiv e algorithm that solves the BT grounding problem via exhaustiv e search. While this ap- proach effecti vely illustrates the core concepts, its exponen- tial time complexity renders it impractical for deployment. Large Models (LMs), pre-trained on extensi ve corpora, images or datasets in other modalities, hav e shown signif- icant abilities in searching, reasoning and planning (Zhou et al. 2024; V almeekam et al. 2023; Cai et al. 2025b). Lev er - aging the advantages of LMs, we propose the ﬁrst frame- work for efﬁciently solving the BT grounding problem, named Conte xt-A ware Beha vior T ree grOunding (CABTO). Figure 1: Concepts inv olv ed in the BT grounding problem. (a) A BT is a directed rooted tree with behavior nodes and control nodes. (b) The solution is a complete and consistent BT system for the given task set. (c) A complete BT system can generate solution BTs for all tasks during the high-lev el BT planning based on action models. (d) A consistent BT system ensures its control policies result in state transitions consistent with their action models during low-le vel BT ex ecution. CABTO mainly utilizes pre-trained LMs to heuristically search the space of action models and control policies based on the contexts of BT planning and en vironmental feedback. CABTO includes three phases: (1) High-le vel model pro- posal. Gi v en a task set, we ﬁrst use Large Language Mod- els (LLMs) to generate promising action models and use a sound and complete BT planning algorithm to ev aluate their completeness. The contexts here in this phase are plan- ning details. (2) Low-le vel policy sampling. W e then em- ploy V ision-Language Models (VLMs) to sample promising policy types as well as their hyperparameters for explored action models. The matching policy and its corresponding action model together form a consistent action. The contexts in this phase are en vironment feedbacks. (3) Cross-lev el re- ﬁnement. If the algorithm fails to ﬁnd any policy for a gi v en action model, then it is determined to be inconsistent. In this case, the contexts of both high-level planning information and low-le vel en vironment feedbacks can be combined to help VLMs to reﬁne the action model and generate more promising action models. The key contrib utions of this work are as follows: • W e formally deﬁne the BT grounding problem as the construction of a complete and consistent BT system for a giv en task set. W e provide a formal analysis and present a naiv e algorithm that elucidates the foundational con- cepts for solving this problem. • W e propose CABTO, the ﬁrst frame work for efﬁciently solving the BT grounding problem. CABTO strategi- cally utilizes pre-trained LMs to heuristically explore the space of action models and control policies, informed by both BT planning contexts and en vironmental feedback. • W e empirically validate CABTO’ s superior ef fectiveness and efﬁcienc y in automatically generating complete and consistent BT systems across 7 di verse task sets in 3 dis- tinct robotic manipulation scenarios. Comprehensiv e ab- lation studies further in vestigate the impact of LMs, con- trol policy types, and cross-le v el reﬁnement. Related W ork Behavior T ree Generation Most existing BT genera- tion methods focus on constructing the BT structure while assuming predeﬁned execution policies. Heuristic search approaches, including grammatical ev olution (Neupane, Goodrich, and Mercer 2018), genetic programming (Lim, Baumgarten, and Colton 2010) , and Monte Carlo D AG Search (Scheide, Best, and Hollinger 2021), have been widely studied. Machine learning methods, such as re- inforcement learning (Banerjee 2018; Pereira and Engel 2015)and imitation learning (French et al. 2019), as well as formal synthesis approaches like L TL (Neupane and Goodrich 2023) and its variants(T adewos, Newaz, and Ka- rimoddini 2022), hav e also been explored. Howe v er , these methods often require complex en vironment modeling or cannot guarantee BT reliability . In contrast, BT planning (Cai et al. 2025b; Chen et al. 2024; Cai et al. 2025a) based on STRIPS-style action models (Fikes and Nilsson 1971) provides interpretable environment modeling while ensuring both reliability and robustness. High-Level Action Models Action models deﬁne the blueprints of actions that dri ve state transitions in a sys- tem (Arora et al. 2018). They are widely used in classi- cal planning (Hoffmann and Nebel 2001; Bonet, Loerincs, and Geffner 1997), task and motion planning (T AMP) (Y ang et al. 2024; Kumar et al. 2024), and symbolic problem solv- ing (Pan et al. 2023; Fik es and Nilsson 1971). T o reduce ex- pert design ef fort, many methods learn action models from plan ex ecution traces (Mahdavi et al. 2024; Bachor and Behnke 2024; Mordoch et al. 2024; Liu et al. 2023), em- ploying inductive learning (Liang et al. 2025), ev olutionary algorithms (Ne wton and Le vine 2010), reinforcement learn- ing (Rodrigues et al. 2012), and transfer learning (Zhuo and Y ang 2014). Howe ver , these approaches typically assume the traces are already available, overlooking ho w to obtain them through lo w-le v el execution—an obstacle to practical deployment. Low-Lev el Control Policies Modern low-le v el robot ma- nipulation policies can be broadly categorized into three types: (1) End-to-end policies, which directly map propri- oceptiv e inputs to joint controls via reinforcement learning (Bai et al. 2025; Chen et al. 2023), imitation learning (Zare et al. 2024), and, more recently , V ision-Language-Action Models (VLAs) ﬁne-tuned from large vision-language mod- els (Zhong et al. 2025; Kim et al. 2024; Zhen et al. 2024). (2) Hierarchical policies, which decompose control into structured modules le veraging representations such as rigid- body poses (Kaelbling and Lozano-P ´ erez 2011), constraints (Huang et al. 2025), af fordances (Huang et al. 2023), way- points (Zhang et al. 2024), or skills and symbolic codes (Haresh et al. 2024; Mu et al. 2024). These approaches ex- ploit expert knowledge to improv e interpretability and ex- tend long-horizon capabilities. (3) Rule-based policies, b uilt solely on expert-designed control algorithms (Thomason, Kingston, and Kavraki 2024; Sundaralingam et al. 2023), offer strong rob ustness for speciﬁc tasks but struggle to gen- eralize to unseen scenarios. Preliminaries Behavior T ree A BT T is a rooted directed tree where internal nodes are control ﬂow nodes and leaf nodes are ex- ecution nodes (Colledanchise and ¨ Ogren 2018). The tree is ex ecuted via periodic ”ticks” from the root. The core nodes include: (1) Condition: returns success if a state propo- sition holds, else failure . (2) Action: performs tasks and returns success , failure , or running . (3) Sequence ( → ): succeeds only if all children succeed (AND logic). (4) Fallback ( ? ): fails only if all children f ail (OR logic). BT System Follo wing (Cai et al. 2021), a BT can be rep- resented as a four-tuple T = ⟨ n, h, π , r ⟩ , where n is the number of binary propositions describing the world state. Here, h : 2 n → 2 n denotes the action model represent- ing the intended state transition; π : 2 n → 2 n denotes the control polic y representing the actual e xecution effect; and r : 2 n 7→ { success , running , failure } partitions the state space according to the BT’ s return status. A BT system is deﬁned as Φ = ⟨C , A⟩ . Each action a ∈ A is a tuple ⟨ h a , π a ⟩ , where h a = ⟨ pr e h ( a ) , add h ( a ) , del h ( a ) ⟩ is its action model (in- tended ef fect) and π a = ⟨ pr e π ( a ) , add π ( a ) , del π ( a ) ⟩ is its control policy (actual ef fect). The precondition pr e h ( a ) , pr e π ( a ) , add effects add h ( a ) , add π ( a ) , and delete effects del h ( a ) , del π ( a ) are all the subset of the condition node set C . In a well-designed BT system, provided that the current state s t satisﬁes the precondition (i.e., s t ⊇ pr e h ( a ) ), the state transition upon completion of action a after k time steps satisﬁes: s t + k = h a ( s t ) = π a ( s t ) = s t ∪ add ( a ) \ del ( a ) (1) where h a ( s t ) and π a ( s t ) denote the states resulting from the action model and the control policy e xecution, respecti vely . BT Planning Giv en a BT system Φ , a BT planning prob- lem is deﬁned as: p = ⟨S , s 0 , g ⟩ , where S is the ﬁnite set of en vironment states, s 0 is the initial state, g is the goal con- dition. A condition c ⊆ C is a subset of a state s , and can Algorithm 1: Naiv e Algorithm for BT Grounding Input : Problem ⟨P , C P , H P , Π P ⟩ Output : Solution Φ = ⟨C , A⟩ 1: A ← ∅ ▷ initialize grounded actions 2: for pre ∈ 2 C P , add ∈ 2 C P , del ∈ 2 C P do 3: h ← ⟨ pre, add, del ⟩ ▷ create an action model 4: if h ∈ H P then 5: for each polic y π ∈ Π P do 6: if Consistent ( h, π ) then 7: a ← ⟨ h, π ⟩ ▷ create a consistent action 8: A ← A ∪ { a } ▷ add the consistent action 9: break 10: end if 11: end for 12: end if 13: end for 14: C ← S a ∈A pr e h ( a ) ∪ add h ( a ) ∪ del h ( a ) 15: retur n Φ = ⟨C , A⟩ be an atom condition node or a sequence node with atom condition nodes as children. If c ⊆ s , then c holds in s . A sound and complete BT planning algorithm, like BT Expan- sion (Cai et al. 2021), ensures a solution BT T in ﬁnite time if p is solv able. Such a BT T can transition the state from s 0 to s n = π T ( s 0 ) ⊇ g in a ﬁnite number of steps n . Problem F ormulation In this paper , we focus on the automatic construction of the BT system, and therefore need to formally deﬁne the prop- erties that describe a well-designed BT system. Deﬁnition 0.1 (Completeness) . A BT system Φ is complete in the task set P if, ∀ p ∈ P , any complete BT planning al- gorithm can produce a BT T that solves the task p according to its action models. The completeness of the BT system Φ describes whether its condition nodes C and action nodes A are suf ﬁcient to solve all of the tasks in the gi ven task set at the planning lev el. Deﬁnition 0.2 (Consistency) . An action a is consistent if pre π ( a ) ⊆ pr e h ( a ) , add π ( a ) = add h ( a ) , del π ( a ) = del h ( a ) . That is, the control policy π a is capable of inducing state transitions that match its action model. A BT system Φ is consistent if ∀ a ∈ A , a is consistent. The consistency of the BT system Φ describes whether all action nodes can be successfully executed and cause the state to transition as desired, just as speciﬁed by their action models. Both completeness and consistency are essential for constructing a BT system for embodied robots to complete tasks. W e then deﬁne the BT grounding problem as follo ws: Pr oblem 1 (BT Grounding) . A BT grounding problem is a tuple ⟨P , C P , H P , Π P ⟩ , where P is the ﬁnite task set, C P is the ﬁnite set of v alid condition nodes, H P is the ﬁnite set of valid action models, Π P is the set of v alid control policies. A solution to this problem is a BT system Φ = ⟨C , A⟩ that is complete and consistent in the task set P , where C ⊆ C P , ∀ a ∈ A , a = ⟨ h a , π a ⟩ , h a ∈ H ⊆ H P , π a ∈ Π ⊆ Π P . Figure 2: The frame work of CABTO includes three phases: (1) High-le vel model proposal leverages the planning contexts for the LLMs to heuristically explore the space of action models; (2) Low-le vel policy sampling lev erages the execution contexts for the VLMs to heuristically explore the space of control policies; (3) Cross-level reﬁnement lev erages both planning and ex ecution contexts for reﬁning inconsistent action models. Methodology This section ﬁrst presents a naive algorithm and a formal analysis to establish the foundational principles of the BT grounding problem. W e then detail the CABTO framework, encompassing high-lev el model proposal, low-le vel policy sampling, and cross-le vel reﬁnement. Finally , we provide the implementation details of the CABTO system. Naive Algorithm for BT Grounding Algorithm 1 out- lines a naiv e approach to BT grounding. Gi ven the problem tuple ⟨P , C P , H P , Π P ⟩ , the algorithm initializes an empty action set A (line 1) and exhausti vely trav erses the power set of action components (line 2). For each candidate action model h , the algorithm ﬁrst v eriﬁes its v alidity (lines 3–4). It excludes models based on domain-independent constraints (e.g., add ∩ del  = ∅ ) or domain-dependent constraints (e.g., mutually exclusi ve preconditions), though the latter typically require extensi ve expert knowledge. Even when restricted to domain-independent constraints, e xploring the model space entails an exponential complexity of O (2 3 n ) . The algorithm then retrieves a control policy π ∈ Π P (line 5) and veriﬁes its consistenc y with h (line 6). Upon a suc- cessful match, it instantiates a consistent action a = ⟨ h, π ⟩ (line 7) and appends it to A . Finally , the algorithm induces the condition set C from the union of all atomic conditions in A (line 14) and returns the resulting BT system Φ . While exhausti v e and correct, this algorithm faces signiﬁcant lim- itations: (1) the exponential complexity of exploring H P , and (2) the practical difﬁculty of designing Π P and verify- ing policy consistency . Notably , automatically synthesizing low-le vel control policies to achieve speciﬁc effects remains a fundamental challenge in robotics (Kumar et al. 2023). T o ov ercome these limitations, we propose CABT O, a principled framew ork designed for efﬁcient BT ground- ing. As illustrated in Algorithm 2, CABTO decomposes the grounding process into three phases, lev eraging multi- modal contexts to circumvent exhausti v e search. The follow- ing sections detail the implementation and context acquisi- tion strategies employed in each phase. High-level Model Proposal CABTO initially initializes the grounded action set A as empty (Line 1) and deﬁnes the unexplored model space H U as the complete set of potential action models H P (Line 2). The process commences with an initial proposal phase, where the LLM receiv es a structured textual prompt deﬁning the task set P . This context encap- sulates goal states and initial conditions formalized as ﬁrst- order logic propositions, alongside the semantics descrip- tions of scene objects. Le v eraging this task-speciﬁc conte xt, the LLM identiﬁes a subset of promising models H E from H P by specifying their symbolic preconditions and ef fects in a programmatic format (Line 3): H E = LLM ( P , H P ) (2) Empirical results demonstrate that for simple task sets, this initial proposal phase often yields sufﬁcient action models to satisfy the majority of requirements in P . T o accommodate comple x scenarios where initial propos- als may be incomplete, CABTO employs a reﬁnement loop that iterates until the task set P is veriﬁed as fully solvable using the v alidated grounded actions in A (Line 4). W ithin Algorithm 2: CABTO Input : Problem ⟨P , C P , H P , Π P ⟩ Output : Solution Φ = ⟨C , A⟩ 1: A ← ∅ ▷ initialize grounded actions 2: H U ← H P ▷ initialize model search spaces 3: H E ← LLM ( P , H P ) ▷ Equation 2 4: while H U  = ∅ and not AllSolvable ( P , A ) do 5: // high-level model proposal 6: repeat 7: I f ail ← {I p | p ∈ P , BTPlanning ( p, H E ) fails } 8: if I f ail  = ∅ then 9: H ′ ← LLM ( P , H U , I f ail ) ▷ Equation 3 10: H U ← H U \ H ′ , H E ← H E ∪ H ′ 11: end if 12: until I f ail = ∅ or H U = ∅ 13: // low-level policy sampling 14: for each h ∈ H E \ H do 15: n ← 0 , π ← null , Consistent ( h, π ) ← False 16: while n < N max and not Consistent ( h, π ) do 17: π ← VLM ( h, Π P , I e ) ▷ Equation 4 18: Sample a scenario s 0 where pr e ( h ) ⊆ s 0 19: s t , I e ← Execute ( π , s 0 ) 20: if s t ⊇ ( pre ( h ) ∪ add ( h ) \ del ( h )) then 21: Consistent ( h, π ) ← True 22: A ← A ∪ {⟨ h, π ⟩} , H ← H ∪ { h } 23: end if 24: n ← n + 1 25: end while 26: // cross-level refinement 27: if not Consistent ( h, π ) then 28: h ′ ← VLM ( h, H U , {I p } , Π P , {I e } ) ▷ Equation 5 29: H U ← H U \ { h ′ } , H E ← H E ∪ { h ′ } 30: end if 31: end for 32: H E ← H ▷ Prune H E to validated set 33: end while 34: C ← S ⟨ h,π ⟩∈A ( pr e ( h ) ∪ add ( h ) ∪ del ( h )) 35: retur n Φ = ⟨C , A⟩ this loop, the algorithm assesses the completeness of the cur- rent candidate set H E by attempting to synthesize BTs for all tasks in P through BT Planning. Any planning failure triggers the aggregation of diagnostic data into a failure set I f ail (Line 7). Each entry I p ∈ I f ail encapsulates critical diagnostics, such as the topological sketches of incomplete BTs and the count of expanded conditions. These metrics provide essential semantic cues, aiding the LLM in identify- ing symbolic gaps to propose more promising action models. H ′ ← LLM ( P , H U , I f ail ) (3) Subsequently , proposed models are transferred from H U to the candidate set H E (Line 10). This heuristic search iter- ates until the task set P is logically spanned by a complete BT system. Low-Lev el Policy Sampling This phase veriﬁes the phys- ical consistency of candidate action models h ∈ H E (Line 14). For each model h , we initialize the trial counter n = 0 and the policy π as null. T o bridge the gap between abstract symbolic reasoning and precise physical execution, we pro- pose a hierarchical framew ork that inte grates Molmo(Deitke et al. 2025) with programmatic code generation. Speciﬁ- cally , within a b udget of N max attempts (Line 16), the VLM is a programmatic sampler (Line 17) that translates high- lev el semantic intentions into grounded control policies: π ← VLM ( h, Π P , I e ) (4) where Π P represents the set of available control inter- faces. These interfaces comprise Molmo-based (Deitk e et al. 2024) perception APIs for extracting en vironmental ke y- points, cuRobo-based (Sundaralingam et al. 2023) (a 7-DoF IK solver) motion control APIs for the robotic arm, and grip- per actuation commands. The ex ecution context I e serves as a critical nexus for closed-loop iterativ e reﬁnement. It encap- sulates multi-modal diagnostic data, including egocentric vi- sual observations, previously synthesized control code, post- hoc visual feedback, and categorical success/failure signals. By maintaining this high-ﬁdelity temporal trace, the VLM can ef fecti vely anchor its subsequent sampling within the physical constraints e videnced by prior ex ecution attempts. Speciﬁcally , the VLM selectively in v okes Molmo-based perception tools conditioned on the logical semantics of h . When precise spatial grounding is necessitated, the VLM lev erages Molmo to e xtract functional af fordances and task- relev ant keypoints—such as optimal grasp points or tar- get placement coordinates—directly from visual observ a- tion V . Subsequently , the VLM synthesizes these grounded keypoints and parameterized APIs into executable Pythonic code that instantiates the speciﬁc control policy π . T o v alidate the policy π , we initialize a simulation s 0 such that the initial state satisﬁes the precondition pr e ( h ) (Line 18). After e xecution (Line 19), we check if the terminal state s t achiev es the e xpected symbolic ef fects: s t ⊇ ( pr e ( h ) ∪ add ( h ) \ del ( h )) (Line 20). Upon veriﬁcation, the grounded action ⟨ h, π ⟩ is appended to the action set A . Cross-Le vel Reﬁnement If a sufﬁcient number of policies fails to yield a valid policy , the action model h is deemed physically inconsistent. While a nai ve approach would be to discard the model and restart the high-le v el proposal in the next iteration, we instead le v erage both planning and ex ecu- tion contexts to reﬁne h (Line 28): h ′ ← VLM ( h, H U , {I p } , Π P , {I e } ) (5) Here, the VLM synthesizes the planning context {I p } , which deﬁnes the functional necessity of h within successful symbolic sequences ( ∀I p ∈ {I p } , h ∈ T p ), and the execu- tion context {I e } , which comprises multi-modal diagnostic data such as e gocentric pre/post-action imagery and binary feedback. By integrating these cross-le vel insights, the VLM identiﬁes underlying failures—such as omitted spatial pre- conditions or inaccurate symbolic effects—to synthesize a rectiﬁed action model h ′ . Upon completing the reﬁnement loop, a kno wledge syn- chronization step (Line 32) updates H E ← H , ensuring the explored pool consists exclusi v ely of models veriﬁed by physical policies. This update provides a grounded, reliable action library for subsequent planning iterations (Line 7). The cycle repeats until a set of grounded actions A renders all tasks P solvable, after which the condition set C is ex- tracted to deﬁne the ﬁnal grounded state space (Line 34). Robot T ask Set T ask Attributes GPT -3.5-T urbo GPT -4o Acts Conds Steps ASR(w/o → w) CSR(w/o → w) FC ASR(w/o → w) CSR(w/o → w) FC Franka Cover 2.0 4.4 4.0 60.0% → 66.7% 40% → 50% 1.6 100.0% → 100.0% 100% → 100% 0.0 Blocks 2.3 3.1 4.1 70.0% → 70.0% 30% → 50% 2.0 60.0% → 80.0% 50% → 80% 1.1 Dual- Franka Pour 5.5 8.1 3.6 80.0% → 96.7% 70% → 90% 0.5 66.7% → 100.0% 60% → 100% 0.6 Handover 5.0 3.3 2.7 80.0% → 90.0% 70% → 90% 0.5 56.7% → 90.0% 30% → 90% 1.3 Storage 6.0 5.4 2.2 56.7% → 73.3% 0% → 60% 2.0 53.3% → 76.7% 20% → 70% 1.7 Fetch Tidy Home 5.6 6.0 2.9 53.3% → 56.7% 40% → 50% 0.7 53.3% → 90.0% 30% → 90% 1.3 Cook Meal 6.8 7.9 5.1 70.0% → 70.0% 50% → 60% 0.7 73.3% → 100.0% 60% → 100% 0.4 T otal 4.7 5.5 3.5 67.1% → 74.8% 42.9% → 64.3% 1.1 66.2% → 91.0% 50% → 90.0% 0.9 T able 1: High-le vel model proposal results (averaged over 10 trials, max FC=3) for GPT -3.5-turbo vs. GPT -4o. Note: “w/o” denotes without planning contexts, and “w” denotes with planning conte xts. Figure 3: Conﬁgurations of the single-arm and dual-arm Franka manipulation tasks in Isaac Sim. Figure 4: The deployment of CABTO in OmniGibson: Giv en a task set, CABTO generates a complete and consistent BT system. F or a speciﬁc task, BT planning is used to generate the solution BT . Then the BT is executed, enabling the robot to successfully achiev e the goal. Experimental Setup T ask Sets W e e valuate the robustness and adaptability of CABTO on a comprehensi ve suite of sev en robotic manipu- lation task sets, encompassing 21 unique goals (three goals per task) across three distinct robotic platforms. These sce- narios are strategically designed to cov er a spectrum of phys- ical and logical challenges: Single-Arm Franka (T1: Cov er , T2: Blocks), Dual-Arm Franka (T3: Pour , T4: Hando ver , T5: Storage), and Mobile Fetch (T6: Tidy Home, T7: Cook Meal). As summarized in T able 1, these tasks range from fundamental pick-and-place and stacking (T1–T2) to com- plex bimanual coordination for cooperativ e transport and ex- change (T3–T5), and long-horizon mobile manipulation in- volving articulated objects and semantic state changes (T6– T7). T o quantify solution complexity , T able 1 reports the re- sulting BT attributes for each task set, including the num- ber of unique action predicates (Acts), condition predicates (Conds), and the total ex ecution steps (Steps). En vir onment Fetch robot experiments were conducted in OmniGibson (Li et al. 2023) for its realistic physics, while Franka tasks were designed in Isaac Sim to enable ﬂe xible object conﬁguration (Figures 3 and Figures 4). All experi- ments are conducted on a single NVIDIA R TX 4090 GPU. Metrics W e ev aluate the completeness of the high-level model using two primary metrics: (1) A verage Planning Suc- cess Rate (ASR): The mean planning success rate across all individual tasks within a giv en task set. (2) Complete Plan- ning Success Rate (CSR): The success rate where all tasks within the set are successfully planned simultaneously . W e also report the av erage number of Feedback Cycles (FC). Action End-to-end Hierarchical Rule-based Molmo+cuRobo+APIs OpenVLA V oxPoser ReK ep Molmo+cuRobo APIs w/o Contexts with Contexts Pick ( obj ) 4/10 4/10 6/10 5/10 6/10 6/10 7/10 Place ( obj, l oc ) 5/10 3/10 7/10 5/10 6/10 6/10 8/10 Open ( container ) 1/10 1/10 1/10 3/10 1/10 2/10 4/10 Close ( container 2/10 2/10 3/10 4/10 2/10 3/10 5/10 Toggle ( switch ) 2/10 1/10 4/10 6/10 5/10 5/10 7/10 T otal 28% 22% 42% 46% 40% 44% 62% T able 2: Evaluation results of lo w-le vel polic y sampling using VLM for 5 typical action models. Action Defect T ype & Description T extual Baseline w/o Feedback with Feedback A vg. FC PutIn ( obj, container ) Pre: Missing IsOpen ( container ) due to closed lid 10% 40% 80% 1.1 Stack ( obj a, obj b ) Pre: Missing Clear ( obj b ) due to surface obstruction 20% 30% 70% 2.1 Lift ( box big , r 1 , r 2 ) Pre: Missing Holding ( r 2 , box big ) in dual-arm coordination 10% 80% 90% 0.3 Pick ( r obot, obj ) Add: Un veriﬁed InReach ( robot, obj ) (Kinematic constraint) 20% 50% 90% 0.8 Put ( obj, l oc ) Del: Stale At ( obj, loc old ) resulting in location redundancy 0% 20% 40% 2.4 T otal 12% 44% 74% 1.3 T able 3: Success rate (SR%) of VLM-based cross-le vel reﬁnement for action models. Results are av eraged over 10 trials ( N F C ≤ 3 ). Pre , Add , and Del represent action precondition, add ef fect, and delete effect, respecti vely . Evaluation of High-Le vel Model Pr oposal Ablating Planning Contexts Planning context feedback prov ed crucial for performance (T able 1). Its inclusion con- sistently boosted goal success rates and system complete- ness, most notably for GPT -4o, where completeness jumped from 50% to o ver 90%. The performance gains were most signiﬁcant in the complex dual-arm and mobile manipula- tion tasks, demonstrating that structured, symbolic feedback from a formal planner can empower LLMs to resolve intri- cate logical challenges. Comparison of LLMs As shown in T able 1, while GPT - 3.5 and GPT -4o performed comparably without planning context feedback, GPT -4o’ s superiority became e vident with it. Guided by this feedback, GPT -4o achieved ov er 90% complete planning success rate, in stark contrast to approx- imately 60% for GPT -3.5. This underscores GPT -4o’ s ad- vanced capacity for leveraging contextual feedback in com- plex reasoning tasks like BT grounding. Evaluation of Low-Le vel P olicy Sampling W e e v aluate the performance of three policy types for low- lev el polic y sampling. Details of these policy are shown in Appendix. W e select ﬁv e typical action models to test the performance of these polices, as sho wn in T able 2. The algorithms show different strengths in various ac- tions. Rekep and Rule-Based methods excel in grasping, while Molmo+cuRobo performs better in Open / Close and Toggle actions. This is due to the semantics-based key- point extraction that accurately identiﬁes object handles and hinges. W e utilize GPT -4o as VLM in the experiment. Ablating Execution Contexts T able 2 presents the Suc- cess Rate (SR) of control policy for ﬁ v e typical action mod- els, where the VLM samples the policy type and its hyperpa- rameters based on the execution contexts. The results show the SR without ex ecution conte xts and with up to three sam- ple attempts. It is e vident that the VLM can ef fecti vely sam- ple lo w-le v el policies, and with e xecution contexts, the SR of the actions improv es. Evaluation of Cr oss-Lev el Reﬁnement Ablating En vir onment Feedback T able 3 catalogs action models that exhibited inconsistencies, where the predicted high-lev el effect div er ged from the low-le vel execution out- come or resulted in an error . Through an iterati v e feedback process, the VLM demonstrated the potential to success- fully correct these high-lev el representations, underscoring the critical role of direct en vironmental feedback. Howe ver , the efﬁcac y of this approach is currently limited for abstract concepts lacking direct visual correlates, such as the sym- bolic target in Put ( obj, l oc ) . Deployment Figure 4 depicts the deployment of our pipeline, where CABTO generates a complete and consis- tent BT system for the gi ven task set. The robot successfully ex ecutes the planned BT actions sequentially for e v ery task. Conclusion In this work, we ﬁrst formalize the BT grounding problem and propose CABTO, a framework that leverages LMs to automatically construct complete and consistent BT systems guided by planning and environmental feedback. The effec- tiv eness of our approach is validated across sev en robotic manipulation task sets. Future work will focus on enhancing LM inference and low-le v el robotic skills via ﬁne-tuning and addressing the transfer to physical systems. Acknowledgments This work was supported by the National Science Fund for Distinguished Y oung Scholars (Grant Nos. 62525213), the National Natural Science Foundation of China (Grant Nos. 62572480), and the University Y outh Independent Innov a- tion Science Foundation (Grant Nos. ZK25-11). References Arora, A.; Fiorino, H.; Pellier, D.; M ´ etivier , M.; and Pesty , S. 2018. A Revie w of Learning Planning Action Models. The Knowledge Engineering Revie w , 33: e20. Bachor , P .; and Behnke, G. 2024. Learning Planning Do- mains from Non-Redundant Fully-Observed Traces: Theo- retical F oundations and Complexity Analysis. In Pr oceed- ings of the AAAI Conference on Artiﬁcial Intelligence , vol- ume 38, 20028–20035. Bai, F .; Li, Y .; Chu, J.; Chou, T .; Zhu, R.; W en, Y .; Y ang, Y .; and Chen, Y . 2025. Retrie v al dexterity: Efﬁcient ob- ject retriev al in clutters with dexterous hand. arXiv preprint arXiv:2502.18423 . Banerjee, B. 2018. Autonomous Acquisition of Behavior T rees for Robot Control. 3460–3467. Bonet, B.; Loerincs, G.; and Gef fner , H. 1997. A Robust and Fast Action Selection Mechanism for Planning. In AAAI , 714–719. Cai, Y .; Chen, X.; Cai, Z.; Mao, Y .; Li, M.; Y ang, W .; and W ang, J. 2025a. Mrbtp: Efﬁcient multi-robot behavior tree planning and collaboration. In Proceedings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 39, 14548– 14557. Cai, Y .; Chen, X.; Mao, Y .; Li, M.; Y ang, S.; Y ang, W .; and W ang, J. 2025b. HBTP: Heuristic Behavior T ree Planning with Large Language Model Reasoning. In 2025 IEEE Inter- national Confer ence on Robotics and Automation (ICRA) , 13706–13713. Cai, Z.; Li, M.; Huang, W .; and Y ang, W . 2021. BT Expan- sion: A Sound and Complete Algorithm for Behavior Plan- ning of Intelligent Robots with Behavior Trees. In AAAI , 6058–6065. AAAI Press. Chen, X.; Cai, Y .; Mao, Y .; Li, M.; Y ang, W .; Xu, W .; and W ang, J. 2024. Integrating Intent Understanding and Opti- mal Beha vior Planning for Behavior T ree Generation from Human Instructions. In IJCAI . Chen, Y .; W ang, C.; Fei-Fei, L.; and Liu, C. K. 2023. Se- quential Dexterity: Chaining Dexterous Policies for Long- Horizon Manipulation. arXiv pr eprint arXiv:2309.00987 . Colledanchise, M.; and ¨ Ogren, P . 2018. Behavior T rees in Robotics and AI: An Intr oduction . CRC Press. Colledanchise, M.; Parasuraman, R.; and ¨ Ogren, P . 2019. Learning of Behavior Trees for Autonomous Agents. IEEE T ransactions on Games , 11(2): 183–189. Deitke, M.; Clark, C.; Lee, S.; T ripathi, R.; Y ang, Y .; Park, J. S.; Salehi, M.; Muennighoff, N.; Lo, K.; Soldaini, L.; et al. 2024. Molmo and Pixmo: Open W eights and Open Data for State-of-the-Art Multimodal Models. arXiv pr eprint arXiv:2409.17146 . Deitke, M.; Clark, C.; Lee, S.; T ripathi, R.; Y ang, Y .; Park, J. S.; Salehi, M.; Muennighoff, N.; Lo, K.; Soldaini, L.; et al. 2025. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , 91–104. Fikes, R. E.; and Nilsson, N. J. 1971. STRIPS: A New Ap- proach to the Application of Theorem Proving to Problem Solving. 2(3-4): 189–208. French, K.; W u, S.; Pan, T .; Zhou, Z.; and Jenkins, O. C. 2019. Learning Behavior T rees from Demonstration. In 2019 International Confer ence on Robotics and Automation (ICRA) , 7791–7797. IEEE. Haresh, S.; Dijkman, D.; Bhattacharyya, A.; and Memisevic, R. 2024. Cle vrSkills: Compositional Language And V isual Reasoning in Robotics. Hoffmann, J.; and Nebel, B. 2001. The FF Planning System: Fast Plan Generation through Heuristic Search. Journal of Artiﬁcial Intelligence Resear ch , 14: 253–302. Huang, W .; W ang, C.; Li, Y .; Zhang, R.; and Fei-Fei, L. 2025. ReKep: Spatio-T emporal Reasoning of Relational Ke ypoint Constraints for Robotic Manipulation. In Confer- ence on Robot Learning , 4573–4602. PMLR. Huang, W .; W ang, C.; Zhang, R.; Li, Y .; W u, J.; and Fei-Fei, L. 2023. V oxposer: Composable 3d V alue Maps for Robotic Manipulation with Language Models. arXiv pr eprint arXiv:2307.05973 . Kaelbling, L. P .; and Lozano-P ´ erez, T . 2011. Hierarchical T ask and Motion Planning in the No w . In 2011 IEEE In- ternational Confer ence on Robotics and Automation , 1470– 1477. IEEE. Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T .; Balakr- ishna, A.; Nair, S.; Rafailov , R.; Foster , E.; Lam, G.; San- keti, P .; et al. 2024. OpenVLA: An Open-Source V ision- Language-Action Model. arXiv pr eprint arXiv:2406.09246 . Kumar , N.; Ramos, F .; Fox, D.; and Garrett, C. R. 2024. Open-W orld T ask and Motion Planning via V ision- Language Model Inferred Constraints. arXiv pr eprint arXiv:2411.08253 . Kumar , V .; Shah, R.; Zhou, G.; Moens, V .; Caggiano, V .; Gupta, A.; and Rajeswaran, A. 2023. RoboHive: A Uni- ﬁed Framework for Robot Learning. In Oh, A.; Naumann, T .; Globerson, A.; Saenko, K.; Hardt, M.; and Le vine, S., eds., Advances in Neural Information Pr ocessing Systems , volume 36, 44323–44340. Curran Associates, Inc. Li, C.; Zhang, R.; W ong, J.; Gokmen, C.; Sriv asta v a, S.; Mart ´ ın-Mart ´ ın, R.; W ang, C.; Levine, G.; Lingelbach, M.; Sun, J.; An v ari, M.; Hwang, M.; Sharma, M.; A ydin, A.; Bansal, D.; Hunter, S.; Kim, K.-Y .; Lou, A.; Matthe ws, C. R.; V illa-Renteria, I.; T ang, J. H.; T ang, C.; Xia, F .; Sav arese, S.; Gweon, H.; Liu, K.; W u, J.; and Fei-Fei, L. 2023. BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. In Liu, K.; Kulic, D.; and Ichno wski, J., eds., Pr oceedings of The 6th Confer ence on Robot Learning , volume 205 of Pr oceed- ings of Machine Learning Resear ch , 80–93. PMLR. Liang, Y .; K umar , N.; T ang, H.; W eller , A.; T enenbaum, J. B.; Silv er , T .; Henriques, J. F .; and Ellis, K. 2025. V isu- alPredicator: Learning Abstract W orld Models with Neuro- Symbolic Predicates for Robot Planning. In The Thirteenth International Confer ence on Learning Repr esentations . Lim, C.-U.; Baumgarten, R.; and Colton, S. 2010. Evolv- ing Behaviour T rees for the Commercial Game DEFCON. In Applications of Evolutionary Computation: EvoApplica- tons 2010: EvoCOMPLEX, EvoGAMES, EvoIASP, EvoIN- TELLIGENCE, EvoNUM, and EvoSTOC, Istanbul, T urke y, April 7-9, 2010, Pr oceedings, P art I , 100–110. Springer . Liu, B.; Jiang, Y .; Zhang, X.; Liu, Q.; Zhang, S.; Biswas, J.; and Stone, P . 2023. LLM+P: Empowering Large Language Models with Optimal Planning Proﬁciency. arXiv . Mahdavi, S.; Aoki, R.; T ang, K.; and Cao, Y . 2024. Lev er - aging En vironment Interaction for Automated PDDL Gen- eration and Planning with Large Language Models. arXiv pr eprint arXiv:2407.12979 . Mordoch, A.; Scala, E.; Stern, R.; and Juba, B. 2024. Safe Learning of Pddl Domains with Conditional Effects. In Pr oceedings of the International Conference on Automated Planning and Scheduling , v olume 34, 387–395. Mu, Y .; Chen, J.; Zhang, Q.; Chen, S.; Y u, Q.; Ge, C.; Chen, R.; Liang, Z.; Hu, M.; T ao, C.; et al. 2024. RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthe- sis. arXiv pr eprint arXiv:2402.16117 . Neupane, A.; and Goodrich, M. 2019. Learning Swarm Be- haviors Using Grammatical Evolution and Behavior T rees. In IJCAI , 513–520. IJCAI Organization. Neupane, A.; and Goodrich, M. A. 2023. Designing Beha v- ior Trees from Goal-Oriented L TLf Formulas. arXiv pr eprint arXiv:2307.06399 . Neupane, A.; Goodrich, M. A.; and Mercer , E. G. 2018. GEESE: Grammatical Evolution Algorithm for Ev olution of Swarm Beha viors. 999–1006. Newton, MA.; and Levine, J. 2010. Implicit Learning of Compiled Macro-Actions for Planning. In ECAI 2010 , 323– 328. IOS Press. ¨ Ogren, P .; and Sprague, C. I. 2022. Behavior T rees in Robot Control Systems. Annual Revie w of Control, Robotics, and Autonomous Systems , 5: 81–107. Pan, L.; Albalak, A.; W ang, X.; and W ang, W . Y . 2023. Logic-Lm: Empowering Large Language Models with Sym- bolic Solvers for Faithful Logical Reasoning. arXiv pr eprint arXiv:2305.12295 . Pereira, R. d. P .; and Engel, P . M. 2015. A Frame w ork for Constrained and Adapti ve Behavior -Based Agents. arXiv pr eprint arXiv:1506.02312 . Rodrigues, C.; G ´ erard, P .; Rouveirol, C.; and Soldano, H. 2012. Acti ve Learning of Relational Action Models. In In- ductive Lo gic Pr ogramming: 21st International Confer ence, ILP 2011, W indsor Great P ark, UK, J uly 31–August 3, 2011, Revised Selected P apers 21 , 302–316. Springer . Scheide, E.; Best, G.; and Hollinger , G. A. 2021. Behav- ior T ree Learning for Robotic T ask Planning through Monte Carlo D A G Search over a Formal Grammar . 4837–4843. Xi’an, China: IEEE. ISBN 978-1-72819-077-8. Sundaralingam, B.; Hari, S. K. S.; Fishman, A.; Garrett, C.; V an W yk, K.; Blukis, V .; Millane, A.; Oleynikov a, H.; Handa, A.; Ramos, F .; et al. 2023. Curobo: P arallelized collision-free robot motion generation. In 2023 IEEE Inter- national Confer ence on Robotics and Automation (ICRA) , 8112–8119. IEEE. T adewos, T . G.; New az, A. A. R.; and Karimoddini, A. 2022. Speciﬁcation-Guided Behavior T ree Synthesis and Execu- tion for Coordination of Autonomous Systems. Expert Sys- tems with Applications , 201: 117022. Thomason, W .; Kingston, Z.; and Kavraki, L. E. 2024. Mo- tions in Microseconds via V ectorized Sampling-Based Plan- ning. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , 8749–8756. V almeekam, K.; Marquez, M.; Olmo, A.; Sreedharan, S.; and Kambhampati, S. 2023. PlanBench: An Extensible Benchmark for Ev aluating Large Language Models on Plan- ning and Reasoning about Change. In Oh, A.; Naumann, T .; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems , vol- ume 36, 38975–38987. Curran Associates, Inc. Y ang, Z.; Garrett, C.; Fox, D.; Lozano-P ´ erez, T .; and Kael- bling, L. P . 2024. Guiding Long-Horizon T ask and Mo- tion Planning with V ision Language Models. arXiv pr eprint arXiv:2410.02193 . Zare, M.; K ebria, P . M.; Khosra vi, A.; and Nahav andi, S. 2024. A Survey of Imitation Learning: Algorithms, Recent Dev elopments, and Challenges. IEEE T ransactions on Cy- bernetics . Zhang, K.; Ren, P .; Lin, B.; Lin, J.; Ma, S.; Xu, H.; and Liang, X. 2024. PIV OT -R: Primitiv e-Dri ven W aypoint- A ware W orld Model for Robotic Manipulation. arXiv pr eprint arXiv:2410.10394 . Zhen, H.; Qiu, X.; Chen, P .; Y ang, J.; Y an, X.; Du, Y .; Hong, Y .; and Gan, C. 2024. 3d-Vla: A 3d V ision-Language-Action Generativ e W orld Model. arXiv pr eprint arXiv:2403.09631 . Zhong, Y .; Bai, F .; Cai, S.; Huang, X.; Chen, Z.; Zhang, X.; W ang, Y .; Guo, S.; Guan, T .; Lui, K. N.; et al. 2025. A Surve y on V ision-Language-Action Models: An Action T o- kenization Perspecti ve. arXiv pr eprint arXiv:2507.01925 . Zhou, A.; Y an, K.; Shlapentokh-Rothman, M.; W ang, H.; and W ang, Y .-X. 2024. Language Agent Tree Search Uniﬁes Reasoning, Acting, and Planning in Language Models. In Salakhutdinov , R.; Kolter , Z.; Heller , K.; W eller , A.; Oliver , N.; Scarlett, J.; and Berkenkamp, F ., eds., Pr oceedings of the 41st International Confer ence on Machine Learning , vol- ume 235 of Pr oceedings of Machine Learning Resear ch , 62138–62160. PMLR. Zhuo, H. H.; and Y ang, Q. 2014. Action-Model Acquisition for Planning via T ransfer Learning. Artiﬁcial intelligence , 212: 80–103.

CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment