Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-d…
Authors: Jooyoung Kim, Wonje Choi, Younguk Song
Cr oss-Domain Demo-to-Code via Neur osymbolic Counterfactual Reasoning Jooyoung Kim, W onje Choi, Y ounguk Song, Honguk W oo * Department of Computer Science and Engineering, Sungkyunkw an Uni v ersity { onsaemiro, wjchoi1995, syw2045, hwoo } @skku.edu Abstract Recent advances in V ision-Language Models (VLMs) have enabled video-instructed r obotic pro gramming , allowing agents to interpr et video demonstrations and generate exe- cutable contr ol code . W e formulate video-instructed r obotic pr o gramming as a cr oss-domain adaptation problem, wher e per ceptual and physical differ ences between demonstra- tion and deployment induce pr ocedur al mismatc hes. How- ever , curr ent VLMs lack the pr ocedural understanding needed to reformulate causal dependencies and ac hieve task-compatible behavior under such domain shifts. W e intr oduce N E S Y C R , a neur osymbolic counterfactual r ea- soning fr amework that enables verifiable adaptation of task pr ocedur es, pr oviding a r eliable synthesis of code policies. N E S Y C R abstracts video demonstrations into symbolic tra- jectories that capture the underlying task pr ocedur e . Given deployment observations, it derives counterfactual states that r e veal cr oss-domain incompatibilities. By exploring the symbolic state space with verifiable chec ks, N E S Y C R pr oposes pr ocedural r evisions that r estor e compatibility with the demonstrated pr ocedure . N E S Y C R ac hieves a 31.14% impr ovement in task success over the str ong est baseline Statler , showing r obust cross-domain adaptation acr oss both simulated and r eal-world manipulation tasks. 1. Introduction Advances in foundation models hav e accelerated progress tow ard general-purpose embodied intelligence, enabling robots to interpret human instructions and execute comple x tasks as autonomous control policies [ 7 , 15 , 25 , 51 ]. In par- ticular , Large Language Models (LLMs) with code-writing capabilities hav e inspired the Code-as-Policies paradigm, in which executable control code is synthesized from language instructions using predefined APIs [ 24 , 26 , 36 , 54 ]. Further - more, V ision-Language Models (VLMs) have extended this paradigm to ward a general form of video-instructed robotic programming, where robotic programs are generated from * Honguk W oo is the corresponding author . instructional video demonstrations by translating observed task sequences into structured task specifications that can be compiled into control code [ 56 , 58 , 61 , 65 ]. By captur- ing richer perceptual context and task intent from demon- strations, these methods enable more grounded robotic pro- gramming than text instructions alone. In video-instructed robotic programming, domain gaps between the demonstration and deployment are inevitable due to inherent dif ferences in en vironmental layouts, object properties, and morphological constraints [ 11 , 16 , 45 , 46 ]. While sensory observ ations can rev eal physical discrepan- cies between the demonstration and the deployment en vi- ronment, these observ ations alone do not explain how struc- tural differences disrupt the underlying task procedure or how actions will causally unfold under altered conditions. T o address this limitation, we present N E S Y C R , a neu- rosymbolic frame work that formulates cross-domain adap- tation from demonstrations as counterfactual reasoning. At its core, N E S Y C R infers ho w task outcomes would change under altered domain factors and proposes alternativ e ac- tions that re vise the demonstrated beha vior to restore task- oriented beha vioral compatibility . Given a demonstration, N E S Y C R constructs a symbolic world model that encodes the task procedure as a symbolic abstraction of the trajec- tory . Lev eraging this model, N E S Y C R performs neurosym- bolic counterf actual adaptation to re vise the procedure for the deployment domain, enabling verifiable adaptation of the demonstrated procedure across domains. As sho wn in Figure 1 , the demonstration domain in- volv es a human performing a drawer -or ganizing task with objects such as a magnetic hook and a box of scre ws placed on the table. In the deployment domain, howe ver , the hu- man hand is replaced by a robotic gripper , the magnetic hook is already arranged inside the drawer , and the screws are scattered across the surface. These differences create a procedural incompatibility: direct grasping becomes infea- sible, requiring the magnetic hook to be repurposed as an auxiliary tool for gathering the screws. Furthermore, this modification introduces a cascading incompatibility: ob- jects to be organized later in the procedure can obstruct the magnet, rendering it inaccessible. Resolving these coupled ❸ ❶ ❷ ❶ ❹ ❸ ❗ ❹ … P er cep tual Gap ❺ (1) Naiv e Demo -t o-Code (2) NeS yCR (Our s) Or g aniz e the object s in t o dr a w er s. ❺ ❹ Gr ounde d Code P olicy Naiv e Code P olicy T ask Ex ecut ion Demo … R ele ase bo t t om dr aw er handle Gr asp magne tic hook Mo v e o v er bo t t om dr aw er R ele ase magne tic hook 36 37 38 39 Pr ec ondition Ex c eption ⤷ 37 38 39 Gr asp magne tic hook Mo v e o v er bo t t om dr aw er R ele ase magne tic hook … S ymbo lic T ool VLM … … … R ele ase bo t t om dr aw er handle Gr asp magne tic hook Mo v e o v er black scr e ws dr aw er … 36 37 38 Gr ounde d Code P olicy Adap t ed Pr ocedu r e Af f or dance Ch ang e R ela tion Ch ang e Embodi men t Chang e Ph y sic al Gap ❷ Deplo ymen t Dom ain Demons tr a tion D omain Demons tr a t ed Pr ocedu r e Pr ocedu r al R e visio ns Coun t erf actual St a t e Scene ❶ Or g aniz e the object s in t o dr a w er s. Human Hand Robot G ripper Scatt ered Inside Outside Gathered Figure 1. Overvie w of N E S Y C R in a dra wer-or ganizing task scenario. (Left) Illustration of the domain gap between the demonstration and deployment. (Middle) Overvie w of N E S Y C R framework, which generates an adapted procedure via neurosymbolic counterfactual reasoning. (Right) Outcome of the adapted procedure, showing that N E S Y C R successfully ex ecutes the task via a grounded code policy . incompatibilities requires reordering the procedure so that the magnet is retrie ved and used to aggre gate the screws be- fore the target objects are placed. Ho we v er , a direct demo- to-code approach using VLM-based prompting would still attempt to grasp the scattered screws with the gripper , which fails due to their small size and dispersion. In contrast, N E S Y C R identifies these incompatibilities through VLM- based symbolic translation and symbolic forward simula- tion. By combining the VLM’ s commonsense reasoning with symbolic v erification, it produces an adapted and co- herent procedure that completes the task. Further details are illustrated in Figure 5 . T o construct the symbolic world model, the VLM ab- stracts the demonstration into a symbolic trajectory that captures objects and their spatial and temporal relations, while the symbolic tool v erifies these representations for logical consistency . In a deployment domain, N E S Y C R identifies structural variations through this symbolic world model, detecting where the demonstrated procedure fails to reach the goal state under counterfactual states derived from target-domain observ ations. The VLM then proposes alternativ e action operators for the incompatible steps, and the symbolic tool verifies their causal validity , yielding re- vised procedural steps that restore compatibility . This rea- soning process produces an adapted task specification that remains logically valid while preserving the task intent of the demonstration. The adapted specification is then com- piled into a reliable, deployment-grounded code policy . W e e v aluate N E S Y C R on a diverse set of video- instructed robotic programming scenarios, deploying robotic agents in both simulated and real-world en viron- ments. The cross-domain setting between demonstration and deployment is characterized by domain factors that cap- ture variations in en vironmental and embodiment configura- tions. The scenarios consist of long-horizon manipulations in v olving multiple subtask types and requiring up to 116 visual-motor API calls, yielding compositional and proce- durally comple x tasks. N E S Y C R outperforms the strongest baseline, Statler [ 66 ], achie ving an a verage improv ement of 31.14% in task success rate, as reported in T ables 1 and 2 . Our contributions are summarized as follows: (1) W e present the N E S Y C R framework that casts cross-domain adaptation from demonstrations as counterfactual reason- ing for video-instructed robotic programming. (2) W e im- plement a VLM–symbolic tool pipeline that proposes and verifies alternati v e action steps, ensuring procedural com- patibility across domains. (3) W e ev aluate N E S Y C R across simulated and real-world robotic manipulation tasks using an experimental design that provides high granularity over domain factors and task complexity , enabling precise anal- ysis of its cross-domain procedural adaptation. 2. Problem f ormulation W e formulate the embodied domain as a tuple M = ( S , A , T ,g ) , where S denotes the state space, A the action space, T : S × A → S the transition function, and g ∈ G the goal state. Under partial observability [ 52 ], the agent re- ceiv es observations o t ∈ O at each timestep t via an observa- tion function Ω : S × A → O , which maps the underlying state to perceptual observations (e.g., RGB images). In cross- domain settings, a demonstration D = ( { o t } N t =1 , ℓ ) specifies a domain M S = ( S S , A S , T S ,g ) performed under specific en vironmental and embodiment configurations, optionally with a language description ℓ . The agent must achiev e the same goal in a deployment domain M T = ( S T , A T , T T ,g ) , where S S = S T , A S = A T , or T S = T T . Our objectiv e is to optimize a policy π θ from a single demonstration to maxi- mize task success in the deployment domain: π ∗ θ = argmax π θ E τ ∼ p ( ·| π θ , M T ) SR( τ , g ) − λ D( τ , D ) (1) where τ = { o 1 , a 1 , . . . , o N , a N } denotes a trajectory sam- pled from p ( τ | π θ , M T ) , induced by executing π θ in M T . Here, SR( τ , g ) measures task success, D( τ , D ) measures the deviation between τ and D , and λ is a weighting factor . Thus, the policy π ∗ θ aims to maximize task success in the deployment domain while maintaining alignment with the demonstration. Each timestep t corresponds to a semanti- cally coherent interv al, where the observation o t reflects the causal effect of action a t [ 22 , 27 ]. W e represent π θ as exe- cutable control code such that τ corresponds to its execution trace during deployment [ 36 , 58 ]. 3. Neurosymbolic Counterfactual Reasoning W e formulate video-instructed robotic programming as a cross-domain adaptation problem, in which the agent must transform an instructional video demonstration recorded into a logically verified and deployable task specification for execution in deployment domains with div erse en viron- mental or embodiment conditions. Such domain gaps alter how the agent interacts with its environment, thereby in- ducing procedural discrepancies between the demonstrated and deployable behaviors. Rather than directly imitating the demonstrated behavior , the agent must reason about whether and how the demonstrated procedure should be re- vised under structural variations, thereby adjusting its ac- tions to preserve task-le v el consistency . W e address this challenge through neurosymbolic coun- terfactual reasoning ( N E S Y C R), which bridges the domain gaps between demonstrations and target observations by en- abling procedure adaptation. N E S Y C R translates a demon- stration into a symbolic representation of preconditions, ac- tions, and effects, thereby constructing a symbolic world model. Gi v en a target observ ation, the framew ork identi- fies its corresponding counterfactual situation by v erifying which preconditions in the procedure are satisfied or vio- lated. Based on this, N E S Y C R adapts the demonstrated procedure by adding or removing actions so that the result- ing procedure aligns the counterfactual state with the de- sired preconditions and effects. Specifically , N E S Y C R in- tegrates VLM-based procedural alternativ e generation with symbolic verification and operates in tw o phases: (i) sym- bolic world model construction and (ii) neur osymbolic counterfactual adaptation . In phase (i), the demonstration is abstracted into a compact symbolic state sequence, from which a symbolic world model is constructed. In phase (ii), Repredict Action V erification Demonstrated Procedure Pr econdition E ff ect 1) … 2) … 1) … 2) … Action Symbolic W orld Model Symbolic T ool Y es No VLM ? … … V erified? … Symbolic State T ranslation VLM Action Prediction VLM ? 2 Figure 2. Symbolic world model construction the frame work contrasts this symbolic world model with target-domain observations to deriv e counterf actual states. The symbolic tool identifies where the demonstrated pro- cedure fails, the VLM proposes alternativ es to repair these steps, and the symbolic tool verifies their causal validity . Through this neurosymbolic loop, N E S Y C R produces a causally coherent, deplo yment-grounded task specification that is compiled into an ex ecutable code policy . 3.1. Symbolic world model construction N E S Y C R translates the demonstration into a symbolic world model that encodes the causal structure of the demon- strated behavior , ensuring its reproducibility within a sym- bolic state space. As sho wn in Figure 2 , rather than treat- ing the demonstration as a raw sequence of observations, the VLM abstracts it into symbolic transitions. Consecu- tiv e observations are parsed into scene graphs representing symbolic states, from which the VLM predicts a symbolic operator that specifies the preconditions and effects of the ex ecuted action. The symbolic tool then verifies the consis- tency of these transitions and integrates them into a unified world model that is logically coherent and reconstructable with respect to the demonstrated beha vior . This symbolic world model W serves as a plan verification model and is expressed in a STRIPS-style formalism [ 17 ], supporting forward ex ecution and logical v alidation. W = ( Q , P , A , Φ) , Φ : ( s, a ) 7→ s ′ (2) Here, Q denotes the set of objects identified in the scene, P the set of predicates representing object affordances and spatial relations, and A the set of symbolic actions, each defined by preconditions and effects ov er P . Φ denotes the symbolic tool (e.g., V AL [ 18 , 23 ]) responsible for forward simulation and consistency v erification. Gi ven a current symbolic state s ∈ 2 P and an action a ∈ A whose pre- conditions are satisfied in s , Φ applies the effects of a to produce the next symbolic state s ′ . Symbolic state translation. T o obtain a symbolic state sequence from the demonstration, N E S Y C R prompts the VLM Ψ to generate grounded scene graphs for key frames [ 34 , 60 ]. At each timestep t , Ψ extracts object enti- ties and spatial relations from the image observation o t and language description ℓ , forming a symbolic state s t : Ψ( { o 1 , . . . , o N } ; ℓ ) = { s 1 , . . . , s N } , Q = N [ t =1 Object ( s t ) , P = N [ t =1 Predicate ( s t ) (3) where Object ( s t ) and Predicate ( s t ) denote the objects and predicates grounded in s t , respectiv ely . The resulting se- quence { s 1 ,. . . ,s N } represents a dynamic scene graph as a sequence of symbolic states deriv ed from the demonstra- tion, grounded in the object set Q and predicate set P . Symbolic dynamics r econstruction. Giv en the symbolic state sequence { s 1 ,. . . ,s N } , N E S Y C R abduces a set of symbolic action operators A that capture the causal tran- sitions between consecutiv e states. For each state pair ( s t , s t +1 ) , the VLM Ψ predicts an action operator a t = Ψ( s t , s t +1 ) whose ef fects correspond to the state dif ference s t +1 \ s t , and whose preconditions hold in s t . Each a t is rep- resented as a tuple ( name , pr e , eff ) and appended to A . T o verify that A is consistent with the symbolic state sequence, the symbolic tool Φ performs forward simulation by apply- ing each a t to s t and ensuring that the resulting symbolic state satisfies s t +1 at ev ery step. ∀ t ∈ [1 , N − 1] , Φ( s t , a t ) | = s t +1 ⇒ V erified ( W ) (4) If this condition holds over the entire trajectory , W is con- structed and verified, and the predicted action sequence is adopted as demonstrated procedure π = { a 1 ,. . . ,a N − 1 } . 3.2. Neurosymbolic counterfactual adaptation Based on the symbolic world model, N E S Y C R performs counterfactual adaptation in a neurosymbolic manner , iden- tifying causal inconsistencies induced by target-domain constraints and re vising the demonstrated procedure to re- store causal consistenc y . As shown in Figure 3 , giv en an ob- servation from the deployment domain, the VLM generates a target symbolic state that serves as a counterfactual con- figuration by intervening on variables reflecting the target- domain conditions. Using this counterfactual state, the sym- bolic tool performs forward simulation along the demon- strated procedure to identify actions whose preconditions … Goal Condition Check Adapted Procedure … Demonstrated Procedure Symbolic W orld Model ? ? ? Skip Continue Exploration V erified? Symbolic T ool Skip Counterfactual Exploration Action V erification Counterfactual Identification ? VLM No Y es Symbolic T ool Y es VLM No Symbolic T ool Inconsistent? Figure 3. Neurosymbolic counterfactual adaptation are inconsistent with the counterfactual setting, thereby re- vealing cross-domain inconsistencies that hinder e xecution. T o resolve these conflicts, the VLM abduces alternativ e pro- cedure steps through the insertion or remo v al of actions, while the symbolic tool iterati v ely verifies their logical con- sistency . By iterating this e xploration, the adapted proce- dure is refined into a deployment-grounded task specifica- tion, from which an ex ecutable code policy is synthesized. Counterfactuals identification. As the first step of neu- rosymbolic counterfactual adaptation, N E S Y C R identi- fies causal inconsistencies in the demonstrated procedure through forward simulation under counterfactual condi- tions. The VLM Ψ generates a counterfactual state ˆ s 1 that reflects the target observation, while the symbolic tool Φ simulates each action a t in the demonstrated procedure to estimate its outcome in the deployment domain. ˆ s t +1 = Φ( ˆ s t , a t ) , t = 1 , . . . , N − 1 (5) An action is regarded as inconsistent when its preconditions are not satisfied in the current counterfactual state or when its effects fail to reproduce the expected predicates in the corresponding next state s t +1 . Inconsistent ( a t ) ⇔ pr e ( a t ) ⊈ ˆ s t ∨ eff ( a t ) ⊈ s t +1 (6) Such inconsistencies in the procedure must be resolv ed to restore entire causal consistency in the deployment domain. Counterfactual exploration. T o resolve these incon- sistencies, N E S Y C R performs counterfactual exploration within the symbolic state space, grounding the task proce- dure through additiv e and subtracti ve modifications. For each inconsistent action identified, the VLM Ψ proposes alternativ e actions whose effects restore the violated pre- conditions of the subsequent valid action a t +1 . If no such alternativ e is applicable or the action becomes redundant to the task objective, it is remo ved from the procedure. Let the action revised through counterf actual exploration be ˜ a t = ( Ψ( ˆ s t , a t ; s t +1 ) , if Inconsistent ( a t ) , a t , otherwise. (7) The adapted procedure ˜ π , deriv ed from Eq. ( 7 ), restores pro- cedural compatibility in deployment domain by satisfying ∀ t ∈ [1 , N − 1] , Φ( ˆ s t , ˜ a t ) = ˆ s t +1 , until ˆ s t +1 | = s N . (8) Once ˜ π reaches the goal condition specified by s N , Ψ trans- lates ˜ π into a code policy π θ = Ψ( ˜ π ) , ensuring that it re- mains grounded in the deployment domain. Algorithm 1 summarizes the ov erall process of N E S Y C R . 4. Evaluation W e e valuate N E S Y C R through experiments designed to ad- dress the following questions: (Q1) How does N E S Y C R perform compared to existing baselines in cross-domain demo-to-code settings? (Q2) How robust is N E S Y C R to increasing domain gaps between the demonstration and de- ployment? (Q3) How does N E S Y C R respond as the com- plexity gap between the demonstration and deployment tasks gro ws? (Q4) What is the contribution of each com- ponent of N E S Y C R? 4.1. Experiment setting Cross-domain settings. Our cross-domain settings are defined by domain factors that introduce perceptual and physical variations between the demonstration and deploy- ment domains. W e consider fi ve such factors that induce procedural differences in task e xecution: (1) Obstruction introduces interfering objects that require additional resolv- ing steps. (2) Object affor dance alters object states and inter-object relations, yielding new af fordances and rela- tional dependencies. (3) Kinematic configuration changes the robot’ s joint structure, af fecting its reachable workspace Algorithm 1 N E S Y C R Framework Demonstration D = ( { o t } N t =1 , ℓ ) , T arget Observ ation ˆ o VLM Ψ , Symbolic T ool Φ , Symbolic Action Set A = {} 1: /* Symbolic world model construction */ 2: Get Q , P and { s t } N t =1 via Eq. ( 3 ) 3: for t = 1 , . . . , N − 1 do 4: Get grounded action a t = Ψ( s t , s t +1 ) 5: V erify a t using Φ ; otherwise repredict a t cf. Eq. ( 4 ) 6: A ← A ∪ { a t } 7: end for 8: Get verified symbolic w orld model W = ( Q , P , A , Φ) 9: /* Neur osymbolic counterfactual adaptation */ 10: Initialize adapted procedure ˜ π = [ ] , t ← 0 11: repeat 12: t ← t +1 ; Get counterfactual state ˆ s t = Ψ( s t ; ˆ o ) 13: Do forward simulation using Eq. ( 5 ) 14: if Inconsistent ( a t ) then 15: Get interpolated action ˜ a t = Ψ( ˆ s t , a t ; s t +1 ) 16: else 17: ˜ a t = a t 18: end if cf. Eq. ( 6 ) & Eq. ( 7 ) 19: V erify ˜ a t using Φ and get ˆ s t +1 via Eq. ( 8 ) 20: Append ˜ a t to ˜ π 21: until ˆ s t +1 | = s N 22: Synthesize code policy π θ = Ψ( ˜ π ) and motion constraints. (4) Gripper type modifies the end- effector design, altering feasible grasp actions and contact affordances. (5) Combination jointly applies multiple do- main factors, ranging from en vironmental factors (i.e., (1)- (2)) to embodiment f actors (i.e., (3)-(4)), to create diverse and complex cross-domain scenarios. For demonstrations, we use data collected directly from both simulated and real environments, including robot ex ecutions and human- performed instructional videos. For deployment, simulated experiments are conducted in Genesis [ 70 ], a physics-based general-purpose robotics platform, while real-world ev alu- ations are performed on a 7-DoF Franka Emika Research 3. Benchmark tasks. The deployment setting comprises long-horizon manipulation scenarios (up to 116 API calls per scenario) focused on table or ganization and object rear - rangement [ 31 , 59 ]. Each scenario is composed of primiti v e subtasks (e.g., pick&place , sweep , rotate , slide ), forming procedurally compositional manipulation sequences. This compositional structure allows systematic control over task complexity . For e v aluation, scenarios are categorized into 3 complexity levels— Low , Medium , and High —according to the number , diversity , and dependenc y depth of sub- tasks. Higher le vels correspond to longer action horizons and stronger inter-subtask dependencies. A total of 440 sce- narios are categorized into three complexity levels: 160 for Method Low-Complexity Medium-Complexity High-Complexity SR GC PD SR GC PD SR GC PD Cross-domain factor : Obstruction and Object affor dance Demo2Code 26.67 ± 5.76 55.00 ± 4.24 5.00 ± 3.42 25.00 ± 5.64 51.11 ± 4.37 10.67 ± 5.11 22.50 ± 6.69 61.25 ± 4.38 6.94 ± 2.20 GPT4V -Robotics 71.67 ± 5.87 82.50 ± 3.91 29.15 ± 5.34 41.67 ± 6.42 68.89 ± 4.11 43.18 ± 5.09 20.00 ± 6.41 57.50 ± 4.49 35.94 ± 10.68 Critic-V 45.00 ± 6.48 65.83 ± 4.52 10.37 ± 4.58 35.00 ± 6.21 64.44 ± 4.18 24.97 ± 4.95 15.00 ± 5.72 58.75 ± 3.96 45.83 ± 9.50 MoReVQA 53.33 ± 6.49 71.67 ± 4.35 32.29 ± 5.17 43.33 ± 6.45 75.00 ± 3.23 25.17 ± 6.07 27.50 ± 7.15 73.75 ± 3.22 40.91 ± 8.10 Statler 61.67 ± 6.33 75.83 ± 4.37 38.92 ± 4.86 41.67 ± 6.42 71.67 ± 3.62 49.07 ± 5.95 5.00 ± 3.49 60.62 ± 3.09 84.38 ± 15.62 LLM-DM 51.67 ± 6.51 67.50 ± 4.87 66.02 ± 4.19 20.00 ± 5.21 47.78 ± 4.37 77.13 ± 3.21 12.50 ± 5.30 53.12 ± 4.49 91.25 ± 5.45 N E S Y C R 86.67 ± 4.43 92.50 ± 2.61 1.92 ± 1.35 75.00 ± 5.64 90.56 ± 2.25 10.57 ± 2.22 60.00 ± 7.84 83.12 ± 3.94 12.50 ± 2.91 Cross-domain factor : Kinematic configuration and Gripper type Demo2Code 33.33 ± 6.14 36.67 ± 6.04 0.00 ± 0.00 25.00 ± 5.64 28.33 ± 5.70 0.00 ± 0.00 20.00 ± 6.41 20.00 ± 6.41 7.81 ± 2.29 GPT4V -Robotics 60.00 ± 6.38 68.33 ± 5.44 17.22 ± 4.91 56.67 ± 6.45 81.67 ± 3.01 38.86 ± 4.38 30.00 ± 7.34 76.88 ± 3.40 41.15 ± 9.50 Critic-V 36.67 ± 6.27 54.17 ± 5.22 0.00 ± 0.00 25.00 ± 5.64 59.44 ± 3.81 8.00 ± 5.45 22.50 ± 6.69 66.25 ± 3.96 8.33 ± 2.08 MoReVQA 48.33 ± 6.51 63.33 ± 5.16 13.10 ± 5.70 31.67 ± 6.06 70.56 ± 3.08 46.43 ± 6.43 25.00 ± 6.93 69.38 ± 3.52 20.00 ± 7.95 Statler 68.33 ± 6.06 74.17 ± 5.25 29.43 ± 5.14 50.00 ± 6.51 70.56 ± 4.35 53.35 ± 4.31 42.50 ± 7.92 74.38 ± 4.24 50.00 ± 5.60 LLM-DM 53.33 ± 6.49 69.17 ± 4.77 58.33 ± 5.32 35.00 ± 6.21 62.22 ± 4.15 81.69 ± 3.70 27.50 ± 7.15 67.50 ± 4.12 76.14 ± 9.79 N E S Y C R 93.33 ± 3.25 96.67 ± 1.62 0.00 ± 0.00 78.33 ± 5.36 85.00 ± 4.15 5.11 ± 2.47 52.50 ± 8.00 78.12 ± 4.22 10.12 ± 3.18 Cross-domain factor : Combination ( Obstruction , Object affor dance , Kinematic configuration , and Gripper type ) Demo2Code 30.00 ± 7.34 46.25 ± 6.55 0.00 ± 0.00 20.00 ± 6.41 48.33 ± 5.96 5.56 ± 3.64 7.50 ± 4.22 28.75 ± 5.77 12.50 ± 6.25 GPT4V -Robotics 55.00 ± 7.97 70.00 ± 5.88 15.76 ± 5.80 35.00 ± 7.64 69.17 ± 4.53 32.38 ± 7.25 15.00 ± 5.72 63.12 ± 3.80 36.46 ± 13.35 Critic-V 40.00 ± 7.84 60.00 ± 5.99 5.00 ± 3.42 37.50 ± 7.75 69.17 ± 4.68 5.93 ± 3.41 27.50 ± 7.15 66.25 ± 4.44 16.48 ± 4.15 MoReVQA 55.00 ± 7.97 72.50 ± 5.36 14.85 ± 4.88 45.00 ± 7.97 68.33 ± 5.46 48.49 ± 8.61 25.00 ± 6.93 62.50 ± 4.56 25.62 ± 6.81 Statler 67.50 ± 7.50 81.25 ± 4.63 30.12 ± 5.19 52.50 ± 8.00 82.50 ± 3.15 64.42 ± 4.22 32.50 ± 7.50 73.12 ± 4.04 61.06 ± 5.26 LLM-DM 32.50 ± 7.50 56.25 ± 5.71 82.05 ± 3.80 22.50 ± 6.69 60.00 ± 4.65 80.56 ± 2.65 10.00 ± 4.80 46.88 ± 4.92 81.25 ± 3.61 N E S Y C R 80.00 ± 6.41 88.75 ± 3.79 2.81 ± 1.97 65.00 ± 7.64 83.33 ± 4.30 6.60 ± 2.53 47.50 ± 8.00 75.00 ± 4.48 17.11 ± 3.16 T able 1. Performance on cross-domain demo-to-code tasks using simulation-based demonstration and deployment Low , 160 for Medium , and 120 for High . Baselines. W e implement six state-of-the-art baselines for video-instructed robotic programming, grouped into three: (1) VLM-based code policy synthesis , which generates task specifications from demonstrations and synthesizes code policies from the generated specifications, represented by Demo2Code [ 58 ], (2) VLM-based r easoning , which pro- duces target-domain task specifications through the reason- ing capabilities of VLMs, including GPT4V -Robotics [ 55 ], Critic-V [ 69 ], and MoReVQA [ 41 ], and (3) W orld-model- based appr oaches , which construct LLM-based or neu- rosymbolic world models that support the generation of target task specifications, including Statler [ 66 ] and LLM- DM [ 21 ]. The baselines in (2) and (3) are not designed for code policy synthesis; in our ev aluation, we apply their adaptation mechanisms in task specification genera- tion, from which code policies are synthesized. Evaluation metrics. T o e v aluate the objectiv es in Eq. ( 1 ), we use sev eral metrics. Success Rate (SR) is the per - centage of tasks completed in full, with a task counted as successful only when all subtasks are achieved. Goal Condition (GC) measures the proportion of success con- ditions achie ved, reflecting the degree of subtask comple- tion [ 49 ]. Procedure Deviation (PD) quantifies the align- ment between the adapted and demonstrated procedures using a success-weighted, length-normalized edit distance ov er their subtask-achie vement sequences [ 19 , 28 ]. 4.2. Main result T o address (Q1) , we ev aluate N E S Y C R in a simulated cross-domain en vironment where domain gaps are system- atically induced through controlled variations of our do- main factors. This setting allo ws fine-grained and repro- ducible analysis of code policy performance under differ - ent en vironmental and embodiment changes. As shown in T able 1 , we compare N E S Y C R with six state-of-the- art baselines across en vironmental, embodiment, and com- bined cross-domain scenarios. N E S Y C R consistently out- performs all baselines in both SR and GC across ev ery do- main factor . On av erage, it achiev es a 27.73% higher SR and 15.16% higher GC than GPT4V -Robotics, the strongest VLM-based r easoning baseline. Similarly , N E S Y C R im- prov es SR by 24.77% and GC by 13.20% ov er Statler , the strongest world-model-based baseline. These results demonstrate that N E S Y C R provides substantially more ro- bust cross-domain adaptation than e xisting methods. The baselines in the VLM-based reasoning category , in- cluding GPT4V -Robotics, Critic-V , and MoReVQA, strug- gle to maintain consistent task specifications under do- main shifts. In particular , erroneous feedback generation in Critic-V and error propagation across multi-stage reasoning in MoReVQA lead to substantial performance degradation, causing 40.91% and 32.95% drops in SR, respectiv ely . Al- though Statler lev erages symbolic state representations and supports VLM-guided simulati ve planning, its lack of sym- bolic tool integration and dependence on replanning from scratch result in substantial performance de gradation, espe- cially in high-complexity scenarios. LLM-DM, while em- ploying symbolic tools, depends on constructing complete domain kno wledge from a single demonstration, which fre- quently produces in v alid or illogical plans, resulting in a 41.82% drop in SR. Demo2Code, which lacks a dedicated adaptation mechanism, exhibits a 49.09% drop in SR com- pared to N E S Y C R . In terms of PD, howe ver , N E S Y C R achiev es a comparable score, with only a 1.73 average in- crease across domain factors. This is because Demo2Code tends to imitate the demonstration without accounting for the deployment domain, yielding procedurally similar but frequently unreliable code policies. Method Medium-Complexity SR GC PD Cross-domain factor : Combination Demo2Code 0.00 ± 0.00 25.00 ± 3.57 − GPT4V -Robotics 50.00 ± 18.90 75.00 ± 9.64 0.00 ± 0.00 Statler 50.00 ± 18.90 67.86 ± 15.45 42.86 ± 24.74 N E S Y C R 87.50 ± 12.50 98.21 ± 1.79 24.49 ± 11.54 T able 2. Performance on cross-domain demo-to-code tasks using real-world demonstrations and deployment Furthermore, we e v aluate N E S Y C R in real-world de- ployment using a physical robot, with results presented in T able 2 . Using demonstrations captured from human video recordings, we ev aluate performance on the same task ex e- cuted under an obstructiv e domain gap using a Franka Re- search 3 arm. In the demonstration, a human organizes objects into two parallel drawers. In the deployment do- main, these drawers sit at a perpendicular angle, creating a mutual-interference constraint: one dra wer cannot open un- less the other is closed. Because the objects are arranged in stacks requiring ordered handling, the agent must alternate drawer operations, correctly inferring and maintaining the interference constraint throughout the procedure. N E S Y C R effecti vely adapts to domain changes, achieving an 87.50% higher SR than Demo2Code and a 37.50% higher SR than both Statler and GPT4V -Robotics. GPT4V -Robotics often ov erlooks the constraints imposed by the perpendicular ori- entation and the stacked-object setting, which require alter- nating drawer operations; instead, it attempts to open both drawers at once. Statler maintains a w orld state but, lacking any explicit mechanism for enforcing this constraint, inter- mittently attempts to open both drawers at once throughout the procedure. In contrast, N E S Y C R explicitly encodes the interference as a symbolic precondition and uses symbolic tools to verify its satisf action throughout the entire proce- dure. This enables an algorithmic process for detecting and resolving incompatible procedural steps regardless of hori- zon length, supporting reliable real-world control. 4.3. Analysis and ablation (a) Comparison of SR, PD across different le v els of domain gap (b) Comparison of SR, PD across different le v els of task complexity gap Figure 4. Analysis on (a) domain gap and (b) task comple xity gap Analysis on domain gap. T o address (Q2) , we exam- ine the robustness of N E S Y C R as the domain gap between demonstration and deployment increases in Figure 4a . W e amplify variations in en vironmental factors while keep- ing task complexity fixed, creating increasingly challenging cross-domain ev aluation settings. As the gap widens, the number of obstructing objects grows, requiring additional alternativ e actions to maintain procedural compatibility . For all gap levels, N E S Y C R explores and revises procedural steps more accurately than the baselines, achie ving higher SR and lower PD. Howe v er , VLM-based baselines strug- gle to infer verified alternative action operators required to restore procedural compatibility , leading to a sharp perfor- mance drop from the Moderate to the High gap le vel. Analysis on task complexity gap. T o address (Q3) , we ev aluate the performance of N E S Y C R under increasing task complexity gaps between the demonstration and deploy- ment domains, as shown in Figure 4b . T ask complexity is controlled by increasing the number of subtasks and the depth of their dependencies while fixing a domain factor such as Obstruction . When the complexity gap is mod- erate (e.g., up to 3 additional subtasks), N E S Y C R ef fec- tiv ely adapts the procedure. Y et, as the gap widens, the deployment-domain task scenario di ver ges from the demon- strated one, causing a pronounced performance drop as the demonstration no longer provides suf ficient guidance and a fundamentally new demonstration becomes necessary . Method Medium-Complexity SR GC PD Cross-domain factor : Combination N E S Y C R 68.42 ± 7.64 84.21 ± 4.49 6.60 ± 2.53 N E S Y C R w/o Eq. ( 8 ) 50.00 ± 8.22 78.07 ± 4.03 14.65 ± 3.20 N E S Y C R w/o Eq. ( 6 ) 47.37 ± 8.21 77.19 ± 4.00 8.55 ± 2.61 N E S Y C R w/o Eq. ( 6 ) & Eq. ( 8 ) 39.47 ± 8.04 74.56 ± 4.25 9.63 ± 3.90 N E S Y C R w/o Eq. ( 4 ) 34.21 ± 7.80 67.54 ± 4.79 6.84 ± 3.89 T able 3. Ablation study of N E S Y C R Ablation study . T o address (Q4) , we conduct an ablation study to assess the contrib ution of each component within N E S Y C R . As shown in T able 3 , remo ving either the neu- rosymbolic counterfactual adaptation or the symbolic world model leads to a significant drop in SR and an increase in PD, underscoring the importance of these components for effecti ve cross-domain demo-to-code performance. Specif- ically , removing the verification of alternativ e actions (w/o Eq. ( 8 )) results in an 18.42% drop in SR, as the lack of v eri- fication allows the VLM to generate semantically plausible but logically inconsistent actions. Remo ving the counter- factual identification (Eq. ( 6 )) pre vents the frame work from obtaining feedback from the symbolic tool on which pro- cedural steps require revision, leading to a 21.05% drop in SR. When both components are remov ed (w/o Eq. ( 8 ) & Eq. ( 6 )), SR decreases by 28.95%. Removing the symbolic world model (w/o Eq. ( 4 )) causes the most sev ere degrada- tion, as the framework can no longer perform symbolic tool- based verification, resulting in the lo west SR of 34.21%. Figure 5. V isualization of a cross-domain demo-to-code task V isualization of adaptation. T o elaborate on the example in Figure 1 , we visualize the demonstration and the adapted procedure in the real-world deployment scenario. As sho wn in Figure 5 , the demonstration domain features a human performing a drawer -or ganizing task with a magnetic hook and a box of scre ws on the table. In the deployment do- main, howe ver , the human hand is replaced by a robotic gripper , the magnetic hook is already inside the drawer , and the screws are scattered across the surface. First, N E S Y C R detects that the magnetic hook already occupies the target position in the deployment domain and removes the redun- dant steps related to placing it. The VLM then repurposes the hook as an intermediate tool for aggregating the scat- tered screws, introducing actions for retrieving the hook and placing it over them. These alternative steps, howe ver , can be obstructed by objects that must also be placed in the drawer . N E S Y C R resolves this potential f ailure by reorder - ing the steps, performing the retriev al and aggre gation steps before the or ganizing steps and producing a final procedure that restores compatibility with the deployment domain. 5. Related work Foundation models provide commonsense kno wledge and reasoning capabilities that enable generalized embodied control, leveraging pretrained language and vision knowl- edge to generate executable control code from instruc- tions or demonstrations [ 36 , 54 , 58 , 61 ]. Learning from demonstrations is common for embodied agents, b ut poli- cies trained via behavioral cloning or in verse reinforcement learning struggle to generalize under perceptual and physi- cal changes [ 5 , 30 , 40 , 47 ]. Prior work has explored feature alignment, state-transition matching, and video-based im- itation [ 10 , 44 , 67 , 68 ]. Neurosymbolic approaches com- bine neural adaptability with symbolic reasoning to im- prov e correctness and interpretability in embodied plan- ning, with recent methods using LLMs or VLMs to de- riv e symbolic representations that are verified by symbolic solvers [ 13 , 37 , 50 , 53 ], whereas our work integrates such neurosymbolic reasoning with counterfactual inference to revise demonstrated procedures for cross-domain adapta- tion. Further discussion of related work is in Appendix A. 6. Conclusion In this work, we presented N E S Y C R , a neurosymbolic framew ork for cross-domain code synthesis in video- instructed robotic programming grounded in counterf actual reasoning. By constructing a verifiable symbolic world model and performing neurosymbolic counterfactual adap- tation, N E S Y C R con v erts video demonstrations into exe- cutable code policies that remain valid under perceptual and physical variations. Extensiv e experiments in simu- lation and real-w orld settings sho w that N E S Y C R consis- tently outperforms existing baselines, maintaining higher task success and procedural consistency ev en as domain gaps and task complexity increase. As sho wn in Figure 4 , when the task complexity gap between the demonstration and deployment domains becomes large, the target task can no longer be solved via demonstration-based counterfactual reasoning alone. Future work will extend this boundary through causal re-grounding, enabling N E S Y C R to infer new causal relations and remain rob ust in no vel conte xts. Acknowledgement This work was supported by Institute of Information & communications T echnology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT), RS-2022- II220043, Adapti ve Personality for Intelligent Agents, RS-2022-II221045, Self-directed multi-modal Intelligence for solving unknown, open domain problems, RS-2025- 02218768, Accelerated Insight Reasoning via Continual Learning, and RS-2025-25442569, AI Star Fello wship Support Program (Sungkyunkwan Univ .) RS-2019- II190421, Artificial Intelligence Graduate School Program (Sungkyunkwan Univ ersity)), the National Research Foundation of K orea (NRF) grant funded by the Korea gov ernment (MSIT) (No. RS-2026-25474409), IITP-ITRC (Information T echnology Research Center) grant funded by the K orea government (MSIT) (IITP-2025-RS-2024- 00437633, 10%), IITP-ICT Creati ve Consilience Program grant funded by the Korea gov ernment (MSIT) (IITP-2026- RS-2020-II201821), and by Samsung Electronics Co., Ltd. References [1] Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram, Manuela V eloso, Daniel W eld, David W ilkins Sri, Anthony Barrett, Dav e Christian- son, et al. Pddl— the planning domain definition language. T echnical Report, T ech. Rep. , 1998. 13 [2] Sudhir Agarwal, Anu Sreepathy , David H Alonso, and Prarit Lamba. Llm+ reasoning+ planning for supporting incom- plete user queries in presence of apis. arXiv pr eprint arXiv:2405.12433 , 2024. 13 [3] Sanghyun Ahn, W onje Choi, Jun yong Lee, Jinwoo P ark, and Honguk W oo. T owards reliable code-as-policies: A neuro- symbolic framework for embodied task planning. In The Thirty-ninth Annual Confer ence on Neural Information Pr o- cessing Systems , 2025. 13 [4] Montserrat Gonzalez Arenas, T ed Xiao, Sumeet Singh, V idhi Jain, Allen Ren, Quan V uong, Jake V arley , Alexander Her- zog, Isabel Leal, Sean Kirmani, et al. Ho w to prompt your robot: A promptbook for manipulation skills with code as policies. In 2024 IEEE International Confer ence on Robotics and Automation (ICRA) , 2024. 13 [5] Brenna D Argall, Sonia Chernova, Manuela V eloso, and Brett Bro wning. A surve y of robot learning from demon- stration. Robotics and autonomous systems , 2009. 8 , 13 [6] Ashay Athalye, Nishanth Kumar , T om Silver , Y ichao Liang, Jiuguang W ang, T om ´ as Lozano-P ´ erez, and Leslie Pack Kaelbling. From pixels to predicates: Learning symbolic world models via pretrained vision-language models. arXiv pr eprint arXiv:2501.00296 , 2024. 13 [7] Anthony Brohan, Y evgen Chebotar , Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Pr oceedings of the 6th Confer ence on Robot Learning , 2023. 1 , 13 [8] Kaylee Burns, Ajinkya Jain, Kee gan Go, Fei Xia, Michael Stark, Stefan Schaal, and Karol Hausman. Genchip: gener- ating robot polic y code for high-precision and contact-rich manipulation tasks. In Pr oceedings of the 37th IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , 2024. 13 [9] Y ongchao Chen, Y ilun Hao, Y ang Zhang, and Chuchu Fan. Code-as-symbolic-planner: F oundation model-based robot planning via symbolic code generation. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , 2025. 13 [10] Sungho Choi, Seungyul Han, W oojun Kim, Jongseong Chae, Whiyoung Jung, and Y oungchul Sung. Domain adapti ve im- itation learning with visual observation. Advances in Neural Information Pr ocessing Systems , 2023. 8 , 13 [11] W onje Choi, W oo Kyung Kim, SeungHyun Kim, and Honguk W oo. Efficient polic y adaptation with contrasti ve prompt ensemble for embodied agents. arXiv preprint arXiv:2412.11484 , 2024. 1 , 13 [12] W onje Choi, Jooyoung Kim, and Honguk W oo. Nesypr: Neurosymbolic proceduralization for efficient embodied rea- soning. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems , 2025. 13 [13] W onje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, and Honguk W oo. Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains. arXiv preprint arXiv:2503.00870 , 2025. 8 [14] Cristina Cornelio and Mohammed Diab. Recover: A neuro-symbolic framework for failure detection and reco v- ery . arXiv pr eprint arXiv:2404.00756 , 2024. 13 [15] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey L ynch, Aakanksha Chowdhery , Brian Ichter, A yzaan W ahid, Jonathan T ompson, Quan V uong, T ianhe Y u, W enlong Huang, Y evgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Serge y Levine, V incent V anhoucke, Karol Hausman, Marc T oussaint, Klaus Gref f, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied multimodal lan- guage model. In arXiv pr eprint arXiv:2303.03378 , 2023. 1 , 13 [16] Arnaud Fickinger , Samuel Cohen, Stuart Russell, and Bran- don Amos. Cross-domain imitation learning via optimal transport. arXiv preprint , 2021. 1 , 13 [17] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence , 1971. 3 [18] Maria Fox and Derek Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains. Journal of artifi- cial intelligence r esear ch , 2003. 4 [19] Maria Fox, Alfonso Gerevini, Derek Long, Iv an Serina, et al. Plan stability: Replanning versus plan repair . In ICAPS , 2006. 6 , 18 [20] Martin Gebser, Roland Kaminski, Benjamin Kaufmann, and T orsten Schaub. Multi-shot asp solving with clingo. Theory and Practice of Logic Pr ogramming , 2019. 13 [21] Lin Guan, Karthik V almeekam, Sarath Sreedharan, and Sub- barao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model- based task planning. Advances in Neural Information Pro- cessing Systems , 2023. 6 , 24 [22] Haoran He, Chenjia Bai, Ling Pan, W einan Zhang, Bin Zhao, and Xuelong Li. Learning an actionable discrete diffusion policy via large-scale actionless video pre-training. Ad- vances in Neural Information Pr ocessing Systems , 2024. 3 [23] Richard Ho wey , Derek Long, and Maria Fox. V al: Auto- matic plan validation, continuous effects and mixed initiati ve planning using pddl. In 16th IEEE International Confer ence on T ools with Artificial Intelligence , 2004. 4 [24] Siyuan Huang, Zhengkai Jiang, Hao Dong, Y u Qiao, Peng Gao, and Hongsheng Li. Instruct2act: Mapping multi- modality instructions to robotic actions with large language model. arXiv preprint , 2023. 1 , 13 [25] W enlong Huang, Fei Xia, T ed Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan T ompson, Igor Mordatch, Y evgen Chebotar , et al. Inner monologue: Em- bodied reasoning through planning with language models. arXiv pr eprint arXiv:2207.05608 , 2022. 1 , 13 [26] W enlong Huang, Chen W ang, Ruohan Zhang, Y unzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv pr eprint arXiv:2307.05973 , 2023. 1 , 13 [27] W illiam Huey , Huaxiao yue W ang, Anne W u, Y oav Artzi, and Sanjiban Choudhury . Imitation learning from a single tem- porally misaligned video. arXiv preprint , 2025. 3 [28] Mert Inan, Aishwarya Padmakumar , Spandana Gella, P atrick Lange, and Dilek Hakkani-T ur . Multimodal contextualized plan prediction for embodied task completion. arXiv preprint arXiv:2305.06485 , 2023. 6 , 18 [29] Adam Ishay , Zhun Y ang, and Joohyung Lee. Leverag- ing large language models to generate answer set programs. arXiv pr eprint arXiv:2307.07699 , 2023. 13 [30] Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler , Fred- erik Ebert, Corey L ynch, Serge y Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , 2022. 8 , 13 [31] Y unfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi W ang, Y ongqiang Dou, Y anjun Chen, Li Fei-Fei, Anima Anandku- mar , Y uke Zhu, and Linxi Fan. V ima: General robot manip- ulation with multimodal prompts. In F ortieth International Confer ence on Machine Learning , 2023. 5 , 15 [32] Byeonghwi Kim, Jinyeon Kim, Y uyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-a ware planning and en vironment-aw are memory for instruction follo wing em- bodied agents. In Pr oceedings of the IEEE/CVF Interna- tional Confer ence on Computer V ision , 2023. 13 [33] Bowen Li, T om Silver , Sebastian Scherer , and Alexander Gray . Bilevel learning for bilevel planning. arXiv preprint arXiv:2502.08697 , 2025. 13 [34] Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. In Pr oceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , 2024. 4 [35] Y in Li, Liangwei W ang, Shiyuan Piao, Boo-Ho Y ang, Ziyue Li, W ei Zeng, and Fugee Tsung. Mccoder: streamlining mo- tion control with llm-assisted code generation and rigorous verification. arXiv preprint , 2024. 13 [36] Jacky Liang, W enlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter , Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. arXiv pr eprint arXiv:2209.07753 , 2022. 1 , 3 , 8 , 13 [37] Y ichao Liang, Nishanth K umar , Hao T ang, Adrian W eller, Joshua B T enenbaum, T om Silver , Jo ˜ ao F Henriques, and Ke vin Ellis. V isualpredicator: Learning abstract world mod- els with neuro-symbolic predicates for robot planning. arXiv pr eprint arXiv:2410.23156 , 2024. 8 , 13 [38] Xinrui Lin, Y angfan W u, Huan yu Y ang, Y u Zhang, Y anyong Zhang, and Jianmin Ji. Clmasp: Coupling lar ge language models with answer set programming for robotic task plan- ning. arXiv preprint , 2024. 13 [39] Bo Liu, Y uqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empower - ing large language models with optimal planning proficiency . arXiv pr eprint arXiv:2304.11477 , 2023. 13 [40] Y uXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE international confer ence on robotics and automation (ICRA) , 2018. 8 , 13 [41] Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reason- ing models for video question answering. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2024. 6 , 23 [42] Theo X Olausson, Alex Gu, Benjamin Lipkin, Cedegao E Zhang, Armando Solar-Lezama, Joshua B T enenbaum, and Roger Le vy . Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic prov ers. arXiv preprint , 2023. 13 [43] Liangming Pan, Alon Albalak, Xinyi W ang, and W illiam Y ang W ang. Logic-lm: Empo wering lar ge language models with symbolic solvers for faithful logical reasoning. arXiv preprint , 2023. 13 [44] Y uzhe Qin, Y ueh-Hua W u, Shaowei Liu, Hanwen Jiang, Rui- han Y ang, Y ang Fu, and Xiaolong W ang. Dexmv: Imitation learning for dexterous manipulation from human videos. In Eur opean Confer ence on Computer V ision , 2022. 8 , 13 [45] Aravind Rajeswaran, V ikash Kumar , Abhishek Gupta, Giu- lia V ezzani, John Schulman, Emanuel T odorov , and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 , 2017. 1 , 13 [46] Dripta S Raychaudhuri, Sujoy Paul, Jeroen V anbaar , and Amit K Roy-Cho wdhury . Cross-domain imitation from ob- servations. In International confer ence on machine learning , 2021. 1 , 13 [47] St ´ ephane Ross, Geoffre y Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. In Pr oceedings of the fourteenth inter- national conference on artificial intelligence and statistics , 2011. 8 , 13 [48] Sangwoo Shin, Daehee Lee, Minjong Y oo, W oo K yung Kim, and Honguk W oo. One-shot imitation in a non-stationary en- vironment via multi-modal skill. In International Conference on Machine Learning , 2023. 13 [49] Mohit Shridhar , Jesse Thomason, Daniel Gordon, Y onatan Bisk, W inson Han, Roozbeh Mottaghi, Luke Zettlemoyer , and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , 2020. 6 , 17 [50] T om Silver , Soham Dan, Kavitha Sriniv as, Joshua B T enen- baum, Leslie Kaelbling, and Michael Katz. Generalized planning in pddl domains with pretrained large language models. In Proceedings of the AAAI confer ence on artificial intelligence , 2024. 8 , 13 [51] Chan Hee Song, Jiaman W u, Clayton W ashington, Brian M Sadler , W ei-Lun Chao, and Y u Su. Llm-planner: Fe w-shot grounded planning for embodied agents with large language models. In Pr oceedings of the 19th IEEE/CVF International Confer ence on Computer V ision , 2023. 1 , 13 [52] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An intr oduction . MIT press, 2018. 2 [53] Hao T ang, Darren Ke y , and Ke vin Ellis. W orldcoder , a model-based llm agent: Building world models by writing code and interacting with the environment. In Advances in Neural Information Pr ocessing Systems , 2024. 8 , 13 [54] Sai V emprala, Rogerio Bonatti, Arthur Bucker , and Ashish Kapoor . Chatgpt for robotics: Design principles and model abilities. 2023. Published by Microsoft , 2023. 1 , 8 , 13 [55] Naoki W ake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun T akamatsu, and Katsushi Ikeuchi. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstra- tion. IEEE Robotics and Automation Letters , 2024. 6 , 23 [56] Beichen W ang, Juexiao Zhang, Shuwen Dong, Irving Fang, and Chen Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model. arXiv preprint arXiv:2410.08792 , 2024. 1 , 13 [57] Lirui W ang, Y iyang Ling, Zhecheng Y uan, Mohit Shridhar , Chen Bao, Y uzhe Qin, Bailin W ang, Huazhe Xu, and Xiao- long W ang. Gensim: Generating robotic simulation tasks via large language models. arXiv preprint , 2023. 13 [58] Y uki W ang, Gonzalo Gonzalez-Pumariega, Y ash Sharma, and Sanjiban Choudhury . Demo2code: From summarizing demonstrations to synthesizing code via extended chain-of- thought. Advances in Neural Information Processing Sys- tems , 2023. 1 , 3 , 6 , 8 , 13 , 23 [59] Y ufei W ang, Zhou Xian, Feng Chen, Tsun-Hsuan W ang, Y ian W ang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: T ow ards unleashing infi- nite data for automated robot learning via generativ e simula- tion. arXiv preprint , 2023. 5 , 15 [60] Zuoxu W ang, Zhijie Y an, Shufei Li, and Jihong Liu. Ind- vissgg: Vlm-based scene graph generation for industrial spa- tial intelligence. Advanced Engineering Informatics , 2025. 4 [61] Senwei Xie, Hongyu W ang, Zhanqi Xiao, Ruiping W ang, and Xilin Chen. Robotic programmer: V ideo instructed pol- icy code generation for robotic manipulation. arXiv preprint arXiv:2501.04268 , 2025. 1 , 8 , 13 [62] Dongil Y ang, Minjin Kim, Sunghwan Mac Kim, Beong- woo Kwak, Minjun Park, Jinseok Hong, W oontack W oo, and Jinyoung Y eo. Llm meets scene graph: Can large language models understand and generate scene graphs? a bench- mark and empirical study . In Pr oceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics (V olume 1: Long P apers) , 2025. 28 [63] Y ijun Y ang, Tianyi Zhou, Kanxue Li, Dapeng T ao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Y uhui Shi. Em- bodied multi-modal agent trained by an llm from a parallel textworld. In Pr oceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , 2024. 13 [64] Zhun Y ang, Adam Ishay , and Joohyung Lee. Coupling lar ge language models with logic programming for robust and gen- eral reasoning from text. arXiv pr eprint arXiv:2307.07696 , 2023. 13 [65] W eirui Y e, Fangchen Liu, Zheng Ding, Y ang Gao, Oleh Ry- bkin, and Pieter Abbeel. V ideo2policy: Scaling up manip- ulation tasks in simulation through internet videos. arXiv pr eprint arXiv:2502.09886 , 2025. 1 , 13 [66] T akuma Y oneda, Jiading Fang, Peng Li, Huanyu Zhang, T ianchong Jiang, Shengjie Lin, Ben Picker , Da vid Y unis, Hongyuan Mei, and Matthew R W alter . Statler: State- maintaining language models for embodied reasoning. In 2024 IEEE International Confer ence on Robotics and Au- tomation (ICRA) , 2024. 2 , 6 , 24 [67] Sarah Y oung, Dhiraj Gandhi, Shubham T ulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. V isual imitation made easy . In Confer ence on Robot learning , 2021. 8 , 13 [68] Tianhe Y u, Chelsea Finn, Annie Xie, Sudeep Dasari, Tian- hao Zhang, Pieter Abbeel, and Sergey Levine. One-shot im- itation from observing humans via domain-adaptive meta- learning. arXiv preprint , 2018. 8 , 13 [69] Di Zhang, Jingdi Lei, Junxian Li, Xunzhi W ang, Y ujie Liu, Zonglin Y ang, Jiatong Li, W eida W ang, Suorong Y ang, Jianbo W u, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , 2025. 6 , 23 [70] Xian Zhou, Y iling Qiao, Zhenjia Xu, TH W ang, Z Chen, J Zheng, Z Xiong, Y W ang, M Zhang, P Ma, et al. Genesis: A generativ e and uni v ersal physics engine for robotics and beyond. arXiv preprint , 2024. 5 , 13 [71] W ang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, and Jesse Thomason. Psalm-v: Automating symbolic planning in interactiv e visual en vironments with large language models. arXiv pr eprint arXiv:2506.20097 , 2025. 13 Cr oss-Domain Demo-to-Code via Neur osymbolic Counterfactual Reasoning Supplementary Material Contents A . Further Related W orks 13 B . Envir onment Settings 13 B.1 . Genesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.2 . Real-world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 . Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.4 . N E S Y C R Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.5 . Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C . Additional Experiments 28 C.1 . Experiments on VLM State Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.2 . Ablations on VLM choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.3 . Robustness to Scene Graph Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.4 . Running Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D . Additional V isualizations 30 D.1 . Figure 5 in Main Paper V isualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.2 . Real-world Experiment V isualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E . Prompts 33 E.1 . Demo2Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 . GPT4V -Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.3 . Critic-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 E.4 . MoReVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 E.5 . Statler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.6 . LLM-DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.7 . NeSyCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 E.8 . Common . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A. Further Related W orks Foundation models for embodied control. Foundation models have emerged as a promising alternative to con ventional neural policies for embodied control, offering broad commonsense knowledge and reasoning capabilities that enable gener- alized task planning without e xtensi ve task-specific data [ 7 , 15 , 25 , 32 , 51 , 63 ]. By le v eraging pretrained language and vision representations, these models interpret human instructions and generate goal-directed behaviors for embodied agents. Recent studies ha ve utilized LLMs and VLMs with code-writing capabilities to synthesize e xecutable control code for embodied agents. LLM-based approaches map language instructions to code policies [ 4 , 8 , 24 , 26 , 35 , 36 , 54 , 57 ], while VLM-based methods extend this to video demonstrations [ 56 , 58 , 61 , 65 ]. In this work, we enhance the VLM-based paradigm with rob ust cross-domain transfer to bridge mismatches between demonstrations and deployment. Cross-domain adaptation from demonstrations. A well-established approach for training embodied agents is to learn from demonstrations, which enables policy acquisition without explicit re w ards or manual supervision [ 5 , 30 , 40 , 47 ]. How- ev er , policies learned through beha vioral cloning or in verse reinforcement learning often struggle to generalize under signifi- cant domain shifts, such as variations in perceptual and physical factors, between expert demonstrations and the agent’ s de- ployment [ 11 , 16 , 45 , 46 , 48 ]. Prior works have explored adaptation strategies based on feature alignment, polic y fine-tuning, and state-transition matching, as well as video-based imitation methods that align visual representations across domains (e.g., human-to-robot transfer) [ 10 , 44 , 67 , 68 ]. Despite these, achie ving robust generalization under perceptual and physical vari- ations remains challenging, especially with limited demonstrations and procedurally complex tasks. In this work, we focus on domain gaps that induce procedural discrepancies and address them through counterfactual reasoning. Neurosymbolic approaches for embodied agents. Neurosymbolic methods integrate the adaptability of neural networks with the formal reasoning capabilities of symbolic tools, improving the correctness and interpretability of task-le vel planning and reasoning. Recent studies have lev eraged foundation models to produce symbolic formulations that are subsequently verified by symbolic tools [ 29 , 42 , 43 , 64 ]. In embodied control, symbolic formalisms such as PDDL [ 1 ] and ASP [ 20 ] were employed, using LLMs to encode domain knowledge and translate task descriptions into problem instances [ 2 , 3 , 9 , 12 – 14 , 33 , 38 , 39 , 50 ]. Furthermore, recent works ha v e incorporated VLMs to enable perceptually grounded symbolic reasoning, lev eraging the models’ vision–language pretraining to extract predicates, infer object relations, and construct symbolic world models that support embodied planning [ 6 , 37 , 53 , 71 ]. Our work integrates neurosymbolic reasoning with counterf actual inference, where the VLM proposes alternativ e actions while a symbolic tool validates and adjusts the resulting procedures to ensure reliable task ex ecution across domains. B. En vir onment Settings B.1. Genesis Genesis [ 70 ] is a GPU-accelerated physics simulation platform designed for general-purpose robotic learning and e v aluation. The platform provides a Python-based API that enables flexible implementation of custom en vironments with configurable scene layouts, object properties, and task specifications. W e lev erage Genesis to construct the simulated en vironment for ev aluating our models on the cross-domain demo-to-code tasks, as shown in T able 1 of the main paper . Our ev aluation framew ork encompasses two key dimensions: (1) cross-domain settings (rows of T able 1) that systematically vary en viron- mental and embodiment factors, and (2) tasks with graduated complexity levels (columns of T able 1) ranging from simple operations to complex multi-step procedures. W e provide detailed descriptions of each dimension belo w . B.1.1. Cross-domain settings T o implement the cross-domain settings, we define fi v e cross-domain factors grouped into three e v aluation categories: (1) Obstruction and Object Affordance, which assess performance under en vironmental shifts; (2) Kinematic Configuration and Gripper T ype, which assess performance under embodiment shifts; and (3) Combination, which ev aluates robustness under both types of shifts simultaneously . These cross-domain settings can be applied to each task, allo wing to introduce controlled domain gaps between the demonstration and deployment domains. W e describe each cross-domain setting in detail below . Obstruction and Object aff ordance. T o implement this en vironmental factor , we introduce mechanisms to control ob- struction and object-affordance lev els for each subtask. The obstruction level ranges from 0 to 2, with higher le vels indicating increased domain gap and task complexity . As the obstruction level increases, target objects become more occluded, requir- ing the agent to resolve the obstruction before initiating the task. The scenes for the cross-domain settings for each subtask type are depicted in Figure 6 , with their explanations pro vided in T able 4 . Figure 6. Example scene of Obstruction and Object affordance T ask Level 1 Level 2 Pick & Place A cube block stacked on preceding block. Multiple cube blocks stacked on preceding block. Sweep A chess piece starts in the wrong box. Multiple chess pieces start in the wrong box. Rotate Hinge lid starts closed. Objects start stacked on top of the closed lid. Slide T arget dra wer starts closed. T arget dra wer starts closed, other obstructs from abo ve. T able 4. Description of task variations across obstruction le v els Kinematic configuration and Gripper type. T o implement this embodiment factor , we configure the robot with a 7-DoF vacuum suction gripper to compare it against a 9-DoF finger gripper . The finger gripper grasps objects by opening and closing its fingers, whereas the v acuum gripper secures objects by activ ating and deactiv ating its suction mechanism. As each embodiment provides a distinct Action API set, the code polic y must reorg anize ho w actions are in v oked and composed to conform to the target APIs. When adaptation is incomplete, this restructuring of API usage does not occur . The example scenes for the cross-domain setting are depicted in Figure 7 . Figure 7. Example scene of Kinematic configuration and Gripper type Combination. The Combined setting implements all cross-domain factor variations simultaneously , including obstruction and object-affordance lev els, kinematic configuration, and gripper type. This setup creates the most challenging ev aluation condition by introducing domain gaps across both environmental and embodiment dimensions, requiring comprehensiv e adaptation from the demonstration to the deployment domain. B.1.2. Benchmark tasks T o assess the effecti v eness of the generated code policy in the embodied domain, we design and implement four representati ve subtasks widely used in the area of robotic manipulation research [ 31 , 59 ] in Genesis. These subtasks are composed to generate a single long-horizon task, with the dif ficulty of each subtask controlled by both domain factors and the comple xity lev el, allo wing for systematic e v aluation. The example scenes for each subtasks are depicted in Figure 8 . • Pick&Place . The robot picks up cube blocks from the table and places them into a box. • Sweep . The robot sweeps chess pieces across the board, pushing each piece into its corresponding box. The task succeeds only if all chess pieces are pushed inside their correct boxes. • Rotate . The robot picks up fruits, places them into a hinged container, and closes the container by rotating the lid around its axis. The task is considered successful only if all fruits are placed inside the container and the lid is fully closed. • Slide . The robot picks up cylinder blocks, places them into a drawer , and closes the drawer by sliding it shut. The task is considered successful only if all cylinder blocks are placed inside the dra wer and the drawer is fully closed. Figure 8. Example scenes for the subtasks T ask complexity level. By controlling the indi vidual complexity level of each subtask and the total number of subtasks used to compose a single task, we control the task complexity used for ev aluation. W e divide the complexity le vel into three categories—lo w , medium, and high—where the low level consists of two subtasks, and the medium and high lev els consist of three and four subtasks, respecti vely . As complexity increases, not only does the number of subtasks gro w , b ut the number of task-objects in each subtask also increases, making the task more complex and long-horizon. Example scenes for each task complexity le v el are depicted in Figure 9 . Figure 9. Example scenes for each task complexity le vel B.1.3. Evaluation W e ev aluate scenarios constructed by combining domain settings with task complexity levels. In total, 220 scenario config- urations are generated and categorized into three complexity levels: 80 Low , 80 Medium, and 60 High. Each configuration is ev aluated on scenes initialized with two dif ferent random seeds, yielding a total of 440 ev aluation scenarios as depicted in T able 5 . Domain Factors \ T ask Complexity Low Medium High T otal Obstruction & Object Affordance 60 60 40 160 Kinematic Config & Gripper 60 60 40 160 Combination 40 40 40 120 T otal Scenarios 160 160 120 440 T able 5. Experimental scenarios for model ev aluation B.1.4. Primitive APIs T o connect the generated code polic y with robot control, we expose a set of primitive APIs that the VLM-generated code policy can inv oke during execution. Specifically , we provide two primitive interfaces: a Perception API and an Action API. The complete specifications of these APIs are detailed in T able 6 and T able 7 . B.2. Real-world En vir onment setup. W e conducted the real-world e xperiments using a 7-DoF Franka Emika Research 3 robotic arm with a two-finger gripper on a tabletop workspace. An Intel RealSense D435 RGB-D camera was mounted abov e the table to provide top-do wn RGB and depth observ ations. The captured images were then processed by an object detection and segmentation module to e xtract the categories and bounding box es of task-rele v ant objects. Object configuration. Across all experiments including the main scenario in Figure 1 of the main paper , we use an object pool consisting of three drawers, a magnetic hook, v arious types of scre ws, and other auxiliary objects for intermediate manipulations. The initial positions of all objects are randomized for each trial. Figure 10 sho ws a representativ e e xample of the en vironment used in the e xperiments. B.3. Metrics W e employ three metrics that assess dif ferent aspects of task completion and procedure adherence. Per ception API Primitive API Description is obj visible(obj name) Returns a boolean indicating whether the object is present in the scene’ s object list. get obj names() Returns a list of all object names currently present in the scene. get obj pos(obj name) Returns the ( x, y , z ) position of the corresponding object. get obj bbox(obj name) Returns its axis-aligned bounding box [min, max] in world coordinates. get obj size(obj name) Returns its size vector computed as the difference between the max and min corners of its bounding box. gripper is open() Returns a Boolean indicating whether the gripper is currently open. obj in gripper(obj name) Returns a Boolean indicating whether the object is currently within the gripper’s grasp or suction region. get empty floor xy(obj name) Returns a collision-free ( x, y ) position on the floor where an object can be placed without overlapping e xisting objects. T able 6. Perception API primitiv es Action API Primitive API Description move gripper to(obj name, depth) Moves the end-ef fector to ward the object. move to position(pos) Moves the end-ef fector to a tar get position. move parallel(move dir, offset) Moves the end-effector parallel to the workspace plane in the specified direction by a giv en of fset. grasp handle(handle name) Grasps the handle when the end-effector is suf ficiently close. release handle() Releases the currently grasped handle and opens the gripper . open gripper() Open the finger gripper . close gripper() Close the finger gripper . attach vacuum handle(handle name) V acuum suction tool counterpart to grasp handle . detach vacuum handle() V acuum suction tool counterpart to release handle . deactivate vacuum() V acuum suction tool counterpart to open gripper . activate vacuum() V acuum suction tool counterpart to close gripper . T able 7. Action API primitiv es Success Rate (SR). The Success Rate measures the percentage of tasks that are completed in full. A task is counted as successful only when all subtasks are achiev ed: SR = 1 N N X i =1 1 [ all subtasks completed in task i ] where N is the total number of tasks and 1 [ · ] is the indicator function. Goal Condition (GC). The Goal Condition metric measures the proportion of success conditions achiev ed, reflecting the degree of subtask completion [ 49 ]: GC = 1 N N X i =1 number of achiev ed subtasks in task i total number of subtasks in task i This metric provides a more granular vie w of partial task completion compared to SR. Figure 10. Example scenes from the real-world en vironment used in our e xperiments Procedur e Deviation (PD). The Procedure De viation quantifies the alignment between the adapted procedure and the demonstrated procedure using a length-normalized edit distance ov er their subtask-achie vement sequences of succeeded tasks [ 19 , 28 ]. PD = 1 N success X i ∈S EditDistance ( S adapt i , S demo i ) max( | S adapt i | , | S demo i | ) where S is the set of successfully completed tasks, N success = |S | is the number of successful tasks, S adapt i and S demo i are the subtask-achiev ement sequences for the adapted and demonstrated procedures in task i , respecti v ely , and EditDistance ( · , · ) [ 28 ] computes the edit distance between two sequences. B.4. N E S Y C R Implementation In this section, we describe the implementation details of N E S Y C R, which is built upon two complementary reasoning engines: a vision language model (VLM) and a symbolic tool. For the VLM component, we employ the general-purpose GPT -5 model and specialize its behavior via stage-specific prompting to satisfy the distinct functional requirements of each module. For symbolic tool, we use the open-source PDDLGym 1 library to handle PDDL parsing and state grounding. Because N E S Y C R only requires forward state progression—rather than full planning—we implement a symbolic execution logic in Python that performs precondition verification and ef fect application. W e provide the pseudo-code for the symbolic tool in Algorithm 2 . T o use the symbolic representation, a predicate set that determines the scope of symbolic abstraction must be supplied. W e design a predicate set aimed at capturing generalizable relations common across embodied domains, including both physical relations and embodiment-specific predicates. Although richer or more domain-specific relations could be incorporated, we restrict ourselves to general predicates, as predicate in vention lies outside the focus of this work. The predicate set used in main experiment is pro vided in Figure 11 , note that this same predicate set is also shared among baselines that use symbolic representations. 1 https://github .com/tomsilver/pddlgym Algorithm 2 Symbolic T ool Φ 1: /* F orwar d execution with pr econdition check */ 2: function S Y M B O L I C E X E C U T E ( s, a ) 3: if pre ( a ) ⊆ s then 4: retur n ⊥ , pre ( a ) \ s Pr econdition V erification 5: end if 6: s ′ ← ( s \ del ( a )) ∪ add ( a ) Effect Application 7: retur n s ′ , ∅ 8: 9: /* V erification pr ocess */ 10: function S Y M B O L I C V E R I F Y ( s, a ) 11: s ′ , V ← S Y M B O L I C E X E C U T E ( s, a ) 12: if V = ∅ then 13: retur n F A I L , V Inconsistent 14: end if 15: retur n P A S S , s ′ Consistent Predicate Set Predicate Definition # Physics-related (OverOf ?a ?b) ?a is v ertically abo ve ?b without contact. (OnTopOf ?a ?b) ?a is resting on and supported by ?b . (InsideOf ?a ?b) ?a is contained within ?b . (Open ?x) Container ?x is fully open. (Closed ?x) Container ?x is fully closed. # Embodiment-specific (FingerGripper) Robot uses a two-finger gripper . (VacuumSuction) Robot uses a vacuum suction tool. (GripperSurrounding ?x) Gripper encloses ?x without closing. (GripperHolding ?x) Gripper is closed and holding ?x . (GripperOpen) Gripper is open. (GripperClosed) Gripper is closed. (VacuumAligned ?x) V acuum is aligned with ?x but inacti v e. (VacuumAttached ?x) V acuum is active and attached to ?x . (VacuumActive) V acuum suction is on. (VacuumInactive) V acuum suction is off. Figure 11. Predicate set used for symbolic state representation B.4.1. Symbolic W orld Model Construction Demonstration Symbolic States Action Operator Demonstrated Procedure V erification T ranslate Reconstruct Y ield Simulate V erify Figure 12. N E S Y C R symbolic world model construction high-level flo w VLM. In symbolic world model construction, VLM performs: (1) symbolic state translation , where we process demonstration using a VLM with sequence of multi-view images and objects to e xtract grounded scene graphs representing symbolic states at each timestep, forming an ordered sequence of symbolic states from the first timestep to the final timestep that captures object entities and spatial relations; and (2) symbolic dynamics r econstruction , where the VLM analyzes consecutiv e state pairs at each timestep to predict action operators. For each transition between timestep t and t + 1 , the VLM is provided with the pre vious state, current state, and their state dif ference (additions and deletions of predicates). The VLM then predicts action operators specifying: (i) action semantic description, (ii) preconditions required for e xecution, and (iii) effects produced after e x ecution. Examples of the symbolic states and the action operators are provided in Figure 13 . Symbolic T ool. In symbolic world model construction, symbolic tool performs: (1) action verification through the following process: for each predicted action operator at timestep t , the tool (i) verifies that the current symbolic state fulfills all preconditions specified in the operator , (ii) applies the action’ s effects to the current state to produce the resulting ne xt state, and (iii) validates that this resulting state matches the e xpected symbolic state at timestep t + 1 . If verification fails, the VLM is triggered to repredict the action operator until the entire sequence is verified. Symbolic State Example - (Closed top drawer) - (GripperClosed) - (GripperHolding green cylinder) - (OnTopOf apple floor) . . . - (OnTopOf toy box floor) - (OnTopOf yellow cube floor) - (Open bottom drawer) - (Open hinge lid) Action Operator Example Name: MoveGripperToSurroundOrange Preconditions: - (GripperSurrounding orange) - (GripperOpen) - (forall (?y - thing) (not (GripperHolding ?y)) ) Effects: - (GripperClosed) - (GripperHolding orange) - (not (GripperOpen)) - (not (GripperSurrounding orange)) Figure 13. Example of symbolic state and action operator extracted from the demonstration B.4.2. Neurosymbolic Counterfactual Adaptation Deployment Counterfactual State Inconsistency Check Alternativ e Action Demonstrated Procedure Adapted Procedure Specification T ranslate Simulate Inconsistent Intervene Consistent Y ield Figure 14. N E S Y C R neurosymbolic counterfactual adaptation high-level flo w VLM. In neurosymbolic counterfactual adaptation, VLM is used for: (1) counterfactual state translation , where the VLM observes the deployment scene to generate a counterf actual initial state that reflects tar get-domain conditions while maintaining compatibility with the symbolic predicates from the w orld model; and (2) counterfactual e xplor ation , where for each action identified as inconsistent during symbolic forward simulation, the VLM proposes alternati ve action operators by analyzing the current counterfactual state, the incompatible original action, and the violated preconditions, generating interventions whose ef fects restore the violated preconditions or achie v e equi v alent outcomes under the deployment domain constraints. Symbolic T ool. In neurosymbolic counterfactual adaptation, the symbolic tool performs two functions: (1) counterfactual identification , where it simulates the demonstrated procedure in the counterfactual setting by iterati vely computing the ne xt counterfactual state by applying each demonstrated action to the current counterfactual state, checking whether the action’ s preconditions are satisfied in the current state, flagging actions as inconsistent when some preconditions are violated; and (2) action verification , where it validates VLM-proposed alternati ve action by verifying that its preconditions are satisfied in the counterfactual state. The identification-exploration loop continues iterati v ely until either the adapted procedure successfully reaches the goal condition while maintaining causal consistency throughout, or the maximum number of iterations is reached. An example of counterfactual identification and exploration is pro vided in Figure 15 . Main hyperparameters of N E S Y C R are listed in T able 8 . Hyperparameters V alue Max explorations 10 (low/medium), 20 (high) (Obstruction/Kinematic) 15 (low/medium), 30 (high) (Combination) T able 8. N E S Y C R hyperparameters Identification Example Current Counterfactual State: - (Closed bottom drawer) - (Closed hinge lid) - ... Incompatible Action: OpenGripperToDropOrangeIntoHingeBody - Preconditions: - (not (Closed hinge lid)) - ... - Effects: - (InsideOf orange hinge body) - ... V iolated Precondition: - (not (Closed hinge lid)) Exploration Example <<<<<<< SEARCH OpenGripperToDropOrangeIntoHingeBody - Preconditions: - (not (Closed hinge lid)) - ... - Effects: - (InsideOf orange hinge body) - ... . . . ======= MoveHeldOrangeOverFloor - Preconditions: - (GripperHolding orange) - ... - Effects: - (OverOf orange floor) - ... . . . >>>>>>> REPLACE Figure 15. Example of counterfactual identification and exploration for the demonstration B.5. Baselines All baselines receiv e a single demonstration as input—comprising a sequence of multi vie w images, instruction, and object set, supplemented with task context—and produce a tar get task specification in the form of high-le vel procedure. Once the task specification is generated, the code policy synthesis process remains consistent across all baseline models. An example of task specification is provided in Figure 16 . T ask Specification Example 1. Move gripper to surround green cylinder 2. Close gripper on green cylinder 3. Move held green cylinder over bottom drawer 4. Release green cylinder into bottom drawer . . . 46. Move gripper to surround red cube 47. Close gripper on red cube 48. Move held red cube over toy box 49. Open gripper to place red cube into toy box Figure 16. Example of task specification T able 9 summarizes the characteristics of each baseline method across fi ve dimensions: Natively Code indicates whether the method generates code policies in its original work, or has been adapted for our e v aluation. Explicit Adapt denotes explicit mechanisms for adapting demonstrations to the deployment domain. Replanning indicates generating compatible procedures by replanning from scratch in the deployment domain, rather than adapting demonstrations. W orld Model refers to use of structured world representations. Symbolic T ool denotes the use of external symbolic tool. Baseline Natively Code Explicit Adapt Replanning W orld Model Symbolic T ool Demo2Code ✓ ✗ ✗ ✗ ✗ GPT4V -Robotics ✓ ✓ ✓ ✗ ✗ Critic-V ✗ ✓ ✗ ✗ ✗ MoReVQA ✗ ✓ ✗ ✗ ✗ Statler ✗ ✓ ✓ ✓ ✗ LLM-DM ✗ ✓ ✓ ✓ ✓ T able 9. Comparison of baseline methods • VLM-based code policy synthesis. uses VLMs to generate task specifications from visual demonstrations and synthesizes code policies, without incorporating an explicit adaptation mechanism. – Demo2Code [ 58 ] generates task specifications by recursi vely summarizing the demonstration, with deployment domain information provided at intermediate steps. • VLM-based reasoning . utilize the reasoning capabilities of VLMs to perform adaptation, generating task specifications tailored to the target domain. – GPT4V -Robotics [ 55 ] generates task information from demonstrations and uses a VLM to generate a grounded target specifications from visually observing the target scene. – Critic-V [ 69 ] uses a VLM-based critic to iterativ ely refine the initial task specification through generated critiques, ensuring compatibility with the deployment domain. – MoReVQA [ 41 ] follows a multi-stage modular reasoning process. It generates task information from demonstrations and uses VQA-style VLM querying to generate grounded target specifications. • W orld-model-based approaches. leverage LLM-based or neurosymbolic w orld models to support target task specification generation through high-lev el replanning mechanisms. – Statler [ 66 ] equips LLMs with explicit world state representations that serv e as memory throughout the replanning process, facilitating consistent reasoning ov er e xtended time horizons. – LLM-DM [ 21 ] constructs explicit PDDL world models from demonstrations using LLMs, then emplo ys domain-independent symbolic planners to generate target task specifications by searching for high-le v el plans in the problem file for deployment domain. Demo2Code. This method generates a code policy from demonstrations by recursi vely summarizing the demonstration through extended chain-of-thought to yield a task specification, which acts as a seed for generating the code polic y . While the original work does not assume a cross-domain setting, we modify the implementation by injecting deployment domain information in the final stage of summarization. Demonstration Observations Deployment Actions Specification Summarize Summarize Y ield Figure 17. Demo2Code high-lev el flo w W e implement Demo2Code referring to the of ficial repository 2 . Implementation of adapted Demo2Code compose of: (1) observation pr ediction , where we batch-process source demonstration timesteps using a VLM with multi-vie w images (top, front, back) to extract an ordered sequence of high-le v el observ ations; and (2) action pr ediction , where the VLM predicts high-lev el actions from the observ ation sequence while being pro vided with deployment domain information to generate domain-adapted actions that serve as the task specification. GPT4V -Robotics. This method generates task specifications from visual demonstrations by first extracting domain-agnostic task descriptions from source demonstrations, then grounding them to deployment en vironments through visual scene understanding. While the original work does not assume a cross-domain setting, we adapt the model by providing tar get scene information during the action planning stage. Note that we retain the name GPT4V -Robotics from the original work, though we use GPT -5 as the base VLM for consistency . Demonstration T ask Description Deployment Actions Specification Extract Plan Y ield Figure 18. GPT4V -Robotics high-level flo w 2 https://github .com/portal-cornell/demo2code W e implement GPT4V -Robotics referring to the official repository 3 . Implementation of adapted GPT4V -Robotics compose of: (1) task description generation , where a VLM analyzes the source demonstration to e xtract high-le vel task understanding and domain knowledge; and (2) action planning , where the VLM generates an ordered sequence of grounded actions by observing the deployment scene Critic-V . This method enhances VLM multimodal reasoning through iterativ e refinement with natural language critiques from visual observations. W e adapt this framew ork for cross-domain demo-to-code by using it to iterati vely refine action plans generated from source demonstrations until they align with the visual analysis of the deployment scene, and use the action plans to yield specifications for generating code policies. Demonstration Actions Specification V isual Critique Deployment Generate Y ield Critique Refine Figure 19. Critic-V high-lev el flo w W e implement Critic-V referring to the of ficial repository 4 . Implementation of the adapted Critic-V compose of: (1) initial action generation , where a VLM generates an initial action sequence from source demonstrations; (2) critique g eneration , where a VLM critic observ es the deployment scene to identify incompatibilities in the actions and provides natural language feedback; and (3) action r efinement , where the VLM refines the actions based on the feedback. Steps (2) and (3) repeat until the critic determines no issues exist or a maximum number of iterations is reached. Hyperparameters V alue Max critique refinement 10 (low/medium), 20 (high) (Obstruction/Kinematic) 15 (low/medium), 30 (high) (Combination) T able 10. Critic-V hyperparameters MoReVQA. This method operated through a three-stage modular pipeline consisting of ev ent parsing, grounding, and reasoning. W e adapt it for cross-domain demo-to-code by using these stages to construct subgoals and query the deployment scene for achiev e subgoal, ultimately generating a grounded specification that is used to produce the final code polic y . W e implement MoReVQA based on the of ficial supplementary material provided 5 . The implementation of the adapted MoReVQA compose of: (1) M1 Event P arsing processes the input instruction from the demonstration, con v erts it into a parsed ev ent, and stores it in shared memory . (2) M2 Gr ounding uses both the demonstration’ s task description and the parsed ev ent stored in shared memory to generate subgoals that are adapted to the deployment scene. (3) M3 Reasoning combines the parsed ev ent in memory with the VQA results deri v ed from the deployment scene and generates the actions required to achiev e each subgoal. 3 https://github .com/microsoft/GPT4V ision-Robot-Manipulation-Prompts 4 https://github .com/kyrieLei/Critic-V 5 https://juhongm999.github .io/more vqa Demonstration M1 Event P arsing M2 Grounding M3 Reasoning Actions Specification T ask Description Deployment Shared Memory VQA Generate Y ield Extract Figure 20. MoReVQA high-lev el flo w Statler . This method ensures consistent state tracking across planning steps by maintaining an explicit world state representation as a memory . W e adapt this framework for cross-domain demo-to-code scenarios by e xtracting task descriptions from demonstrations to re-generate action plans on the deployment domain, and use the action plans to yield specifications for generating code policies. Demonstration T ask Description Deployment Predicted State Actions (Incremental) Specification Extract Predict Plan Y ield Update Figure 21. Statler high-lev el flo w W e implement Statler referring to the of ficial repository 6 . Implementation of the adapted Statler compose of: (1) task description generation , where a VLM analyzes the source demonstration to e xtract high-le v el task understanding and domain knowledge; (2) initial state pr ediction , where the VLM observes the deployment scene to predict the initial state using the same predefined predicates as N E S Y C R; and (3) incremental action planning , where at each step the VLM generates next se v eral actions conditioned on the current state, task description, and pre vious actions, then updates the state representation accordingly . This state-action cycle repeats iterativ ely until the goal is reached or a maximum number of iterations is reached. LLM-DM. This method employs a neurosymbolic planning approach by generating PDDL (Planning Domain Definition Language) files via VLM and lev erage symbolic solv er to deri ve action plan. W e adapt this framework for cross-domain demo-to-code by using VLMs to predict domain file from source demonstration and problem file from deployment information, then iterativ ely refine the domain file until a v alid plan is found through PDDL solving and using the plan as specification to generate code policy . 6 https://github .com/ripl/statler Hyperparameters V alue Max planning iteration 10 (low/medium), 20 (high) (Obstruction/Kinematic) 15 (low/medium), 30 (high) (Combination) T able 11. Statler hyperparameters Demonstration PDDL Domain Deployment PDDL Problem Actions Specification Predict Predict Solve T ranslate Figure 22. LLM-DM high-lev el flo w W e implement LLM-DM referring to the of ficial repository 7 , with our own implementation for the problem file prediction. Implementation of the adapted LLM-DM consists of: (1) domain pr ediction , where VLMs analyze source demonstrations to propose additional predicates beyond those in N E S Y C R , along with actions, for constructing a PDDL domain file; (2) pr oblem pr ediction , where VLMs observe the deployment scene to formulate a PDDL problem file using the proposed predicates; (3) PDDL-based solving , where a symbolic solver attempts to synthesize an action sequence satisfying the gi v en files; and (4) domain r efinement , where upon solving failure, VLMs analyze the f ailure and refine the domain model. This plan-refine cycle repeats until a v alid plan is found or maximum attempts are reached. For the symbolic solver , we use Fast Downw ard as provided in the of ficial PDDLGym Planners repository 8 . Hyperparameters V alue Max domain refinement 10 (low/medium), 20 (high) (Obstruction/Kinematic) 15 (low/medium), 30 (high) (Combination) T able 12. LLM-DM hyperparameters 7 https://github .com/GuanSuns/LLMs-W orld-Models-for-Planning 8 https://github .com/ronuchit/pddlgym planners C. Additional Experiments C.1. Experiments on VLM State T ranslation T o e v aluate the performance of symbolic state translation, we conduct experiments follo wing [ 62 ], with results presented in T able 13 . The results show that the VLM reliably translates ra w frames into scene graphs, achie ving an F1-score abo ve 0.82 against the ground-truth symbolic states of real-world scenes using GPT -5. Method Real-world Precision Recall F1 Method : N E S Y C R GPT -5 0.97 0.71 0.82 GPT -5-mini 0.88 0.75 0.80 T able 13. Performance on VLM symbolic state translation C.2. Ablations on VLM choice T able 14 reports the ablation study on VLM choice for N E S Y C R . The results show that while GPT -5 achieves the best performance (with 80.00% SR, 88.33% GC), smaller models like GPT -5-mini and GPT -4 series still maintain competitive results with SR ranging from 63-70% and GC abov e 75%. This shows that N E S Y C R remains effecti v e across v arying model scales. Method Combined SR GC PD Method : N E S Y C R GPT -5 80.00 ± 7.43 88.33 ± 4.60 3.75 ± 2.61 GPT -5-mini 70.00 ± 8.51 76.67 ± 7.08 11.43 ± 4.04 GPT -4.1 66.67 ± 8.75 80.00 ± 5.67 9.00 ± 4.16 GPT -4o 63.33 ± 8.95 75.00 ± 6.67 11.05 ± 4.39 T able 14. Ablation on VLM choice C.3. Robustness to Scene Graph P erturbation T o e v aluate the robustness of N E S Y C R to imperfect symbolic state translations, we conduct experiments under tw o types of scene graph perturbations: (1) randomly dropping 10% of relations, and (2) injecting 10% noisy (incorrect) relations. As shown in T able 15 , N E S Y C R maintains stable performance e ven when perturbations are applied to all scene graphs along the demonstration. This robustness stems from the fact that perturbation-induced precondition violations are handled identically to violations arising from true cross-domain mismatches, through the same verification and refinement loops employed in both the symbolic world modeling and adaptation phases. None 10% Drop 10% Noise N E S Y C R 65.00 ± 7.64 55.00 ± 7.97 60.00 ± 7.84 T able 15. Robustness to scene graph perturbation C.4. Running Time Analysis T able 16 reports the running time comparison between iterati ve methods, decomposed into VLM inference time and symbolic tool ex ecution time. N E S Y C R incurs minimal overhead compared to other iterati v e baselines, as symbolic tool operates as a forward ex ecutor that performs sequential state transitions with comple xity linear in the demonstration length, rather than exponential symbolic search. The primary computational bottleneck is VLM inference, a cost shared across all iterativ e baselines; the symbolic tool itself contrib utes mar ginal latency . Method VLM Symbolic T otal Critic-V 196.23s – 196.23s Statler 83.96s – 83.96s LLM-DM 91.56s 0.40s 91.96s N E S Y C R 118.22s 0.01s 118.23s T able 16. Running time comparison between iterativ e methods. D. Additional V isualizations D.1. Figur e 5 in Main P aper V isualizations In this section, we illustrate the adaptation process of N E S Y C R in Figure 23 , and the execution of the generated code polic y for our main scenario in Figure 24 . Left. The demonstration-deriv ed procedure includes the redundant action “Release Magnetic Hook Into Bottom Drawer” since the action requires the target object to not be in the tar get position whereas magnetic hook already occupies its tar get position in the deployment domain, violating the precondition (not (InsideOf magnetic hook bottom drawer)) identified by the symbolic tool. Based on the violated precondition and the current state showing that the magnetic hook is already in place, the VLM generates an exploration to remo ve the redundant action chunk. Middle. A domain gap exists in the deployment domain where the fine-grained object, black scre ws are scattered rather than gathered, violating the precondition for action “Grasp Black Scre ws” (not (and (Finegrained black screws) (not (Gathered black screws)))) . Based on the violated precondition and current state sho wing that magnetic hook is present in the scene, the VLM generates an exploration that repurposes the magnetic hook as an aggre gation tool, inserting new actions to collect the scattered scre ws before attempting to grasp them. Right. “Grasp Magnetic Hook” in the newly added hook retrie v al action causes a precondition violation (not (and (OnTopOf silver screws magnetic hook) (not (OnTopOf gold screws bottom drawer)) (not (OnTopOf bracket bottom drawer)))) indicating that multiple objects are stacked on top of the hook. VLM resolves by reordering the procedure: the hook retriev al and scre w aggre gation steps are mov ed earlier in the sequence, before or ganizing intermediate objects. This resolves the violation, producing a final procedure compatible with the deployment domain. Figure 23. An expanded visualization of the adaptation process illustrated in Figure 1 Or g aniz e the objects in t o dr a w er s. Mo v e g a ther ed scr e w s in t o dr a w er Pick magne tic hook t o g a ther scr e w s Pick & place in t ermedia t e objects Close dr a w er Or g aniz e the objects in t o dr a w er s. Place in r ed dr a w er dr a w er and open blue dr a w er Close blue Place in blue dr a w er Close r ed dr a w er and open r ed dr a w er Close r ed dr a w er and open blue dr a w er Place in r ed dr a w er Place in blue dr a w er Figure 24. An expanded visualization of the task ex ecution illustrated in Figure 1 D.2. Real-w orld Experiment V isualizations In this section, we illustrate the ex ecution of the generated code policy for our real-w orld experiment in Figure 25 . Or g aniz e the objects in t o dr a w er s. Mo v e g a ther ed scr e w s in t o dr a w er Pick magne tic hook t o g a ther scr e w s Pick & place in t ermedia t e objects Close dr a w er Or g aniz e the objects in t o dr a w er s. Place in r ed dr a w er dr a w er and open blue dr a w er Close blue Place in blue dr a w er Close r ed dr a w er and open r ed dr a w er Close r ed dr a w er and open blue dr a w er Place in r ed dr a w er Place in blue dr a w er Figure 25. V isualization of real-world experiment in T able 2 E. Prompts E.1. Demo2Code Demo2Code — Observation Pr ediction [System] Y ou are a scene descriptor for robotics. The user will pro vide three synchronized images of the scene (top/front/back views). Y our task is to describe what is happening in the scene and ho w are objects positioned and oriented based on provided scene and additional information. Output requir ements: • Provide one clear and concise scene description including the position and orientation of the objects. • Start the scene description with an unordered dash (-). [User] Y our task is to describe what is happening in the scene and how are objects positioned and oriented based on provided scene and additional information. Below is an e xample. [Example] Now pro vide the scene descriptions for the follo wing scene and information. Information • Instruction: { instruction } • Object in scene: { objects } • Object state in scene: { object state } Observation Description Demo2Code — Action Prediction [System] Y ou are an action predictor for robotics. The user will provide observ ation sequence consisting of scene descriptions for each timestep and additional scene information. Y our task is to infer the high-le v el action that occurs between each pair of consecutiv e timesteps based on pro vided observ ation sequence and scene information. Output requir ements: • Provide one clear and concise action description, including the semantic and movement details of objects, for each transition between timesteps. • Start the action description with ordered numbering (1., 2., ...). • Use scene information and frames to identify the possible inconsistencies between observations and the scene, if exist infer the adapted actions. [User] Y our task is to infer the high-level action that occurs between each pair of consecutiv e timesteps based on provided observation sequence and scene information. Below are the definitions of the predicates. { predicates } Below is an e xample. [Example] Now pro vide the action descriptions for the follo wing observ ation sequence and scene information. Information • Instruction: { instruction } • Object in scene: { objects } • Object state in scene: { object state } • Observation Descriptions: { observations } Action Descriptions E.2. GPT4V -Robotics GPT4V -Robotics — Domain Description [System] Y ou are a domain descriptor for robotics. The user will provide three synchronized image sequences of the demonstration (top/front/back vie ws). Y our task is to describe the domain and goal of the task being demonstrated in single-line natural language based on provided demonstration and information. Output requir ements: • Produce one clear and specific domain description which can help future task planning for same domain in new scenes, including the specific goal information. • Start and end the domain description with triple backticks ( ``` ). [User] Y our task is to describe the domain and goal of the task being demonstrated in single-line natural language based on provided demonstration and information. Below is an e xample. [Example] Now pro vide the domain descriptions for the follo wing demonstration and information. Information • Instruction: { instruction } • Object in scene: { objects } • Object state for each timestep: { object state } Domain Description GPT4V -Robotics — Action Planning [System] Y ou are an task planner for robotics. The user will provide three synchronized images of the scene (top/front/back views). Y our goal is to generate task plan for scene based on provided task information and deployment scene. Output requir ements: • Provide a step-by-step action plan to accomplish the task with each clear and concise single-line actions. • Start the action plan with ordered numbering (1., 2., ...). [User] Y our goal is to generate task plan for scene based on provided task information and deplo yment scene. Below are the definitions of the predicates. { predicates } Below is an e xample. [Example] Now pro vide the action plan for the follo wing information. Information • Domain Description: { domain description } • Instruction: { instruction } • Objects in scene: { objects } • Gripper state in scene: { gripper state } Action Plan E.3. Critic-V Critic-V — Feedback Generation [System] Y ou are a feedback generator for robotics. The user will provide source action plan with task information and three synchronized images of the deployment scene (top/front/back views). Y our task is to analyze the source action plan and giv e a feedback on ho w to refine it to succeed on the task in the deployment scene. Output requir ements: • Provide clear and concise feedback on ho w to correct the action plan for deplo yment scene, if no problem e xists, state ‘‘No issues’’ . • Only provide feedback for the most major critical issue that would lead to task f ailure in the deployment scene. • Start the feedback with an unordered dash (-). [User] Y our task is to analyze the source action plan and give a feedback on how to refine it to succeed on the task in the deployment scene. Below is an e xample. [Example] Now pro vide the feedback for the follo wing initial action plan and information. Information • Instruction: { instruction } • Objects in scene: { objects } • Object state in scene: { object state } • Initial Action Plan: { demo summary } Correction F eedback Critic-V — Correction Pr oposal [System] Y ou are an action plan corrector for robotics. The user will provide an action plan with a feedback for better task success. Y our task is to generate a SEARCH/REPLACE patch that applies the feedback to the action plan. Output requir ements: • Use commonsense reasoning to propose a patch that applies the feedback, ensuring the action plan is ex ecutable in the deployment scene. • First provide a reasoning about the root cause of the feedback, how to realize the feedback as an actual action patch, and why your proposed patch works. • Propose one SEARCH block and one REPLACE block which can be applied to the original action plan to apply the feedback. • Format your response in the follo wing way: Information • (your reasoning here starting with a dash) Correction patch: <<<<<<< SEARCH ActionToRemove_1 ActionToRemove_2 ... ======= ActionToAdd_1 ActionToAdd_2 ... >>>>>>> REPLACE [User] Y our task is to generate a SEARCH/REPLACE patch that applies the feedback to the action plan. Below is an e xample. [Example] Now pro vide the correction patch for the follo wing action plan and feedbacks. Information • Instruction: { instruction } • Objects in scene: { objects } • Action Plan: { demo summary } • Feedback: { feedback } Correction rationale E.4. MoReVQA MoReVQA — Event Parsing [System] Y ou are a parsed e vent mak er to answer the plans of the question for e v ent parsing. The user will provide a question. Output requir ements: 1. question: Change the instruction into a question. 2. conjunction: One of (And / Or / None). 3. parse event: A list of sub-ev ents split by conjunction. If none exists, include one e vent. 4. ev ent object: Specify the main object(s) for each sub-ev ent. 5. classify: One of (which / where / why / how). [User] Y ou are a parsed e vent mak er to answer the plans of the question for e vent parsing. Below is an e xample. [Example] Now mak e the Parsed Ev ent based on the pro vided information. Information • Instruction: { instruction } Answer MoReVQA — Event Gr ounding [System] Y ou are a grounding module that aligns parsed ev ents with the visual frames of a target scene. The user will provide feedback (optional), domain description, target scene, tar get objects, parsed e vents, and ev ent objects. Y our task is to create specific parsed e vents and objects grounded in the gi ven scene and tar get objects, ensuring ph ysical feasibility . Output requir ements: JSON format. • parse event : scene-grounded, ph ysically feasible. • event object : specify subject and tar get for each grounded e v ent. [User] Y ou are a grounding module to align parsed e vents with the visual frames of a tar get scene. Below is an e xample. [Example] Now mak e specific parsed e vents and e v ent objects. Information • Domain Description: { domain description } • T arget Objects: { target objects } • Event Queue: { event queue } • Event Object: { event object } Specific parsed events and e vent objects MoReVQA — M2 V erified API Generator [System] Y ou are an API generation module that v erifies whether a gi ven question is successfully e xecuted. Y our task is to generate a verify API that determines whether the question can succeed. [User] Y ou are an API generation module that verifies whether a gi v en question is successfully ex ecuted. Below is an e xample. [Example] Now generate the M2 v erify API. Information • Question: { question } M2 verify API MoReVQA — V erify API Executor [System] Y ou are a verify-API executor that determines whether the question can succeed if ev ents are successfully executed. The user will provide tar get scene, e v ent queue, and verify API. Y our task is to verify whether the question itself can succeed assuming all ev ents e xecute successfully . Output requir ements: JSON format. • event queue : e x ecuted ev ent queue. • verified : true if question can succeed; plural or ‘‘all’’ still counts as true with one representative. • reason : explanation if false; otherwise ‘‘None’’ . • verify action : returns true if the question can succeed in tar get scene. [User] Y ou are a verify-API e xecutor that e v aluates whether the gi v en question can succeed. Below is an e xample. [Example] Now generate the v erified API. Information • verify api: { verify api } • question: { question } • ev ent queue: { event queue } Result MoReVQA — M3 VQA API Generation Module [System] Y ou are an M3 VQA API generation module. Y our task is to generate exactly three procedural VQA sub-questions for each ev ent. Assumptions: • All required objects exist in the scene, and the action is feasible. • Do NO T ask existence questions. • Focus on procedural HO W questions. Input: • question: event queue Output: • For each e vent queue, output: – One top-lev el VQA prompt phrased as ‘‘How to ?’’ – Exactly three sub-questions specifying steps or constraints • Prefer imperativ e, scene-grounded, concise wording. Output format: vqa("How to ?") vqa(["", "", ""]) [User] Y ou are an M3 VQA API generation module that generates exactly three procedural VQA sub-questions. Below is an e xample. [Example] Now generate the M3 VQA API. Information • Event Queue: { event queue } M3 VQA API MoReVQA — VQA API Executor [System] Y ou are a VQA API executor that answers questions about the tar get scene to ev aluate whether demonstrated actions can be successfully reproduced. The user will provide: • target scene • ev ent queue and e vent objects Output: Answer the VQA. [User] Y ou are a VQA API ex ecutor that ev aluates whether the demonstrated actions can be successfully reproduced in the giv en tar get scene. Below is an e xample: [ Example] Now mak e the answer for the VQA. Information • VQA: { vqa } Result MoReVQA — Action Planning [System] Y ou are a task planner for robotics. The user will provide three synchronized images of the scene. Y our goal is to generate task plans based on task information and the deployment scene. Output requir ements: • Provide a step-by-step action plan. • Each action must be a clear and concise single-line instruction. • Start with ordered numbering (1., 2., ...). [User] Y our goal is to generate an action plan for the scene based on the provided task info and deplo yment scene. Below are the definitions of the predicates: { predicates } Below is an e xample: [Example] Now pro vide the action plan for the follo wing information. Information • Event Queue: { event queue } • Objects in scene: { target objects } • Object state in scene: { target object state } • VQA Answer: { vqa answer } Action Plan E.5. Statler Statler — Action Planning [System] Y ou are a task planner for robotics. The user will provide three synchronized images of the scene (top/front/back views). Y our goal is to continue the task plan with a fe w actions and predict the state after the plan based on provided task information and the deployment scene. Output requir ements: • Provide a step-by-step partial action plan (at most 10 steps) to head to ward the goal; if the goal is reached, output ‘‘Goal reached’’ . • Start the action plan with ordered numbering (1., 2., ...). • After the action plan, output the predicted state as a list of grounded atoms holding true for the scene after ex ecuting the action plan. • Start each grounded atom of the state with an unordered dash (-). [User] Y our goal is to continue the task plan with a few actions for the scene based on the provided task information and deployment scene. Below are the definitions of the predicates. { predicates } Below is an e xample. [Example] Now use the same format for the follo wing information. Information • Domain Description: { domain decription } • Instruction: { instruction } • Object in scene: { objects } Pre vious Actions and Current State • Previous Actions: { demo summary } • Current State: { current state } Action Plan (at most 10 steps) Predicted State E.6. LLM-DM LLM-DM — Action Recommendation [System] Y ou are a PDDL action recommender for robotics. The user will provide a description of the domain and task instruction, along with the objects in the scene. Y our task is to recommend the useful PDDL actions to solv e the task in natural language. Output requir ements: • The actions should hav e a name and a short description of what the action does in natural language. • Start the actions with ordered numbering (1., 2., ...). • Make the actions general enough to be reusable in dif ferent tasks within the same domain. [User] Y our task is to recommend the useful PDDL actions to solve the task in natural language. Below is an e xample. [Example] Now recommend the actions based on the pro vided demonstration information. Demonstration Information • Domain Description: { domain description } • Instruction: { instruction } • Objects in scene: { objects } Action Recommendations LLM-DM — Predicate Pr oposal [System] Y ou are a predicate proposer for robotics. The user will provide target initial state, base predicates, and domain/action descriptions in natural language. Y our task is to propose a set of untyped predicates with associated descriptions that is useful to define the actions in PDDL format. Output requir ements: • Provide one clear and specific list of untyped predicates with descriptions that is useful for defining the actions in PDDL format. • Start the predicates with unordered dash (-). [User] Y our task is to propose a set of untyped predicates with associated descriptions that is useful to define the actions in PDDL format. Below are the definitions of the e xisting predicates. Do not inv ent predicates with duplicated meanings. { predicates } Below is an e xample. [Example] Now propose predicates based on the pro vided domain description and action descriptions. Domain Description • { domain description } Action Descriptions • { action description } Predicate Pr oposal LLM-DM — Action Construction [System] Y ou are an action constructor for robotics. The user will provide action description in natural language and predicates. Y our task is to con vert the action description into a PDDL-style action definition based on the provided domain description and predicate set. Output requir ements: • Produce one clear PDDL-style action definition for the action in the action description. • The action definition should include the action name, parameters, preconditions, and effects. • Start the name with ‘‘-’’ (dash followed by a space) under the Name: section; the action name should not ov erlap with predicate names. • Start the parameters with ordered numbering (1., 2., ...) under the Parameters: section. • Start and end the preconditions and effects with triple backticks ( ``` ) under the Preconditions: and Effects: sections respectiv ely . • All predicates used in the action definition must be from the provided predicates. • Use predicate names exactly as gi ven in the provided predicate list; do not in v ent ne w predicates or rename/alias any predicate. • Y ou can use (not ...) , (and ...) in the preconditions and ef fects; do not use (or ...) , (when ...) . [User] Y our task is to con vert the action description into a PDDL-style action definition based on the provided domain description and predicate set. Below is an e xample. [Example] Now con vert the action description into PDDL-style action definition. Action Description • { action description } Predicates • { predicates } Action Definition LLM-DM — Problem Pr ediction [System] Y ou are a PDDL problem predictor for robotics. The user will provide a domain description, a list of objects, a predicate set, and three synchronized images of the deployment scene (top/front/back views). Y our task is to predict the PDDL-style problem definition for the target initial scene. Output requir ements: • Produce one clear and specific PDDL-style problem definition for the target initial scene. • The objects, initial state, and goal state should be in PDDL format separately wrapped in triple backticks ( ``` ) under ‘‘Objects:’’ , ‘‘Initial state:’’ , and ‘‘Goal state:’’ sections respectively . • Start the content of objects, initial state, and goal state with ‘‘(:objects’’ , ‘‘(:init’’ , and ‘‘(:goal’’ respectiv ely . • All predicates in the initial state and goal state must be from the provided predicate set. [User] Y our task is to predict the PDDL-style problem definition for the target initial scene. Below is an e xample. [Example] Now predict the PDDL-style problem definition based on the pro vided information. Domain Information • Domain description: { domain description } • Predicates: { predicates } Problem Inf ormation • Instruction: { instruction } • Objects in scene: { objects } • Object state in scene: { object state } Predicted Pr oblem Definition LLM-DM — Domain Refinement [System] Y ou are a PDDL domain refiner for robotics. The user will provide a problematic domain PDDL which f ailed to generate a plan for the given problem PDDL, and optionally some conte xt about the failure if available. Y our task is to diagnose the domain PDDL based on the problem PDDL and apply all necessary fixes to make the problem solvable. Output requir ements: • First provide a reasoning about what is wrong with the original domain and ho w you fixed it. • Start the refinement rationale with ‘‘-’’ (dash followed by a space) under the ‘‘Refinement Rationale:’’ section. • Produce one refined domain PDDL that can solve the given problem PDDL; keep the name of the domain the same as the original. • Start and end the refined domain PDDL with triple backticks ( ``` ) under the ‘‘Refined Domain PDDL:’’ section. [User] Y our task is to diagnose the domain PDDL based on the problem PDDL and apply all necessary fixes to make the problem solvable. Below are the definitions of the predicates. { predicates } Now return the refined domain in a whole. Domain PDDL • { domain pddl } Problem PDDL • { problem pddl } Context about failure (if any): • { failure context } Refinement Rationale: Refined Domain PDDL: LLM-DM — Plan T ranslation [System] Y ou are a plan translator for robotics. The user will provide a target plan in PDDL format. Y our task is to translate the plan into clear and concise natural language steps, preserving the order and intention of each action. Output requir ements: • T ranslate each action into natural language that describes what the robot is doing. • Start the actions with ordered numbering (1., 2., ...). [User] Y our task is to translate the plan into clear , concise, human-readable task steps, preserving the order and intention of each action. Below is an e xample. [Example] Now translate the PDDL-style plan into natural language based on the pro vided information. T arget plan: • { target plan } Natural-language Plan: E.7. NeSyCR NeSyCR — Action Prediction [System] Y ou are an action predictor for robotics. The user will provide state transition information consisting of pre vious state, current state and state dif ference and scenes. Y our task is to predict the ex ecuted action’ s semantic, preconditions and effects based on pro vided state transition information. Output requir ements: • First provide a reasoning about what action was executed, why these preconditions were necessary , and why these effects occurred. • Then produce one clear and specific action semantic inferred from the state transition. • Start the action semantic with ‘‘-’’ (dash followed by a space). • If preconditions is empty leav e it as ‘‘None’’ . The effects can ne v er be empty . • Use negation in preconditions and ef fects when rele v ant (e.g., (not (GripperHolding block)) ). • Use forall quantifiers when the preconditions or effects should apply to all possible objects (e.g., (forall (?x - thing) (not (GripperHolding ?x))) ). • Except the forall quantifier , all atoms must be fully grounded with specific object names. [User] Y our task is to predict the executed action’ s semantic, preconditions and effects based on provided state transition information. Below are the definitions of the predicates. { predicates } Below are e xamples. [Example] Now predict the action based on the follo wing information. State Description: • Instruction: { instruction } • Objects in scene: { objects } • Previous state (before action): { prev state } • Current state (after action): { curr state } • State difference: { state diff } Prediction Rationale: Predicted Action: Action Semantic: Precondition: Effect: NeSyCR — Refinement Proposal [System] Y ou are an action plan refiner for robotics. Giv en the task information, action information and error details, propose a refinement patch so that the target state no w meets the pre viously unmet preconditions of the erroneous action. Output requir ements: • Use commonsense reasoning to propose a patch that resolves the f ailure, ensuring the erroneous action’ s precon- ditions are satisfied. • First pro vide a reasoning about the root cause of the error , how to make the erroneous action’ s preconditions satisfied, and why your proposed patch works. • Propose one SEARCH block and one REPLACE block which can make preconditions of the erroneous action satis- fied. • The SEARCH block must contain a continuous, consecutiv e sequence of actions from the action plan in their e xact form. • Format your response in the follo wing way: Refinement rationale: - (your reasoning here starting with a dash) Refinement patch: <<<<<<< SEARCH ActionToRemove_1 - Preconditions: ... - Effects: ... ActionToRemove_2 - Preconditions: ... - Effects: ... ... ======= ActionToAdd_1 - Preconditions: ... - Effects: ... ActionToAdd_2 - Preconditions: ... - Effects: ... ... >>>>>>> REPLACE [User] Revise the action sequence with a SEARCH/REPLACE patch so unmet preconditions of the erroneous action are satisfied. Below are the predicate definitions: { predicates } Now produce your refinement patch for the information belo w . T ask Information • Instruction: { instruction } • Objects in scene: { objects } Action Information • Previously e xecuted actions: { executed actions } • Erroneous action: { erroneous action } • Remaining actions (excluding erroneous action): { remaining actions } Error Detail • State when error occurred: { error state } • Unfulfilled Preconditions: { unfulfilled preconditions } Refinement rationale: Refinement patch: E.8. Common Common — State Prediction [System] Y ou are a state predictor for robotics. The user will provide you with three synchronized images of the scene (top/fron- t/back views) and partial ground truth object states. Y our task is to output state as list of grounded atoms which additionally holds true for the provided scene and information. Output requir ements: • Start the grounded atoms with unordered dashes (-). • Only include atoms that are strongly supported by the images, output “None” if no additional atoms can be inferred. • Do not repeat information already provided in object state. [User] Y our task is to output state as list of grounded atoms which additionally holds true for the provided scene and information. Below are the definitions of the predicates. { predicates } Below are e xamples. [Example] Now pro vide the scene descriptions for the follo wing sequences. Information: • Instruction: { instruction } • Objects in scene: { objects } • Object state in scene: { object state } State Predictions: Common — Code Generation [System] Y ou are a python code generator for robotics. The user will provide two things: 1) A set of imported Python modules and docstrings describing the av ailable functions. 2) A specification containing: • Instruction: A natural language command for the robot • Objects in scene: List of av ailable objects • Demonstration summary: A sequence of high-level action steps that serv e as a high-le v el plan. Output requir ements: • Use only the provided Python libraries and functions. Do not import new libraries, create new APIs, or change function signatures. • Adhere strictly to the giv en docstrings. Call functions exactly as defined, with the allo wed parameters. • Return only the Python code, enclosed in triple backticks (“‘python ... ”’). • T reat the demonstration summary as a high-lev el plans; follo w its sequence based on the Instruction and av ailable objects. • Do not add extra steps unless they are implied by the demonstration summary or the instruction. [User] Y our task is to write robot control scripts in Python code. The Python code should be general and applicable to different robotics en vironments. Below are the imported Python libraries and functions that you can use, you can not import ne w libraries. [Libraries and Functions] Below shows the docstrings for these imported library functions that you must follow . Y ou can not add additional parameters to these functions. [API Docstring] Below are e xamples. [Example] Now generate Python code that follo ws the given specification. Specification: • Instruction: { instruction } • Objects in scene: { objects } • Demonstration summary: { demo summary } Generated Code:
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment