CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments
Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.
💡 Research Summary
The paper introduces CAPER (Constrained And Procedural Reasoning), a modular framework designed to enable robotic assistants to safely and reliably execute scientific experiments that demand strict adherence to protocols, long‑horizon manipulation, and operation under sparse supervision. The authors argue that end‑to‑end vision‑language‑action (VLA) models, which rely on dense trajectory data and the ability to recover from errors, are ill‑suited for research‑and‑development (R&D) laboratory settings where mistakes are often irreversible and demonstrations are limited.
CAPER separates the overall control pipeline into three distinct layers.
- Task‑level planner: Using a Meta‑Llama‑3.1‑8B‑Instruct model guided by chain‑of‑thought (CoT) prompting, the system first interprets the high‑level goal, extracts prerequisite conditions, decomposes the goal into a sequence of symbolic subtasks, and finally validates the plan through an iterative verification‑correction loop. This planning occurs entirely in the language domain, producing a procedurally valid symbolic plan S* that is independent of any visual input.
- Mid‑level multimodal planner: This stage grounds each symbolic subtask into concrete robot actions. A conditional diffusion model predicts short‑horizon future frames conditioned on the current observation and the subtask description, providing visual context that highlights potential collisions or spatial conflicts. A vision‑language model (GPT‑4o) then maps the subtask together with the current and predicted frames into a structured set of predefined action primitives (move, grasp, pour, stir). The use of a frozen CLIP encoder and cross‑attention ensures tight integration of language and vision while keeping spatial decision‑making out of the LLM.
- Low‑level controller: A continuous‑control policy, trained with reinforcement learning (e.g., DDPG), receives the action primitives and raw observations to generate joint‑level commands. The reward function balances progress toward the target, successful grasps, and collision avoidance. Reinforcement learning is confined to this layer, allowing the policy to adapt to physical uncertainties without re‑introducing perceptual or procedural ambiguity.
The authors formalize the overall policy as π(aₜ|oₜ,G) = π_exec(aₜ|oₜ,S*) with S* = π_plan(G), highlighting the clean factorization between symbolic planning and execution. This separation improves robustness, interpretability, and data efficiency because each module can be optimized independently and only on the uncertainties it is best suited to handle.
Experiments are conducted in a CoppeliaSim (V‑REP) simulation of a UR3 arm equipped with typical laboratory objects (beakers, petri dishes, cylinders, etc.). The framework is evaluated on a newly created scientific‑workflow benchmark that emphasizes procedural correctness, as well as on a public long‑horizon manipulation dataset. Across both domains, CAPER achieves consistent gains: success rates improve by 12–18 percentage points and procedural correctness by 15–22 points, especially in low‑data regimes (10–20 demonstrations) and tasks with more than 20 steps. Ablation studies demonstrate that the verification‑correction loop in the task‑level planner and the visual prediction component in the mid‑level planner are critical for preventing irreversible protocol violations.
In summary, CAPER demonstrates that explicitly encoding procedural commitments as interpretable intermediate representations, and assigning distinct uncertainty domains to separate modules, yields a robot system that is both data‑efficient and safe for scientific experimentation. The authors suggest future work on transferring the approach to real laboratory hardware, handling more complex chemical protocols, and integrating human‑in‑the‑loop collaboration.
Comments & Academic Discussion
Loading comments...
Leave a Comment