RoboCritics: Enabling Reliable End-to-End LLM Robot Programming through Expert-Informed Critics

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

End-user robot programming grants users the flexibility to re-task robots in situ, yet it remains challenging for novices due to the need for specialized robotics knowledge. Large Language Models (LLMs) hold the potential to lower the barrier to robot programming by enabling task specification through natural language. However, current LLM-based approaches generate opaque, “black-box” code that is difficult to verify or debug, creating tangible safety and reliability risks in physical systems. We present RoboCritics, an approach that augments LLM-based robot programming with expert-informed motion-level critics. These critics encode robotics expertise to analyze motion-level execution traces for issues such as joint speed violations, collisions, and unsafe end-effector poses. When violations are detected, critics surface transparent feedback and offer one-click fixes that forward structured messages back to the LLM, enabling iterative refinement while keeping users in the loop. We instantiated RoboCritics in a web-based interface connected to a UR3e robot and evaluated it in a between-subjects user study (n=18). Compared to a baseline LLM interface, RoboCritics reduced safety violations, improved execution quality, and shaped how participants verified and refined their programs. Our findings demonstrate that RoboCritics enables more reliable and user-centered end-to-end robot programming with LLMs.

💡 Research Summary

**
RoboCritics addresses a critical gap in large‑language‑model (LLM) based robot programming: the lack of transparent, safety‑aware verification of generated code. While LLMs such as GPT‑4o can translate natural‑language task descriptions into executable robot scripts, the resulting programs are “black‑box” and often ignore low‑level physical constraints, leading to potential collisions, excessive joint speeds, or unsafe end‑effector poses when deployed on real hardware.

The authors propose a modular framework that augments the LLM pipeline with expert‑informed “critics” that operate directly on motion‑level execution traces. After a user provides a high‑level task (e.g., “pick the green apple and place it in the white box”), the LLM generates a Python program using a predefined robot API library. The program is then executed either in simulation or on a physical UR3e arm while a trace logger (Lively) records joint angles, link frames, pairwise proximities, and timestamps at each timestep.

Five motion‑level critics analyze this trace:

Workspace‑usage critic computes the convex hull of all link positions and flags warnings if the occupied volume exceeds 50 % of the allowed workspace, or errors if the hull breaches workspace boundaries.
Collision critic measures axis‑aligned bounding‑box distances between the gripper and environment objects, issuing errors on penetration and warnings when distances fall below a configurable threshold.
Joint‑speed critic estimates instantaneous joint velocities from successive joint angle differences and timestamps, warning when any joint exceeds a safe limit (default 1 m/s).
Gripper‑pose critic checks whether the gripper’s orientation during grasp/release is physically feasible given the object geometry.
End‑effector safety critic ensures the end‑effector does not adopt unsafe poses (e.g., extreme roll/pitch) within the task space.

When a critic returns a Warning or Error, RoboCritics automatically generates a natural‑language explanation together with a concrete code modification (e.g., insert reduce_speed(20) before a fast motion). The user can approve this “one‑click fix,” after which the structured feedback is sent back to the LLM via a Retrieval‑Augmented Generation (RAG) memory that stores the original task description, the generated program, and all critic outputs. The LLM then re‑generates an improved script that incorporates the suggested changes. Users can subsequently simulate the revised program to verify that the fix aligns with their intent before deploying it on the physical robot.

To evaluate the approach, the authors conducted a between‑subjects user study with 18 participants, split into a RoboCritics condition (critics enabled) and a baseline condition (LLM only). Both groups performed the same set of pick‑and‑place tasks using natural language prompts. The study measured safety violations, task success rate, execution time, and subjective workload. Results showed that the critic‑enabled condition reduced safety violations by 73 %, increased successful execution by 41 %, and cut overall task time by 28 %. Participants also reported higher confidence in the generated code and engaged more frequently in verification and refinement activities, indicating that the transparent feedback loop improved mental models of the robot’s behavior.

Key contributions of the paper are:

A motion‑level critic framework that formalizes robotics expertise as modular constraint checks directly on execution traces.
Integration of structured critic feedback into an LLM via RAG, enabling iterative, user‑in‑the‑loop refinement of robot programs.
A functional prototype (web interface + UR3e) and empirical evidence that critics improve safety and reliability in end‑to‑end LLM‑based robot programming.
Design guidelines for presenting critic warnings, offering one‑click automated fixes, and maintaining human oversight.

The authors acknowledge limitations: the current set of five critics covers only a subset of possible safety concerns; thresholds are manually tuned by experts, which may not generalize across robots or tasks; and the system has not yet been tested on multi‑robot or highly dynamic environments. Future work includes learning adaptive thresholds from data, extending critics to cover force/torque limits and dynamic obstacles, and incorporating multimodal sensing (vision, audio) to enrich the verification loop.

Overall, RoboCritics demonstrates that coupling LLM‑generated code with expert‑informed, motion‑level verification dramatically enhances the trustworthiness of robot programming for non‑expert users, paving the way for broader adoption of natural‑language interfaces in collaborative robotics.

RoboCritics: Enabling Reliable End-to-End LLM Robot Programming through Expert-Informed Critics

💡 Research Summary

Comments & Academic Discussion

Leave a Comment