Test-Driven Agentic Framework for Reliable Robot Controller

Test-Driven Agentic Framework for Reliable Robot Controller
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, we present a test-driven, agentic framework for synthesizing a deployable low-level robot controller for navigation tasks. Given a 2D map with an image of an ultrasonic sensor-based robot, or a 3D robotic simulation environment, our framework iteratively refines the generated controller code using diagnostic feedback from structured test suites to achieve task success. We propose a dual-tier repair strategy to refine the generated code that alternates between prompt-level refinement and direct code editing. We evaluate the approach across 2D navigation tasks and 3D navigation in the Webots simulator. Experimental results show that test-driven synthesis substantially improves controller reliability and robustness over one-shot controller generation, especially when the initial prompt is underspecified. The source code and demonstration videos are available at: https://shivanshutripath.github.io/robotic_controller.github.io.


💡 Research Summary

The paper introduces a test‑driven, agentic framework for automatically synthesizing low‑level robot controllers for navigation tasks, addressing the brittleness of one‑shot large language model (LLM) code generation. The authors target two representative settings: (1) a 2‑D map‑based environment where a raw RGB map image, start and goal coordinates are provided, and (2) a 3‑D physics‑based simulation using the Webots platform. In both cases the goal is to produce a self‑contained Python script (controller.py) that can be deployed without further LLM assistance.

The core of the framework is a closed‑loop “generate‑evaluate‑repair” cycle. An initial prompt (PROMPT_FIXED) containing non‑negotiable requirements (allowed libraries, mandatory functions) is combined with dynamically updated “AUTO_REPAIR_RULES”. This prompt is fed to an LLM (code gen.py) together with environment context (occupancy grid and metadata for 2‑D, or the world file and robot specifications for 3‑D). The LLM outputs a candidate controller.

The candidate is immediately validated by a structured PyTest suite that checks for syntactic correctness, proper API usage, safety constraints (collision avoidance), and task success (reaching the goal). Test failures are captured in a detailed report. A second LLM (repair.py) consumes this report and decides between two repair strategies:

  1. Code‑level repair – the LLM directly edits the generated controller to fix the specific errors (e.g., import mistakes, wrong unit conversions, missing sensor handling). This is attempted for a bounded number of edit attempts (repair patience J).

  2. Prompt‑level repair – if code‑level edits do not resolve the failures within the edit budget, the framework augments the original prompt with a summary of the failures (the “AUTO_REPAIR_RULES” section). The enriched prompt is then sent back to the code generator, producing a new controller candidate.

The loop repeats until either all tests pass or a maximum iteration count K is reached. This dual‑tier repair strategy allows the system to compensate for underspecified prompts by progressively enriching the specification while also correcting concrete coding mistakes.

For the 2‑D case, the authors implement a robust map‑preprocessing pipeline: grayscale conversion, Otsu thresholding, polarity evaluation (dark vs. light obstacles), morphological cleaning, and obstacle inflation. A grid search over polarity, cleanup strength, and inflation radius yields candidate occupancy grids. Each candidate is scored based on path length, clearance, and smoothness, and the best grid is selected and stored in params.json. This metadata is then fed to the controller synthesis stage.

In the 3‑D case, the controller directly interacts with Webots APIs (differential‑drive commands, ultrasonic range sensors) without any map preprocessing. The world file (envt.wbt) provides geometry, obstacle layout, and robot dimensions, which are included in the prompt to ground the generated code.

Experimental evaluation consists of multiple independent runs (R = 10) for each setting, comparing the proposed agentic loop against a baseline one‑shot generation. Success is defined as passing the full test suite and achieving collision‑free navigation to the goal within a time budget. Results show a substantial increase in success rate (from roughly 85 % for one‑shot to over 96 % with the agentic framework). The improvement is especially pronounced when the initial prompt lacks critical details such as unit conventions or sensor specifications; the repair loop automatically injects these missing pieces. Moreover, the authors observe that code‑level repairs tend to fix low‑level syntax and runtime errors, while prompt‑level repairs address higher‑level logical omissions.

Key contributions highlighted by the authors are:

  • A novel agentic workflow that iteratively refines both prompts and code to synthesize reliable controllers from vague natural‑language specifications.
  • A unified methodology applicable to both 2‑D image‑based maps and 3‑D simulated worlds.
  • Empirical evidence that test‑driven synthesis dramatically improves reliability and robustness over single‑shot generation.

The paper also discusses limitations. The approach relies on manually crafted PyTest suites, which may be labor‑intensive for complex or highly dynamic environments. The current implementation is Python‑centric and may not directly translate to real‑time embedded controllers without additional optimization or compilation steps. Finally, the repair process depends on the capabilities of the underlying LLM; newer or domain‑specific models could require re‑tuning of prompts and repair rules.

Future work directions include automated test generation (e.g., model‑based or coverage‑guided), extension to multi‑modal sensor inputs (camera, LiDAR), integration of a compilation stage to produce C/C++ binaries for real‑time deployment, and incorporating optional human‑in‑the‑loop feedback to blend expert insight with automated repair.

In summary, the test‑driven agentic framework presented in this work offers a practical pathway to bridge the gap between high‑level task description and low‑level executable robot control code, leveraging structured testing and dual‑tier LLM‑based repair to achieve high reliability without continuous human intervention.


Comments & Academic Discussion

Loading comments...

Leave a Comment