Schema First Tool APIs for LLM Agents: A Controlled Study of Tool Misuse, Recovery, and Budgeted Performance

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tool use has become central to modern LLM agents, yet interface design is rarely isolated as an experimental variable. This paper studies whether schema based tool contracts and structured validation diagnostics improve reliability under strict interaction budgets. We evaluate three conditions that preserve identical tool semantics and information content: free form documentation, JSON Schema specifications, and JSON Schema with structured diagnostics. We implement a deterministic software engineering sandbox with logs, metrics, configurations, and repository tasks, and evaluate a fully crossed pilot with one open local model, three seeds, three interface conditions, and four budgets. We report end task success, interface misuse, execution failures, semantic misuse, recovery behavior, and overhead. In this pilot, success remains zero across conditions, while schema conditions reduce interface misuse but not semantic misuse. The evidence supports a precise interpretation that interface formalization improves contract adherence, but semantic action quality and timeout sensitive tasks remain dominant bottlenecks under constrained local inference.

💡 Research Summary

This paper presents a rigorously controlled study of how the design of tool interfaces influences the reliability of large language model (LLM) agents when operating under strict interaction budgets. The authors isolate the interface representation as the sole independent variable while keeping the model, prompting, tool semantics, and execution environment constant. Three interface conditions are compared: (A) free‑form natural‑language documentation, (B) strict JSON Schema specifications, and (C) JSON Schema augmented with structured, field‑level validation diagnostics. All conditions are derived from the same canonical contract for each tool, guaranteeing information equivalence across conditions.

A deterministic software‑engineering sandbox is built to emulate realistic debugging, monitoring, and configuration tasks. The sandbox contains immutable artifacts (logs, metrics, configuration files, a small repository with unit tests) that can only be altered through tool calls such as search_logs, get_metric, read_file, grep_repo, run_tests, and apply_patch. Each tool has a well‑defined input contract; the executors are identical across conditions, and runtime precondition failures produce a fixed error message to avoid confounding effects.

The agent is allowed a maximum of B steps, with B ∈ {3, 5, 8, 12}. At each step it may emit a tool call or a final answer. Success is determined by a deterministic checker that evaluates the final answer against a hidden ground‑truth issue. The study measures six primary outcomes: task success (S), interface misuse rate (I), execution failure rate (E), recovery probability after an invalid call (R), semantic misuse rate (M), and token overhead (O). Semantic misuse is automatically labeled by checking whether a schema‑valid call aligns with any pre‑approved high‑level trace for the task.

Key findings:

Interface misuse reduction (H1) – Both schema‑based conditions (B and C) cut the interface misuse rate by roughly 40‑60 % compared with free‑form documentation. The structured contract clearly guides the model away from missing fields, type mismatches, and enum violations.
Recovery improvement (H2) – Structured diagnostics (C) provide concrete field‑level error objects, leading to a modest increase in the probability that the model corrects its mistake. However, the gain is not statistically significant, indicating that more precise error messages alone do not guarantee successful recovery.
Semantic misuse persists – Across all conditions, the semantic misuse rate remains high (≈60 %). Calls that satisfy the schema but are irrelevant to the task (wrong tool, wrong intent, redundant actions) are unaffected by the presence of a schema or diagnostics. This demonstrates that contract rigor does not substitute for higher‑level planning competence.
Task success remains near zero – Even with the most generous budget (B = 12), overall success is below 2 % and is zero for the tighter budgets (B = 3, 5). The reduction in interface errors does not translate into higher end‑to‑end performance because execution timeouts, planning errors, and the limited reasoning capacity of the local model dominate.
Token overhead – Adding schemas and diagnostics increases prompt length by about 12 % of tokens, a modest cost that does not materially affect the measured outcomes.

The authors conclude that schema‑first interfaces act as a low‑cost “compiler” for LLM agents, effectively preventing syntactic API errors. However, achieving reliable task completion under tight budgets requires addressing the deeper problem of semantic planning and tool‑selection strategies. Future work should explore hybrid approaches that combine contract enforcement with meta‑prompting, chain‑of‑tools orchestration, or external planners, as well as dynamic budget‑aware decision making.

Overall, the paper contributes a reproducible benchmark suite, a clear taxonomy of failure modes, and empirical evidence that formalizing tool contracts improves interface adherence but does not solve the fundamental challenges of meaningfully using tools within constrained LLM deployments.

Schema First Tool APIs for LLM Agents: A Controlled Study of Tool Misuse, Recovery, and Budgeted Performance

💡 Research Summary

Comments & Academic Discussion

Leave a Comment