Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors – irrelevant functions sharing type-compatible variables with relevant functions – prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.

💡 Research Summary

Paper Overview
The authors introduce FuncBenchGen, a novel evaluation framework designed to benchmark tool‑augmented large language models (TaLMs) on multi‑step function‑calling tasks. Existing benchmarks suffer from two major shortcomings: (1) limited control over task difficulty (e.g., number of required calls, dependency depth, presence of distractor APIs) and (2) data contamination, where benchmark examples may already be present in the model’s pre‑training corpus or become searchable at test time. FuncBenchGen addresses both by generating synthetic function sets and tasks on‑the‑fly, guaranteeing that no pre‑training data overlaps with the evaluation material, and by representing function dependencies as a hidden directed acyclic graph (DAG). The model only sees the list of function signatures and must infer the correct call order to compute a target variable.

Framework Design

Graph‑Based Task Generation – Users specify four independent parameters:
- n_core – number of core functions that are truly required to solve the problem.
- d – maximum dependency depth (length of the longest path).
- n_conn – number of connected irrelevant functions (CINs) that share type‑compatible variables with core functions.
- n_dis – number of disconnected irrelevant functions (DINs) that are isolated from the solution DAG.
  The generator first builds a valid core path of length d, then adds remaining core nodes randomly while preserving acyclicity, and finally inserts CINs as children of random core nodes and DINs as isolated sub‑graphs.
Function Schema Creation – Each node becomes a function schema containing: a random name, typed input parameters, a single typed output, and a natural‑language description. Types and sub‑types are the only semantic link; variable names are irrelevant.
Execution Model – At evaluation time each variable is assigned a three‑digit integer. Functions return the correct output only when exact expected inputs are supplied; otherwise they emit a random value, mimicking real‑world APIs that may silently fail on malformed arguments. The LLM can issue multiple calls per turn, receive the deterministic outputs, and continue until it decides to stop.

Experimental Setup
The authors evaluate seven state‑of‑the‑art models (GPT‑5, GPT‑4‑Turbo, Claude‑3, Llama‑2‑70B, Mistral‑Large, etc.) across a systematic grid of configurations, probing five research questions (RQ1–RQ5) concerning core set size, distractor impact, depth sensitivity, scaling with larger function sets and reasoning budgets, and failure‑type mitigation.

Key Findings

Reasoning‑Optimized Models Lead – GPT‑5 and Claude‑3 consistently outperform general‑purpose models by 12–18 percentage points, confirming that models tuned for chain‑of‑thought reasoning are better at inferring hidden dependencies from signatures alone.
Depth Is the Dominant Difficulty Factor – Success rates drop sharply as depth increases: from ~70 % at depth 5 to <15 % at depth 10, and near 0 % at depth 20. This reveals a severe limitation in long‑horizon state tracking and memory for current LLMs.
Connected Distractors Are Highly Disruptive – Adding as few as five CINs reduces success by 15–25 pp, whereas the same number of DINs has negligible effect. The models appear to over‑generalize based on type compatibility, treating any function with matching input/output types as potentially relevant.
State Propagation Errors Dominate – Even when syntactically correct calls are made, models frequently reuse stale variable values or pass incorrect arguments to downstream calls. This “state‑tracking” failure is the most common error class observed.
Simple Mitigation Yields Large Gains – The authors propose a lightweight augmentation: at each turn, the system restates all known variable names and their current values to the LLM. This explicit reminder dramatically improves performance—GPT‑5’s success jumps from 62.5 % to 81.3 %, and other models see 10–15 pp gains.

Implications and Future Directions
FuncBenchGen establishes a contamination‑free, highly controllable benchmark that can isolate specific weaknesses of tool‑using LLMs. By decoupling task difficulty into orthogonal dimensions (size, depth, distractor connectivity), researchers can perform fine‑grained ablations and track progress over time. The framework also opens avenues for extending evaluation to real APIs, dynamic typing, multi‑output functions, and human‑in‑the‑loop feedback loops. Moreover, the success of the variable‑restate mitigation suggests that future model architectures or prompting strategies should incorporate explicit state‑refresh mechanisms, perhaps via internal memory modules or external state‑management APIs.

In summary, the paper makes three major contributions: (1) a novel, fully synthetic, contamination‑free benchmark generator for multi‑step function calling, (2) a comprehensive empirical study revealing depth and connected distractor sensitivity as primary bottlenecks, and (3) a simple yet effective mitigation technique that substantially boosts performance across a range of models. These contributions set a new standard for evaluating and improving the reliability of tool‑augmented LLMs in complex, multi‑turn environments.

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment