Beyond Perfect APIs A Comprehensive Evaluation of Large Language Model Agents Under Real-World API Complexity
š Original Paper Info
- Title: Beyond Perfect APIs A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity- ArXiv ID: 2601.00268
- Date: 2026-01-01
- Authors: Doyoung Kim, Zhiwei Ren, Jie Hao, Zhongkai Sun, Lichao Wang, Xiyao Ma, Zack Ye, Xu Han, Jun Yin, Heng Ji, Wei Shen, Xing Fan, Benjamin Yao, Chenlei Guo
š Abstract
We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.š” Summary & Analysis
1. **New Benchmark WildAgtEval**: This paper introduces a new benchmark, WildAgtEval, designed to evaluate the capabilities of large language model (LLM) agents in invoking external functions under real-world API complexities. 2. **API Complexity Types**: The benchmark is based on eight major types of API complexities commonly observed in practice. These complexity types help assess how LLM agents handle various problem scenarios. 3. **Complexity Integration Mechanism**: The paper proposes a mechanism where each type of complexity naturally integrates into relevant API functions, reflecting real-world occurrences.Metaphor Explanation
- Step 1: Building Design: WildAgtEval is like designing and evaluating a complex building that LLM agents can use in reality. This helps understand the structure and functionality well.
- Step 2: Test Drive: Each complexity type reflects diverse problem scenarios that may occur in real-world settings, allowing evaluation of how LLM agents handle these issues. It’s akin to testing a carās performance on different road conditions.
- Step 3: Practical Validation: Integrating complexities into the API functions helps predict and address potential problems when LLM agents are used in reality. This is similar to improving products based on actual user feedback.
š Full Paper Content (ArXiv Source)
/>
/>
Large language modelsā(LLMs) agents, such as Amazon Alexa, have rapidly emerged as powerful interfaces for numerous real-world applications, building on LLMsā remarkable performanceāsurpassing human accuracy in college-level mathematics and excelling in high-stakes domainsā. To evaluate this role, a growing body of work has introduced function- or tool-calling benchmarksā, which assess whether agents produce correct API calls that fulfill user instructions. These benchmarks steadily refine our understanding of whether LLM agents can effectively address diverse instructions and execute complex, multi-step tasks.
Despite these efforts, most existing benchmarks assume an idealized
scenario in which the API functions are straightforward to use, and
always produce reliable outputs. However, as shown in
FigureĀ 3(a), these assumptions deviate
substantially from real-world scenarios. In practical
deploymentsā(e.g., Amazon Alexa), agents must carefully adhere to
extensive, meticulous API specificationsā(e.g., domain-specific
formatting rules ā[shippingĀ carrier]-[id]ā) while also managing
imperfect API execution, which often produces noisy outputsā(e.g.,
āsponsored_resultā) or encounters runtime errors.
Consequently, current benchmarks often produce overly optimistic capability assessments by failing to evaluate agent performance under realistic complexities. For example, in FigureĀ 3(b), these benchmarks cannot detect agent failures arising from intricate API specification, wherein agents simply use seemingly relevant informationā(e.g., ā16ā) rather than adhering to the required formatā(e.g., āUPS-16ā). Similarly, they do not capture failures stemming from noisy API execution results, such as when agents recommend inappropriate contentā(e.g., a 46-minute recipe for a 20-minute meal request) due to confusion over irrelevant sponsored results.
To address this gap, we propose WildAgtEval, a novel benchmark that moves beyond idealized APIs to evaluate LLM agentsā ability to invoke external functions under real-world API complexity. Specifically, WildAgtEval simulates these real-world complexities within an API systemāa fixed suite of functionsāthus exposing challenges during userāagent conversationsā(FigureĀ 3(b)). Consequently, WildAgtEval provides (i) the API system and (ii) user-agent interactions grounded in it, covering a broad range of complexity types, as shown in TableĀ [tab:contribution]: **āAPI specification**, covering intricate documentation and usage rules, and **āAPI execution**, capturing runtime challenges. Across these dimensions, the API system includes 60 specific complexity scenarios, yielding approximately 32K distinct test configurations. User-agent interactions for these scenarios are generated using a recent conversation-generation methodā.
An important consideration is that the complexity scenarios in the API
system should faithfully reflect the real-world API environments. To
this end, we employ a novel assign-and-inject mechanism that
integrates complexities into the API system, leveraging the insight that
each complexity type naturally arises in specific categories of API
functions based on their functionalities. For example, irrelevant
information frequently occurs in information-retrieval functionsā(e.g.,
search_recipes() in
FigureĀ 3(b)). Accordingly, we first
assign each complexity type to the functions most likely to encounter
this type of complexity in the real world; and then inject these
complexities by modifying the corresponding API implementations.
Our evaluation on WildAgtEval shows that most complexity scenarios consistently degrade performance across strong LLM agentsā(e.g., Claude-4-Sonnetā), with irrelevant information complexity posing the greatest challenge, causing an average performance drop of 27.3%. Moreover, when multiple complexities accumulate, the performance degrades by up to 63.2%. Qualitative analysis further reveals that, when facing unresolvable tasks, LLMs persist in attempting to solve them, ultimately distorting user intent and producing misleading success responses.
Related Work
API-Based Benchmarks for LLM Agents
Existing API-based benchmarks have advanced agent evaluation by focusing
on multi-step reasoning through functional dependencies. BFCLv3ā and
$`\tau`$-benchā introduce sequential dependency scenariosā(e.g.,
search_media() ā play() for playback). Moreover, ComplexFuncBenchā
incorporates user constraints into multi-step tasks and NESTfulā extends
to mathematical domains. Meanwhile, Incomplete-APIBankā and ToolSandboxā
assess robustness of agents to missing APIs and service state
variations, respectively. Nevertheless, prior benchmarks still
underrepresent real-world API
complexitiesā(TableĀ [tab:contribution]), yielding
overly optimistic assessments; this work addresses this gap by
integrating such complexities into agent evaluation.
User-Based Benchmarks for LLM Agents
User-based benchmarks assess whether LLM agents fulfill diverse real-world user instructions of varying complexity. MT-Evalā, Multi-Turn-Instructā, and Lost-In-Convā evaluate how well agents handle diverse user instructions in multi-turn interactions, emphasizing challenges such as intent disambiguation, multi-intent planning, and preference elicitation. IHEvalā and UserBenchā further extend these evaluations by incorporating scenarios where user intents are conflicting and evolving, respectively. Despite their broad coverage, they typically overlook the complexities related to tool invocation, such as complex API specifications or noisy API execution.
WildAgtEval: Evaluating LLM Agents under API Complexity
We present WildAgtEval to benchmark the robustness of LLM agents when invoking external functions under real-world API complexities. Following prior workā, an agent receives (i) an executable API system and (ii) a sequence of user-agent interactionsā(hereafter, referred to as the conversations), and then produces responses by invoking the appropriate API calls.
WildAgtEval advances existing benchmarks to better reflect real-world agent challenges. It introduces eight API complexity types commonly observed in practice, and integrates them into an API system guided by real-world usage patterns. As a result, WildAgtEval comprises 60 distinct complexity scenarios, supporting a potential of approximately 32,000 unique test configurations.
Taxonomy of API Complexities
TableĀ [tab:api_complexity_taxonomy]
describes the eight types of API complexity spanning the API
specification and execution phases, and
FigureĀ 4 illustrates a prompt for an agent that
shows how these complexities manifest in each API function. The
specification complexities affect what the agent reads. Within the
agentās prompt, these complexities appear in the āAPI functionsā and
āInstructionsā sectionsā(see
FigureĀ 4), ensuring their consistent inclusion
during prompt construction. In contrast, the execution complexities
affect what the agent observes after making API calls; they are
introduced into subsequent prompts according to the API calls invoked by
the agent. As shown by [C6] in
FigureĀ 4, once the agent calls stock_price(),
the subsequent prompt contains irrelevant informationā(e.g.,
āAAPLā(sponsored)ā), forcing the agent to filter or reconcile noisy
outputs.
Real-World Complexity Integration
Each complexity typeā(e.g., irrelevant information) naturally arises in
specific categories of functions, reflecting their core
functionalitiesā(e.g., information retrieval). Building on this
observation, we employ an assign-and-inject complexity integration
mechanism where each complexity type is first assigned to the
functions most likely to encounter this type of complexityābased on its
real-world likehood of occurrenceāand subsequently injected into those
functions. For example, the irrelvant information complexity is
assigned and injected into the information retrieval function,
stock_price(). Furthermore, to ensure natural assignments of all
complexity types to relevant functions, we construct our API system to
include a sufficiently diverse set of functions.
Benchmark Construction through Complexity Integration
/>
style="width:100.0%" />
Overview
FigureĀ 5 illustrates the assign-and-inject process that constructs WildAgtEval, outlined as
-
StageĀ 1: Construct a multi-domain API system and conversations grounded in that system.
-
StageĀ 2: Assign complexities to relevant functions and create concrete complexity scenarios.
-
StageĀ 3: Inject these scenarios into API functions and, when required, into conversations.
Stage 1: Multi-Domain API System and Conversation Construction
API system construction. To enable natural complexity integration across all complexity types, we develop a comprehensive multi-domain API system spanning seven commonly used domainsā(e.g., device control, information retrieval) comprising 86 API functions. Each function is fully executable, accompanied by the relevant databases and policies. Due to the extensive domain coverage and policy constraints, the prompt describing API usage for LLM agents reaches approximately 34K tokens.
Conversation construction. We adopt the conversation generation
frameworkā to produce multi-turn conversations paired with precise API
call annotations. As outlined in
FigureĀ 11, the process begins by curating
verified intent primitives, which are atomic units encapsulating a
specific user goalā(e.g., āwatch a movieā) along with the corresponding
API callsā(e.g., search_media(), power_on(), play()).
Subsequently, an LLM composes these primitives into longer interaction
scenarios, ensuring realistic conversations. All synthesized
conversations undergo post-generation validation to confirm both the
coherence of the conversational context and the correctness of the
associated API call annotations. The resulting dataset contains 300
multi-turn conversations requiring a total of 3,525 agent API calls,
with each conversation averaging approximately 4.7 user-agent turns and
2.5 API calls per turn.
Further details regarding the construction of both the API system and the conversations are provided in AppendixĀ 7.1.
Stage 2: Complexity Assignment
We identify relevant complexity-function pairs by evaluating their likelihood of real-world occurrence, subsequently generating concrete scenarios for pairs deemed highly relevant.
Relevance-based complexity matching. Consider a multi-domain API system with functions $`\mathcal{F} = \{f_1, ..., f_n\}`$ and a set of complexity types $`\mathcal{C}`$. For each complexity type $`c \in \mathcal{C}`$, we aim to find a subset of functions $`\mathcal{F}^*_c \subset \mathcal{F}`$ most relevant to $`c`$ā(e.g., $`\mathcal{F}^*_{c_3}=\{f_1,f_2\}`$ in FigureĀ 5). Specifically, for every pair $`(c, f)`$, we quantify likelihood of real-world occurrence via a relevance scores $`{\rm r}_{c,f}`$. We obtain this score by using an instruction-following language model with a relevance assessment template $`\mathcal{I}_{\text{rel}}`$ as
\begin{equation}
{\rm r}_{c,f} = {\rm InstructLM}(c, f, \mathcal{I}_{\text{rel}}),
\label{eq:rel}
\end{equation}
where $`\mathcal{I}_{\text{rel}}`$ specifies how to assess the likelihood that $`f`$ will exhibit complexity type $`c`$ in real-world environmentsā(see FigureĀ 15).
Then, to select the most relevant functions $`\mathcal{F}^*_c`$ for complexity type $`c`$, we take the top-$`k`$ functions based on their relevance score as
\begin{equation}
\mathcal{F}^*_c = \{f : f \in {\rm top}\text{-}{\rm k}({\rm r}_{c,f}) \}.
\end{equation}
Complexity scenario generation. After determining the most relevant complexity-function pairs, we generate concrete complexity scenarios (e.g., an ad-hoc rule on the phone number in FigureĀ 5). Specifically, for each pair $`(c, f)`$ with $`f \in \mathcal{F}^*_c`$, we construct a scenario $`{\rm s}_{c,f}`$ by mapping the complexity type $`c`$ to a real-world application context for $`f`$ using the scenario specification template $`\mathcal{I}_{\text{scen}}`$ as
\begin{equation}
\label{eq:scenario_generation}
{\rm s}_{c,f} = {\rm InstructLM}(c, f, \mathcal{I}_{\text{scen}}),
\end{equation}
where $`\mathcal{I}_{\text{scen}}`$ specifies how $`c`$ could manifest in $`f`$ under real-world conditions, as detailed in FigureĀ 16. Because the generated scenarios exhibit considerable variation, we generate multiple candidates $`\{s_{c,f}^{(i)}\}_i`$ via Eq.Ā [eq:scenario_generation] and select the most representative using a validation template $`\mathcal{I}_{\text{val}}`$ as
\begin{equation}
{\rm s}_{c,f}^* = \operatornamewithlimits{argmax}_{{\rm s} \in \{{\rm s}_{c,f}^{(i)}\}_i} {\rm InstructLM}({\rm s}, c, f, \mathcal{I}_{\text{val}}),
\label{eq:val}
\end{equation}
where $`\mathcal{I}_{\text{val}}`$ evaluates the realism and fidelity to the specified complexity type, as shown in FigureĀ 19.
Stage 3: Complexity Implementation
We integrate selected scenarios $`{\rm s}_{c,f}^*`$ into the API function $`f`$ and into the conversation, adhering to rigorous quality assurance protocols. TableĀ [tab:complexity_stats] summarizes the complexity scenarios implemented for each complexity type in WildAgtEval.
API system update. We modify the code implementations of relevant
API functions to realize the complexity scenarios. For example, we
incorporate validation proceduresā(e.g., invalid(phone) in
FigureĀ 5) and adjust the function outputs to
introduce noiseā(e.g., sponsored content).
Conversation update. Certain complex scenarios require modifications
to the annotations for conversations, i.e., gold API calls and reference
responses, to accurately reflect newly introduced complexities. For
example, when applying ad-hoc rules, we update the gold API calls to
follow the required formatsā(e.g., āD:ā in
FigureĀ 5). For feature limitation and system
failure errors, we annotate the prescribed workaround and clear error
messages, respectively, as the new reference response. Additional
details appear in
AppendixĀ 7.3.
Quality assurance. Maintaining API executability and accurate labels is essential for reliable evaluation; for example, if alternative solutions exist, agents may exploit unintended solution paths, potentially undermining the evaluationās validity. To prevent such issues, we employ (1) manual editing of API functions and conversation labelsā, and (2) trial runs with Claude-4-Sonnetā to detect all alternative solutions, following the prior workā.
Evaluation
Experiment Setting
Evaluation framework. We evaluate LLM agents using WildAgtEval under two setups.
We measure the impact of each complexity scenario on LLM agents independently by preserving a correct conversation history with gold API calls. This controlled experimental setup enables a fine-grained analysis of each complexityās influence that may be obscured by dominant complexity factors.
We assess how accumulated complexities impact agent performance across multi-turn conversations, where complexities compound and conversational context accumulate across turns.
LLM agents. We employ ten state-of-the-art LLMs, comprising seven resource-intensive models including Claude-4.0-Sonnet, Claude-4.0-Sonnetā(Think), Claude-3.7-Sonnet, Claude-3.5-Sonnetā, GPT-OSS-120Bā, Qwen3-235B-Instruct, and Qwen3-235B-Thinkingā; and resource-efficient models including Qwen3-32Bā, Mistral-Small-3.2-24B-Instructā, and DeepSeek-R1-Qwen32B-Distillā.
Metrics. We evaluate (1) API call accuracy, quantifying the agreement between agent predictions and the gold API calls that fulfill the userās primary intentā, and (2) error-handling accuracy in feature limitation and system failure error scenarios, assessing the degree of alignment between the agentās fallback response and a reference error handling response. Because multiple valid solutions may exist, we employ LLM-Judgeā to quantify the alignment.
Implementation details. We adopt the ReAct promptingā, instructing LLM agents to produce responses in the format āThought: {reasoning or explanatory text} Action: {JSON-format action argument}ā under zero-shot inference. We limit inference to 15 steps per conversational turn, mitigating excessive computational overhead while allowing sufficient reasoning depth for complex scenarios. The complete implementation details can be found in AppendixĀ 8.
Main Results
All API complexity types degrade agent performance, with irrelevant information presenting the greatest challenge. TableĀ [tab:isolated_complexity] presents the results of our isolated complexity evaluation, indicating that all types of API complexity in WildAgtEval reduces the performance of all LLM agentsāby an average of 12.0%. Notably, the introduction of irrelevant information imposes the most significant challengeā(see ExampleĀ [example3]), resulting in an average performance deterioration of 27.3%. This finding shows that even state-of-the-art LLMs struggle significantly with real-world API complexities.
style="width:100.0%" />
Agent performance consistently deteriorates with accumulating API complexity. FigureĀ 6 presents the results from the cumulative complexity evaluation, which incrementally compounds API complexity. Performance declines by an average of 34.3%, peaking at a 63.2% for Claude-3.5-Sonnetāsubstantially exceeding the degradation observed in isolated complexity evaluations. Interestingly, Claude-3.7-Sonnet achieves the highest resilience, while Claude-4.0-Sonnetā(Think) experiences a sharper degradation relative to its robustness under isolated complexities. This result suggests that different levels of complexity exert heterogeneous effects on LLM agents.
Agents struggle to apply effective workarounds for feature limitation errors. Error-handling evaluations in TableĀ [tab:error_handling] suggest that feature limitation errors are 34.3% more challenging than system failure errors. In many instances, agents solve problems in incorrect waysā(see ExampleĀ [example6]), or prematurely terminate the process. For example, GPT-OSS-120B returns only concise responsesā(e.g., ācurrently unavailableā) without attempting to explore potential solutions.
Enhancing reasoning is an effective strategy for handling error-based complexities. Specifically, the advanced reasoning LLMs such as Claude-4.0-Sonnetā(Think) and Qwen3-235B-Thinking show an 11.4% improvement over average error handling. AppendixĀ 9.1 compares the responses of Qwen3-235B-Thinkingā(reasoning) and Qwen3-235B-Instructā(non-reasoning) upon receiving the error message āWeather data for Paris temporarily unavailable. Other regions are accessible.ā Here, the ideal strategy would be to query nearby regions for weather data. The reasoning agent accordingly queries ādifferent location in France, such as Lyon or Marseilleā, whereas the non-reasoning agent prematurely terminates the query. Additional analyses are provided in AppendixĀ 9.2.
Core Failure Analysis
We examine prevalent agent failures using results from Claude-4-Sonnet and GPT-OSS-120B, chosen for their robust performance. Additional common failures are available in AppendixĀ 10.
Agents frequently overlook explicit functional dependency
instructions. As shown in
ExampleĀ [example1], despite the clear API
specification requiring device activation prior to color_set()
function, the agents consistently omit the mandatory power_on() call.
These failures are widespread across other device-related functions,
such as make_call(), which likewise require powering on the device
before execution.
Functional Dependency Failure API Specification: āWhen setting the color, make sure the device is turned on. If not, turn it on.ā
User Request: āMake the Bathroom Light orangeā
Expected API Calls:
-
get_user_inventory() -
power_on(āā12āā) -
color_set(āāorangeāā, āā12āā)
Actual Agent Output:
-
get_user_inventory() -
ā Omitted
-
color_set(āāorangeāā, āā12āā)ā Fail
r0.37
Consequently, as shown in
TableĀ [tab:power_on_performance],
Claude-4-Sonnet (Think) omits power_on() in over 30% of the cases, and
GPT-OSS-120B also does not fully meet this straightforward requirement.
Agents overlook ad-hoc formatting requirements, as shown in the left conversation of FigureĀ 3(b). An additional example of non-compliance with other ad-hoc rule is provided in ExampleĀ [example2].
R0.47
Notably, the evaluated LLM agents exhibit distinct vulnerabilities with respect to various ad-hoc rules. As shown in TableĀ [tab:adhoc_rules], Claude-4-Sonnetā(Think) and GPT-OSS-120B exhibit contrasting vulnerabilities, suggesting that compliance with these ad-hoc rules is highly model-dependent.
Agents poorly discriminate irrelevant data in execution results. In addition to the agent failure in the right conversation of FigureĀ 3(b), the agentās vulnerability to irrelevant data persists throughout the conversation, as demonstrated by ExampleĀ [example3]. Specifically, when the agent receives API results containing both relevant search resultsā(e.g., ārecipe34: Lamb Biryaniā) and irrelevant sponsored contentā(e.g., ārecipe32ā), it initially behaves correctly by presenting only the relevant results to the user with āLamb Biryaniāā(ārecipe34ā) and āPorkāā(ārecipe4ā). However, upon receiving a follow-up request for more details on these same recipes, the agent erroneously focuses on the sponsored content, fetching ārecipe32ā and ārecipe8ā.
Irrelevant Data Discrimination Failure User Request 1: ā10 quick recipes under 20 minutesā
Agent API Call:
search_recipes(max_time=20)
API Execution Resultsā(Hidden from user):
{{āārecipe_idāā:āārecipe32āā, āāsource_typeāā: āāsponsored_cont
entāā}, (...){āārecipe_idāā:āārecipe8āā, āāsource_typeāā: āāsponsored_contentāā}, (...){āārecipe_idāā:āārecipe34āā, āānameāā:āāLamb Biryaniāā, āāsource_typeāā: āāsearch_resultāā}, (...){āārecipe_idāā:āārecipe4āā, āānameāā:āāSweet and Sour Porkāā, āāsource_typeāā: āāsearch_resultāā}, (...)}
Agent Response: ā8 recipes: 1.(…) 2.(…) 3.(…) 4.(…) 5. Lamb Biryani … 6.(…) 7. Sweet and Sour Pork …ā
User Request 2: āShow me the 5th and 7th ones?ā
Expected API Calls (from search results):
-
get_recipe_details(āārecipe34āā) -
get_recipe_details(āārecipe4āā)
Actual Agent Outputā(from sponsored content):
-
get_recipe_details(āārecipe32āā) -
get_recipe_details(āārecipe8āā)
R0.44
TableĀ [tab:irrelevant_data] reports
that, in the presence of irrelevant data, the performance on
search_recipe() is reduced by nearly half across agents, highlighting
the pervasive challenge of effectively filtering irrelevant information.
Agents persist excessively in problem-solving when API execution errors arise, often leading to the distortion of user intent and the false reporting of success. As shown in ExampleĀ [example5], upon encountering a system failure error, the expected behavior would be to provide appropriate error reporting of the system-wide issue. Instead, the agents engage in arbitrary attempts, such as repeatedly invoking any endpoint identifiersā(e.g., ā4ā, ā3ā), thus reducing user satisfaction and potentially compromising system stabilityā(e.g., by inadvertently accessing other usersā devices).
Task Infeasibility Recognition Failure User Request: āPlay Star Heart on Bedroom TVā
Agent API Call:
get_user_inventory()ā System failure error
API Execution Resultsā(Hidden from user):
{āāsuccessāā: false,āāerrorāā: āāINFRA_DB_POOL_EXHAUSTED_0x7
F3Aāā,āāmessageāā: āāPostgreSQL connection pool (...)āā}
Expected Agent Response:
- āPlease try again shortly as this is a temporary database system issueā
Actual Agent Output:
-
āCanāt access, Iāll try a common endpoint (…).ā
-
play(endpoint=āā4āā, ...)ā NotFoundError -
play(endpoint=āā3āā, ...)ā NotFoundError -
play(endpoint=āā2āā, ...)ā NotFoundError
Moreover, as shown in ExampleĀ [example6], the agents further exacerbate these issues by distorting the userās original request in order to claim success. Specifically, when unable to fulfill the request, they reinterpret it as sending a message to whoever āappears first in your contactsāā(e.g., āJohn Hillā). They then falsely assert successful message delivery, violating both user intent and system reliability.
User Intent Distortion and False Success User Request: āText the last person who messaged meā
Agent API Call:
get_message(limit=10)ā Feature limitation error
API Execution Resultsā(Hidden from user):
{āāsuccessāā: false,Ā Āāāerrorāā: āāRetrieving all messages is temporarily limitedto recent history (...)āā}
Expected Agent Response:
-
Use hint ārecent historyā and retry with reduced limit parameter, e.g.,
get_message(limit=3) -
Alternative: Provide clear error reporting about temporary limitations
Actual Agent Output:
-
ātrouble accessing your message history, find your contacts firstā
-
find_contact()ā{āā8āā:āāJohn Hillāā, (...)} -
ā(…) Iāll send it to John Hill since he appeared first in your contacts.ā
-
send_message("8", message=āāHello! (...)āā) -
āIāve sent your message āHello! (…)ā to John Hill. The message was delivered successfullyā
This result highlights the urgent need for robust reasoning mechanisms that recognize task infeasibility or inherent system limitations.
Conclusion
Beyond the idealized API environment, we introduce WildAgtEval designed to evaluate LLM agents under the complexities commonly encountered in real-world API usage. By applying our assign-and-inject complexity integration mechanism, we construct realistic complex scenarios that reveal significant limitations in state-of-the-art LLMs, with performance drops of up to 63.2%. These findings underscore the necessity of developing more robust agents capable of managing the full spectrum of real-world API complexities.
Limitations
In this work, we identify the limitations of LLM agents operating in realistic tool invocation environments and observe two recurrent failure patterns: (1) non-compliance with domain-specialized constraintsā(e.g., functional dependencies and ad-hoc formatting rules), and (2) unpredictable behavior on infeasible tasks, occasionally resulting in user intent distortion. Our study characterizes these phenomena but does not propose training-based remedies; developing robust training methods remains an open direction for future work.
To advance this goal, we suggest two key dimensions for further exploration: (1) enhancing constraint-aware instruction-following, thereby facilitating agent adaptation to domain- and business-specific logic; and (2) improving reasoning over inherently infeasible tasks to ensure safe behavior when tools are unstable. Achieving these training objectives requires curated dataset that jointly cover both axes. Built on WildAgtEval, such data can be generated at scale via recent pipelineā with rule-basedā and LLM-basedā data quality verification. Furthermore, to mitigate overfitting and preserve generalization, newly generated datasets should be augmented with existing public instruction-followingā and reasoningā corpora, in accordance with continual learning principlesā. Finally, standard training protocols, including supervised fine-tuning and reinforcement learningā, equip agents to attain robust performance during real-world deployment.
Ethical Considerations
This work focuses on generating userāagent interactions to simulate realistic tool-invocation environments without employing human annotators. Consequently, we foresee minimal ethical concerns arising from the training procedure. Specifically, creation of WildAgtEval adheres to a common LLM-based conversation-generation protocol described in previous researchā. Therefore, we do not anticipate ethical violations or adverse societal consequences stemming from this work.
| Domain | Description | ||
| Functions | |||
| functions | |||
| Notification | Controls time-based functionalities, including alarms and reminders, with support for timers and recurring schedules. | create_alarm |
8 |
| Manages communication channels, including calls and messaging, with basic contact resolution and status checks. | make_call |
7 | |
| Handles food-related services, from meal planning to food delivery, including preferences and dietary constraints. | place_delivery_order |
12 | |
| Enables content discovery and playback across diverse media, sources, and providers. | search_media |
16 | |
| Provides unified control of smart-home devicesā(e.g., TVs, lights, thermostats), including scenes and simple automation. | color_set |
19 | |
| Facilitates product search, payment processing, and order tracking, with basic cancellation management. | checkout |
12 | |
| Delivers weather forecasts, news updates, and general knowledge, including customized alerts. | weather_current |
12 |
Beyond Perfect APIs:
A Comprehensive Evaluation of LLM Agents
Under API Complexity
(Supplementary Material)
Details of Benchmark Construction
Stage 1: Multi-Domain API System and Conversation Construction
We build a multi-domain API ecosystem and synthesize conversations grounded in executable function specifications.
API construction. All functions are implemented in Python and
defined with OpenAIās tools/function-calling schema, i.e., JSON Schema
with type, functions, and parameters fields.
TableĀ [tab:api_system] summarizes the
domains, representative functions, and statistics of the API system.
Example of API function:
FigureĀ 9 presents a representative API
function of WildAgtEval, track_order().
In particular, it demonstrates the functionās runtime behavior (e.g.,
invoke()). In our system, all API calls are managed via a unified
invocation interface invoke_tool(), following prior workā; for
example, the agent invokes track_order by calling
invoke_tool("track_order", order_id=$`\ldots`$). The agent
references the information for each API function, as shown in
FigureĀ 7, to select and utilize
the appropriate functionality.
json "function": "name": "track_order", "description": "Track the shipping status of a specific order. Provides current status, tracking number, and estimated delivery date if available.", "parameters": "order_id": "The unique ID of the order to track. This ID is prefixed with the shipping carrier code followed by a hyphen and the order suffix (e.g., āUPS-345ā, āFDX-678ā). The suffix is typically extracted from the original order ID by excluding the initial characters (e.g., for order_id ā12345ā, suffix is ā345ā; for order_id ā345678ā, suffix is ā5678ā)." ...
track_order(). json "order_id": "ORDER0001", "user_id": "user1", "items":[ "product_id": "prod41", "name": "Vital Wireless Earbuds", "quantity": 2, "price": 44.55, "subtotal": 89.1 , ...] "payment": "method_id": "pm3", "method_type": "apple_pay", ... ...
orders.json.class TrackOrder(Tool):
@staticmethod
def invoke(data: Dict[str, Any], order_id: str) -> str:
"""
Track the shipping status of an order.
Args:
data: The data dictionary containing orders
order_id: ID of the order to track
Returns:
A JSON string with the result of the operation
"""
### Check if feature limitation error complexity should be activated ###
uncertainty_feature_limitation_error_enabled = os.getenv('ENABLE__FEATURE_LIMITATION_ERROR__TRACK_ORDER', 'false').lower() == 'true'
### Validating the compliance with the ad-hoc rule ###
# Validate order ID format - checking the format "carrier-suffix"
if "-" not in order_id:
return json.dumps({
"success": False,
"message": "Invalid order ID format."
})
# Get carrier from order_id
provided_carrier = order_id.split("-", 1)[0]
### Raise feature limitation error - Original carriers temporarily unavailable ###
if uncertainty_feature_limitation_error_enabled:
alternative_carriers = ["SwiftShip", "RapidCargo"]
return json.dumps({
"success": False,
"message": f"{provided_carrier} tracking temporarily unavailable. It may have been changed to other shipping carriers like {', '.join(alternative_carriers)}"
})
# Get the order, ensuring it belongs to the current user
order = find_order_by_id(data, order_id, current_user)
# Validate order ID format - checking the carrier
...
# return message
if status == "processing":
return json.dumps({
"success": True, ...
"message": "Your order is being processed."
})
...track_order() function, illustrating the key algorithmic
steps in the order-tracking process.json "user_id": "user1", "name": "Sarah Rodriguez", "home_id": "home1", "preferences": "location": "New York", ..., ...
users.json.Example of databases: During the execution of most API functions,
the database is either read or written, providing direct access to the
API systemās database. As indicated in Lines 34ā35 of
FigureĀ 9, this behavior involves the
variable data. For instance, the track_order() function specifically
retrieves information from orders.json. As shown in
FigureĀ 8, each order includes realistic
attributes such as order_id, the user_id of the individual placing
the order, items, and payment details. Furthermore, multiple databases
are interconnected based on a shared schema; for example,
FigureĀ 10 demonstrates how orders.json is
linked to user records via user_id. Both
FigureĀ 8 and
FigureĀ 10 display orders and user information
for the case in which user_id equals user1.
style="width:100.0%" />
Conversation construction. Following the conversation generation framework of Barres et al., we construct a scalable data-generation pipeline grounded in our API system, as shown in FigureĀ 11. This pipeline produces multi-turn conversations that are both diverse and natural, while remaining precisely labeled with executable API callsāan essential requirement for robust evaluation of agent reasoning, and execution under realistic conditions.
The process unfolds in four stages. First, after establishing the API
codebase, we build a directed graph in which each function is
represented as a node, and each edge denotes a dependencyā(i.e., a
functionās output serving as another functionās input). By traversing
this graph, we identify meaningful multi-step task sequences that
emulate realistic API usage for addressing user requests or intents. For
example, the play() function depends on prior calls to the API calls
search_media(), get_user_inventory(), power_on(), as shown in
FigureĀ 11.
Second, each valid execution path is transformed into a verified intent
primitiveāan atomic unit that encapsulates (i) a specific user goal,
(ii) a corresponding chain of API calls, and (iii) diversified parameter
settings that emulate natural variations in user context, preferences,
and input conditions. These primitives serve as composable building
blocks, enabling flexible scenario construction across domains. For
example, the āwatch a movieā intent primitive comprises the API calls
search_media(), get_user_inventory(), power_on(), play().
Similarly, there are more primitivesāsuch as āpause,ā āshuffle,ā ānext,ā
or āadd media to playlistāāencapsulating API call sequences for related
tasks.
Third, an LLM composes multiple verified intent primitives into multi-turn conversations, generating the corresponding API call sequences. The LLM first inspects these primitives to produce a high-level conversation flow describing how user goals naturally unfold within a single conversation. For instance, the LLM might outline a scenario in which the user initially plays a song, then pauses playback, shuffles the playlist, advances to the next track, and finally adds the current song to a personalized playlist. Subsequently, the LLM translates this plan into a detailed multi-turn conversation by generating user and agent utterances and integrating the relevant verified intent primitives, as shown in FigureĀ 11. Since these primitives already include validated API call sequences, each generated conversation is both natural and accurately aligned with the correct API behaviors.
Finally, each conversation undergoes a two-phase quality assurance process. We first convert the JSONL outputs into executable Python scripts and automatically validate whether each conversation runs successfully. Any conversations that fail execution are filtered out. The remaining dialogues are then manually reviewed by human experts to ensure semantic coherence, logical flow, and API correctness. The resulting benchmark contains 300 multi-turn, multi-step conversations, each averaging 4.7 dialogue turns and 2.5 API calls per turn.
Stage 2: Complexity Assignment
We automate complexity assignment with a language model using three prompt templates: a relevance assessment template $`\mathcal{I}_{\mathrm{rel}}`$ in Eq.Ā [eq:rel], a scenario specification template $`\mathcal{I}_{\mathrm{scen}}`$ in Eq.Ā [eq:scenario_generation], and a scenario validation template $`\mathcal{I}_{\mathrm{val}}`$ in Eq.Ā [eq:val]. Given a function and a candidate complexity type, $`\mathcal{I}_{\mathrm{rel}}`$ estimates the likelihood that the complexity arises for the function in deployment; top-ranked functionācomplexity pairs are then instantiated as concrete scenarios by $`\mathcal{I}_{\mathrm{scen}}`$; finally, $`\mathcal{I}_{\mathrm{val}}`$ filters candidates based on real-world plausibility and fidelity to the targeted complexity type. The complete prompt templates appear in FiguresĀ 15ā19.
Implementation details. We use Claude-3.7-SonnetĀ with temperature $`0.8`$ and keep all other hyperparameters at their default values; other large language models can be substituted without altering the pipeline.
Stage 3: Complexity Implementation
Reference Response Construction for Error-based Complexities
For both feature limitation and system failure error scenarios, we
specify reference responses that include either a recommended
workaround or a standardized error message, respectively. Since error
handling seldom admits a single gold-standard outcome, each reference
response captures a valid approach representing ideal error
resolution. For instance, if a user queries weather(āāSeattleāā) but
receives a message stating that āSeattle is currently unavailable;
however, search for other regions is available,ā the system has failed
to fulfill the request directly. A proper response would suggest
nearby locations, such as āKirklandā or āTacoma,ā rather than
searching for a random region or immediate surrender. Hence, the
reference response specifies the conceptual approach (e.g., searching
for nearby locations), rather than detailing specific ones, to gauge
how closely the agentās strategy aligns with a valid approach.
Consequently, for each of the 16 error-based scenarios in WildAgtEval, we define an evaluation prompt that incorporates both the reference response and the corresponding scenario-specific validity criteria. We then employ an LLM-based judgeā to score how closely the agentās error handling aligns with the recommended approach detailed in the reference response. FiguresĀ 20ā22 demonstrate representative evaluation prompts for feature limitation and system failure errors.
Results of Complexity Integration
TableĀ [tab:complexity_stats]
summarizes, for each complexity type, the number of functions that
contain injected complexity scenarios, along with representative
functions. Notably, a single function may host multiple
complexitiesā(e.g., track_order() in
FigureĀ 9); in such cases, the
functionās behavior reflects the combined effects of the activated
complexities.
| Complexity Type | Coverage | Representative Affected Functions |
|---|---|---|
| API specification complexities | ||
| Ad-hoc rules | 8 API functions | lock_lock,
track_order, play |
| Unclear functionality boundary | 20 API functions | get_user_inventory,
search_product, make_call |
| Functional dependency | 20 API functions | get_user_inventory,
search_media, power_on |
| Ambiguous description | Present in base API system | ā |
| Information notices | 8 API functions | temperature_set,
stock_watchlist, make_call |
| Partially irrelevant information | 8 API functions | search_recipes,
knowledge_lookup, stock_watchlist |
| Feature limitation errors | 8 API functions | get_notifications,
weather_forecast, track_order |
| System failure errors | 8 API functions | make_call,
get_user_inventory, stock_price |
Example of complexity integration. We integrate two complexity
typesāan ad-hoc rule and a feature limitation errorāinto the
track_order() function depicted in
FigureĀ 9. LinesĀ 15ā21 and 37ā38
validate the ad-hoc rule, which ensures that the order_id parameter
follows the industry-standard format ā[shipping carrier]ā[id].ā
Specifically, the code checks whether a hyphen is present and whether
the carrier name is valid. This ad-hoc rule is also specified in the API
documentationā(see
FigureĀ 7), allowing LLM agents to
reference the correct parameter format.
In parallel, LinesĀ 12ā23 determine whether the feature limitation error is activated under the current complexity configuration, and if soā(LinesĀ 26ā32), an error message indicates that the original carrier is unavailable and has been replaced by another shipping carrier.
Complete Experiment Details
Evaluation framework. FigureĀ 12 provides a visual overview of our two evaluation methods. In the isolated complexity settingā(left side of FigureĀ 12), we preserve a correct conversation historyāusing ground-truth API callsāup to the point immediately preceding the tested API call. This approach allows us to measure the impact of each complexity type independently. In contrast, the cumulative complexity settingā(right side of FigureĀ 12) compounds multiple complexities by continuously integrating the agentās previously predicted API calls throughout the conversation.
style="width:100.0%" />
Implementation details. The agent prompt structure consists of three primary components: API specification documentation (approximately 34K tokens), conversational context, and API execution guidelines (approximately 19K tokens), providing agents with comprehensive functional specifications and operational procedures. Within this structure, we employ the ReAct promptingā, where models are instructed to generate responses in the format āThought: {reasoning or explanatory text} Action: {JSON-format action argument}ā using zero-shot inference.
For Claude-based agents, we access Anthropic models via Amazon Bedrock.
All other models are served with vLLMāe.g., the vllm-openai:gptoss
image for GPT-OSS-120Bāand executed on Amazon AWS EC2 p5en
instances. The source code is publicly available at
https://github.com/Demon-JieHao/WildAGTEval
.
From the 300 userāagent conversations provided by WildAgtEval, we evaluate the agents on a stratified subset of 50 conversations to manage the inference cost for state-of-the-art LLMs. To ensure reproducibility and enable further analysis, we release both the evaluation subset of 50 conversations and the full set of 300 conversations, respectively. WildAgtEval is distributed under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license.
Evaluation metric. We evaluate LLM agents using (1) API call accuracy and (2) error-handling accuracy for feature limitation and system failure error scenarios.
API call accuracy: Consistent with prior work , we evaluate whether an agentās API behavior genuinely fulfills the userās intended goal. In contrast to approaches that judge success solely from final database statesāwhich can be inadequate because many intents do not yield observable state changes (e.g., āCheck the current weatherā)āwe also treat such nonāstate-changing queries as valid targets for evaluation.
We therefore partition the ground-truth API sequence into two
categories: core APIs and support APIs. An agent is deemed
successful on a given turn only if it invokes all required core APIs
for that turn. Core APIs directly address the userās intent
irrespective of whether the database state changes. While they often
include operations such as add_to_cart(), checkout(), power_on() as
well as non-database-altering operationsā(often information-retrieval
functions, such as weather_current()) that independently satisfy the
userās intent. Because each user request encodes at least one intent,
there is at least one core API per request.
Support APIs, on the other hand, are supplementary operations (e.g.,
search_product()) that provide information needed to correctly invoke
downstream core APIs (e.g., add_to_cart(), checkout()). Their use is
not scored directly; repeated or suboptimal support calls do not affect
success as long as core calls are correct. Hence, final success is
assessed solely on the correctness of core API execution.
Formally, let $`C_i^\text{(pred)}`$ denote the set of core APIs invoked by the agent at $`i`$-th conversation turn, and $`C_i^\text{(gold)}`$ be the corresponding ground-truth set. With $`T`$ total conversation turns, the API call accuracy $`\mathcal{A}_\text{\rm API}`$ is defined as
\begin{equation}
\mathcal{A}_\text{\rm API}
= \frac{1}{T}\sum_{i=1}^{T}
\mathbb{I}\!\left[\, C_i^{(\mathrm{pred})} = C_i^{(\mathrm{gold})} \,\right],
\end{equation}
where $`\mathbb{I}[\cdot]`$ denotes the indicator function that returns 1 if the condition holds and 0 otherwise.
Error-handling accuracy: We define this metric to quantify how closely an agentās fallback response aligns with a reference error-handling response. Formally, let $`E`$ be the index set of conversation turns that instantiate either a feature limitation or a system failure error scenario. For each $`i \in E`$, let $`R_i^{(\mathrm{pred})}`$ denote the agentās fallback response and $`R_i^{(\mathrm{ref})}`$ the reference error-handling response. We employ an LLM-Judge scoring function, $`\mathcal{J}:\mathcal{R}\times\mathcal{R}\to\{1,\dots,5\}`$, where $`\mathcal{R}`$ is the space of textual responses and $`\mathcal{J}(\cdot,\cdot)`$ is an ordinal score reflecting the alignment between two responses. Then, error-handling accuracy $`\mathcal{A}_{\rm{Error}}`$ is defined as
\begin{equation}
\mathcal{A}_{\mathrm{Error}}
= \frac{1}{|E|}\sum_{i\in E}
{\mathcal{J}}\!\big(R_i^{(\mathrm{pred})},\,R_i^{(\mathrm{ref})}\big).
\end{equation}
In our experiments, $`\mathcal{J}`$ is instantiated using Claude-4-OpusĀ to leverage strong reasoning while avoiding overlap with the tested LLM agents. Additional details on reference error-handling responses and the corresponding evaluation prompt are provided in AppendixĀ 7.3.1 and FiguresĀ 20ā22, respectively.
Complete Main Analyses
Detailed Comparison of Reasoning and Non-Reasoning Agents under Error-based Complexities
ExampleĀ [reason_comparision] represents how the reasoning agentā(marked in blue) more effectively addresses errors arising from API complexity than the non-reasoning agentā(marked in red). Specifically, when the system fails to retrieve weather data for a particular region, an ideal strategy is to query nearby regions as a workaround. This result demonstrates the effectiveness of the reasoning mechanism for accurately identifying and implementing an appropriate workaround for the encountered error.
Comparative Analysis of Error Handling in Reasoning vs. Non-Reasoning Agents (Qwen3-235B-Thinking vs. Qwen3-235B-Instruct) User Request: āI will travel to paris. Can you show me the 7-day forecast for Paris?ā
Agent API Call:
weather_forecast(location=āāparisāā, (...))ā Feature limitation error
API Execution Resultsā(Hidden from user):
{āāsuccessāā: false``,āāmessageāā: āāWeather data for paris temporarily unavailable. Other regions are accessibleāā}
Expected Agent Response:
- Since weather data for Paris is currently unavailable, the agent should query weather information for nearby regions; the LLM-judge evaluation prompt is provided in FigureĀ 20.
Actual Output from Reasoning Agentā(Qwen3-235B-Thinking):
- āLet me analyze the current situation: The user has been going through a sequence of actions: first, they asked for top 10 Italian food recipes; then they wanted to buy ingredients for the best recipe; now theyāre asking for a 7-day forecast for Paris. The last API response was for the weather forecast request, which returned: “success”: false, “message”: “Weather data for paris temporarily unavailable. Other regions are accessible”. Since the weather data for Paris is unavailable, I need to inform the user that Paris weather data is unavailable, suggest alternatives (other regions as mentioned in the error message), and ask if they want to try a different location. According to the instructions, I should only make one API call at a time, wait for API results before making next decisions, and signal completion with “<End of Turn>” when the request is fully satisfied. Since the weather API call for Paris has already failed (as shown in the API_RESPONSE), I donāt need to make another API call for Paris; instead, I should inform the user of the failure and suggest alternatives. However, I could make a new API call for a different location if the user agrees, but since I need to wait for user confirmation first, my next step should be to inform the user about the Paris weather data unavailability, suggest trying a different location, and wait for the userās response before making any new API calls. So I should not make any API call right now; instead, I should end this turn with a message to the user and wait for their next input. According to the instructions, when the userās request is fully satisfied (or in this case, when weāve processed the failed request and provided appropriate feedback), I should use “<End of Turn>”. So my response should be a message explaining the situation and offering alternatives, followed by “<End of Turn>”. </think> The weather data for Paris is currently unavailable according to the system response. Would you like me to check the forecast for a different location in France, such as Lyon or Marseille, or perhaps another major European city? I can try an alternative location if youād prefer. <End of Turn>ā
Actual Output from Non-Reasoning Agentā(Qwen3-235B-Instruct):
- āIām unable to retrieve the 7-day weather forecast for Paris as the weather data is currently unavailable. You may want to try again later or check another weather service for Paris forecasts. <End of Turn>ā
Additional Analyses
Pitfalls of Over-Reasoning in Reasoning-Based Agent
Although reasoning improves error handling, as discussed in
SectionĀ 5.2, we also observe that
reasoning-based agents occasionally overinterpret user intent, thereby
reducing overall performance. Specifically, in
ExampleĀ [example_over-reasoning],
the user explicitly requests an āAmerican restaurant,ā indicating that
search_restaurant(``cuisine_type=āāAmericaāā``) is the correct API
call. However, the reasoning agent Qwen3-235B-Thinking overextends its
reasoning by recalling the userās previously stated preference for
āNew Yorkā and therefore incorrectly
invokes
search_restaurant(``cuisine_type=āāAmericaāā, ``location=āāNew Yorkāā``).
Over-Reasoning by Qwen3-235B-Thinking Previous API Execution Results:
{āāpreferenceāā: {āālocationāā:āāNew Yorkāā, āālanguageāā: āāenāā, āānews_categoriesāā: [āāworldāā, āāsportsāā, āāhealthāā], (...)}}
*Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā (after 3 conversation turns…)***
User Request: āCan you find me the best American restaurant?ā
Expected API Call:
search_restaurant(``cuisine_type=āāAmericaāā)
Actual Agent Output:
search_restaurant(``cuisine_type=āāAmericaāā,location=āāNew Yorkāā) āIncorrect search
Similar overinterpretation issues occur in other search-related
functionsā(e.g., search_product()), resulting in an overall accuracy
of only 30.0% in generating correct API calls. In contrast, another
reasoning agent, Claude-4-Sonnetā(Think), achieves an accuracy of 64.4%,
illustrating how distinct reasoning strategies can produce substantially
different outcomes. These findings align with the notably low cumulative
evaluation results of Qwen3-235B-Thinking shown in
FigureĀ 6.
Impact of API Complexity on User-Instruction Complexity
Implementation details. We apply the API complexities in WildAgtEvalā(TableĀ [tab:api_complexity_taxonomy]) to two instruction-side conditions distinguished by coreference structureāwith coreferencesā(pronouns or nominal references spanning multiple turns) versus without coreferencesā(all entities explicitly stated), as illustrated in FigureĀ 13. It is widely known that coreferential conversations generally pose greater instruction-following difficulty than non-coreferential onesā.
APIāuser complexity overlap amplifies difficulty. FigureĀ 14 reports performance retention ratio when API complexities are injected, defined as
\begin{equation}
\textit{Retention Ratio} = \frac{\text{performance}_{\text{after}}}{\text{performance}_{\text{before}}},
\end{equation}
which measures post-injection performance relative to pre-injection performance. We employ this ratio to control for baseline difficulty, noting that coreferential conversations inherently yield lower pre-injection performance. Normalizing by the pre-injection score thus isolates the marginal effect of introducing API complexity.
In general, combining API complexities with coreferences results in considerably greater performance retention degradation, producing an 11.0% decline compared to 7.1% in the absence of coreferences. In particular, among the examined complexity types, irrelevant information complexity again poses the greatest challenge for the LLM agent by reducing average performance retention by an additional 0.1%. These findings suggest that the co-occurrence of API-side and user-side complexities creates more challenging scenarios, indicating a combinatorial effect from both complexity sources.
Additional Failure Analyses
Agents are easily distracted by informational notices in API execution
results. As illustrated in
ExampleĀ [example4], when API responses contain
auxiliary function notices alongside execution results, agents
demonstrate increased susceptibility to overlooking critical functional
dependencies. Specifically, when temperature_set() responses include
auxiliary function notices about brightness_adjust() and users later
request brightness adjustments, agents exhibit increased tendency to
omit the mandatory power_on() prerequisite step, directly attempting
brightness modifications that result in failures.
style="width:60.0%" />
style="width:100.0%" />
(a) All API complexities (b) Irrelevant information
Distraction by Informational Notices User Request 1: āSet temperature to be 32 degreesā
Agent API Call:
temperature_set(32)
API Execution Results (Hidden from user):
{āāsuccessāā: āātrueāā, āāTemperatureāā: āā32āā,āāCompanion functionsāā:{āāAPI Callāā: āābrightness_adjustāā, āāparamsāā: (...)}}
User Request 2: āSet Kitchen Light brightness to 20%ā
Expected API Calls:
-
power_on(āā35āā) -
brightness_adjust(āā35āā, temperature=āā20āā)
Actual Agent Output:
-
ā Omitted
-
brightness_adjust(āā35āā, temperature=āā20āā)ā Failed
R0.44
As shown in TableĀ [tab:informational_notice], when informational notices are present, Claude models show sharp performance degradation. This result suggests that agents struggle to maintain focus on task-critical functional dependencies when exposed to potentially relevant contextual information.
Additional examples of ad-hoc rule violation. As shown in ExampleĀ [example2], despite clear API specification requiring endpoint parameter formatting as ā[device_name]_[id]ā, agents ignore domain-specific formatting rules and apply device informationā(e.g., ā16ā) directly without proper string concatenationā(e.g., āGarageDoorOpener_16ā).
Ad-hoc Rule Violation - lock_lock() API Specification: āWhen using
lock_lock() API, for enhanced security reasons, the endpoint parameter
should follow the format ā[device_name]_[id]ā.ā
User Request: āLock the Garage Door Openerā
Expected API Call:
lock_lock(āāGarageDoorOpener_16āā)
Actual Agent Output:
lock_lock(āā16āā)ā Missing device name
/>
/>
/>
/>
/>
/>
weather_forecast().
/>
get_message().
/>
get_user_inventory().š ė ¼ė¬ø ģź°ģė£ (Figures)
















