Introduction

Challenges in real-world agent deployment.

Unassessable agent failures by current benchmarks.

Key motivation for : (a) highlights the challenges in real-world agent deployment; and (b) provides conversations of that reveal LLM agents’ failure modes often overlooked by current benchmarks.

Large language models (LLMs) agents, such as Amazon Alexa, have rapidly emerged as powerful interfaces for numerous real-world applications, building on LLMs’ remarkable performance—surpassing human accuracy in college-level mathematics and excelling in high-stakes domains . To evaluate this role, a growing body of work has introduced function- or tool-calling benchmarks , which assess whether agents produce correct API calls that fulfill user instructions. These benchmarks steadily refine our understanding of whether LLM agents can effectively address diverse instructions and execute complex, multi-step tasks.

Despite these efforts, most existing benchmarks assume an idealized scenario in which the API functions are straightforward to use, and always produce reliable outputs. However, as shown in Figure 3(a), these assumptions deviate substantially from real-world scenarios. In practical deployments (e.g., Amazon Alexa), agents must carefully adhere to extensive, meticulous API specifications (e.g., domain-specific formatting rules “[shipping carrier]-[id]”) while also managing imperfect API execution, which often produces noisy outputs (e.g., “sponsored_result”) or encounters runtime errors.

Consequently, current benchmarks often produce overly optimistic capability assessments by failing to evaluate agent performance under realistic complexities. For example, in Figure 3(b), these benchmarks cannot detect agent failures arising from intricate API specification, wherein agents simply use seemingly relevant information (e.g., “16”) rather than adhering to the required format (e.g., “UPS-16”). Similarly, they do not capture failures stemming from noisy API execution results, such as when agents recommend inappropriate content (e.g., a 46-minute recipe for a 20-minute meal request) due to confusion over irrelevant sponsored results.

To address this gap, we propose , a novel benchmark that moves beyond idealized APIs to evaluate LLM agents’ ability to invoke external functions under real-world API complexity. Specifically, simulates these real-world complexities within an API system—a fixed suite of functions—thus exposing challenges during user–agent conversations (Figure 3(b)). Consequently, provides (i) the API system and (ii) user-agent interactions grounded in it, covering a broad range of complexity types, as shown in Table [tab:contribution]: ** API specification**, covering intricate documentation and usage rules, and ** API execution**, capturing runtime challenges. Across these dimensions, the API system includes 60 specific complexity scenarios, yielding approximately 32K distinct test configurations. User-agent interactions for these scenarios are generated using a recent conversation-generation method .

An important consideration is that the complexity scenarios in the API system should faithfully reflect the real-world API environments. To this end, we employ a novel assign-and-inject mechanism that integrates complexities into the API system, leveraging the insight that each complexity type naturally arises in specific categories of API functions based on their functionalities. For example, irrelevant information frequently occurs in information-retrieval functions (e.g., search_recipes() in Figure 3(b)). Accordingly, we first assign each complexity type to the functions most likely to encounter this type of complexity in the real world; and then inject these complexities by modifying the corresponding API implementations.

Our evaluation on shows that most complexity scenarios consistently degrade performance across strong LLM agents (e.g., Claude-4-Sonnet ), with irrelevant information complexity posing the greatest challenge, causing an average performance drop of 27.3%. Moreover, when multiple complexities accumulate, the performance degrades by up to 63.2%. Qualitative analysis further reveals that, when facing unresolvable tasks, LLMs persist in attempting to solve them, ultimately distorting user intent and producing misleading success responses.