Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity
Introduction
Large language models (LLMs) agents, such as Amazon Alexa, have rapidly emerged as powerful interfaces for numerous real-world applications, building on LLMs’ remarkable performance—surpassing human accuracy in college-level mathematics and excelling in high-stakes domains . To evaluate this role, a growing body of work has introduced function- or tool-calling benchmarks , which assess whether agents produce correct API calls that fulfill user instructions. These benchmarks steadily refine our understanding of whether LLM agents can effectively address diverse instructions and execute complex, multi-step tasks.
Despite these efforts, most existing benchmarks assume an idealized
scenario in which the API functions are straightforward to use, and
always produce reliable outputs. However, as shown in
Figure 3(a), these assumptions deviate
substantially from real-world scenarios. In practical
deployments (e.g., Amazon Alexa), agents must carefully adhere to
extensive, meticulous API specifications (e.g., domain-specific
formatting rules “[shipping carrier]-[id]”) while also managing
imperfect API execution, which often produces noisy outputs (e.g.,
“sponsored_result”) or encounters runtime errors.
Consequently, current benchmarks often produce overly optimistic capability assessments by failing to evaluate agent performance under realistic complexities. For example, in Figure 3(b), these benchmarks cannot detect agent failures arising from intricate API specification, wherein agents simply use seemingly relevant information (e.g., “16”) rather than adhering to the required format (e.g., “UPS-16”). Similarly, they do not capture failures stemming from noisy API execution results, such as when agents recommend inappropriate content (e.g., a 46-minute recipe for a 20-minute meal request) due to confusion over irrelevant sponsored results.
To address this gap, we propose , a novel benchmark that moves beyond idealized APIs to evaluate LLM agents’ ability to invoke external functions under real-world API complexity. Specifically, simulates these real-world complexities within an API system—a fixed suite of functions—thus exposing challenges during user–agent conversations (Figure 3(b)). Consequently, provides (i) the API system and (ii) user-agent interactions grounded in it, covering a broad range of complexity types, as shown in Table [tab:contribution]: ** API specification**, covering intricate documentation and usage rules, and ** API execution**, capturing runtime challenges. Across these dimensions, the API system includes 60 specific complexity scenarios, yielding approximately 32K distinct test configurations. User-agent interactions for these scenarios are generated using a recent conversation-generation method .
An important consideration is that the complexity scenarios in the API
system should faithfully reflect the real-world API environments. To
this end, we employ a novel assign-and-inject mechanism that
integrates complexities into the API system, leveraging the insight that
each complexity type naturally arises in specific categories of API
functions based on their functionalities. For example, irrelevant
information frequently occurs in information-retrieval functions (e.g.,
search_recipes() in
Figure 3(b)). Accordingly, we first
assign each complexity type to the functions most likely to encounter
this type of complexity in the real world; and then inject these
complexities by modifying the corresponding API implementations.
Our evaluation on shows that most complexity scenarios consistently degrade performance across strong LLM agents (e.g., Claude-4-Sonnet ), with irrelevant information complexity posing the greatest challenge, causing an average performance drop of 27.3%. Moreover, when multiple complexities accumulate, the performance degrades by up to 63.2%. Qualitative analysis further reveals that, when facing unresolvable tasks, LLMs persist in attempting to solve them, ultimately distorting user intent and producing misleading success responses.