Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods …
Authors: Penny Chong, Harshavardhan Abich, ani
Accepted as a conference paper at ICLR 2026 T A L K , E V A L U A T E , D I AG N O S E : U S E R - A W A R E A G E N T E V A L U A T I O N W I T H A U T O M A T E D E R R O R A N A L Y S I S Penny Chong 1 ∗ , Harsha vardhan Abichandani 1 , Jiyuan Shen 1 , Atin Ghosh 1 , Min Pyae Moe 1 , Y ifan Mai 2 , Daniel Dahlmeier 1 1 SAP , 2 Stanford Univ ersity { penny.chong, harshavardhan.abichandani, jiyuan.shen } @sap.com { atin.ghosh, min.pyae.moe, d.dahlmeier } @sap.com yifan@cs.stanford.edu A B S T R AC T Agent applications are increasingly adopted to automate workflo ws across div erse tasks. Ho wev er, due to the heterogeneous domains they operate in, it is chal- lenging to create a scalable ev aluation framew ork. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the de velopment of a unified agent e valuation approach. Moreov er , they do not systematically account for the user’ s role nor expertise in the interaction, pro viding incomplete insights into the agent’ s performance. W e argue that ef fective agent e valuation goes beyond correctness alone, incorporating con versation quality , ef ficiency and systematic diagnosis of agent errors. T o ad- dress this, we introduce the TED frame work (T alk, Evaluate, Diagnose) 1 . (1) T alk: W e leverage r eusable , generic expert and non-expert user persona templates for user-agent interaction. (2) Ev aluate: W e adapt existing datasets by representing subgoals—such as tool signatures, and responses—as natural language grading notes, ev aluated automatically with LLM-as-a-judge. W e propose new metrics that capture both turn ef ficiency and intermediate progress of the agent comple- menting the user -aware setup. (3) Diagnose: W e introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncov ering common errors, and providing actionable feedback for agent improvement. W e show that our TED framework rev eals new insights regarding agent performance across models and user expertise lev els. W e also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporat- ing the identified error remedies into the agent’ s design. 1 I N T RO D U C T I O N Large Language Models (LLMs) agents (Liu et al., 2023; Jang et al., 2025; Koh et al., 2024) are in- creasingly being adopted for many real-world tasks in various domains due to their potential of fully automating mundane workflo ws and enhancing producti vity . Howe ver , e valuation of agents remains a challenge today due to the heterogeneous domains the agents operate in. As every domain comes with its own goals, creating a scalable unified ev aluation framework which reliably assesses agent performance across div erse tasks is non-trivial. Existing works (Qian et al., 2024; Lu et al., 2024; Barres et al., 2025; Chang et al., 2024) each propose their own ev aluation methods, e.g., checking database states, tool signatures, or exact matches which dif fer in scope and assumptions, making unification challenging. Moreov er, since agent behavior is heavily influenced by the conv ersation trajectory with the user , current assessment methods that overlook the user’ s role in the interaction may fail to comprehensi vely capture agent’ s performance. Giv en that agents are non-deterministic and it is difficult to craft reference con versations, a common practice to interact with the agent is to dynamically simulate the user responses in the conv ersation loop with the agent (Y ao et al., 2024). This has been adopted as a common practice for agent ∗ Corresponding author . 1 Code and dataset are av ailable in the repository https://github.com/SAP-samples/agent-quality-inspect. 1 Accepted as a conference paper at ICLR 2026 ev aluation because static user setups, where user messages are predetermined, do not work. This is because the agent’ s responses to earlier predetermined user inputs may diver ge from the reference con versation for which the static messages were curated. Howe ver , most works employing dynamic con versation have limitations because they do not systematically separate user persona from task instructions, thus failing to account for the impact of user behavior (independent of the task) on agent performance, providing incomplete insights. This is important as good agents ask clarifying questions when gi ven incomplete input, while poor agents do not; thus, systematic testing is essential for fair comparison across agents and tasks. Despite the complexity of agent trajectories, existing works (Qiao et al., 2024; Xiao et al., 2024; Qian et al., 2025) often stop at metric reporting. T o address these shortcomings, we propose the TED framework (T alk, Evaluate, Diagnose). (1) In the T alking stage, we decouple user personas from task instructions and introduce a user-a ware agent ev aluation frame work based on reusable , generic persona templates enabling di verse and systematic creation of test scenarios. (2) In the Evaluation stage, we adapt existing datasets by representing subgoals—such as tool signatures, and responses—as natural language grading notes, and ev aluate them with LLM-as-a-judge. W e propose new metrics that capture not only partial progress and task success, but also the efficiency of task progression—measured in conv ersational turns. (3) In the Diagnosis stage, we introduce an automated error analysis tool that examines inconsistencies of both agents and LLM-as-a-judge, automatically identifies errors, and offers actionable feedback. W e summarize our contributions as follows: i) Propose an agent ev aluation framework applicable across heterogeneous agent domains that is built on r eusable , generic expert and non-expert persona templates that systematically assess the impact of users’ role on agent performance. ii) Introduce a benchmark by adapting existing datasets to grading notes—natural language check- lists of subgoals. Grading notes serve as assertion criteria for LLM-as-a-judge, which scores the agent performance based on its trajectory log without requiring access to the en vironment. iii) Introduce new metrics to accompan y the user-aw are e valuation setup, which are essential for capturing an agent’ s progress with respect to the number of con versational turns. iv) Propose an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncov ering common errors, and providing actionable feedback for agent impro vement. 2 R E L A T E D W O R K S Con versation simulation. A majority of the agents today are con versational and in volv e in vok- ing multiple tools to solve a task. W ith complex tasks requiring human interaction, the literature (Y ao et al., 2024; W ang et al., 2023; Xiao et al., 2024) has adopted a dynamic setup using a LLM- simulated user (user proxy), for automated testing of agents. Howe ver , existing dynamic ev aluation methods face se veral limitations: some rely on user instruction prompts that are tightly coupled with specific agents, scenarios, and personas (Y ao et al., 2024), while others omit user personas alto- gether (Lu et al., 2024)—both of which limit the reusability of ev aluation methods across different domains. Although agent performance is influenced by the behavior of user proxy , this dependency is rarely analyzed systematically due to user personas being inconsistently defined across samples (Huang et al., 2025). While prior work (Barres et al., 2025) introduced a systematic ev aluation of “easy” and “hard” personas for one of the domains, their telco-specific user prompt templates are not generic and limits reusability across domains. Our TED frame work differs from prior work by allowing end-user to systematically test the agent with reusable , generic expert and non-expert personas that are agent- or task-agnostic. W e demonstrate this in our experiments on the τ 2 -bench (Barres et al., 2025) and T oolSandbox (Lu et al., 2024) datasets, which span various domains such as airline booking, messaging, setting reminders, etc., all ev aluated using the same user persona templates without any tuning. Metrics and error analysis. T o ev aluate agent performance, most prior work (W ang et al., 2023; Xie et al., 2024) relies on success rate. Howe ver , the metric focuses solely on the final outcome and provides only a coarse-grained assessment of agent behavior . This is first addressed by AgentBoard (Chang et al., 2024) that introduced progress rate as a fine-grained metric but in a multi-step agent- en vironment setting without con versation simulation. W e extend this to multi-turn settings and propose metrics that combine turn-level efficiency and progress rate. Unlike MINT (W ang et al., 2 Accepted as a conference paper at ICLR 2026 2023), which measures only final success after t interactions, our turn-aw are e valuation captures per-turn progress and efficiency , offering a richer measure of agent performance in complex tasks. Giv en the non-deterministic behavior of agents, Y ao et al. (2024) reports the pass @ k and pass ˆ k metrics. In line with the pass @ k metric used to assess the chance of whether at least one out of k trials is successful, we also report new metrics that capture the best-case performance under the stochastic runs. Instead of checking goal attainment via direct database lookups, tool signatures, etc., we represent all subgoals as grading notes. This approach abstracts comple x goals, is user- friendly , and does not require system-state access, making our ev aluation applicable to both agents that modify the system state and those that do not. While prior work uses natural language for only some assertions (Barres et al., 2025), we extend this to cov er tool calls and end responses. Similar to Cui et al. (2025), we identify common errors made by LLM agents; ho wever , our approach disco vers these errors in an unsupervised manner via automatic analysis of real-time logs rather than relying on predefined categories. 3 T A L K , E V A L UA T E , D I AG N O S E : T E D F R A M E W O R K W e define a LLM agent as an automated system that performs tasks via interactions with users, tools, and the environment. Its action space includes tool use, responses to users, and internal reasoning. After each action, the agent receives partial state information, such as API responses, or a subse- quent user utterance. T o systematically ev aluate agents, we introduce the TED framew ork—T alk, Evaluate, and Diagnose—as complementary and interdependent stages. In the T alking stage, di verse user-agent interactions are simulated, to study how robust agents complete tasks, for the different type of users, such as non-expert users who require more con versational turns. Traditional metrics like success or progress rates often fail to capture subtleties of turn efficienc y , moti vating metrics that consider both task progress and turn-efficienc y during the Evaluation stage. Moreov er, ev alu- ation using LLM-as-a-judge are subject to stochasticity and potential errors. The Diagnosis stage helps e xtract meaningful insights from inconsistencies and errors made by both the agent and LLM- as-a-judge. T ogether, these stages form a unified framew ork as detailed in the following subsections. 3 . 1 T H E T A L K I N G S TAG E Dynamic evaluation with expert and non-expert user personas . Existing methods that use LLM- simulated user also kno wn as user proxy (Y ao et al., 2024; Lu et al., 2024) are constrained by either tightly coupled or missing user personas, hindering systematic analysis of the ef fect of user behavior on agent performance. A tightly coupled task complexity and user persona, makes it challenging to isolate their individual impacts on agent performance. F or instance, when an agent answers technical legal questions, the outcome may differ depending on whether the user is an expert or a layperson, even if the task complexity remains constant. Howe ver , if both the task and user e xpertise as determined by the user persona vary simultaneously , it becomes difficult to determine which factor is driving performance differences. In this work, we propose a scalable, dynamic agent ev aluation framew ork that le verages r eusable , g eneric e xpert and non-expert user personas to simulate realistic user interactions across a wide range of scenarios. Let P = { p expert , p non − expert } denote the set of persona prompts with different user expertise level, I be the set of task instructions, and U be the set of full user prompt consumed by the LLM-simulated user . W e abstract the full user prompt templating process as a function f , combining user persona prompt p , with a task instruction i : u = f ( p, i ) , (1) where p ∈ P , i ∈ I , and u ∈ U . The function f includes general rules for the user proxy , along with a two-step process—reflection followed by response. For each agent and task instruction sample i , we vary only the persona prompt p to generate u expert and u non − expert . Refer to Appendix A.3 for the prompt f and user persona template p . An example of task instruction i is shown in Fig. 5. 3 . 2 T H E E V A L UAT I O N S TAG E W e define the set of grading notes G as natural language text used as assertion-based ground truths by LLM-as-a-judge. Each subgoal is represented by one such grading note 2 . Unlike prior work 2 Subgoal is represented by grading note which is a natural language text. 3 Accepted as a conference paper at ICLR 2026 that uses keypoints (Hao et al., 2025) or limited natural language assertions (Barres et al., 2025), we expand coverage to include tool calls, their order , and key agent responses in G . While we adopt the notion of milestones (key e vents that must happen) (Lu et al., 2024) for the set G , we do not follow their DA G-based construction method. An example of grading note is: Ag ent should enable W ifi . More examples are in Appendix A.12. 3 . 2 . 1 L L M - A S - A - J U D G E A N D M A X P R O G R E S S R A T E @ k LLM-as-a-judge. W e extend be yond the multi-step agent-environment setting and exact match metric (Chang et al., 2024) by ev aluating agents in a multi-turn user -agent setup, where grading notes serve as subgoals to assess both intermediate and final states, tool calls, as well as the agent’ s output responses. Let D = { ( i, G i ) | i ∈ I } be the test dataset, where i ∈ I is a task instruction, G i = { g i, 1 , g i, 2 , ..., g i,n i } be the set of grading notes associated with the task instruction i , and | G i | be the number of subgoals, i.e., grading notes. W e denote the corresponding agent trajectory , which includes information on tool calls, agent responses and user utterances for the entire conv ersation up to the final con versational turn, as τ i . For a task sample ( i, G i ) , the progress of the agent given its trajectory τ i , is defined as the proportion of subgoals achiev ed: pr og r ess ( i, G i , τ i ) = 1 | G i | | G i | X j =1 LLM j udg e ( i, g i,j , τ i ) , (2) where LLM j udg e ( · ) returns 1 if the subgoal g i,j is achieved, and 0 otherwise. W e define the progress rate as the average progress across all samples in the dataset D , i.e., pr og r essr ate = E ( i,G i ) ∼ P D [ pr og ress ( i, G i , τ i ) ] . Using LLM-as-a-judge with grading notes reduces the need for custom dataset-specialized ev aluation harnesses and infrastructure. In this formulation, the judge is queried once for ev ery subgoal. Ho wev er, to ensure reliability , we run the judge multiple times and take a majority vote as the final score. W e discuss the stability of the judge further in Section 3.3. The LLM j udg e ( · ) prompt is provided in Appendix A.4. From pass @ k to M axP r og r essRate @ k . Giv en the non-deterministic nature of agent behavior , a commonly used ev aluation metric is pass @ k = E P task 1 − n − c k / n k (Y ao et al., 2024), which measures the probability that at least one trial succeeds when sampling k out of n total trials. The notation c denotes the number of trials that are successful. Each trial represents a complete multi-turn con versation, consisting of multiple back-and-forth user-agent exchanges. By this definition, when n = k , the pass @ k metric ev aluates to 1 if at least one of the k trials for a giv en task is successful, and 0 otherwise. The metric then corresponds to the expected maximum success per task, av eraged over all tasks, measuring the agent’ s best performance across the trials: pass @ k = E ( i,G i ) ∼ P D max { success ( i, G i , τ l i ) | l = 1 , . . . , k } , where success ( · ) ∈ { 0 , 1 } . (3) The notation success ( i, G i , τ l i ) for a giv en sample ( i, G i ) represents whether the agent with trajec- tory τ l i successfully completes the task on the l -th trial, with a value of 1 for success and 0 for failure. By taking the maximum success over k trials via the max {·} operator , we capture the agent’ s best performance across these trials. W e then relax the strict success condition in equation 3 by defining a thresholded progress-based success criterion: pass @ k = E ( i,G i ) ∼ P D max 1 { pr og ress ( i, G i , τ l i ) ≥ thr eshol d } | l = 1 , . . . , k , (4) where 1 {·} is the indicator function and the thr eshol d ∈ [0 , 1] defines the minimum progress for a trial to be considered successful. Setting thr eshold = 1 counts only trials with full sub- goals completion (i.e., pr og r ess ( i, G i , τ l i ) = 1 ) as successful, and treats any partial progress (i.e., pr og r ess ( i, G i , τ l i ) < 1 ) as failure. Nonetheless, equation 4 applies a hard threshold—treating all progress below the threshold as fail- ure—and discards agent’ s fine-grained progress. T o retain this information, we define a soft ver - sion, M axP rog r essRate @ k to e valuate agent’ s best performance based on the maximum progress achiev ed at the final conv ersational turn, across k trials, av eraged ov er all samples: M axP rog r essRate @ k = E ( i,G i ) ∼ P D max pr og r ess ( i, G i , τ l i ) | l = 1 , . . . , k . (5) 4 Accepted as a conference paper at ICLR 2026 3 . 2 . 2 P R O G R E S S A N D T U R N - L E V E L E FFI C I E N C Y The turns within each con versational trial are interdependent where errors in the earlier turns can propagate and impact task success. While the M axP rog r essR ate @ k metric in equation 5 cap- tures non-determinism by measuring agent’ s best performance across the k trials and e valuates fine- grained progress only at the final conv ersational turn, it does not assess how quickly progress is made throughout the conv ersation. This gap in ev aluation leads us to consider two distinct scenarios: i) where making early progress matters, and ii) where it does not. i) Early progress matters. In this subsection, we view progress as a function of conv ersational turns and for notational simplicity , we denote the progress at turn t by p ( t ) := prog r ess ( i, G i , τ l i [1 : t ]) , where τ l i [1 : t ] denotes the segment of the agent trajectory τ l i from the first turn up to turn t . Let p ( t ) : [0 , T ] → R represents the discrete progress values at each turn. For computing A UC, we treat the discrete values p ( t ) as a continuous, monotonically increasing function obtained via linear interpolation. The function measures the agent’ s progress at turn t by the proportion of achiev ed subgoals, i.e., grading notes, assuming previously completed milestones cannot be undone. The A UC of the continuous progress function is then defined as AU C = R T 0 p ( t ) dt where T is the maximum turns of a con versation. For a giv en task sample ( i, G i ) ∈ D , we define p 1 ( t ) and p 2 ( t ) to be the progress functions of two agents, respectiv ely . Consider the case where both agents starts from 0 progress, i.e., p 1 (0) = p 2 (0) = 0 and first agent is strictly more efficient than the second, i.e., p 1 ( t ) ≥ p 2 ( t ) , ∀ t ∈ (0 , T ] , we ha ve: AU C 1 = Z T 0 p 1 ( t ) dt > Z T 0 p 2 ( t ) dt = AU C 2 . (6) In this scenario, an ef ficient agent—compared to a less efficient one—will achieve a higher A UC score. The A UC rewards agent for achieving subgoals early which is crucial for long-horizon tasks such as navigation (Shridhar et al., 2020; Chev alier-Boisv ert et al., 2018), where finding the right room or object early often reduces downstream confusion. Likewise, in multi-step planning tasks, like web browsing (Zhou et al., 2023), early retriev al of relev ant results significantly narrows the search space, increasing likelihood of success. ii) Early progr ess does not matter . While A UC metric fav ors early progress, one may argue that this is unnecessary in tasks like booking a trip, where reserving a plane and hotel are interchangeable subtasks, and order should not affect the outcome. In such cases, completing the simpler subtask with fewer subgoals first, followed by the more complex one (or vice-versa) should not affect the final score, i.e., case where two agents start with zero progress and reach the same progress within the same number of con versational turns, despite the dif ferences in trajectories. T o handle scenarios where early progress is not vital, one can weight the increase in progress uniformly by computing the pr og r essper tur n ( P P T ) , forming a telescoping series: P P T = 1 T T − 1 X t =0 p ( t + 1) − p ( t ) = p ( T ) T , (7) where p ( t ) is the discrete progress value at turn t , T is the minimum number of con versational turns to reach the final achiev ed progress p ( T ) , and p (0) = 0 . T o align with the M axP r og r essR ate @ k metric from equation 5, we report both the M axAU C @ k and M axP P T @ k , a veraged over the task samples, while setting n = k . Further details are in the Appendix A.5. 3 . 3 T H E D I A G N O S I S S TAG E A utomated Error Analysis. Although a majority of existing works (Xiao et al., 2024; Qian et al., 2025) stop at reporting final dataset metrics, we argue that e valuation should also include error anal- ysis and actionable improv ements. While using grading notes and LLM-as-a-judge simplify our ev aluation, the inherent non-determinism of LLMs remains a challenge. Our proposed metrics ag- gregate results using a majority vote from judge runs and the best agent performance across k trials. Howe ver , the aggre gation o verlooks consistency—an essential aspect of robust agent ev aluation. T o address this, we further introduce an automated error analysis tool that analyzes both judge and agent inconsistencies by plotting sample-level progress expectations and variances, offering deeper insights on top of the final aggregated metrics. 5 Accepted as a conference paper at ICLR 2026 Error 5: Required Cancellation T oo l Not Utilized Error 4 : MissingBooking To o l I n v o c a t i o n Error 6: IncorrectArgumen- update_reservation_f lights Cluster 1: Missing T ool Us age - cancel_reservation Cluster 3: Incorrect Flight Selection or Argument in Booking T ool Error 1: Missing cancel_reservation tool invocation Error 3 : Booking T ool Payment Allocation Failure Error 2 : update_reservation_flight s - incorrect_arguments All samples i ) Low - level Error Id entification Cluster 2: Missing T ool Us age - book_reservation ii) Semantic Clu stering of Error T ypes 𝓒 = 𝒇 clus 𝓧, 𝑮 𝒙 = 𝒇 iden 𝜺 Figure 1: Our proposed two-step automated error discovery approach that automatically identifies common errors of the agent based on judge and agent inconsistencies. Identical error colors indicate that similar low-le vel errors are clustered into the same high-le vel category . For each subgoal g i,j ∈ G i , we define a binary r .v . Z i,j , where Z i,j = 1 if the agent achie ves the j -th subgoal under the gi ven trajectory , and 0 otherwise. Let the probability of achieving the subgoal g i,j be P r ( Z i,j = 1) = z i,j . The progress for the sample ( i, G i ) is defined as the proportion of subgoals the agent achiev ed, i.e., pr og r ess ( i, G i , τ l i ) = P j Z i,j | G i | . Its expectation and v ariance are giv en by: E [ pr og r ess ( i, G i , τ l i )] = P j z i,j | G i | ; V ar[ pr og r ess ( i, G i , τ l i )] = P j z i,j (1 − z i,j ) | G i | 2 , (8) where z i,j = 1 Q P Q q =1 z ( q ) i,j is estimated by averaging over Q judge runs per subgoal, generalizing the single binary judge output in equation 2 to a probabilistic estimate. Plotting E [ pr og r ess ( i, G i , τ l i )] and V ar[ pr og r ess ( i, G i , τ l i )] for each task ( i, G i ) ∈ D , capture judge’ s inconsistency through the variance, while agent’ s inconsistency is reflected in the different expected progress values across the k trials. Building on this, we propose an automated error discovery approach that automatically identifies the common errors of the agent based on judge and agent inconsistencies. Our approach consists of two steps : (1) low-le vel error identification, and (2) semantic clustering of error types. F or ev ery binary score z ( q ) i,j from the judge, there is a corresponding explanation e ( q ) i,j . W e define e i,j = { e (1) i,j , ..., e ( Q ) i,j } and the error candidate set E = { ( g i,j , e i,j ) | P r ( Z i,j = 1) < 1 } to be a tuple of subgoals and corresponding explanations where the judge’ s prediction is inconsistent or indicates the subgoal may not have been achie ved. For each candidate error ε ∈ E , we first perform the low-le vel error identification step, follo wed by a semantic clustering step: x = f iden ( ε ); C = f clus ( X , G ) , (9) where f iden ( · ) and f clus ( · ) is the error identification, and clustering prompt functions, respectiv ely , and x ∈ X is the low-le vel error , and C is the cluster label. This clustering step will merge seman- tically similar errors into the same group and provide a high-lev el error summary . W e illustrate this two-step process in Fig. 1. Note that the errors with the same color are mer ged into one cluster label. W e also show preliminary results demonstrating agent improv ement by le veraging the identified er - rors. For a detailed algorithm of our automated error analysis method, and the f iden ( · ) and f clus ( · ) prompt templates, refer to Appendix A.6. 4 D A T A S E T S A N D E X P E R I M E N TA L S E T U P W e use two agent benchmarks: τ 2 -bench (Barres et al., 2025) and T oolSandbox (Lu et al., 2024). For τ 2 -bench, we utilize 21 and 25 samples from the airline and retail domains, respecti vely . For the airline domain, we further divide the samples into “easy” and “hard”. For T oolSandbox, we select 37 base scenarios and exclude variants with dif ferent initial messages or multi-turn con versations, as these can be effecti vely simulated using our dynamic user proxy—where both initial and subsequent 6 Accepted as a conference paper at ICLR 2026 T able 1: Overall performance of dif ferent agent models on τ 2 -bench airline and T oolSandbox, using gpt-4.1 as user proxy and LLM-as-a judge. Results are displayed with scores for Expert Persona | Non-expert Persona. For metrics with @ k , the number of trials is n = k = 20 for τ 2 -bench and n = k = 8 for T oolSandbox. M axP r og r essRate @ k is abbreviated as M axP r og @ k . Agent Model M eanP rog @ k M axP r og @ k M axAU C @ k M axP P T @ k pass @ k τ 2 -bench Airline Domain (Easy) gpt-4.1 0.95 | 0.82 1.00 | 1.00 0.99 | 0.81 0.80 | 0.50 1.00 | 1.00 gpt-4o 0.79 | 0.86 1.00 | 1.00 0.96 | 0.86 0.70 | 0.53 1.00 | 1.00 gpt-4o-mini 0.70 | 0.61 0.90 | 0.90 0.85 | 0.73 0.60 | 0.37 0.80 | 0.80 gpt-5 0.92 | 0.92 1.00 | 1.00 0.97 | 0.88 0.67 | 0.54 1.00 | 1.00 mistral-nemo 0.87 | 0.49 1.00 | 0.80 0.97 | 0.67 0.67 | 0.48 1.00 | 0.60 mistral-large 0.65 | 0.53 1.00 | 1.00 0.96 | 0.79 0.60 | 0.42 1.00 | 1.00 T oolSandbox Dataset gpt-4.1 0.91 | 0.87 0.98 | 0.97 0.96 | 0.92 0.84 | 0.73 0.92 | 0.92 gpt-4o 0.95 | 0.94 0.99 | 1.00 0.98 | 0.96 0.94 | 0.81 0.95 | 0.97 gpt-4o-mini 0.91 | 0.85 0.95 | 0.93 0.94 | 0.90 0.89 | 0.77 0.89 | 0.84 gpt-5 0.78 | 0.78 0.97 | 0.91 0.95 | 0.84 0.83 | 0.66 0.95 | 0.84 mistral-nemo 0.72 | 0.71 0.92 | 0.96 0.88 | 0.87 0.76 | 0.65 0.84 | 0.92 mistral-large 0.82 | 0.79 0.94 | 0.95 0.93 | 0.91 0.87 | 0.75 0.89 | 0.89 messages are generated dynamically , and the non-expert user persona effecti vely simulates multi- turn conv ersations. Our setup offers greater variability than the original variants with fixed initial messages. The base scenarios consist of a v ariety of task-oriented domains ranging from contact updates and messaging to reminders, currency conv ersion, etc. W e use only milestones (key ev ents that must happen) and con vert them into grading notes. Importantly , any existing benchmark can be adapted to fit into our ev aluation framework by con verting the ground truths into grading notes. W e set the maximum number of turns to 15 for τ 2 -bench and 8 for T oolSandbox. Each sample is ev aluated ov er multiple agent trials, n = 20 trials for τ 2 -bench and n = 8 trials for T oolSandbox. W e report metrics at k = n trials. W e use the gpt-4.1 model as LLM-as-a-judge for grading the subgoals and for error identification and clustering in our experiments. Unless specified, the user proxy also uses the gpt-4.1 model. More details are in Appendix A.12. 5 R E S U L T S A N D D I S C U S S I O N 5 . 1 M A I N R E S U LT S T able 1 summarizes the overall performance of various agent models on τ 2 -bench airline domain and T oolSandbox, with gpt-4.1 serving as the user proxy . Additional results for τ 2 -bench re- tail domain are provided in Appendix A.10, T able 15. On easy airline samples, metrics such as M axP rog r essRate @ k and pass @ k tend to saturate, with most models achieving near-perfect scores. M eanP r og @ k , which measures the av erage progress rate across all k trials, captures how consistently agents can achiev e the subgoals. Howe ver , even M eanP r og @ k can remain high for strong models making it less effecti ve at distinguishing between top-performing agents. While M axP rog r essRate @ k gives us the best agent performance o ver k trials, it fails to giv e any mean- ingful distinction between models, especially for easy samples. By incorporating M axAU C @ k and M axP P T @ k , we obtain a more comprehensive ev aluation of agent performance. For example, on τ 2 -bench, gpt-4o-mini (expert) and mistral-large (expert) achiev e similar M eanP r og ress @ k scores (differing by only 5% ). Ho wev er, M axAU C @ k shows a larger difference of 10% (0.96 vs 0.85) and a change in rankings. Further comparison of the M axAU C @ k with M axP P T @ k scores for the two models, suggests that mistral-large achieves greater turn-le vel efficienc y and faster progress in the initial turns, but both models have equal aver - age progress over turns as indicated by the identical M axP P T @ k scores. Similar pattern persists in the T oolSandbox dataset, where models such as gpt-5 and mistral-nemo have larger differences 7 Accepted as a conference paper at ICLR 2026 1 2 3 4 5 6 7 8 T ur n 0.00 0.25 0.50 0.75 1.00 p r o g r e s s ( i , G i ) gpt-5 mistral-nemo (a) 1 2 3 4 5 6 7 8 T ur n 0.00 0.25 0.50 0.75 1.00 p r o g r e s s ( i , G i ) gpt-5 mistral-nemo (b) 1 2 3 4 5 6 7 8 T ur n 0.00 0.25 0.50 0.75 1.00 p r o g r e s s ( i , G i ) gpt-4.1 mistral-lar ge (c) Figure 2: Progress curves for selected T oolSandbox samples. (a) search reminder with recency upcoming : mistral-nemo (non-expert, pur- ple; A UC =0.88, PPT =0.20) vs. gpt-5 (non-expert, blue; A UC =0.61, PPT =0.20). (b) find current city low battery mode : mistral-nemo (expert, purple; A UC =0.77) vs. gpt-5 (non-expert, blue; A UC =0.64). (c) add reminder content and date and time : gpt-4.1 (non-expert, green; A UC =0.50) vs. mistral-large (non-e xpert, red; A UC =0.34). Sample 14 Sample 18 Sample 22 0.0 0.2 0.4 0.6 0.8 1.0 [ p r o g r e s s ( i , G i , l i ) ] T rajectory T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 Figure 3: E [ pr og r ess ( i, G i , τ l i )] (dot) and V ar[ prog r ess ( i, G i , τ l i )] (error bar) on τ 2 -bench gpt-4o- mini agent using non-expert user proxy gpt-4.1. Each dot is an agent trajectory from a single trial and each task sample from the airline domain is e valuated using n = k = 20 agent trials. W e display only three example samples here. Sample 14 belongs to the hard split and others in the easy split. on the M axAU C @ k and M axP P T @ k metrics, when interacting with expert user , but a smaller difference on M axP rog r essRate @ k . W e also examine the impact of user persona on agent performance. The non-expert user simulates an inexperienced user, resulting in agents taking more con versational turns to complete the task. This is consistently reflected in the M axAU C @ k scores, which are lower for the non-expert compared to the expert user persona across all models and datasets. It is because the expert persona provides clearer and more informativ e input, enabling the agents to complete tasks faster . The baseline metric M axP rog r essRate @ k which measures the agent best performance at the end of the con versation, ov erlooks turn count and thus show similar agent performances when interacting with expert v ersus non-expert user . The τ 2 -bench agent with gpt-4o-mini achiev es the same M axP rog r essRate @ k score of 0.9 for expert and non-expert users, but more con versational turns are required during the interaction with non-expert user as shown in the lo wer M axAU C @ k and M axP P T @ k scores when compared to expert user . Another interesting observation is in some cases (e.g., gpt-5 vs mistral-large on τ 2 -bench), both models achieve the same M axP r og r essRate @ k = 1 for both user personas. Ho wev er, when examining the M axAU C @ k metric, we see the performance gap between the two models is notably larger for agents interacting with non-expert user as compared to expert user . This highlights that the type of user interaction can significantly influence agent performance and should be considered as an important dimension when ev aluating agents. Moreov er , we know that the AU C metric emphasizes early progress, while P P T weights the in- creases in progress uniformly across turns. T o illustrate this, we analyze the performance at the sample le vel. In Fig. 2, we sho wcase instances where two agents reach the same final progress 8 Accepted as a conference paper at ICLR 2026 but differ in the number of subgoals achie ved at v arious turns. For example, in the T oolSandbox sample searc h r eminder with recency upcoming , both gpt-5 (non-expert) and mistral-nemo (non- expert) achie ve a P P T of 0.20, yet their AU C scores are 0.61 and 0.88, respectiv ely . The progress rate curves in Fig. 2a demonstrate that mistral-nemo makes rapid early progress, while gpt-5’ s progress is more gradual. A closer examination of the agent trajectories (see Appendix A.13) re- veals that mistral-nemo executes tool calls in the initial turn and then seeks clarification, whereas gpt-5 begins by clarifying questions before in voking tools. Since the AU C is sensiti ve to agents that achiev e subgoals earlier in the interaction, mistral-nemo has a much higher A UC score than gpt-5, as compared to the P P T metric which simply av erages the increase uniformly . Hence, in scenarios where early progress is less relev ant, the P P T metric may be more suitable. 5 . 2 C O M PA R I S O N O F D I FF E R E N T A G E N T M O D E L S U S I N G T H E E V A L U A T I O N P A R A D I G M S F R O M T H E O R I G I N A L τ 2 - B E N C H A N D T O O L S A N D B OX P A P E R S Beyond demonstrating the additional insights provided by our metrics, we also compare the TED framew ork as a whole—including metrics and user proxy—with the e valuation paradigms from τ 2 - bench (Barres et al., 2025) and T oolSandbox (Lu et al., 2024). 5 . 2 . 1 τ 2 - B E N C H O R I G I N A L P A P E R E V A L UAT I O N A P P R OA C H In the T ables 2 and 3, we e valuated the τ 2 -bench agent on the airline and retail domains across different LLM models using the τ 2 -bench (Barres et al., 2025) original paper ev aluation approach. For a fair comparison with our TED framew ork, the same set of samples and the same number of trials n = k = 20 are used in the experiments. Comparing Barres et al. (2025)’ s approach (T able 2, left column) with our TED frame work (T a- ble 1) on the easy split of the τ 2 -bench airline domain, we observe that metrics such as pass @ k (Y ao et al., 2024) and pass ˆ k (Barres et al., 2025) saturate for sev eral models in T able 2, where performances are clustered into two different groups, revealing minimal information about the per- formance ranking of the different models. On the pass ˆ 4 metric which is a form of consistency or reliability measure, model gpt-4.1 performs best, followed by gpt-4o, gpt-4o-mini, mistral-large, and finally mistral-nemo when e valuated using Barres et al. (2025)’ s original approach on the airline easy split. Comparing this performance ranking with TED-ev aluated results in T able 1, we observe change in rankings for se veral models. For instance, gpt-4o-mini on some metrics and setups in T able 1 performs w orse than mistral-nemo which contradicts with the ranking in T able 2. In T able 1, mistral-nemo performs better than gpt-4o-mini across the dif ferent metrics for the expert user setup. For the non-expert user setup, we observe that mistral-nemo performs worse than gpt-4o-mini on the M axAU C @ k metric (0.67 vs 0.73) but better on the M axP P T @ k metric (0.48 vs 0.37). This suggests that under the non-expert user interaction, the agent with gpt-4o-mini e xhibits a steeper per- formance impro vement in the early interaction turns before plateauing (reflected by a higher AU C ), but a lo wer average performance gain (indicated by a lower P P T ) compared to mistral-nemo. This finding highlights the importance of systematically analyzing agent’ s performance not only based on the underlying LLM models but also based on the con versation quality between user and agent as determined by the user e xpertise. Additionally , we see the benefit of a joint interpretation of multiple metrics ( AU C and P P T ) for a better understanding of the progression of agent performance which existing coarse-grained metrics like pass @ k and pass ˆ k fail to pro vide. Similarly , a change in model rankings can be observed for the τ 2 -bench agent on the full airline (easy+hard) and retail domains. In T able 2 (right column), the best performing model on the full airline domain is gpt-4.1 b ut this is not true for TED-ev aluated results in T able 14, where the mistral models when interacting with the expert user generally show better or on par performance with the gpt-4.1 model on the M axAU C @ k , M axP P T @ k , and pass @ k metrics. A similar behavior can be observed for the mistral-nemo and gpt-4o-mini models. As shown in the right column of T able 2, the agent using gpt-4o-mini outperforms mistral-nemo. This finding is consistent with our results for the non-e xpert user persona setting in T able 14, b ut it does not align with the expert user setting, where we observe a reversal in the relativ e ranking, with mistral-nemo outperforming gpt-4o-mini. W e further illustrate this with an example in Fig. 4a and 4b by plotting the metric values for the two LLM models on a giv en sample from the airline domain. Based on Fig. 4a, we observe that the agent using mistral-nemo model, when interacting with an expert user, it achiev es the maximum progress value of 1 in the first turn, resulting in an AU C = 1 which surpasses the performance of 9 Accepted as a conference paper at ICLR 2026 T able 2: Comparison of different agent models using the ev aluation paradigm from the τ 2 -bench (Barres et al., 2025) original paper and their performances on the pass @ k , pass ˆ 4 , and pass ˆ k metrics in the airline domain. The number of trials used is n = k = 20 . Agent Model τ 2 -bench Airline Domain (Easy) τ 2 -bench Airline Domain (Easy + Har d) pass @ k pass ˆ 4 pass ˆ k pass @ k pass ˆ 4 pass ˆ k gpt-4.1 1.00 0.60 0.20 0.76 0.29 0.10 gpt-4o 0.80 0.40 0.20 0.57 0.13 0.05 gpt-4o-mini 1.00 0.28 0.20 0.67 0.10 0.05 mistral-nemo 0.80 0.06 0.00 0.43 0.01 0.00 mistral-large 1.00 0.16 0.00 0.71 0.07 0.00 T able 3: Comparison of different agent models using the ev aluation paradigm from the τ 2 -bench (Barres et al., 2025) original paper and their performances on the pass @ k , pass ˆ 4 , and pass ˆ k metrics in the retail domain. The number of trials used is n = k = 20 . Agent Model τ 2 -bench Retail Domain pass @ k pass ˆ 4 pass ˆ k gpt-4.1 1.00 0.54 0.20 gpt-4o 1.00 0.43 0.04 gpt-4o-mini 0.92 0.23 0.04 mistral-nemo 0.76 0.08 0.00 mistral-large 0.96 0.33 0.08 gpt-4o-mini (i.e., AU C = 0 . 982 ) that completes all the defined subgoals only in the second turn. On the other hand, Barres et al. (2025)’ s original e valuation approach in Fig. 4b aligns with our results on the non-expert user setting where the gpt-4o-mini model performs better than mistral- nemo model ( AU C = 0 . 714 vs AU C = 0 . 4892 ). This further justify the importance of considering the user-agent con versation quality during e valuation. Like wise when comparing results on the retail domain, the model gpt-4.1 using the e valuation paradigm from the τ 2 -bench (Barres et al., 2025) original paper appears to be the best perform- ing model based on T able 3 but is not necessarily the case when comparing with the results obtained from our TED framew ork in T able 15. These observations show the significance of reporting mul- tiple metrics, e.g., AU C and P P T , to fully understand the task progression of the agent and the impact of user’ s role on the agent performance. 1 2 3 4 5 6 7 8 9 0.0 0.5 1.0 p r o g r e s s ( i , G i , i ) mistral-nemo -e xpert gpt-4o -mini-e xpert 1 2 3 4 5 6 7 8 9 T ur n 0.0 0.5 1.0 p r o g r e s s ( i , G i , i ) mistral-nemo -none xpert gpt-4o -mini-none xpert (a) pass@k pass^2 pass^4 pass^k Metric Name 0.0 0.2 0.4 0.6 0.8 1.0 Metric V alue 1.0 0.0 0.0 0.0 1.0 0.15 0.01 0.0 mistral-nemo gpt-4o -mini (b) Figure 4: (a) τ 2 -bench airline sample 8 : mistral-nemo expert ( A UC =1.0, PPT =1.0) vs. gpt- 4o-mini expert ( A UC =0.982, PPT =0.5) and mistral-nemo non-expert ( A UC =0.482, PPT =0.25) vs. gpt-4o-mini non-expert ( A UC =0.714, PPT =0.125) (b) τ 2 -bench airline sample 8 : Barres et al. (2025)’ s original ev aluation approach with n = k = 20 trials. 10 Accepted as a conference paper at ICLR 2026 5 . 2 . 2 T O O L S A N D B O X O R I G I N A L P A P E R E V A L UAT I O N A P P R OA C H W e e valuated the T oolSandbox agent across different LLM models using the original ev aluation approach from T oolSandbox (Lu et al., 2024) and report the models’ performance in T able 4. For a fair comparison with our TED framework in T able 1, the same set of samples and the same number of trials n = k = 8 are used in the experiments. W e do not report the final graph similarity score, which is a composite of milestone and minefield scores, because minefields are not considered in our setup. Instead, we consider only the milestone similarity score S cor e M + and report the best performance of the model by taking the maximum milestone similarity score across the k trials as denoted by M axS cor e M + @ k . T able 4: Comparison of dif ferent agent models using the ev aluation paradigm from the T oolSandbox (Lu et al., 2024) original paper and their performances on the milestone similarity S cor e M + metric. The number of trials used is n = k = 8 . Agent Model T oolSandbox Dataset (Base Scenarios) (Base Scenarios + V ariants) M axS cor e M + @ k M axS cor e M + @ k gpt-4.1 0.92 0.90 gpt-4o 0.94 0.91 gpt-4o-mini 0.90 0.88 mistral-nemo 0.87 0.85 mistral-large 0.91 0.89 In T able 4, we present the performance of dif ferent agent models on base scenarios (left column), as well as on both base scenarios and variants (right column). Based on the table, we observe that both setups exhibit the same performance ranking across models, with gpt-4o achieving the best perfor- mance, follo wed by gpt-4.1, and mistral-nemo performing the worst. Ho wev er, this performance ordering differs from our main observ ation in T able 1 that is obtained using the TED frame work. Our TED-based ev aluation on the T oolSandbox dataset in T able 1 also shows gpt-4o outperforming gpt-4.1 across different metrics and user personas, but reveals a different performance ranking be- tween mistral-nemo and gpt-4o-mini. Based on T able 1, mistral-nemo outperforms gpt-4o-mini on the pass @ k metrics for the non-expert user persona setting (0.92 vs 0.84) and vice-versa (0.84 vs 0.89) for expert user . For these two models, only the expert user setting in T able 1 aligns with the ranking observed in T able 4. This finding suggests that Lu et al. (2024)’ s ev aluation paradigm and the S cor e M + metric in T able 4 may not always paint a complete story , as they do not effecti vely disentangle user beha vior from the e valuation process. These observations highlight the importance of agent ev aluation frameworks that account for user behavior , and demonstrate the additional values provided by our TED frame work in which existing e valuation frame works are lacking. 5 . 3 E R R O R A NA LYS I S Besides reporting the results on the dataset-level, we analyze the expectations and v ariances of the multiple judge runs and the different k agent trials, as shown in Fig. 3. The dif ferent color points for each sample represent the different agent trial runs, which we refer to as trajectories in our discussion. For sample 14, we can clearly see that half of the agent runs have an e xpected per-sample progress of 0.6, while 35% of them are at 0.4, suggesting that there are one or more subgoals, where the agent consistently fails or succeeds. Upon examining the trajectories, we identify the problematic subgoal to be “ Agent should cancel reservation K1NW8N. ” Notably the trials in these two clusters has zero v ariance in the judge runs, indicating that the judge is highly confident in its assessment. For this sample, only trajectories 5 and 18 show relati vely large v ariance bars, suggesting inconsistency in the judge model itself. W e then apply our automated error analysis to identify the common errors made by the agent. Our tool identifies four distinct errors for sample 14 as shown in Fig. 5. In the trajectory 6, the agent did not check the details of the existing flight, which was supposed to be basic economy . Thus, the agent did not cancel the previous flight when attempting to reschedule causing a discrepancy in the final payment output. This error was consistently captured by our 11 Accepted as a conference paper at ICLR 2026 Figure 5: τ 2 -bench airline sample 14. The blue box shows a truncated task instruction i ∈ I for the non-expert user proxy gpt-4.1 model. The green boxes contain the truncated dialogue for trajectory 6 (left) and agent’ s trajectory 18 (right). The agent model is gpt-4o-mini. The top-right box shows the errors identified. Zoom in for a larger vie w . judge as indicated by the zero v ariance bar . On the other hand, trajectory 18 inv olves a different payment-related error whereby the agent hallucinates the value $2613.00, that exceeds the actual cost. This spurious v alue pre vented the agent from calling book r eservation , triggering a cascade of three subsequent errors. 5 . 4 I N C O R P O R AT I N G I D E N T I FI E D E R RO R S I N T O A G E N T ’ S D E S I G N T o demonstrate the effecti veness of our identified errors in improving the agent, we incorporate these errors into the design of the agent using several strategies on τ 2 -bench’ s airline special split (selected due to their lo w progress rate and progress-per-turn) and T oolSandbox in T able 5. W e examine sev eral in-context learning approaches to incorporate the found errors into the agent instruction. W e explore simple strategies such as Human Notes (HN) 3 where TED errors are manually refined before using them in the agent instructions, and the Err or Insertion (EI) approach where TED errors are directly used without modification. Moreover , to demonstrate the effecti veness of the TED errors with other in-context learning approach such as the HiTEC-ICL (abbreviated as HTC ) method (Cui et al., 2025), we conduct a comparison of the authors’ generic global errors versus the variant with our TED errors, denoted as HTC † in T able 5. The standard HTC method constructs global and local error checklists, which are injected as meta- data into the user instruction, simulating an artificial con versation loop to better guide the agent. The predefined global errors are highly generic, such as Empty Parameter Value Error or Missing Required Parameter Error . After the first round of user prompting with the global error checklist, the user model is also prompted ag ain in the second round with the local error checklist, enabling the agent to recognize local errors for the relev ant tools. For the HTC † variant, the generic global errors are replaced with TED errors, while the use of local error checklist remain consistent with the original method where detailed tool info are provided to the LLM to generate tool-specific error patterns. Based on T able 5, all four approaches show positive gains in agent performance for a majority of the metrics on T oolSandbox dataset. For the τ 2 -bench special split, we observe mixed trends, with both positive and negati ve performance gains over the baseline agents among the four approaches. Specifically for gpt-4o-mini, we observ e a significantly larger performance gain on τ 2 -bench for the HTC † variant using TED errors as compared to the standard HTC method that uses predefined 3 E.g., “Y ou must strictly check and double confirm all r equirements for chang e flight actions before calling the tool. If you are unsur e or confused, always ask clarifying questions to the user . ” 12 Accepted as a conference paper at ICLR 2026 T able 5: Agent improvement results ov er the baseline agent for Expert Persona | Non-expert Persona, using gpt-4.1 as user proxy and LLM-as-a judge. Blue ↑ shows an improv ement ov er the baseline agent, and red ↓ shows a decrease. W e compare the gain in performance for several agent models using the Error Insertion (EI), Human Notes (HN), HiTEC-ICL (HTC) (Cui et al., 2025), and HTC † variant methods. Agent Model M eanP rog @ k M axP rog @ k M axAU C @ k M axP P T @ k τ 2 -bench Airline Domain (Special Split: Samples 7, 14, 21, 23, and 29) gpt-4o-mini + EI 0.37 ↑ 0.09 | 0.31 ↓ 0.01 0.66 ↑ 0.06 | 0.61 ± 0.00 0.59 ↑ 0.05 | 0.39 ↓ 0.02 0.27 ↑ 0.03 | 0.11 ↓ 0.01 gpt-4o-mini + HN 0.30 ↑ 0.02 | 0.33 ↑ 0.01 0.63 ↑ 0.02 | 0.61 ± 0.00 0.53 ↓ 0.01 | 0.45 ↑ 0.04 0.24 ± 0.00 | 0.14 ↑ 0.02 gpt-4o-mini + HTC 0.29 ↑ 0.01 | 0.26 ↓ 0.06 0.54 ↓ 0.06 | 0.51 ↓ 0.10 0.50 ↓ 0.04 | 0.41 ± 0.00 0.32 ↑ 0.08 | 0.13 ↑ 0.01 gpt-4o-mini + HTC † 0.33 ↑ 0.05 | 0.33 ↑ 0.01 0.61 ↑ 0.01 | 0.65 ↑ 0.04 0.57 ↑ 0.03 | 0.43 ↑ 0.02 0.29 ↑ 0.05 | 0.13 ↑ 0.01 gpt-4.1 + EI 0.52 ± 0.00 | 0.38 ± 0.00 0.78 ↑ 0.04 | 0.77 ↓ 0.08 0.66 ↑ 0.02 | 0.41 ↓ 0.06 0.26 ↓ 0.01 | 0.10 ↓ 0.02 gpt-4.1 + HN 0.53 ↑ 0.01 | 0.41 ↑ 0.03 0.78 ↑ 0.04 | 0.85 ± 0.00 0.66 ↑ 0.02 | 0.55 ↑ 0.08 0.27 ± 0.00 | 0.14 ↑ 0.02 gpt-4.1 + HTC 0.54 ↑ 0.02 | 0.39 ↑ 0.01 0.82 ↑ 0.08 | 0.73 ↓ 0.12 0.67 ↑ 0.03 | 0.41 ↓ 0.06 0.27 ± 0.00 | 0.10 ↓ 0.02 gpt-4.1 + HTC † 0.55 ↑ 0.03 | 0.40 ↑ 0.02 0.71 ↓ 0.03 | 0.82 ↓ 0.03 0.63 ↓ 0.01 | 0.51 ↑ 0.04 0.24 ↓ 0.03 | 0.12 ± 0.00 T oolSandbox Dataset gpt-4o-mini + EI 0.87 ↓ 0.04 | 0.89 ↑ 0.04 0.97 ↑ 0.02 | 0.98 ↑ 0.05 0.94 ± 0.00 | 0.91 ↑ 0.01 0.85 ↓ 0.04 | 0.66 ↓ 0.11 gpt-4o-mini + HN 0.88 ↓ 0.03 | 0.91 ↑ 0.06 0.96 ↑ 0.01 | 0.96 ↑ 0.03 0.95 ↑ 0.01 | 0.92 ↑ 0.02 0.90 ↑ 0.01 | 0.74 ↓ 0.03 gpt-4o-mini + HTC 0.93 ↑ 0.02 | 0.92 ↑ 0.07 0.97 ↑ 0.02 | 0.98 ↑ 0.05 0.96 ↑ 0.02 | 0.95 ↑ 0.05 0.94 ↑ 0.05 | 0.83 ↑ 0.06 gpt-4o-mini + HTC † 0.92 ↑ 0.01 | 0.91 ↑ 0.06 0.98 ↑ 0.03 | 0.98 ↑ 0.05 0.97 ↑ 0.03 | 0.95 ↑ 0.05 0.93 ↑ 0.04 | 0.82 ↑ 0.05 gpt-4.1 + EI 0.95 ↑ 0.03 | 0.93 ↑ 0.06 0.99 ↑ 0.01 | 0.99 ↑ 0.02 0.97 ↑ 0.01 | 0.93 ↑ 0.01 0.87 ↑ 0.03 | 0.76 ↑ 0.03 gpt-4.1 + HN 0.95 ↑ 0.03 | 0.97 ↑ 0.10 0.98 ± 0.00 | 0.99 ↑ 0.02 0.97 ↑ 0.01 | 0.95 ↑ 0.03 0.91 ↑ 0.07 | 0.83 ↑ 0.10 gpt-4.1 + HTC 0.95 ↑ 0.03 | 0.95 ↑ 0.08 0.98 ± 0.00 | 0.99 ↑ 0.02 0.96 ± 0.00 | 0.95 ↑ 0.03 0.89 ↑ 0.05 | 0.75 ↑ 0.02 gpt-4.1 + HTC † 0.95 ↑ 0.03 | 0.95 ↑ 0.08 1.00 ↑ 0.02 | 0.99 ↑ 0.02 0.99 ↑ 0.03 | 0.96 ↑ 0.04 0.92 ↑ 0.08 | 0.82 ↑ 0.09 generic global errors. Howe ver , for gpt-4.1 on τ 2 -bench samples, there is no clear trend suggesting that one performs better than the other . W e also find that the Errors Insert strategy improv es sev eral setups—notably gpt-4o-mini on τ 2 - bench (+9% in M eanP r og @ k , +5% in M axAU C @ k )—implying that awareness of common fail- ures helps the agent perform better , with only a fe w sho wing declines. In contrast, the Human Notes strategy giv es a more consistent improvement for more setups as compared to Error Insert. W e ob- serve in particular a significant gain in M axP P T @ k of 7-10% for gpt-4.1 using Human Notes on T oolSandbox dataset. W e want to emphasize that, our work only e valuates the usefulness of the TED errors—as indicated by improv ements over baseline agents (blue upward arrows)—and does not propose a new in-context learning method to construct better agent instruction prompts. 5 . 5 E X T E N D E D E X P E R I M E N T S A N D H U M A N V A L I D A T I O N W e also present results on the full split of τ 2 -bench airline and retail domains in Appendix A.10, and include an ablation on user model variation in Appendix A.11. T o validate our ev aluation, we conduct human studies on the correctness of the user proxy , LLM-as-a-judge, and the identified TED errors as reported in Appendices A.7, A.8, and A.9, respectively . Results show that the user proxy behaves correctly in most cases, with only 6–12% errors due to instruction-following issues. The LLM-as-a-judge human study also shows high agreement between human rater and the LLM- as-a-judge prediction. These observations suggest both components in the TED framework remain reliable and cost-effecti ve. More details are discussed in Appendices A.7 to A.9. 13 Accepted as a conference paper at ICLR 2026 6 C O N C L U S I O N A N D F U T U R E W O R K In this work, we introduced the TED framework that redefines agent ev aluation. W e showed that including error insights into the agent’ s design leads to gains, with peaks of 8% for M axAU C @ k and 10% for M axP P T @ k metrics. In the future, we plan to explore the applicability of our metric to non-task-oriented domains, such as open-ended dialogue with con versational agents, where the expected responses of the agent can be assessed using our grading notes. Limitation of our approach and LLM usage are discussed in Appendix A.1 and A.2, respectiv ely . R E F E R E N C E S V ictor Barres, Honghua Dong, Soham Ray , Xujie Si, and Karthik Narasimhan. τ 2 -bench: Evaluating con versational agents in a dual-control en vironment. arXiv preprint , 2025. Ma Chang, Junlei Zhang, Zhihao Zhu, Cheng Y ang, Y ujiu Y ang, Y aohui Jin, Zhenzhong Lan, Ling- peng K ong, and Junxian He. Agentboard: An analytical e valuation board of multi-turn llm agents. Advances in neural information pr ocessing systems , 37:74325–74362, 2024. Maxime Chev alier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Y oshua Bengio. Babyai: A platform to study the sample efficiency of grounded language learning. arXiv pr eprint arXiv:1810.08272 , 2018. Y ue Cui, Liuyi Y ao, Shuchang T ao, W eijie Shi, Y aliang Li, Bolin Ding, and Xiaofang Zhou. En- hancing tool learning in large language models with hierarchical error checklists. arXiv preprint arXiv:2506.00042 , 2025. Y upu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Y ubo Chen, Kang Liu, and Jun Zhao. Evalu- ating personalized tool-augmented llms from the perspectives of personalization and proactivity . arXiv pr eprint arXiv:2503.00771 , 2025. Kung-Hsiang Huang, Akshara Prabhakar , Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey , Y ixin Mao, Silvio Sav arese, Caiming Xiong, and Chien-Sheng W u. Crmarena-pro: Holistic assessment of llm agents across div erse business scenarios and interactions. arXiv pr eprint arXiv:2505.18878 , 2025. Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, T aewhoo Lee, W oojeong Kim, and Bongwon Suh. Dice-bench: Ev aluating the tool-use capabilities of lar ge language models in multi-round, multi-party dialogues. arXiv pr eprint arXiv:2506.22853 , 2025. Jing Y u K oh, Robert Lo, Lawrence Jang, V ikram Duvvur , Ming Chong Lim, Po-Y u Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov , and Daniel Fried. V isualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv pr eprint arXiv:2401.13649 , 2024. Xiao Liu, Hao Y u, Hanchen Zhang, Y ifan Xu, Xuanyu Lei, Hanyu Lai, Y u Gu, Hangliang Ding, Kaiwen Men, Kejuan Y ang, et al. Agentbench: Evaluating llms as agents. arXiv pr eprint arXiv:2308.03688 , 2023. Jiarui Lu, Thomas Holleis, Y izhe Zhang, Bernhard Aumayer , Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Y in, et al. T oolsandbox: A stateful, conv ersational, interactive e valuation benchmark for llm tool use capabilities. arXiv pr eprint arXiv:2408.04682 , 2024. Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Y uji Zhang, Hongyi Du, Jiarui Y ao, Xiaocheng Y ang, Denghui Zhang, et al. Escapebench: T owards advancing creativ e intelli- gence of language model agents. arXiv pr eprint arXiv:2412.13549 , 2024. Cheng Qian, Emre Can Acikgoz, Hongru W ang, Xiusi Chen, A virup Sil, Dilek Hakkani-T ¨ ur , Gokhan T ur , and Heng Ji. Smart: Self-aware agent for tool overuse mitigation. arXiv preprint arXiv:2502.11435 , 2025. Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin W ang, Ningyu Zhang, Y ong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Benchmarking agentic workflow generation. arXiv preprint arXiv:2410.07869 , 2024. 14 Accepted as a conference paper at ICLR 2026 Mohit Shridhar, Xingdi Y uan, Marc-Alexandre C ˆ ot ´ e, Y onatan Bisk, Adam T rischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interacti ve learning. arXiv pr eprint arXiv:2010.03768 , 2020. Xingyao W ang, Zihan W ang, Jiateng Liu, Y angyi Chen, Lifan Y uan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv pr eprint arXiv:2309.10691 , 2023. Ruixuan Xiao, W entao Ma, Ke W ang, Y uchuan W u, Junbo Zhao, Haobo W ang, Fei Huang, and Y ongbin Li. Flowbench: Re visiting and benchmarking workflow-guided planning for llm-based agents. arXiv pr eprint arXiv:2406.14884 , 2024. T ianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, T oh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer en vironments. Advances in Neur al Information Pr ocessing Systems , 37:52040–52094, 2024. Shunyu Y ao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv . org/abs/2406.12045 , 2024. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, T ianyue Ou, Y onatan Bisk, Daniel Fried, et al. W ebarena: A realistic web en vironment for build- ing autonomous agents. arXiv pr eprint arXiv:2307.13854 , 2023. 15 Accepted as a conference paper at ICLR 2026 A P P E N D I X C O N T E N T S A.1 Limitation of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 LLM Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Prompt templates for user proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.4 Prompt template for LLM-as-a-judge . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.5 Additional details on progress and turn-lev el efficienc y metrics . . . . . . . . . . . 22 A.6 Additional details on automated error analysis . . . . . . . . . . . . . . . . . . . . 22 A.7 Human validation on the correctness of user proxy . . . . . . . . . . . . . . . . . 26 A.8 Human validation on the correctness of LLM-as-a-judge . . . . . . . . . . . . . . 26 A.9 Human validation on the correctness of the identified TED errors . . . . . . . . . . 27 A.10 Additional experiments on the τ 2 -bench dataset . . . . . . . . . . . . . . . . . . . 28 A.11 Ablation study on the different user models . . . . . . . . . . . . . . . . . . . . . 34 A.12 Additional information on dataset and experimental setup . . . . . . . . . . . . . 35 A.13 Additional details on agent trajectories . . . . . . . . . . . . . . . . . . . . . . . . 40 16 Accepted as a conference paper at ICLR 2026 A A P P E N D I X A . 1 L I M I TA T I O N O F O U R A P P R O AC H Our approach, which uses grading notes and LLM-as-judge, simplifies ev aluation by relying solely on the agent’ s trajectory , without requiring access to the underlying en vironment. Ho wev er, this approach has certain limitations. While we show that multiple judge runs improves reliability and our automated error analysis tool helps in debugging, the method cannot verify whether the database state has actually changed when such modifications are not reflected in the trajectory . Consequently , this limits our ability to capture silent failures that produce no observ able outputs. A . 2 L L M U S AG E LLM is used to assist the writing of UI code for the automated error analysis tool. In addition to that, we use LLM to refine the prompt templates that are used in our experiments. LLM is also used to refine and polish the text in the paper to impro ve clarity and presentation. A . 3 P RO M P T T E M P L A T E S F O R U S E R P RO X Y The following are the reusable , generic expert and non-expert user prompt templates, followed by the templates for the two-step function f in equation 1. The { user task summary } placeholder corresponds to the task instruction i ∈ I , and the { agent desc } placeholder corresponds to the agent description. For the two-step generation process f -reflection followed by response, the placeholders { chat history } and { termination msg } represent the user-agent chat history up to the current stage of conv ersation and the termination message that the user should produce at the end of the dialogue, respectiv ely . The placeholder { reflection history } represents the user reflection history up to the current stage of con versation. Generic expert user persona prompt template: Y ou are acting as an expert LLM-simulated user who fully understands the AI assistant system and goal. Always respond naturally in clear , concise language that fits the expert user role and goal. Provide complete and precise information in your responses. Generate one line at a time. Do not give away all the instructions at once. Only pro vide the information that is necessary for the current step. Y ou are provided with the follo wing user task summary: [user task summary] { user task summary } Y ou understand the system well and will provide thorough, accurate responses using only the information provided in the [user task summary] section. If the AI assistant returns output in JSON format, respond only to the content inside the JSON as if the format does not matter . --- The following pro vides an overvie w of the AI assistant if av ailable. [AI Assistant Description] : { agent desc } --- When you as an expert LLM-simulated user is analysing the real-time chat history , carry out a two-step process as the user: first, a Reflection Phase, followed by a Response Generation Phase. 17 Accepted as a conference paper at ICLR 2026 Generic non-expert user persona prompt template: Y ou are simulating a clueless, casual NON-expert user who is interacting with an AI assistant. Y ou don’t fully understand how the AI system w orks, and you tend to gi ve v ague or incomplete instructions — often leaving out k ey steps or context. When you respond: Speak naturally , casually , like someone who’ s unsure ho w to talk to an AI. Be brief and only provide part of the needed information. Do not giv e a full picture unless the assistant directly asks for it. Only share details that are directly related to what was just asked or prompted — not more. Nev er proactiv ely explain your reasoning or provide background info unless the assistant digs into it. Y ou are working to ward the follo wing general task: [User T ask Summary] { user task summary } But since you’re not an expert, you’ll just sort of ”feel your way through it” and leave lots of gaps in your instructions. NEVER provide COMPLETE instructions. AL W A YS OMIT some variables and missing k ey conte xt. If the assistant returns something in structured formats like JSON, you can just react casually to the content. Treat the format lik e it doesn’t matter . --- The following pro vides an overvie w of the AI assistant if av ailable. [AI Assistant Description]: { agent desc } --- When you as a clueless, casual NON-expert user is analysing the real-time chat history , carry out a two-step process as the user: first, a Reflection Phase, followed by a Response Generation Phase. When simulating your process during the con versation: Y ou go through two internal steps each time: 1. Reflection Phase (internal thought): T ake a quick look at the current chat history . Think to yourself: “Okay , what did the assistant just say or ask? What should I probably say next without o verexplain- ing?” Remember: you’ re not confident in how this system works, so don’ t try to be precise. 2. Response Generation Phase (your reply): Now write a short, casual message that giv es only partial information based on what the assistant asked. Leav e things unclear unless the assistant is persistent. 18 Accepted as a conference paper at ICLR 2026 The reflection -step pr ompt template in the two-step function f equation 1: --- The following [Chat History] (if available) provides context and indicates the CURRENT stage of your con versation as a LLM-simulated user with the AI assistant. [Chat History] { chat history } --- Step 1: Reflection Phase Giv en the [Chat History] REFLECT carefully on the AI assistant’ s last response and what the LLM-simulated user is trying to accomplish based on the [user task summary]. Briefly address: - Y our role as the LLM-simulated user . - The current stage of the con versation. Y ou SHOULD NO T skip an y user instructions as mentioned in the [user task summary]. - The assistant’ s last reply in the [Chat History]. IMPOR T ANT CLARIFICA TION: - Revie w the entire [Chat History] and the [user task summary] and see what should be your next response as a LLM-simulated user . - At times, the AI assistant’ s last message may overlap with or anticipate a future user turn. In such cases, treat it strictly as the AI assistant response, not a replacement of the user message Do NO T generate the LLM-simulated user response yet. RESPOND only with a REFLECTION. **IMPOR T ANT** remember your user persona as written in the system prompt (eg: expert user or non-expert) and respond with appropriate reflection. TERMIN A TE ONL Y IF the conv ersation is at its FINAL ST A GE where the agent has completed all the tasks wanted by the user as sho wn in the [user task summary]. If the con versation has concluded, prepare to respond with { termination msg } in the next re- sponse generation phase. Otherwise, DO NO T consider termination if the current conv ersation is not at its final stage. 19 Accepted as a conference paper at ICLR 2026 The response -step prompt template in the tw o-step function f equation 1: --- The following [Chat History] (if available) provides context and indicates the CURRENT stage of your con versation as a LLM-simulated user with the AI assistant. [Chat History] { chat history } --- The following is the LLM-simulated user reflection. [Reflection] { reflection history } --- Step 2: Response Generation Phase Giv en the [Chat History] and [Reflection], GENERA TE the LLM-simulated user NEXT RE- SPONSE that: i) Naturally continues the con versation WITHOUT ADDING NEW T ASK that is NO T found in the [user task summary]. Y ou SHOULD NOT skip an y tasks for the LLM-simulated user . ii) A v oids rev ealing or repeating the AI assistant’ s answers. iv) Responds appropriately to the assistant’ s actual reply , even if vague or off-track. If the AI assistant’ s last message echoes or resembles any part of a user message, it’ s the AI assistant response, NO T a ne w user turn. Note that suggestions or recommendations by the AI assistant should NEVER be MIST AKEN for actual actions taken. GENERA TE the LLM-simulated USER RESPONSE based on the [Reflection]. Return ONL Y the LLM-simulated user response. **IMPOR T ANT** remember your user persona as written in the system prompt (eg: expert user or non-expert) and respond with appropriate response. TERMIN A TE ONL Y IF the con versation is at its FINAL ST A GE where the agent has com- pleted all the tasks wanted by the user as sho wn in the [user task summary]. If the conv ersation has concluded, prepare to respond with { termination msg } in the next response generation phase. Otherwise, DO NO T consider termination if the current conv ersation is not at its final stage. A . 4 P RO M P T T E M P L A T E F O R L L M - A S - A - J U D G E The following is the LLM j udg e ( i, g i,j , τ i ) prompt template from equation 2. The { user task summary } placeholder corresponds to the task instruction i ∈ I , the { grading note } placeholder corresponds to the j -th grading notes g i,j for the task instruction i . The remaining placeholders, { trajectory } , { agent responses } and { dynamicDialogue } represent agent’ s trajectory , responses output, and the user-agent dialogue, respectiv ely , as extracted from τ i . 20 Accepted as a conference paper at ICLR 2026 Prompt T emplate for LLM-as-a-judge equation 2: Y ou are provided with a sample that contains sev eral ke y components centered around an interaction between an agent and a simulated user , referred to as the user proxy . The user proxy represents a human-in-the-loop, engaging with the agent by posing questions and guiding the con versation throughout the dialogue. The [User Summary Instructions] section outlines the user’ s goals, expectations, and the over- all task the agent is expected to complete. The [Agent Responses Submission] section captures the agent’ s actual responses to the user proxy at each turn of the interaction. The [Agent Intermediate T rajectories] section provides a detailed step-by-step reasoning and actions taken by the agent. Finally , the [Dynamic Dialogue] section presents the full conv ersation between the agent and the user proxy . As a judge, your task is to determine whether the subgoal specified in the [Ground Truth Sub- goal] has been achieved. T o make this assessment, e valuate the [Agent Intermediate T rajectories] and [Agent Responses Submission] to ascertain whether the subgoal has been successfully completed. When the completion of a subgoal depends on the av ailability of a specific tool, and that TOOL is ABSENT , the subgoal must be marked as INCOMPLETE, REGARDLESS of the correctness of the agent’ s responses in [Agent Responses Submission]. T ool av ailability is a hard constraint—no amount of valid reasoning can compensate for its absence. Suggesting or recommending a tool is acceptable only if the subgoal does not require the tool to be used. Howev er, CONCEPTUALIZING or HALLUCIN A TING a tool call—i.e., falsely implying the tool was in vok ed—is NO T acceptable, and the subgoal must be marked as INCOMPLETE. Additionally if the agent intermediate trajectory are left BLANK or DOES NO T contain rele- vant tools to the subgoal, it means that the ground truth subgoal is NOT MET and the subgoal is considered as INCOMPLETE, REGARDLESS of the correctness of the agent’ s responses in [Agent Responses Submission]. The thought in [Agent Intermediate T rajectories] CANNO T be considered as a substitute for a tool call and the ground truth subgoal is considered to be INCOMPLETE. [BEGIN D A T A] ************ [User Summary Instructions]: { user task summary } ************ [Ground T ruth Subgoal]: { grading note } ************ [Agent Intermediate T rajectories]: { trajectory } ************ [Agent Responses Submission]: { agent responses } ************ { dynamicDialogue } [END D A T A] During assessment focus solely on the factual content and the goal completion while disre garding any differences in style, grammar , punctuation, or syntax. Begin by presenting a concise argument to confirm the validity of your conclusion. A v oid simply stating the correct answers at the outset. Decide what type of tools is required and then end with your answer formatted as ’GRADE: $LETTER’ (without quotes) where LETTER is one of C or I. Reply with ’GRADE: C’ (without quotes) to indicate COMPLETE if the agent has successfully achiev ed the subgoal. Otherwise, reply with ’GRADE: I’ (without quotes) to indicate INCOMPLETE if the agent did not achiev ed the subgoal. 21 Accepted as a conference paper at ICLR 2026 A . 5 A D D I T I O NA L D E T A I L S O N P R O G R E S S A N D T U R N - L E V E L E FFI C I E N C Y M E T R I C S T o align with the M axP r og r essRate @ k metric from equation 5, which ev aluates the agent’ s best performance across k trials, we report both the max AU C and max P P T of the k trials averaged ov er the task samples: M axAU C @ k = E ( i,G i ) ∼ P D [ max { AU C l | l = 1 , . . . , k } ] . (10) M axP P T @ k = E ( i,G i ) ∼ P D [ max { P P T l | l = 1 , . . . , k } ] . (11) W e sho w in our experiments that the proposed metrics provide interesting insights into the agent behavior that e xisting metrics failed to capture. A . 6 A D D I T I O NA L D E T A I L S O N AU T O M A T E D E R R O R A N A LYS I S The error candidate set E = { ( g i,j , e i,j ) | P r ( Z i,j = 1) < 1 } can hav e two situations: i) when all the judge trials consistently score 0, ii) judge model has disagreement across multiple judge trials. For the first case, we can select any e i,j of the Q trials to get the final x i,j . Usually in our implementation, we select the first explanation e 1 i,j . Ho wever , for second case, the f iden will take all e i,j and apply another selecti ve prompt function f selectiv e to decide the lo w-lev el error x . W e illustrate the entire algorithm in the below pseudo-code: Algorithm 1: Automated Error Analysis Method Input: Judge outputs { ( Z ( q ) i,j , e ( q ) i,j , g i,j ) } for samples i , subgoals j , trials q = 1 ...Q . Output: High-lev el error types C 1 <---STEP1: Low-level error identification---> 2 Initialize E ← ∅ , X ← ∅ 3 for each ( i, j ) do 4 if ( ∀ q , Z ( q ) i,j = 0) or (0 ∈ Z i,j ∧ 1 ∈ Z i,j ) then 5 E ← E ∪ { ( g i,j , e i,j ) } 6 if ∀ q , Z ( q ) i,j = 0 then 7 // Consistent failur e: 8 // Select the first judge e xplanation for err or identification 9 x i,j ← f iden ( g i,j , e (1) i,j ) ; 10 end 11 else 12 // Disagr eement acr oss judge trials: 13 // Iterate on all the judge e xplanation 14 T mp ← [] 15 for q = 1 to Q do 16 T mp ← T mp ∪ { f iden ( g i,j , e ( q ) i,j ) } 17 end 18 x i,j ← f selectiv e ( T mp ) 19 end 20 X ← X ∪ { x i,j } 21 end 22 end 23 <---STEP2: Semantic clustering of error types---> 24 G ← { g i,j | ( i, j ) ∈ E } 25 C ← f clus ( X , G ) 26 return high-lev el error types C 22 Accepted as a conference paper at ICLR 2026 Prompt T emplate for f iden in Automated Error Analysis: Y ou are tasked with summarizing the error type in a concise and abstract manner based on the provided explanation. This explanation is generated by a judge model, which ev aluates whether the agent’ s response satisfies the specified subgoals. In the explanation, a grade of ”C” (Complete) indicates success, while ”I” (Incomplete) indicates failure. Y our goal is to produce an error type that: • Clearly captures the core failure or issue at an abstract le vel. • A v oids restating the explanation verbatim. • Is short, specific, and phrased like a category label rather than a long sentence. • Does not include sensitiv e details or unnecessary context. • If the error type in volves tool usage, explicitly include the tool name in the error type. Y ou will be provided with: • Ground truth subgoals • Judge model’ s explanation of the agent’ s response [BEGIN D A T A] *** [ Ground T ruth Subgoals ] : { subgoals } *** [ Explanation ] : { explanation } *** [END D A T A] Please return your output in the following **strict JSON format**: { "error_type": "", "explanation": "" } 23 Accepted as a conference paper at ICLR 2026 Prompt T emplate for f selective in Automated Error Analysis: Y ou are given multiple independent predictions of the same data row . Each prediction includes an error type assigned by a model. Y our task is to determine the **most probable true error type** using a majority voting approach. Instructions: 1. Revie w all provided error types carefully . 2. Group similar or semantically equiv alent error types together, e ven if their wording dif fers. 3. Count how man y times each grouped error type appears. 4. Select the error type with the highest count as the final result. 5. If there is a tie: - Prefer the error type that is more specific and informativ e. - If still tied, choose the one most consistent with the majority wording. 6. Output ONL Y the most probable error type, without extra commentary . [BEGIN D A T A] [Error T ypes] : { error type list } [END D A T A] Please return your output in the following **strict JSON format**: { "most_probable_error_type": "" } 24 Accepted as a conference paper at ICLR 2026 Prompt T emplate for f clus in Automated Error Analysis: Y ou are tasked with clustering the follo wing error types based on their semantic similarity . The goal is to group related error types under broader , more abstract categories to reduce redundanc y and impro ve generalization. Important: • The cluster label should be primarily grounded in the provided. • Only if you are very certain that an error type is entirely unrelated to the subgoals should you create a new cluster label not deri ved from them. • Always aim to preserv e the subgoal’ s intent when naming clusters. Guidelines: • Error types are not mutually exclusi ve and may overlap in meaning. • Each cluster should reflect the most abstract and inclusiv e label that unifies all error types within it. • Do not merge error types referring to dif ferent tools into a single cluster . • Clusters in volving tool usage must be separated by tool name. • The cluster label for each tool-related cluster must explicitly include that tool’ s name. • Minimize the number of clusters while maintaining clear and meaningful distinctions. • A v oid overly specific wording—cluster labels should be reusable in other contexts where the same subgoal applies. [BEGIN D A T A] *** [Subgoals] : { subgoals } *** [Error T ypes] : { error types } *** [END D A T A] Please return your output in the following strict JSON format: { "clusters": [ { "cluster_label": "", "error_types": ["", "", ...], "error_ids": ["", "", ...] }, ... ] } 25 Accepted as a conference paper at ICLR 2026 A . 7 H U M A N V A L I DAT I O N O N T H E C O R R E C T N E S S O F U S E R P R O X Y T o ensure the reliability of our user proxy simulation, we manually validate user proxy utterances through human ev aluation on 16 randomly selected expert and 16 non-expert user-agent dialogues on τ 2 -bench airline domain and T oolSandbox dataset. W e cate gorize errors into three types: (1) user role confusion, where the user mistakes their role for the agent’ s; (2) failure to follow the specified task instructions i ∈ I , termed as missing or violate instructions; and (3) nonsensical or erroneous user responses. Based on T able 6, we observe that the user proxy in general behaves as expected except for a small number of cases where it does not follow the task instructions. While no AI system can be expected to achie ve 100% accuracy , the lo w number of such errors supports our belief in the user proxy’ s inherent potential for agent ev aluation. T able 6: Correctness of user proxy . Both agent and user proxy use gpt4.1 model. User Persona Errors Role confusion Missing or violate instructions Erroneous responses Expert 0.0 0.06 0.0 Non-expert 0.0 0.125 0.0 A . 8 H U M A N V A L I DAT I O N O N T H E C O R R E C T N E S S O F L L M - A S - A - J U D G E T o ensure the reliability of our LLM-as-a-judge ev aluation, which uses grading notes (i.e., subgoals) as ground truths, we conducted multiple runs of the judge and used majority vote to determine the final scores. The human validation results as shown in T ables 7 to 10 are conducted on these majority-vote outcomes. Based on T able 7, for the gpt-5 agent model, we randomly select 10 sam- ples containing both expert and non-expert users from the τ 2 -bench airline domain, and another 10 samples from the T oolSandbox dataset, resulting in 42 and 31 subgoals, respectively . Since each subgoal is ev aluated independently , we report human validation results using Cohen’ s Kappa at the subgoal lev el. Human annotators are asked to label each subgoal prediction as “success”, “failure”, or “ambiguous”, where “ambiguous” denotes cases that could not be clearly classified as either “success” or “failure”. For “ambiguous” cases, we treat them as being in agreement with the judge prediction. In T able 7, we observe a high Cohen’ s Kappa score (0.84 to 0.92) indicating an almost perfect agreement between the LLM-as-a-judge and human rater . Similar high agreement scores for gpt- 4.1, gpt-4o, and gpt-4o-mini agents are also shown in the T ables 8, 9, and 10, respectiv ely . These agreement scores reinforce that our approach using grading notes and LLM-as-a-judge of fers a reli- able, scalable, and cost-effecti ve alternati ve to other more complex e valuation methods. T able 7: Agreement between human and LLM-as-a-judge measured using Cohen’ s Kappa. The agent uses gpt-5 model while the LLM-as-a-judge and the user proxy use gpt-4.1 model. Dataset Cohen’ s Kappa τ 2 -bench airline (42 subgoals fr om 10 samples) 0.84 T oolSandbox (31 subgoals from 10 samples) 0.92 T able 8: Agreement between human and LLM-as-a-judge measured using Cohen’ s Kappa. The agent, LLM-as-judge, and user proxy use gpt-4.1 model. Dataset Cohen’ s Kappa τ 2 -bench airline (37 subgoals fr om 9 samples) 0.71 T oolSandbox (40 subgoals from 10 samples) 0.60 26 Accepted as a conference paper at ICLR 2026 T able 9: Agreement between human and LLM-as-a-judge measured using Cohen’ s Kappa. The agent uses gpt-4o model while the LLM-as-a-judge and the user proxy use gpt-4.1 model. Dataset Cohen’ s Kappa τ 2 -bench airline (27 subgoals fr om 10 samples) 0.90 T oolSandbox (32 subgoals from 10 samples) 0.80 T able 10: Agreement between human and LLM-as-a-judge measured using Cohen’ s Kappa. The agent uses gpt-4o-mini model while the LLM-as-a-judge and the user proxy use gpt-4.1 model. Dataset Cohen’ s Kappa τ 2 -bench airline (30 subgoals fr om 10 samples) 0.93 T oolSandbox (35 subgoals from 10 samples) 0.94 A . 9 H U M A N V A L I DAT I O N O N T H E C O R R E C T N E S S O F T H E I D E N T I FI E D T E D E R R O R S T o ensure the relev ance and accuracy of the errors identified by our TED framew ork, we manually validate them against the reference labels. Similarly , since errors are identified based on subgoals, we report human v alidation results at the subgoal lev el. Human annotators first identify and catego- rize errors using only the agent’ s trajectory and the ground-truth subgoals. These annotated errors serve as reference labels. T o measure disagreement, they subsequently ev aluate whether the TED- identified errors are semantically consistent with their o wn reference error annotations. For different agent models, we observe only a small disagreement (6–23%) between the errors identified by TED and the reference errors. The detailed results are shown in the T ables 11, 12, and 13. This level of disagreement is reasonable, as error identification is challenging ev en for human annotators. T able 11: Correctness of the identified TED errors. The agent, LLM-as-a-judge, and user proxy use the gpt-4.1 model. Dataset Per-subgoal error Disagreement with human τ 2 -bench airline (37 subgoals fr om 9 samples) 0.22 T oolSandbox (40 subgoals from 10 samples) 0.23 T able 12: Correctness of the identified TED errors. The agent uses gpt-4o model while the LLM- as-a-judge and the user proxy use gpt-4.1 model. Dataset Per-subgoal error Disagreement with human τ 2 -bench airline (27 subgoals fr om 10 samples) 0.19 T oolSandbox (32 subgoals from 10 samples) 0.09 T able 13: Correctness of the identified TED errors. The agent uses gpt-4o-mini model while the LLM-as-a-judge and the user proxy use gpt-4.1 model. Dataset Per-subgoal error Disagreement with human τ 2 -bench airline (30 subgoals fr om 10 samples) 0.23 T oolSandbox (35 subgoals from 10 samples) 0.06 27 Accepted as a conference paper at ICLR 2026 A . 1 0 A D D I T I O N A L E X P E R I M E N T S O N T H E τ 2 - B E N C H D A TA S E T Besides analyzing the agent performance on the easy airline domain samples, we extend our analysis to include hard samples as well. T able 14 presents the performance of v arious agent models under easy and more challenging setting on the airline domain. W e gain some insights into agent beha vior by including hard samples. First, we observe a general decline across all metrics, reflecting the increased difficulty of the sam- ples. M eanP rog @ k and pass @ k scores drop substantially for all models, indicating that agents are less likely to achiev e full task completion on harder samples. While M axP r og r essRate @ k remains relati vely high for most models (e.g., gpt-4.1, gpt-5, mistral-nemo, for expert persona), this metric suggests that most agents can achiev e near to completion progress at least one out of the n = k trials. When we shift our focus to M axAU C @ k , the model ranking changes noticeably (eg: gpt-5, mistral- nemo, mistral-large for expert persona). This shift highlights that while many models can achieve near to completion on the dif ficult tasks, only a fe w do so efficiently . This effect is more pro- nounced for the non-expert persona: LLM agents using gpt-4.1 and gpt-4o models maintain high M axP rog r essRate @ k (0.96 and 0.88), but their M axAU C @ k scores are much lower (0.68 and 0.65). Another interesting observation is that non-expert persona sometimes has higher M axP rog r essr ate @ k (e.g., gpt-5, gpt-4o-mini). Upon closer examination, we found in- stances where the expert persona provided all relev ant information in the very first turn, which ov erwhelmed the agent and led to hallucinations. In contrast, the non-expert persona distrib utes information gradually ov er multiple turns, allowing the agent to respond more effecti vely . This finding highlights the impact of user interaction style on agent performance. T able 14: Overall performance of different agent models on τ 2 -bench airline domain, using gpt-4.1 as user proxy and LLM-as-a-judge. Dataset contains easy and hard samples. Results are displayed with scores for Expert Persona | Non-expert Persona. For metrics with @ k , the number of trials is n = k = 20 . Agent Model M eanP rog @ k M axP r og @ k M axAU C @ k M axP P T @ k P ass @ k τ 2 -bench Airline Domain (Easy + Har d) gpt-4.1 0.75 | 0.67 0.94 | 0.96 0.85 | 0.68 0.44 | 0.25 0.81 | 0.86 gpt-4o 0.63 | 0.55 0.91 | 0.88 0.84 | 0.65 0.44 | 0.23 0.81 | 0.71 gpt-4o-mini 0.53 | 0.53 0.86 | 0.88 0.79 | 0.66 0.43 | 0.26 0.62 | 0.76 gpt-5 0.80 | 0.77 0.96 | 0.97 0.89 | 0.77 0.47 | 0.30 0.91 | 0.91 mistral-nemo 0.67 | 0.36 0.96 | 0.71 0.87 | 0.56 0.44 | 0.25 0.86 | 0.38 mistral-large 0.54 | 0.51 0.94 | 0.93 0.87 | 0.68 0.43 | 0.24 0.86 | 0.76 T able 15: Overall performance of different agent models on τ 2 -bench retail domain, using gpt-4.1 as user proxy and LLM-as-a-judge. Results are displayed with scores for Expert Persona | Non-expert Persona. For metrics with @ k , the number of trials is n = k = 20 . Agent Model M eanP rog @ k M axP r og @ k M axAU C @ k M axP P T @ k P ass @ k τ 2 -bench Retail Domain gpt-4.1 0.92 | 0.88 0.99 | 1.00 0.94 | 0.83 0.51 | 0.24 0.92 | 0.96 gpt-4o 0.87 | 0.75 1.00 | 0.99 0.95 | 0.83 0.54 | 0.25 1.00 | 0.96 gpt-4o-mini 0.81 | 0.76 1.00 | 0.98 0.95 | 0.82 0.54 | 0.25 1.00 | 0.88 mistral-nemo 0.57 | 0.58 0.99 | 0.97 0.86 | 0.79 0.29 | 0.23 0.88 | 0.84 mistral-large 0.80 | 0.81 1.00 | 1.00 0.94 | 0.83 0.49 | 0.23 0.96 | 1.00 In T able 15, we further ev aluate our framework on the retail domain. In contrast to the airline domain, retail tasks appear more solvable, with most models achieving near-perfect 28 Accepted as a conference paper at ICLR 2026 M axP rog r essRate @ k . Howe ver , the trend between the personas remains consistent, with expert personas consistently having higher M axAU C @ k scores than the non-expert personas. An interesting observ ation appears for agents with gpt-4o and gpt-4o-mini models in the non-expert user setting. While both models exhibit nearly identical M eanP r og r essR ate @ k , M axP rog r essRate @ k , M axAU C @ k and M axP P T @ k scores, the agent with gpt-4o-mini shows a lo wer pass @ k score compared to gpt-4o. This suggests that gpt-4o-mini frequently reaches a state of near-completion (i.e., M axP rog r essRate @ k close to 1), but often fails to complete the full task as compared to gpt-4o that has a higher pass @ k score. This further strengthened our justification for the need of fine-grained metrics to quantify the agent’ s progression. 0.0 0.2 0.4 0.6 0.8 1.0 M a x A U C @ k 0.0 0.2 0.4 0.6 0.8 1.0 M a x P r o g r e s s R a t e @ k M a x A U C @ k v s M a x P r o g r e s s R a t e @ k 0.0 0.2 0.4 0.6 0.8 1.0 M a x P P T @ k 0.0 0.2 0.4 0.6 0.8 1.0 M a x P P T @ k v s M a x P r o g r e s s R a t e @ k Figure 6: Scatter plot results on the τ 2 -bench airline domain (easy + hard) illustrating the relation- ship between M axAU C @ k and M axP r og r essRate @ k (left subplot), and between M axP P T @ k and M axP rog r essRate @ k (right subplot) using the setting n = k = 20 trials. 29 Accepted as a conference paper at ICLR 2026 gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.90 0.90 0.80 M a x P r o g r e s s R a t e @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.99 0.81 0.96 0.86 0.85 0.73 0.97 0.88 0.96 0.79 0.97 0.67 M a x A U C @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo Model 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.80 0.50 0.70 0.53 0.60 0.37 0.67 0.54 0.60 0.42 0.67 0.48 M a x P P T @ k Expert Non-Expert Figure 7: Dataset level performance of dif ferent agent models on the τ 2 -bench airline do- main (easy), with error bars representing 95% confidence intervals. The top graph shows the M axP rog r essRate @ k , middle graph shows M axAU C @ k , bottom graph shows M axP P T @ k . 30 Accepted as a conference paper at ICLR 2026 gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.98 0.97 0.99 0.99 0.95 0.93 0.97 0.91 0.94 0.95 0.92 0.96 M a x P r o g r e s s R a t e @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.96 0.92 0.98 0.96 0.94 0.90 0.95 0.84 0.93 0.90 0.88 0.87 M a x A U C @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo Model 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.84 0.73 0.94 0.81 0.89 0.77 0.83 0.66 0.87 0.75 0.76 0.65 M a x P P T @ k Expert Non-Expert Figure 8: Dataset le vel performance of different agent models on the T oolSandbox dataset, with error bars representing 95% confidence intervals. The top graph sho ws the M axP r og r essR ate @ k , middle graph shows M axAU C @ k , bottom graph shows M axP P T @ k . 31 Accepted as a conference paper at ICLR 2026 gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.94 0.96 0.91 0.88 0.86 0.88 0.96 0.97 0.94 0.93 0.96 0.71 M a x P r o g r e s s R a t e @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.85 0.68 0.83 0.65 0.79 0.66 0.89 0.77 0.87 0.68 0.87 0.56 M a x A U C @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini gpt-5 mistral-lar ge mistral-nemo Model 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.44 0.25 0.43 0.23 0.43 0.26 0.47 0.29 0.43 0.24 0.44 0.25 M a x P P T @ k Expert Non-Expert Figure 9: Dataset lev el performance of dif ferent agent models on the τ 2 -bench airline domain (easy+hard), with error bars representing 95% confidence intervals. The top graph shows the M axP rog r essRate @ k , middle graph shows M axAU C @ k , bottom graph shows M axP P T @ k . 32 Accepted as a conference paper at ICLR 2026 gpt-4.1 gpt-4o gpt-4o -mini mistral-lar ge-2411 mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.99 1.00 0.99 1.00 0.99 0.98 0.97 M a x P r o g r e s s R a t e @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini mistral-lar ge-2411 mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.94 0.95 0.95 0.94 0.86 0.83 0.83 0.82 0.83 0.79 M a x A U C @ k Expert Non-Expert gpt-4.1 gpt-4o gpt-4o -mini mistral-lar ge-2411 mistral-nemo 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scor e 0.51 0.54 0.54 0.49 0.29 0.24 0.25 0.25 0.23 0.23 M a x P P T @ k Expert Non-Expert Figure 10: Dataset level performance of dif ferent agent models on the τ 2 -bench retail domain, with error bars representing 95% confidence intervals. The top graph sho ws the M axP r og r essR ate @ k , middle graph shows M axAU C @ k , bottom graph shows M axP P T @ k . 33 Accepted as a conference paper at ICLR 2026 A . 1 1 A B L A T I O N S T U D Y O N T H E D I FF E R E N T U S E R M O D E L S Besides varying the model for agents, we also conducted an ablation study by varying the user models across the gpt f amily as shown in T able 16. W e observ e a similar trend emerges as in earlier analyses: the gap across traditional metrics becomes narro wer , and the agent performance dif ference between expert and non-expert user personas is relativ ely small. This further highlights the impor- tance of our proposed metrics in capturing agent beha vior with respect to turns which is beyond what con ventional metrics do. As expected, the agent interacting with expert user achiev es consis- tently higher performance on M axAU C @ k and M axP P T @ k metrics, confirming our hypothesis that agent can resolve tasks more efficiently as expert users tend to understand the system well and provide complete information for the agent. Interestingly , if we use stronger models such as gpt-5 for the user proxy , we see a smaller gap between expert and non-expert personas. This suggests that as the model capability of the user proxy improves, the model proxy with a non-expert persona behav es more like an expert, and achieves performance closer to expert-le vel outcomes. Based on the current observation, we belie ve that v arying the user model also changes the user expertise level, which potentially simulates different user e xpertise levels. T able 16: Overall performance of a gpt-4.1 agent with different user proxy models on the τ 2 -bench airline domain, using gpt-4.1 model as LLM-as-a-judge. Dataset contains easy and hard samples. Results are displayed with scores for Expert Persona | Non-expert Persona. For metrics with @ k , the number of trials is n = k = 20 . User Model M eanP rog @ k M axP r og @ k M axAU C @ k M axP P T @ k pass @ k τ 2 -bench Airline Domain (Easy + Har d) gpt-4.1 0.75 | 0.67 0.94 | 0.96 0.85 | 0.68 0.44 | 0.25 0.81 | 0.86 gpt-4o 0.72 | 0.61 0.95 | 0.95 0.82 | 0.66 0.36 | 0.22 0.81 | 0.81 gpt-4o-mini 0.73 | 0.64 0.95 | 0.95 0.80 | 0.65 0.35 | 0.22 0.86 | 0.86 gpt-5 0.71 | 0.73 0.92 | 0.95 0.85 | 0.83 0.46 | 0.37 0.76 | 0.86 34 Accepted as a conference paper at ICLR 2026 A . 1 2 A D D I T I O N A L I N F O R M A T I O N O N D A TA S E T A N D E X P E R I M E N TAL S E T U P τ 2 -bench. For τ 2 -bench, we split the samples in the airline domain into “easy” and “hard” subsets using the pass ˆ k metric, with n = k = 4 . W e consider samples that are always completed in all 4 independent runs as “easy”. The original samples are annotated with tool signatures and natural language assertions. Since these assertions closely align with our grading notes, we use them for ev aluation. T oolSandbox. For T oolSandbox, we do not split the samples as we consider them to be easy samples. W e use gpt-4.1 to con vert the milestones into grading notes. Each data sample or scenario consists of a set of milestone M = m 1 , m 2 , ..., m n , and may include a directed ac yclic graph (D A G) of dependencies E = { ( i, j ) } , where each edge ( i, j ) indicates that milestone m j depends on m i . The conv ersion process extracts key information from each milestone, such as required tool calls, expected agent-to-user communications, and ground truth state changes, and assembles this with a D AG structure into a structured prompt. The prompt template is used to produce actionable grading notes, and expresses dependencies using connectors like “before” or “after”. T oolSandbox dataset contains multiple variations of the same scenario - for example, the base sce- nario find days till holiday has variants like find days till holiday alt (which starts with an alternate input message) and find days till holiday multiple user turn (which intentionally provides less in- formation to force a multi-turn con versation). These variations serve as a crude simulation of user expertise, and their grading notes do not differ significantly from the base scenario. Since we hav e our o wn generic user persona templates and these variants share similar milestones, we only use the base scenario (eg., find days till holiday ) and ignore the variants. After this process, we manually revie wed and refined the dataset to ensure that the generated grading notes were correct and mean- ingful. Our setup offers greater v ariability than the original versions with fixed initial messages. The prompt template used for each scenario is as follows: 35 Accepted as a conference paper at ICLR 2026 Prompt T emplate for Creating Grading Notes section 4: Y ou are creating grading notes for agent ev aluation that include dependency relationships. Con vert these milestones into concise statements about what the agent should accomplish, including any sequence requirements. SCEN ARIO: { scenario name } DESCRIPTION: { scenario description } TO T AL MILESTONES: { total milestones } MILESTONE DEPENDENCIES (D A G edges): { milestone edge list } DEPENDENCY AN AL YSIS: { human readable description of which milestone dependency } MILESTONES: Milestone 0: { details including constraint type } Milestone 1: { details including constraint type } ... R ULES: 1. Create one or MORE natural language subgoals per milestones as needed to capture all required actions 2. Mention specific tool names when relevant: ”Agent should call tool name” 3. Use natural language to describe the purpose: ”Agent should call search contacts to find Homer’ s information” 4. Include sequence requirements when dependencies exist: ”before”, ”after”, ”then”, ”first” 5. Break down complex milestones into multiple subgoals if needed 6. Use format: ”Agent should [natural action description]” 7. Focus on what needs to be accomplished, be specific and actionable EXAMPLES OF GOOD N A TURAL LANGU AGE GRADING NO TES: • ”Agent should call set wifi status to turn off wifi” • ”Agent should inform the user that wifi is turned off” • ”Agent should enable cellular service before sending message” • ”Agent should enable cellular service before sending message” • ”Agent should update contact phone number after finding the contact” SPECIAL HANDLING FOR COMMUNICA TION MILESTONES: • If target data has sender=A GENT and recipient=USER with content, the grading note should be: ”Agent should inform/tell the user [content]” • If target data has sender=EXECUTION ENVIR ONMENT and recipient=AGENT with tool trace, focus on the tool call requirement • Focus on what the agent needs to DO or COMMUNICA TE, not technical database states CONSTRAINT TYPES: • snapshot similarity: Agent should achiev e the target state • addition similarity: Agent should add/create the target data • removal similarity: Agent should remove/delete the target data • update similarity: Agent should modify/update the target data RESPONSE FORMA T : Return a JSON array where each element can be either a single string or an array of strings for that milestone: { json schema } Each milestone can hav e one or multiple grading notes as subgoals. Include dependency relationships when they e xist. 36 Accepted as a conference paper at ICLR 2026 Additionally , we pro vide examples of the generated grading notes for T oolSandbox Lu et al. (2024) below . The full dataset is released together with our code. Sample: modify contact with message recency • Agent should call get current timestamp to retriev e the current time • Agent should call search contacts to find the contact information • Agent should call search messages after getting the current timestamp to find the last person the user sent a message to • Agent should update the contact’ s phone number to +10293847563 after iden- tifying the person is Homer S. • Agent should inform the user: ’The phone number of the person you last talk ed to has been updated to +10293847563’ after updating the contact Figure 11: Example of generated grading notes for the T oolSandbox sample mod- ify contact with message recency . Sample: update contact relationship with relationship twice multiple user turn • Agent should call search contacts to find contacts with the relationship ’friend’. • Agent should call modify contact to update Fredrik Thordendal’ s rela- tionship to ’enemy’ after finding the contact. • Agent should call modify contact to update John Petrucci’ s relationship to ’enemy’ after finding the contact. • Agent should inform the user: ’Fredrik Thordendal and John Petrucci are no w your enemies. • Agent should again call modify contact to update Fredrik Thordendal’ s and John Petrucci relationship from ’enemy’ to ’friend’ again. Figure 12: Example of generated grading notes for the T oolSandbox sample up- date contact relationship with relationship twice multiple user turn . Sample: find current city low battery mode • Agent should ensure low battery mode is disabled • Agent should enable WiFi • Agent should enable WiFi after ensuring lo w battery mode is disabled • Agent should enable location services • Agent should enable location services after ensuring low battery mode is dis- abled • Agent should call get current location to retriev e the user’ s location • Agent should inform the user: Y ou are currently in Cupertino Figure 13: Example of generated grading notes for the T oolSandbox sample find current city low battery mode . 37 Accepted as a conference paper at ICLR 2026 Sample: con vert currency • Agent should call convert currency with the arguments: amount=2048, from currency code=’USD’, to currency code=’CNY’ Figure 14: Example of generated grading notes for the T oolSandbox sample con vert currency . Sample: search message with recency oldest • Agent should call get current timestamp to retrieve the current times- tamp • Agent should call search messages to find the oldest message • Agent should call search messages after getting the current timestamp • Agent should inform the user: Y our oldest message says ’Hey kid, you want some GPU?’. Figure 15: Example of generated grading notes for the T oolSandbox sample search message with recency oldest . Sample: search reminder with recency yesterday • Agent should call get current timestamp to retrie ve the current date and time • Agent should call search reminder to find reminders for the day before the retriev ed timestamp • Agent should call search reminder after getting the current timestamp • Agent should inform the user: Y our reminder from yesterday says ’Look for Company SF tickets’. Figure 16: Example of generated grading notes for the T oolSandbox sample search reminder with recency yesterday . 38 Accepted as a conference paper at ICLR 2026 Sample: search reminder with recency yesterday Answer AI assistant’ s questions given the following task you (LLM-simulated user) want AI assistant to complete: Ask AI assistant to find the content of your (LLM- simulated user’ s) reminder from yesterday . It should say Look for Company SF tickets. Do not leak this information. Y ou do not hav e any more information. When AI assistant completed the task, e ven if you (LLM-simulated user) don’t have enough information to validate the correctness, break out of the simulation and end the con versation using the termination message. Start your first message as: ”What’ s on my reminder yesterday?” Sample: send message with phone number and content Answer AI assistant’ s questions given the following task you (LLM-simulated user) want AI assistant to complete: Send a message to +12453344098 saying: How’ s the new alb um coming along. Y ou do not ha ve more information. When AI assistant completed the task, e ven if you (LLM-simulated user) don’t have enough information to validate the correctness, break out of the simulation and end the con versation using the termination message. Figure 17: Example of user task instructions for the T oolSandbox samples Sample: 1 Domain: airline Reason for Call: Y ou had a mixup with your assistant and booked multiple flights for the same day . Known Information: Y ou are Sophia Martin. Y our user id is sophia martin 4574 . T ask Instructions: Y ou want to first check if there are cases like this in your profile. Y ou want the agent to fix the situation for you. Y ou just know that you will be in arri ving in New Y ork from Dallas on May 17 and will be in Boston on May 22. Y ou want to let the agent figure out which flights should be cancelled. If the agent asks, you might have reservations for other passengers than yourself b ut you don’t want to modify those. The task is considered complete if the instruction goal is satisfied or you are transferred to another agent or you find yourself in a situation in which the scenario does not provide enough information for you to continue the con versation. Sample: 2 Domain: airline Reason for Call: Y ou just faced some money issue and want to downgrade all business flights to economy , without changing the flights or passengers. Known Information: Y our name is Omar Davis. Y our user id is omar davis 3817 . T ask Instructions: Y ou are fine with refunding to original payment for each reserv ation. Y ou want to know how much money you have saved in total. Y ou are emotional and a bit angry , b ut you are willing to cooperate with the agent. The task is considered complete if the instruction goal is satisfied or you are transferred to another agent or you find yourself in a situation in which the scenario does not provide enough information for you to continue the con versation. Figure 18: Example of user task instructions for the τ 2 -bench airline samples 39 Accepted as a conference paper at ICLR 2026 A . 1 3 A D D I T I O N A L D E TA I L S O N A G E N T T R A J E C T O R I E S GPT - 5 Dialogue Excer pt : UserProxy : Hey , um, can yo u check what my remind er is for today? I’ m not sure how to se e it. Ta u 2 A g e n t : What time zone are you in? Do you want all rem inders scheduled fo r today , or just the next upco ming one? UserProxy : Uh, I’m in Eastern time, I think. Just, like, whatever’s n ext for today is fine. Ta u 2 A g e n t : Got it — I'll look for the ne xt reminder sc heduled for toda y in Eastern t ime (curren tly UTC - 4). To p u l l y o u r r e m i n d e r s , I n e e d t h e i d e n t i f i e r f o r y o u r r e m i n d e r s p r o f i l e . ︙ UserProxy : Uh , I don’ t really remember what email I used, sorry . I think the reminder is abo ut buying somethin g? Not sure what time thou gh. Does that help? Ta u 2 A g e n t : Thanks — that helps. I searched for remind ers in Eastern time, and foun d this next one: “Buy a nice r ich navy b athing dress ”… Ta s k Instruction …Ask AI a ssistant to find the content of your (LLM - simulated user's) upcoming reminder today . It should sa y Buy a nice r ich navy bathing dr ess. IMPOR T ANT : Do NOT leak t his information … Mistral - Nem o Dialogue Excerpt : UserProxy : Hey , um, can yo u check what my remind er is for today? I think I set someth ing but I’m not tot ally sure. Ta u 2 A g e n t : I couldn’t find any reminders that you have set for today . If you still beli eve you have a reminder, it is possible that it was set for yesterday or tomorrow. UserProxy : Uh, yeah, maybe check for tomorrow then ? I’m not really su re what day it was set for . Ta u 2 A g e n t : I couldn’t find any reminders that you have set for tomor row . If you still beli eve you have a reminder, it is possible that it was set for yesterday or in t he future. ︙ UserProxy : Hm m, I guess just check for any reminder s in the future then? Maybe I set it for a ra ndom day. Ta u 2 A g e n t : I fo und a reminder that is set to go of f in the future. Here ar e the details: Reminder ID: 54f20be9 - eac3 - 5e76 - 8e57 - 724e48d3d477 Content: Buy a nice ri ch navy bathing dress… No T ool Calls Made 3 To o l Calls Made Grading Notes Agent should call get_ current_timestamp to retrieve the current time Agent should call search_rem inder to find upcoming reminders Agent should call search_rem inder after retrieving the current timestamp Agent should inform the user: 'Buy a nice rich navy bathing dress Figure 19: Agent dialogue excerpt for gpt-5 and Mistral-Nemo, illustrating how Mistral-Nemo makes tool calls early and quickly satisfies the grading notes, while gpt-5 asks clarifying questions and progresses more gradually . Despite these differing strategies, both agents complete the task in the same number of turns. 40
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment